What is the difference between a histogram and a bar chart?

A histogram is used for quantitative (numerical) data, where the bars represent intervals of numbers and touch each other to show continuous data. A bar chart is used for qualitative (categorical) data, where each bar represents a distinct category and the bars typically do not touch.

When should I use the mean versus the median?

Use the mean for data that is relatively symmetric and does not have significant outliers, since it uses all data points. Use the median for skewed distributions or data that contains outliers, as the median is resistant to extreme values and better represents the "typical" value.

Why do we divide by n - 1 instead of n for sample standard deviation?

Dividing by n - 1 (called degrees of freedom) instead of n when calculating sample variance or standard deviation provides a more accurate, unbiased estimate of the population variance. If we used n, our sample statistic would tend to underestimate the true population parameter.

What does a large standard deviation tell me?

A large standard deviation means that the data points in your dataset are, on average, far away from the mean. This indicates a high degree of variability or spread in the data. Conversely, a small standard deviation means data points are clustered closely around the mean.

How do I handle outliers once I have identified them?

First, investigate whether the outlier is a data entry error, measurement error, or a legitimate but unusual value. Always report the presence of outliers and their potential impact. Depending on the context, you might remove erroneous values, analyze the data with and without the outlier, or use resistant measures like the median and IQR that are less affected by extreme values.

Average squared deviation from the mean

MathematicsHigh School

Descriptive Statistics

Descriptive statistics is the branch of statistics that involves summarizing and describing data using numerical calculations and graphical displays. It transforms raw numbers into meaningful insights that help us understand patterns, trends, and variability.

This guide covers key definitions, organizing data with histograms and frequency tables, measures of center and spread, five-number summaries, box plots, outlier detection, memory aids, and a practice quiz.

1Introduction

Have you ever looked at a huge table of numbers and felt overwhelmed? That is where descriptive statistics comes in. Descriptive statistics is the branch of statistics that involves summarizing and describing data using numerical calculations and graphical displays. It is about organizing, presenting, and interpreting data in a way that is easy to understand.

Understanding descriptive statistics is crucial because it allows us to make sense of information, identify patterns and trends, communicate findings effectively, and inform decisions using data.

Picture This

Imagine you are a coach for a high school track team. You have just timed all 50 runners in the 100-meter dash. How do you figure out who the fastest runners are, what the typical speed of your team is, or how spread out the abilities are? You use descriptive statistics! Calculate the average time, find the fastest and slowest times, see how spread out the times are, and create a histogram to visualize the distribution.

Why It Matters

Make Sense of Information

Turn complex datasets into simple, understandable summaries that reveal the story behind the numbers.

Identify Patterns & Trends

Spot what is common, what is unusual, and how things are changing over time.

Communicate Findings

Share insights with others using clear graphs, charts, and numerical summaries.

Inform Decisions

Use data to make better choices in business, science, government, and everyday life.

2Key Definitions

Data

A collection of facts, such as numbers, words, measurements, observations, or descriptions of things.

Population

The entire group of individuals or instances about whom we want to draw conclusions. It is the whole pie.

Sample

A subset of the population from which data is actually collected. It is a slice of the pie, used to learn about the whole.

Parameter

A numerical characteristic that describes a population. Think P for Parameter, P for Population.

Statistic

A numerical characteristic that describes a sample. Think S for Statistic, S for Sample.

Variable

A characteristic or attribute that can be measured or observed for each individual in a population or sample.

Quantitative Variable

A variable that can be measured numerically (e.g., height, age, number of siblings).

Qualitative Variable

A variable that describes a characteristic using categories or labels (e.g., hair color, gender).

Frequency

The number of times a particular value or category appears in a dataset.

Outlier

An observation point that is distant from other observations -- an unusually high or low value compared to the rest.

Distribution

The pattern showing how frequently each value or range of values occurs in a dataset.

3Organizing Data

Before we can analyze data, we often need to organize it in a way that makes patterns more visible. Two key tools are frequency tables and histograms.

Frequency Tables

A frequency table lists all the categories or values of a variable and the number of times each occurs. For quantitative data, values are often grouped into intervals or bins.

Example: Test Score Distribution

Score Interval	Frequency
50 -- 59	2
60 -- 69	5
70 -- 79	12
80 -- 89	8
90 -- 100	3

Histograms & Distribution Shapes

A histogram is a powerful way to visualize the distribution of quantitative data. The horizontal axis shows data values (often in intervals) and the vertical axis shows frequency. Unlike bar charts, the bars in a histogram touch each other to indicate continuous data.

Symmetric

Left and right sides are approximate mirror images. Mean and median are roughly equal.

Skewed Left

Tail extends to the left. A few unusually low values pull the mean below the median.

Skewed Right

Tail extends to the right. A few unusually high values pull the mean above the median.

4Measures of Center

These statistics tell us about the "center" or typical value of a dataset. The three main measures are the mean, median, and mode.

Number line comparing the positions of mean, median, and mode in symmetric and skewed distributions

Mean (Arithmetic Average)

The mean is the sum of all values divided by the number of values. It is the most common measure of center but is sensitive to outliers.

x̄ = Σxᵢ / n

Sum of all data values divided by the number of values in the sample.

Median (Middle Value)

The median is the middle value when data is arranged in order. If n is odd, it is the single middle value. If n is even, it is the average of the two middle values. The median is resistant to outliers.

Mode (Most Frequent)

The mode is the value that appears most frequently. A dataset can be unimodal, bimodal, multimodal, or have no mode. The mode works for both quantitative and qualitative data.

When to Use Each

Mean

Best for symmetric distributions with no extreme outliers. Uses all data points.

Median

Best for skewed distributions or data with significant outliers. Resistant to extremes.

Mode

Useful for categorical data or finding the most common value. Works for any data type.

Worked Example: Mean, Median, Mode

Dataset: 10 quiz scores (out of 20)

Sorted: 10, 12, 14, 15, 16, 17, 18, 18, 19, 20

Mean

x̄ = 159 / 10

= 15.9

Median

n = 10 (even)

(16 + 17) / 2

= 16.5

Mode

18 appears twice

Mode = 18

5Measures of Spread

These statistics tell us how spread out or dispersed the data values are. A dataset can have the same center but very different spreads.

Range

Range = Maximum - Minimum

The simplest measure of spread. Easy to calculate but highly sensitive to outliers.

Interquartile Range (IQR)

IQR = Q3 - Q1

Measures the spread of the middle 50% of the data. Resistant to outliers.

Q1 (first quartile) is the median of the lower half of data (25th percentile). Q3 (third quartile) is the median of the upper half (75th percentile). When n is odd, do not include the overall median in either half.

Variance & Standard Deviation

Data points on a number line with arrows showing individual distances from the mean, visualizing standard deviation

Sample Variance

s² = Σ(xᵢ - x̄)² / (n - 1)

Average squared distance from the mean. Divide by n - 1 for unbiased estimation.

Sample Standard Deviation

s = √[Σ(xᵢ - x̄)² / (n - 1)]

Square root of variance. Same units as data, easier to interpret.

Worked Example: Standard Deviation

Data: 10, 12, 14, 15, 16, 17, 18, 18, 19, 20 | Mean (x̄) = 15.9 | n = 10

Step 1: Calculate deviations from mean

Step 2: Square each deviation

Step 3: Sum = 34.81 + 15.21 + 3.61 + 0.81 + 0.01 + 1.21 + 4.41 + 4.41 + 9.61 + 16.81

Σ(xᵢ - x̄)² = 90.9

s² = 90.9 / 9 = 10.1

s = √10.1 ≈ 3.178

6Five-Number Summary & Box Plots

The five-number summary provides a concise description of a quantitative distribution using five key values.

Min

Smallest

25th %ile

Median

50th %ile

75th %ile

Max

Largest

Worked Example: Five-Number Summary

Sorted data: 10, 12, 14, 15, 16, 17, 18, 18, 19, 20

Minimum = 10

Lower half: 10, 12, 14, 15, 16 → Q1 = 14

Median = (16 + 17) / 2 = 16.5

Upper half: 17, 18, 18, 19, 20 → Q3 = 18

Maximum = 20

Five-Number Summary: 10, 14, 16.5, 18, 20

Constructing a Box Plot

Step-by-step construction of a box plot showing minimum, Q1, median, Q3, maximum, and how whiskers and outlier points are drawn

Interactive: Box Plot Builder

Add or remove data points to see how the box plot, five-number summary, and outlier detection update in real-time.

Five-Number Summary

Min12

Q116.5

Median25

Q332.5

Max42

IQR16

Lower Fence-7.5

Upper Fence56.5

Outliers0

Mean25.22

Sorted Data

12, 15, 18, 22, 25, 28, 30, 35, 42

Draw a number line that covers the full range of your data.
Draw a box from Q1 to Q3.
Draw a line inside the box at the median.
Draw whiskers from Q1 to the minimum and from Q3 to the maximum (or to the most extreme non-outlier values).
Plot outliers individually as dots beyond the whiskers.

7Identifying Outliers

Outliers are identified using the 1.5 x IQR rule. Any value outside the calculated boundaries is considered a suspected outlier.

Lower Bound

Q1 - (1.5 × IQR)

Any value below this is a suspected outlier.

Upper Bound

Q3 + (1.5 × IQR)

Any value above this is a suspected outlier.

Worked Example: Outlier Detection

Using our dataset: Q1 = 14, Q3 = 18, IQR = 4

Lower Bound = 14 - (1.5 × 4) = 14 - 6 = 8

Upper Bound = 18 + (1.5 × 4) = 18 + 6 = 24

Data: 10, 12, 14, 15, 16, 17, 18, 18, 19, 20

Any values below 8? No. Any values above 24? No.

No outliers in this dataset.

Real-world dataset displayed using multiple representations: frequency table, dot plot, histogram, and box plot side by side

8Memory Aids

Mnemonic

"P for Population, P for Parameter; S for Sample, S for Statistic."

Helps remember that parameters describe populations and statistics describe samples.

Concept Phrase

"SKEW is where the TAIL is."

If the tail goes right, it is skewed right. If the tail goes left, it is skewed left. The mean is pulled in the direction of the tail.

Mnemonic

"Min-Q1-Med-Q3-Max"

Think of it like a journey: start at the Minimum, pass the first quarter mark (Q1), hit the Median halfway point, reach the three-quarter mark (Q3), and arrive at the Maximum.

Concept Phrase

"IQR is the Inner Quarter Range."

Helps remember that IQR covers the middle 50% of the data, between the 25th and 75th percentiles.

Concept Phrase

"Standard Deviation is the Square Root of the Variance."

Helps remember the relationship and that standard deviation is in the original units of the data.

Decision Rule

"Median for Skewed, Mean for Symmetric."

When data is skewed or has outliers, use the median and IQR. When data is symmetric, use the mean and standard deviation.

9Common Mistakes

Confusing mean and median

Using the mean when the data is heavily skewed or contains outliers, leading to a misleading representation of the center. Remember: median for skewed, mean for symmetric.

Incorrect outlier identification

Forgetting the 1.5 x IQR rule or applying it incorrectly. Always calculate Q1 - 1.5(IQR) and Q3 + 1.5(IQR) as your boundaries.

Misreading histogram shapes

Confusing a histogram with a bar chart (histograms are for quantitative data with touching bars). Also, misidentifying skewness -- the "tail" determines the direction of skew, not the peak.

Calculation errors for quartiles

When n is odd, including the median in both the lower and upper halves when calculating Q1 and Q3. The overall median should not be included in either half.

Forgetting n - 1 for sample standard deviation

Dividing by n instead of n - 1 for sample standard deviation or variance, leading to an underestimation. This is a crucial distinction between sample and population formulas.

Not sorting data first

Attempting to find the median or quartiles without first sorting the data in ascending order. This will always lead to incorrect results.

Misinterpreting standard deviation

Thinking a large standard deviation means the data values are all large, rather than meaning they are widely spread out from the mean.

Ignoring context

Presenting numerical summaries or graphs without discussing what they mean in the context of the problem. Always relate your findings back to the real-world scenario.

Quick Revision Summary

✓Descriptive statistics summarizes and describes data using numbers and graphs.
✓Parameters describe populations; statistics describe samples.
✓Variables can be quantitative (numerical) or qualitative (categorical).
✓Histograms display distribution shape: symmetric, skewed left, or skewed right.
✓Mean (average), median (middle, resistant to outliers), and mode (most frequent) measure center.
✓Range, IQR (resistant to outliers), variance, and standard deviation measure spread.
✓The five-number summary is Min, Q1, Median, Q3, Max -- visualized with a box plot.
✓Outliers are identified using the 1.5 x IQR rule: below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).
✓For skewed data, prefer median and IQR. For symmetric data, prefer mean and standard deviation.
✓Always sort data first before calculating median, quartiles, or the five-number summary.

Frequently Asked Questions

What is the difference between a histogram and a bar chart?: A histogram is used for quantitative (numerical) data, where the bars represent intervals of numbers and touch each other to show continuous data. A bar chart is used for qualitative (categorical) data, where each bar represents a distinct category and the bars typically do not touch.
When should I use the mean versus the median?: Use the mean for data that is relatively symmetric and does not have significant outliers, since it uses all data points. Use the median for skewed distributions or data that contains outliers, as the median is resistant to extreme values and better represents the "typical" value.
Why do we divide by n - 1 instead of n for sample standard deviation?: Dividing by n - 1 (called degrees of freedom) instead of n when calculating sample variance or standard deviation provides a more accurate, unbiased estimate of the population variance. If we used n, our sample statistic would tend to underestimate the true population parameter.
What does a large standard deviation tell me?: A large standard deviation means that the data points in your dataset are, on average, far away from the mean. This indicates a high degree of variability or spread in the data. Conversely, a small standard deviation means data points are clustered closely around the mean.
How do I handle outliers once I have identified them?: First, investigate whether the outlier is a data entry error, measurement error, or a legitimate but unusual value. Always report the presence of outliers and their potential impact. Depending on the context, you might remove erroneous values, analyze the data with and without the outlier, or use resistant measures like the median and IQR that are less affected by extreme values.

Practice Quiz

Test your understanding — select the correct answer for each question.

1.Which measure of center is best for data with outliers?

2.What does the IQR represent?

3.A histogram shows:

4.The mode is:

5.Standard deviation measures:

6.Q3 is also called:

7.An outlier is defined as:

8.The five-number summary includes:

9.A right-skewed distribution has:

10.Variance is:

Final Study Advice

1.Always sort your data before calculating any positional statistics (median, quartiles, percentiles).
2.Choose the right measure of center: mean for symmetric data, median for skewed data or data with outliers.
3.Practice drawing box plots from five-number summaries -- they appear frequently on exams.
4.Remember to divide by n - 1 (not n) when calculating sample variance and standard deviation.
5.Always interpret your results in context -- relate numbers back to what the data represents in the real world.

1Introduction

Why It Matters

2Key Definitions

3Organizing Data

Frequency Tables

Histograms & Distribution Shapes

4Measures of Center

Mean (Arithmetic Average)

Median (Middle Value)

Mode (Most Frequent)

When to Use Each

Worked Example: Mean, Median, Mode

5Measures of Spread

Range

Interquartile Range (IQR)

Variance & Standard Deviation

Worked Example: Standard Deviation

6Five-Number Summary & Box Plots

Worked Example: Five-Number Summary

Constructing a Box Plot

Interactive: Box Plot Builder

7Identifying Outliers

Worked Example: Outlier Detection

8Memory Aids

9Common Mistakes

Quick Revision Summary

Frequently Asked Questions

Practice Quiz

Final Study Advice

Related Topics