Descriptive statistics: mean, median, mode, variance & standard deviation

A recruiter once told a candidate the team's average tenure was eight years, which sounded like a place people stayed. It was true. It was also the average of two fifteen-year veterans and six people who'd quit inside a year. The number wasn't lying. It just wasn't answering the question the candidate actually had.

The quick version

Mean, median, mode are three different answers to "what's typical?", and on a skewed dataset (salaries, deal sizes, response times) they pull apart. The gap between them is information, not noise.
Variance and standard deviation measure spread, how far things sit from the average. Two teams can share a mean and live in completely different worlds.
A single summary number is a compression, and compression loses things. The safe habit: never trust one statistic until you've seen the shape of the data behind it.
The move that beats memorising formulas: plot it first. A two-minute chart catches what the summary buries.

The idea in depth: three answers to "what's typical?"

"Average" sounds like one idea. It's at least three. The mean adds everything up and divides by the count. The median is the middle value when you line the data up in order. The mode is the value that appears most often. On a tidy, symmetric dataset they roughly agree, which is why people use the words loosely. On the data leaders actually deal with, pay, revenue per customer, support-ticket times, deal sizes, they don't agree, and the disagreement is the point.

The reason is skew. A handful of huge values (one whale customer, one founder's salary, one three-day outage) drags the mean toward the extreme while the median barely moves. The median asks "who's in the middle?"; the mean asks "if we pooled it all and shared equally, what would each get?" Those are different questions, and a long tail makes the answers diverge.

This isn't a modern footnote. The historian of statistics Adolphe Quetelet built much of nineteenth-century social science on the idea of l'homme moyen, the "average man", treating the mean of a human trait as a real, representative value rather than just a calculation. As the statistics educator Arthur Bakker documents (Journal of Statistics Education, 2003), it took centuries for the average to graduate from a tool for estimating one true quantity, the diameter of the moon, say, into a stand-in for a whole population. That promotion was powerful and quietly dangerous: the moment a single number represents a group, everyone the number doesn't fit becomes invisible.

So the move is: when someone quotes "the average," ask which one, and ask for the median alongside it. If the mean and median are close, the data is fairly symmetric and either is fine. If they're far apart, you're looking at a skewed distribution, and the median is usually the more honest summary of the typical case. (The recruiter's "eight-year average" had a median closer to one.) For a fuller treatment of why the middle and the tails behave so differently, see the Toolkit on distributions, percentiles & quartiles.

flowchart TD
    A(["You're handed an 'average'"]) --> B{"Mean and median
close together?"}
    B -->|"Yes, fairly symmetric"| C(["Either is fine to quote"])
    B -->|"No, far apart"| D(["Data is skewed:
prefer the median"])
    D --> E(["Ask: what's in the tail
pulling the mean?"])

A 30-second triage for any "average" that lands on your desk. Leaders Loop

The idea in depth: the average tells you the centre, the spread tells you the risk

Centre is only half the story. Imagine two delivery teams that both average ten days to ship a feature. In one, every feature lands between nine and eleven days. In the other, half land in three days and half take seventeen. Same mean. Wildly different team to manage, plan around, or promise a customer. The thing that separates them is spread, and that's what variance and standard deviation measure.

Variance is the average of the squared distances from the mean. The squaring is deliberate: it stops positive and negative gaps cancelling out, and it punishes big deviations harder than small ones. The catch is that squaring also mangles the units, square days, square dollars, which nobody can intuit. So we take the square root and get the standard deviation (SD), which lives back in the original units. An SD of "two days" means a typical feature lands about two days either side of the average. An SD of "seven days" on the same mean means you can't promise anyone anything.

The mean tells you where the centre is. The standard deviation tells you whether the centre is a promise or a coin toss.

So the move is: stop quoting averages naked. "Average handling time is six minutes" is half a fact; "six minutes, give or take four" is a decision you can act on. When you compare two options with the same average, the one with the smaller standard deviation is the more predictable bet, which matters enormously for anything you're putting in front of a customer or a board.

The idea in depth: the honest limitation, a summary is a compression, and compression loses things

Here's the part most explainers skip. Every descriptive statistic is a compression of many numbers into one, and compression always discards something. The discipline is knowing what.

The sharpest demonstration is more than fifty years old. In 1973 the statistician Francis Anscombe published four small datasets in The American Statistician (Anscombe, 1973, Vol. 27, No. 1, pp. 17–21) that share almost every summary statistic, the same mean for x and y, the same variance, the same correlation, the same regression line, to two or three decimals. By the numbers they're identical. Plot them and they're four completely different pictures: one tidy linear relationship, one curve, one straight line wrecked by a single outlier, one near-vertical stack saved by one stray point. Anscombe's argument was that statistics is a craft requiring judgement and a look at the data, not a vending machine you feed numbers into.

flowchart LR
    A(["11 data points"]) --> B(["Mean of x"])
    A --> C(["Mean of y"])
    A --> D(["Variance"])
    A --> E(["Correlation"])
    B --> F(["Identical summary
for all 4 datasets"])
    C --> F
    D --> F
    E --> F
    F --> G(["...yet 4 totally
different shapes"])
    G --> H(["Lesson: plot it
before you trust it"])

Anscombe's quartet: four datasets, one identical set of summaries, four different realities. Leaders Loop, after Anscombe (1973)

The point still has teeth. In 2017, Justin Matejka and George Fitzmaurice of Autodesk Research generalised it with the "Datasaurus Dozen" ("Same Stats, Different Graphs," CHI 2017): a set of datasets that match on mean, standard deviation and correlation to two decimal places, one of which, plotted, is unmistakably a dinosaur. Identical summaries, a cartoon reptile hiding in the numbers.

So the move is: treat summary statistics as the headline, never the article. Before you make a decision on a mean or an SD, look at the distribution, a histogram, a box plot, even a quick scatter. It takes two minutes and it catches the outlier, the second hump, the dinosaur. The honest limitation of descriptive statistics is that they describe; they don't explain, and they can't show you a shape you didn't ask to see.

A worked example: the "fast support team" that wasn't

The figures below are illustrative, chosen to show the mechanics. Say you run customer support and your dashboard reports a mean first-response time of 42 minutes across the week, comfortably under your one-hour target. Good news, by the headline.

Then you pull the underlying tickets. Most are answered in 10–15 minutes. But a cluster of overnight tickets sat for 6–8 hours before the morning shift arrived. Those few extreme values inflate the mean: the median response time is actually 12 minutes, and the mode, the most common single outcome, is around 10. The mean of 42 is the only one of the three that makes the team look slow, and it's slow because of a handful of overnight gaps, not the daytime work.

Now look at spread. The standard deviation is large, well over an hour, which is the real signal. A big SD on a small mean is the dashboard telling you "the average is hiding a tail." You don't need a faster team. You need overnight coverage, or an honest auto-reply that resets expectations until someone's awake. The mean sent you hunting for a speed problem; the median, mode and SD together told you it was a coverage problem. Same data, completely different management response, and you only see it because you refused to trust the single number on the tile.

Frequently asked questions

Mean or median, which should I actually quote?

If the data is roughly symmetric (mean and median close), either is fine and the mean is the convention. If it's skewed, pay, deal sizes, response times, anything with a long tail, the median is the more honest "typical" figure, because a few extremes don't drag it around. When in doubt, quote both; the gap between them tells your audience how skewed things are.

What's the difference between variance and standard deviation?

They measure the same thing, spread, but standard deviation is the square root of variance, which puts it back in the original units (minutes, dollars, days). That makes the SD the one you can actually interpret and the one to quote. Variance is mostly a step on the way to it, useful when you're combining or comparing spreads mathematically.

Do I really need the mode? It feels like the forgotten one.

For continuous numbers (revenue, time) the mode is often unhelpful, no two values are exactly equal. It earns its keep with categories and counts: the most common support reason, the most-picked plan, the most frequent error code. For "what happens most often?" the mode is the right tool, where mean and median don't even apply.

How big a standard deviation is "too big"?

There's no universal threshold, it depends on the mean and the decision. A useful instinct is to compare the SD to the mean: an SD that's a large fraction of the mean signals a process you can't reliably predict or promise. The real answer is to look at the distribution rather than chase a magic ratio.

If summaries can mislead, why use them at all?

Because you can't eyeball ten thousand rows, and a good summary communicates instantly. The fix isn't to abandon mean, median and SD, it's to pair them with a quick look at the shape. Summaries are how you talk about data; the plot is how you check you're not being fooled. You need both.

Related in the Toolkit

Data types (discrete/continuous, categorical/ordinal), which statistic is even valid depends on what kind of data you have (you can't take the mean of job titles).
Distributions, percentiles & quartiles, the natural next step: the shape behind the summary, and how to read tails and the middle.
Correlation vs causation, descriptive stats describe one variable; correlation describes two, with its own famous traps.
Regression (linear, non-linear, logistic), where Anscombe's quartet bites hardest, because a regression line is its own kind of summary.
Statistical significance: p-values, t-scores, chi-square, moving from "what does this sample look like" to "can I trust the difference I'm seeing?"
First principles vs heuristics vs analogical reasoning, "always plot it first" is exactly the kind of heuristic that beats a formula.
Reversible vs irreversible decisions, how much you should care about spread depends on whether you can undo the call.
Jobs-to-be-Done & needs research, a reminder that the right average answers a real question someone is asking.

Where to go next

Charles Wheelan, Naked Statistics (Norton, 2013), the friendliest serious introduction; its chapter on descriptive statistics nails why the mean-versus-median choice changes the story.
F. J. Anscombe, "Graphs in Statistical Analysis," The American Statistician (1973), the original four-dataset paper; short, and the most quoted argument for plotting before you summarise.
Matejka & Fitzmaurice, "Same Stats, Different Graphs," CHI 2017, the Datasaurus Dozen; identical summaries, a dinosaur hiding in the data. Open-access paper, code and visuals.
Hans Rosling, "200 Countries, 200 Years, 4 Minutes" (BBC, 2010), four minutes on why showing the data beats quoting a number; the gold standard for making distributions speak.