Survey & sampling design: a leader's field guide

In 1936 the most respected poll in America asked ten million people who would win the presidency, heard back from nearly two and a half million, and got the answer spectacularly wrong. A young upstart named George Gallup asked about fifty thousand and got it right. The lesson has not aged: how you choose who to ask matters more than how many you ask.

The quick version

A survey makes two leaps of faith: that the people who answered represent everyone you care about, and that their answers measure what you think they do. Most survey disasters live in one of those two gaps.
A small, randomly drawn sample beats a huge, self-selected one. Size buys precision; random selection buys the right to generalise at all.
Question wording is not a tidying-up step. Leading words, agree/disagree scales and vague terms can move your result by double digits, before a single respondent lies to you.
Treat your survey like a small experiment: define the population, draw the sample fairly, pilot the questions, and report how confident you actually are.

The idea in depth: two gaps, not one

The most useful map of survey error comes from Robert Groves and colleagues, whose textbook Survey Methodology (2nd ed., 2009) organises the whole field around what they call total survey error. Their framework splits error into two parallel branches. One is representation: does your sample stand in for the population, and here the villains are coverage (who could you reach at all), sampling (who you drew), and nonresponse (who actually answered). The other is measurement: does each answer mean what you think, and here the villains are the construct you're after, the questions you wrote, and how you processed the responses.

That two-branch picture is the single most important thing a leader can carry from the academic literature. Most of us obsess over sample size and ignore everything else. But size only narrows random noise; it does nothing for a sample that is systematically tilted. Here's the drill before you launch: write one sentence for each branch, "this sample represents ___" and "these answers measure ___", and attack whichever sentence you can't finish honestly.

flowchart TD
    A(["Total survey error"]) --> B("Representation: are these the right people?")
    A --> C("Measurement: do the answers mean what I think?")
    B --> B1("Coverage, who could I reach?")
    B --> B2("Sampling, who did I draw?")
    B --> B3("Nonresponse, who replied?")
    C --> C1("Construct, am I asking about the right thing?")
    C --> C2("Wording, does the question bias the answer?")
    C --> C3("Processing, did I clean and code it well?")

The total-survey-error framework (after Groves et al., 2009): every survey is two leaps of faith at once. Leaders Loop

Representation: why random beats big

The 1936 Literary Digest debacle is the textbook case, and it earns the cliché. The magazine built its list from telephone directories, club rosters and its own subscribers, comfortable, car-owning Americans in the depths of the Depression. It predicted Alf Landon would beat Franklin Roosevelt 57 to 43. Roosevelt won in a landslide, roughly 61 to 37. Peverill Squire's later post-mortem in Public Opinion Quarterly ("Why the 1936 Literary Digest Poll Failed," 1988) showed the deeper rot was nonresponse: even among those mailed, the people who bothered to return the ballot differed sharply from those who didn't. Gallup, meanwhile, built a sample to mirror the electorate by region, age, sex and income, and beat the giant with a fraction of the numbers.

The principle underneath is probability sampling: if every member of your population has a known, non-zero chance of being selected, then, and only then, can you calculate how much uncertainty your estimate carries and attach a margin of error to it. As Pew Research Center's "Methods 101" explainer puts it, random selection is what lets a sample of a thousand speak for a nation. So stop chasing volume. A list you can draw from fairly, every customer, employee or account, with a known chance of selection, is worth more than a viral link the loudest fill in. Convenience samples (whoever clicks, whoever's in the room) describe those people and cannot honestly be projected onto anyone else.

Measurement: the question is the instrument

Even a perfect sample fails if the question is loaded. Howard Schuman and Stanley Presser's classic Questions and Answers in Attitude Surveys (1981) ran split-sample experiments and found that small changes in wording, order and format routinely shifted results by ten points or more. Two findings travel especially well. First, acquiescence bias: people tend to agree. Ask "Do you agree that the new process is working?" and an agree/disagree scale quietly inflates the yes column, a pattern still being documented today, including work in Political Analysis showing acquiescence can inflate estimates of fringe beliefs. Second, question framing: putting one side's reasoning inside the question, "Given rising costs, do you support the change?", tilts the answer before respondents have thought.

Size buys precision. Random selection buys the right to generalise at all.

So the move is: replace agree/disagree items with construct-specific scales ("How well is the new process working? Not at all … Extremely well"), present both sides when an issue is contested ("Some people think A, others think B, which is closer to your view?"), and pilot every survey on five colleagues out loud before it ships. Don Dillman's Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method (4th ed., 2014) is the practitioner's bible here: it treats the whole respondent experience, the invitation, the order, the visual layout, the follow-ups, as part of the instrument, not decoration. This connects directly to validity, reliability and bias: a question that measures the wrong thing consistently is reliably wrong.

The honest limitation: response rates are collapsing

Here is what the textbooks can't fully fix. Telephone survey response rates tracked by Pew fell from 36% in 1997 to about 6% by 2018. When 94% of people decline, the comforting maths of random sampling rests on a fragile assumption, that the few who answer resemble the many who don't. The American Association for Public Opinion Research's evaluation of the 2016 US election polls concluded that nonresponse and the resulting need to weight the data (correcting an overrepresentation of college graduates) drove much of the polling miss. So the move is: treat a low response rate as a red flag, not a footnote. Compare the people who answered against what you already know about the whole population, tenure, region, role, age, and if they diverge, say so out loud rather than presenting the number as gospel.

A worked example: the engagement survey that lied

A regional operations director runs an annual engagement survey across 600 frontline staff. She emails a link, leaves it open for two weeks, and 180 people respond, a respectable-looking 30%. The headline: "82% agree they feel supported by their manager." She is about to take that to the board as a win.

Two problems, one in each branch. On representation: the link was self-selected, and a quick check (illustrative figures) shows night-shift and warehouse staff, the two groups with the thinnest margins and the most turnover, make up 40% of the workforce but only 12% of respondents. The 82% is real, but it mostly describes day-shift office staff. On measurement: "Do you agree you feel supported by your manager?" is a textbook acquiescence trap, and it bundles two ideas (supported, and by your manager) into one.

The fix costs nothing. Instead of an open link, she draws a stratified sample, everyone on night shift and in the warehouse, plus a random slice of day staff, and sends supervisors to collect responses on a tablet during handover, lifting those groups' response rate. She rewrites the item to "How supported do you feel by your manager day-to-day? (Not at all … Extremely)" and adds an open box. The new number lands at 64%, lower, less flattering, and finally true. It also points somewhere: the gap is concentrated on night shift, which is exactly where she can now act. A worse survey would have sent her to celebrate; a better one sent her to the warehouse.

flowchart LR
    A(["Define the population"]) --> B(["Choose a sampling frame you can draw from fairly"])
    B --> C(["Draw a probability or stratified sample"])
    C --> D(["Pilot the questions out loud"])
    D --> E(["Field it; push for response from under-represented groups"])
    E --> F(["Check who answered vs. the whole population"])
    F --> G(["Report the estimate with its uncertainty"])

A do-it-this-week survey workflow. The two checks that matter most, a fair frame, and a who-answered audit, bookend the rest. Leaders Loop

Frequently asked questions

How big does my sample actually need to be?

Smaller than you think, if it's random. For a yes/no question across a large population, roughly 400 random responses give a margin of error near ±5 points, and about 1,000 gets you to ±3, which is why national polls hover around that size. Precision improves with the square root of the sample, so doubling size doesn't halve error. Spend your effort on who answers, not on chasing thousands.

Isn't a bigger sample always more accurate?

No, that's the exact trap the Literary Digest fell into. A large sample drawn from a biased frame just gives you a very precise wrong answer. Size shrinks random noise; it does nothing about systematic tilt. A representative 500 beats a lopsided 50,000.

What's the difference between random and stratified sampling?

Simple random sampling gives everyone an equal chance. Stratified sampling first divides the population into groups (shifts, regions, segments) and samples within each, useful when some groups are small but important, because it guarantees they show up. The worked example above is stratified: it deliberately over-samples night shift so that group isn't drowned out.

Can I trust an open-link survey on our intranet or social channels?

For a temperature check or to gather ideas, fine. For a number you'll put in front of a board or use to allocate budget, no, it's a convenience sample that describes whoever chose to click, and you can't honestly generalise from it. If you must use one, report it as "among respondents" and never as "our staff" or "our customers."

My response rate is low. Is the data worthless?

Not necessarily, but it carries more risk. A low rate matters most when the people who answered differ from those who didn't on the thing you're measuring. The practical safeguard is a who-answered audit: compare respondents to the full population on a few known attributes and weight or caveat accordingly. Honesty about uncertainty beats false confidence.

Related in the Toolkit

Qualitative vs quantitative vs mixed methods, when a survey is the wrong tool, and what to reach for instead.
Interview & ethnographic techniques, for the "why" behind the numbers a survey gives you.
Experiment design (RCTs, A/B testing, quasi-experiments), when you need cause, not just description.
Validity, reliability & bias in research, the deeper theory of why measurement goes wrong.
Jobs-to-be-Done & needs research, framing what to even ask about before you write questions.
First principles vs heuristics vs analogical reasoning, how to reason past a misleading headline number.
Reversible vs irreversible decisions, how much survey rigour a decision actually warrants.
Descriptive statistics (mean, median, mode, variance, SD), making sense of the responses once they're in.

Where to go next

Dillman, Smyth & Christian, Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method (2014), the most practical book on actually fielding a survey people will answer well.
Pew Research Center, "Methods 101: Random Sampling" (video), a clear, two-and-a-half-minute explainer of why random selection is the whole game.
Pew Research Center, "Response rates in telephone surveys have resumed their decline" (2019), the sobering data on collapsing response rates and what it means.
AAPOR, "An Evaluation of 2016 Election Polls in the United States" (2017), a rigorous, readable autopsy of how good sampling still goes wrong.