The Magic Trick of Statistical Significance
Ever wonder why election polls keep missing? It's not the math. The math is always fine.
It's that the kind of person who picks up an unknown number, stays on the line, and answers twenty minutes of questions about politics is a fundamentally different animal than the kind of person who doesn't. A bigger sample doesn't fix that. A better sample does.
I'd trust 30 people chosen truly at random over 500 who volunteered. That's not a hunch. That's how sampling actually works.
Four reasons "statistically significant" doesn't mean what you think.
I would trust 30 people chosen at random over 500 who signed up for a survey panel. That sounds wrong. It's not. The 30 actually represent the population. The 500 represent the kind of person who joins survey panels for gift cards. Those are very different groups. This is one of four reasons "statistically significant" doesn't mean what most people think it means. The math is always fine. Everything before the math is where the problems live.
A few years back I was in a conference room on the 40-something floor of a building in Midtown Manhattan. Glass everywhere. The coffee was better than anything I have at home. The people around the table managed enough money to qualify as a small economy, and every one of them had the kind of math background where they'd finish your sentence if you paused too long near a number. I was presenting research. The differences between groups were obvious. Big gaps. Clear story.
First question. "Are the differences statistically significant?"
I said yes. The room relaxed. Laptops half-closed. On to the next thing.
Walking back through the lobby I remember thinking I'd just watched brilliant quantitative people ask exactly the wrong question. They wanted to know if the math checked out. The math was fine. It's always fine. It was everything that happened before the math that deserved scrutiny, and nobody asked about any of it.
Most people hear about statistical significance in a college course they were required to take. If the p-value is below .05, the result is real. Above .05, it's not. Feels like a cheat code for truth. What nobody mentions is that .05 isn't a law of physics. Ronald Fisher picked that number in the 1920s because it seemed reasonable. A guy made a judgment call and we built a cathedral of credibility on top of it.
Significance testing tells you whether a result might just be a coincidence. For that purpose it works. That's also where its job ends.
Here are four ways the whole thing breaks down, none of which show up in the final report.
1. Who's actually in the data.
The math assumes you pulled a random sample from the population you care about. Random means every person had an equal chance of being selected. Not the willing ones. Not the ones whose email addresses you had. Everyone. In practice this almost never happens.
This is the real reason election polls miss. It's not bad math. It's that the kind of person who picks up an unknown number, stays on the line, and answers twenty minutes of questions about politics is a fundamentally different animal than the kind of person who doesn't. The difference between how you got your sample and how big it is matters more than most people realize. You will never see that caveat in a headline that says "new study of 5,000 people finds..."
2. The questions create the findings.
Ask Americans "Do you support increased funding for assistance to the poor?" and you get broad support. Ask the same people "Do you support expanding welfare programs?" and the numbers fall off a cliff. Same concept. One word changed. Both pass a significance test. Both can't be right.
This goes deeper than wording. A company surveys employee engagement. Scores come back high. Champagne. But the survey wasn't anonymous and everyone knew their manager could see the results. You didn't measure engagement. You measured self-preservation. The number passes every statistical test you throw at it. What it actually represents has nothing to do with what it claims.
3. Test enough things and something will hit.
If you run twenty tests at the .05 threshold, one will look like a real finding even when nothing is going on. That's just math.
What happens in practice is a researcher has a dataset and a hypothesis that didn't work out. The deadline is real. So they start slicing. Men versus women. Young versus old. Region by region. Eventually something crosses the magic line and the paper gets written as if that was the plan from the start.
This is why a study tells you coffee prevents cancer in March and causes it in September. Different slicing, different "significant" result. Both published. Both potentially meaningless.
4. Significant doesn't mean it matters.
These are completely different ideas wearing similar outfits.
With a big enough sample, trivially small effects become significant. You've seen the drug commercials. "Clinically proven" in big letters. Then the fine print shows the drug helped 3 more people out of every 1,000 compared to a placebo. Statistically significant? Sure. Worth the side effects, the cost, the change in your daily routine? That's a completely different question, and the significance test has nothing to say about it.
Significance tells you an effect probably isn't zero. It doesn't tell you the effect is worth getting out of bed for.
So what should you ask instead?
When that room in Midtown asked "is it significant?" what they really wanted to know was "can we trust this enough to act on it?" That depends on who was in the sample, how the questions were written, whether the analyst ran one test or forty, and how big the effect actually is. None of that lives inside a p-value.
The better question is simpler. Is this difference meaningful? Big enough to matter. Observed in a group that looks like the people you care about. Measured in a way you'd defend out loud.
If someone shows you a significant finding and can't tell you who was in the sample, how the questions were worded, and how many tests they ran before finding it, they're showing you the last card in a magic trick and asking you not to wonder about the other fifty-one.
Respect the tool. Interrogate the process. Ask if the difference is meaningful.
The Great Zandini Sees:
Statistical significance tells you the math worked. It has nothing to say about whether anything before the math was solid. That's where the problems live.
The Great Zandini Sees:
Statistical significance tells you the math worked. It has nothing to say about whether anything before the math was solid.
That's where the problems live.