Why Error Bars and Probabilities Are Science’s Strength | Decoding Uncertainty

Here’s something I think we don’t say out loud enough—even to each other: science doesn’t prove things. It quantifies how uncertain we are. 

That’s the game. That’s the strength.

I know you already know that. But let’s be honest—outside of theory, even we professionals fall into “certainty traps.” We want clean conclusions. Journal editors reward them. 

Our audiences expect them. 

But at its core, science isn’t about being sure; it’s about being honest about how unsure we are, and doing that in a mathematically principled way.

This isn’t some philosophical hedge—this is a feature, not a bug. Confidence intervals, margins of error, posterior distributions… these aren’t signs that our models are weak. 

They’re signs that our thinking is strong enough to accommodate uncertainty.

And ironically, the more confident someone seems without these tools? The less scientific they usually are.

Let’s dig into the confidence interval—maybe the most misunderstood thing in all of statistics.

Confidence Intervals Are Not What Most People Think

Let’s talk about confidence intervals. Because I genuinely believe they’re among the most misunderstood constructs in science, full stop.

And I’m not just talking about laypeople here—I’m talking about researchers, analysts, even tenured faculty who use confidence intervals all the time, but read them as if they’re Bayesian credible intervals. You’ve seen it. You might’ve even done it. I have, especially early on.

The most common mistake? Saying something like, “There’s a 95% chance the true mean is in this interval.” That’s false. In the Neyman-Pearson framework, a 95% confidence interval means that if we repeated this procedure a bazillion times, 95% of the resulting intervals would contain the true parameter. It’s a property of the method, not of the specific interval you just got.

But here’s the part I find really interesting: we often act as if it is a probability statement, and sometimes that works surprisingly well. So why does that ambiguity persist?


The Bayesian Shadow of Confidence Intervals

One reason is that, in many practical situations—especially with lots of data and reasonably uninformative priors—a Bayesian credible interval and a frequentist confidence interval will look pretty similar. So even if the interpretation is philosophically wrong, it doesn’t wreck the conclusions.

But there’s another layer: people want probabilistic intuition, not long-run frequencies. It’s more natural to think, “I’m 95% sure the effect is between these two values,” than to think about what would happen in a hypothetical infinite world of replications.

We’ve created a tool that behaves like a Bayesian object—but only some of the time—and then we hand it to people without telling them when it breaks. That’s risky.


Real Consequences – The Replication Crisis

This isn’t just academic nitpicking—it has real downstream consequences. Take the replication crisis in psychology and biomedicine. One of the big contributors? Overconfidence in findings whose intervals were narrow but unstable.

A classic example is the misreading of p-values and confidence intervals in early social priming studies. Researchers would report a small effect with a tight interval, interpret it as definitive, and move on. But follow-ups revealed those intervals weren’t replicable because the original studies were underpowered, and the interval width was misleadingly narrow due to sampling variability.

If they had instead said, “We observed this effect with this interval, but the uncertainty is large, and here’s how that might affect replication,” we might’ve avoided some of that mess.


Metascience Agrees

Recent studies back this up. For instance, Hoekstra et al. (2014) asked researchers to interpret a confidence interval, and most gave incorrect, probability-based answers. These weren’t undergrads—they were working scientists. The authors found that even with reminders, over 50% of participants still misunderstood the interpretation of CIs.

That’s not a user error. That’s a UX problem in statistical communication.


So What Do We Do With That?

I’m not suggesting we throw out confidence intervals—they’re incredibly useful. But maybe we need to treat them with more suspicion when sample sizes are small or assumptions are fragile.

Better yet, let’s stop pretending they’re objective. Every interval has modeling assumptions baked in. Every interval is conditional. And every interval is vulnerable to the same issues as p-values: misinterpretation, miscommunication, and misuse.

So when someone says, “Here’s a 95% CI,” my gut reaction now is: 95% of what? Conditional on what? Do I actually believe the model assumptions?

Confidence intervals don’t give us certainty—they force us to confront the limits of our inference. And that’s exactly why they’re powerful.

Thinking in Probabilities Changes Everything

Let’s pause for a second and just acknowledge something wild: thinking probabilistically is not natural for most humans—including, at times, people who build statistical models for a living. Our brains like stories, binaries, and certainties. Probabilities feel slippery. Like hedging.

But once you embrace probabilistic thinking, really internalize it, it changes how you see data, models, and even decision-making. This isn’t just about using probability distributions or writing priors—it’s about seeing the world as gradients of belief, not black-and-white truth.

And it’s way more powerful than people give it credit for.

So let me walk through five areas where probabilistic thinking isn’t just useful—it’s transformative. These aren’t “tips.” These are conceptual upgrades that shift how you do science.


1. Model Calibration > Point Forecasts

I’m just going to say it: we focus way too much on point estimates. They’re seductive. They feel like answers. But in any model that’s probabilistic in nature—say, a weather model or a financial forecast—calibration is what actually matters.

Imagine your model predicts a 70% chance of rain tomorrow, and over time, every time it says 70%, it rains 70% of the time. That’s a calibrated model. You can trust its probability statements.

Now contrast that with a model that always says “Yes” or “No” about rain. Sure, maybe it gets it right a bunch of the time. But can you act on its uncertainty? Not really. Probabilistic models let you plan better because they let you plan under risk.

This is why weather models and insurance risk models are evaluated using things like Brier scores or reliability diagrams—not just RMSE.


2. Epistemic Humility in Model Comparison

Here’s something Bayesian model comparison taught me: you can have two models that fit the data well, but still differ wildly in their implications.

And that’s not just about the likelihood—it’s about priors, structures, assumptions, all the hidden stuff we usually take for granted. AIC, BIC, Bayes factors… they all give you ways to say, “this model is better,” but none of them can tell you that your entire model class isn’t missing something important.

So now when I do model comparison, I don’t just look at scores—I look at variance across posterior predictive distributions. 

I ask: How sensitive are my conclusions to the model structure? 

That’s epistemic humility. It’s not about finding the model—it’s about knowing when your models disagree and why.


3. Propagating Uncertainty in Hierarchical Models

If you’ve worked with multilevel or hierarchical models, you already know the magic: they let you share strength across groups, estimate partial pooling, and model latent structures.

But here’s the part people often gloss over: these models don’t just handle complexity—they propagate uncertainty more honestly.

When I model something like school performance across districts, a classical fixed-effect model might treat each school as independent. But a hierarchical model says: wait, let’s account for the fact that these schools are similar in some ways—and let’s push that uncertainty up and down the levels.

This becomes huge in meta-analysis, policy modeling, even sports analytics. Hierarchical Bayesian models don’t just give us “better estimates.” They give us a more accurate picture of our uncertainty, especially when data is sparse.


4. Prior Sensitivity as a Feature, Not a Flaw

I used to be nervous about prior sensitivity analyses. Like it was some gotcha that proved my model was fragile.

Now? 

I see it as a chance to learn about the boundaries of my belief system.

The truth is, all data analysis is conditional on assumptions. 

When you vary the priors in a Bayesian model and see how your posterior changes, you’re doing something incredibly important: you’re stress-testing your conclusions.

We do this in other sciences all the time—varying parameters in physical simulations, for instance. 

Why should stats be different? 

Running a few alternative priors and looking at posterior robustness is a way of saying: Here’s where I’m confident, and here’s where I’m not.


5. Probabilistic Causality and Counterfactuals

Last one, and this one’s big: causal inference has come a long way from “correlation is not causation.” The current wave—particularly in the Bayesian and potential outcomes frameworks—is all about assigning probabilities to counterfactuals.

Think about that: we’re now saying things like, “There’s a 70% probability that this person wouldn’t have relapsed if they’d received Treatment A instead of B.” That’s wild! But also incredibly useful for policy.

It’s not perfect, of course—we rely on identifiability assumptions, ignorability, etc. But it means we’re moving beyond binary yes/no answers to probabilistic statements about alternative worlds. 

And that’s a whole new level of nuance.


Probabilistic thinking isn’t just a technical choice—it’s a worldview. And once you get comfortable living in uncertainty, everything sharpens. 

You start asking better questions. 

You stop pretending that 95% certainty is the same as truth.

The Margin of Error Is Political Now

Let’s talk about a different kind of uncertainty—not the kind that lives in equations, but the kind that lives in headlines.

You’ve probably seen it: a political poll says Candidate X is ahead by 3 points, “within the margin of error.” Then talking heads jump in to spin that as a win, a loss, or a toss-up—whatever suits the narrative.

This isn’t just statistical illiteracy. It’s the weaponization of uncertainty.


The Polling Problem

Polls are maybe the most public-facing statistical tool we have. They’re everywhere. And they’re often misunderstood—even by people who report them.

Quick refresher: if a poll shows Candidate A ahead by 3% with a margin of error of ±4%, that means the difference is statistically indistinguishable from zero. The lead could be real… or not. That uncertainty should prompt caution.

But that’s not what happens. Instead, polling data gets read as a scoreboard. “Candidate A is winning!” they say—because nuance doesn’t sell.

What gets lost is the idea that margin of error is just sampling error, and that real uncertainty is even bigger when you consider things like response bias, non-coverage, and modeling assumptions. 

The interval they report? 

That’s the optimistic part.


The COVID Forecast Fiasco

Another great (and painful) example: COVID-19 case forecasts in 2020.

Early in the pandemic, models projected everything from 60,000 to 2 million deaths. People panicked. Then some folks said, “See? The models were wrong!”

But that criticism missed the point: those weren’t predictions. They were conditional projections. Most models said, “If we do X, we might see Y.” But policymakers and media often treated that as “We will see Y.”

This gap between probabilistic modeling and deterministic messaging hurt trust in science. It made modelers seem flaky. 

But the real issue was how badly we communicated what uncertainty means.


Uncertainty ≠ Weakness

Here’s the crux: in science, acknowledging uncertainty is a sign of rigor. In politics or public discourse, it’s often seen as a liability.

We need to push back on that. A model with wide intervals isn’t bad—it’s honest. A narrow confidence interval built on unrealistic assumptions? 

That’s way more dangerous.

And yet, we’re often pressured to deliver clean answers. “What’s the best estimate?” “Is the effect real?” “Will this work?”

The problem isn’t uncertainty—it’s how uncomfortable we are admitting that we don’t know for sure.


Communicating Uncertainty Better

So what do we do? Well, here’s what’s worked for me:

  • Visualize intervals, not just means. Show people the full distribution.
  • Use scenario language. “If X happens, then Y is likely.”
  • Admit the assumptions. Be upfront about limitations.
  • Emphasize uncertainty early, not as a footnote.

Ultimately, we need a language that balances honesty with clarity. Not just for ethics—but because misleading certainty erodes trust faster than honest doubt ever will.

All Uncertainty Is Not Created Equal

Okay, final deep dive: not all uncertainty is the same. And if we want to talk seriously about modeling, inference, and decision-making, we need a better taxonomy.

Let me walk you through five distinct types of uncertainty we face in science. They’re not just semantics—they change how we think, model, and communicate.


1. Aleatoric vs. Epistemic Uncertainty

This one’s foundational.

  • Aleatoric uncertainty = randomness inherent in the system (e.g., dice rolls, quantum events).
  • Epistemic uncertainty = lack of knowledge (e.g., unknown parameters, incomplete models).

The key difference? 

Aleatoric uncertainty can’t be reduced with more data. Epistemic can. But in real systems—like climate modeling or disease spread—we usually have both.

And if we model epistemic uncertainty as aleatoric, we’re essentially saying, “This is just noise,” when it’s actually ignorance. That’s a big mistake.


2. Model Uncertainty

Model misspecification, structural assumptions, choice of functional forms—this is the elephant in the room for a lot of statistical inference.

We often behave as if our model is “correct” and focus only on parameter uncertainty. But in practice, model uncertainty is usually bigger than parameter noise.

Tools like Bayesian model averaging, ensemble modeling, or stacking help here. But even then, we rarely propagate model uncertainty fully into our decisions. We should.


3. Sampling Error vs. Measurement Error

Another underappreciated one. Sampling error is about who we got in the dataset. Measurement error is about how well we measured them.

In social science and public health, measurement error can be brutal—think self-reported drug use, income, or symptoms. And yet, many models treat inputs as clean.

Accounting for both types separately (e.g., through latent variable models, validation studies) is crucial if you want valid inferences.


4. Ontological Uncertainty

This one’s more philosophical, but bear with me.

Ontological uncertainty is when you’re not even sure what variables matter—or whether the system is well-defined.

Think about early-stage research on consciousness, dark matter, or social tipping points. We can’t assign probabilities to these yet. We don’t have the right model class. We’re still trying to figure out what the world even is.

This kind of uncertainty resists quantification—and that’s okay. But we need to name it when it shows up.


5. Communicative Uncertainty

Lastly: the uncertainty that gets lost between the model and the audience.

It doesn’t matter how elegant your probabilistic model is—if you can’t communicate its assumptions, its confidence, and its limitations, you’re adding uncertainty, not removing it.

This is where we circle back to trust, clarity, and even design. Numbers aren’t enough. We need metaphors, visualizations, and better defaults for conveying the fuzziness in our thinking.


Final Thoughts

So here’s where I land after all this: embracing uncertainty isn’t a compromise—it’s the core of good science.

Confidence intervals, margins of error, posterior probabilities—they’re not weak. They’re precise expressions of our ignorance. And in a world that’s messy, complex, and always changing, I’d rather have calibrated humility than false certainty.

If our job is to describe the world, then admitting what we don’t know—and quantifying it carefully—isn’t a failure. It’s the whole point.

Let’s keep doing that work. And let’s keep getting better at saying, “Here’s what the data say… and here’s what they don’t.”