Statistics 101 for Evaluation

Evidence and Impact Module 2025-26

Andreas Varotsis

Statistics 101 for Evaluation

“Understanding uncertainty is the foundation of credible impact measurement.”

Learning Goals

By the end of this session, you’ll be able to:

  • Build a mental model of uncertainty, inference, and design trade-offs
  • Correctly interpret p-values, confidence intervals, and power
  • Understand what a regression coefficient actually tells you
  • Recognize when frequentist tools work — and when Bayesian thinking helps

Today’s Journey (90 min)

  1. Distributions & Law of Large Numbers (18 min)
  2. p-Values, Significance & Confidence Intervals (17 min)
  3. Power & Sample Size (16 min)
  4. Regression & Linear Modelling (24 min)
  5. Bayesian Zoom-Out (10 min)
  6. Wrap & Q/A (5 min)

Each section: Scenario → Math + Intuition → Tool Demo → Case Questions

Part 1: Distributions & Law of Large Numbers

Duration: 18 minutes

The Problem: Noise vs Signal

Scenario: A council launches a redesigned service form.

  • Week 1 (before): Daily completion rates bounce between 38% and 56%
  • Week 2 (after): Daily completion rates bounce between 42% and 58%

Stakeholder question: “Did the redesign work?”

Your challenge: Explain why variation alone doesn’t prove anything

What Is a Distribution?

A distribution describes how outcomes vary across repeated observations.

Example: Daily completion rates over 30 days

  • Mean (μ): average completion rate (e.g., 45%)
  • Variance (σ²): how spread out the rates are
  • Standard deviation (σ): typical distance from the mean (e.g., 8pp)

Key insight: Even with no change, you’ll see variation day-to-day!

The Law of Large Numbers (LLN)

As sample size grows, the sample mean (\(\bar{x}\)) converges to the true mean (μ).

Formula: \[\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \xrightarrow{n \to \infty} \mu\]

Plain English: With more data, your estimate gets more stable and accurate.

Central Limit Theorem (Sneak Peek)

Even if individual outcomes are weird, averages tend to look Normal.

This lets us:

  • Calculate confidence intervals
  • Run hypothesis tests
  • Make probabilistic statements about our estimates

Visual: Show sampling distribution getting narrower and more Normal as n increases

Binary Outcomes: Special Case

For Yes/No outcomes (completed the form, clicked the link, attended the meeting):

Bernoulli distribution:

  • Each person has probability \(p\) of success
  • \(E[p] = p\) (expected value is just \(p\))
  • \(\text{Var}(p) = \frac{p(1-p)}{n}\) (variance decreases with sample size!)

Key implication: The more people in your sample, the more confident you are about the true completion rate.

Graphs: Visualizing Variation

Three key visualizations:

  1. Histogram of daily completions (before/after periods)
  2. Sampling distribution of the mean for n=20 vs n=200
  3. Binomial → Normal approximation overlay

The sampling distribution shows us what repeated samples would look like — this is where uncertainty comes from!

Interactive Demo: Sampling Distribution Explorer

Key insight: Notice how n=200 (coral) is ~2× narrower than n=50 (blue). This demonstrates √n scaling: 4× more data → 2× less uncertainty.

Case Study Prompts

Question 1: If baseline completion is ~45%, what run length (days of data) stabilizes your weekly estimate within ±2pp most of the time?

Question 2: You observe +4pp improvement after the redesign. When is this just noise vs real change? What sample size changes that answer?

Key Takeaway: Part 1

Variation is normal. Without understanding the distribution and sample size, you can’t distinguish signal from noise.

\[\text{Uncertainty} \propto \frac{1}{\sqrt{n}}\]

More data → less uncertainty → stronger conclusions

Part 2: p-Values, Significance & Confidence Intervals

Duration: 17 minutes

The Problem: Did It Actually Work?

Scenario: Two outreach emails (A vs B) invite residents to a community safety survey.

  • Email A (control): Standard invitation → 12.3% response rate
  • Email B (treatment): Personalized invitation → 15.8% response rate

Stakeholder question: “Email B is clearly better, right?”

Your challenge: Is this difference real or could it be random chance?

The Null Hypothesis (H₀)

Null hypothesis (H₀): There is no real difference between A and B.

\[H_0: p_A = p_B\]

Alternative hypothesis (H₁): There is a real difference.

\[H_1: p_A \neq p_B\]

Our test asks: “If H₀ were true, how surprising is what we observed?”

The Test Statistic

What we observe: Difference in sample proportions

\[\hat{p}_B - \hat{p}_A = 0.158 - 0.123 = 0.035\]

Translation: Email B had a 3.5 percentage point higher response rate.

But: Is +3.5pp a lot? Depends on:

  • Sample sizes (nₐ and n_B)
  • Baseline variability
  • What we’d expect from random chance

Standard Error & p-Values

Standard Error measures uncertainty in our estimate:

\[SE(\hat{p}_B - \hat{p}_A) = \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_B(1-\hat{p}_B)}{n_B}}\]

p-value: If the null were true, how often would we see a difference this large or larger?

  • p < 0.05: “Statistically significant” (common threshold)
  • p = 0.03: Only 3% chance of seeing this if emails were really identical

Type I and Type II Errors

H₀ True (no effect) H₁ True (real effect)
Reject H₀ Type I Error (α) ✅ Correct
Fail to Reject H₀ ✅ Correct Type II Error (β)

α (alpha): False positive rate (typically 5%)
β (beta): False negative rate
Power: 1 - β (typically aim for 80%)

Confidence Intervals: Better Than p-Values

95% Confidence Interval for the difference:

\[\text{Estimate} \pm 1.96 \times SE\]

Example: Email B uplift = 3.5pp with 95% CI [0.2pp, 6.8pp]

Interpretation: We’re 95% confident the true effect is between 0.2pp and 6.8pp.

Why better than p-values? Shows magnitude and precision, not just “significant/not significant.”

Formula Detail: Two-Proportion Test

For large samples, the 95% CI is approximately:

\[(\hat{p}_B - \hat{p}_A) \pm 1.96 \times \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_B(1-\hat{p}_B)}{n_B}}\]

Better alternatives for small n:

  • Agresti-Coull CI (adds pseudo-observations)
  • Score interval (inverts the test)

Graph: Visualizing p-Values

Null distribution of the test statistic:

  • Bell curve centered at 0 (no difference)
  • Observed difference marked with a vertical line
  • Shaded tail area = p-value

Visual: The further your observed difference from zero, the smaller the tail area (lower p-value)

Interactive Demo: A/B Testing Simulator

Key insight: Use the 4 sliders to explore how sample size, effect size, alpha, and number of trials affect Type I error (null scenario, coral) and statistical power (alternative scenario, blue).

The Danger of p-Hacking

What is p-hacking?

Testing multiple hypotheses but only reporting the “significant” ones.

The problem: With α = 0.05, pure chance gives you 1 false positive per 20 tests!

Common forms:

  • Testing many outcomes, reporting only “significant” ones
  • Analyzing by multiple subgroups (age, gender, location, device…)
  • Stopping data collection when p < 0.05
  • Trying different statistical methods until one “works”

Result: Published findings that won’t replicate

Interactive Demo: P-Hacking Simulator

Key insight: Use the 4 sliders to explore how testing multiple hypotheses inflates false positives. Even when there are NO real effects, you’ll find “significant” results by pure chance!

Preventing p-Hacking: Pre-Registration

The gold standard: Pre-register your analysis plan

Before collecting data:

  1. Specify ONE primary outcome
  2. Define your analysis plan (method, covariates, sample size)
  3. Register it publicly (OSF, AsPredicted, clinical trials registry)

Why this works:

  • Removes researcher degrees of freedom
  • Makes deviations transparent
  • Increases trust in findings
  • Prevents fooling yourself

Example: Clinical trials must pre-register to prevent selective reporting

Dealing with Multiple Comparisons

If you must test multiple outcomes:

  • Bonferroni correction: Divide α by number of tests
    • Testing 20 outcomes? Use α = 0.05/20 = 0.0025
  • False Discovery Rate (FDR): More powerful alternative (Benjamini-Hochberg)
  • Report ALL tests, not just significant ones

Red flags to watch for:

  • “We found X worked for [oddly specific subgroup]”
  • No pre-registered analysis plan
  • Only reporting significant results

Bottom line: p-hacking is easy to do accidentally. Pre-registration prevents it.

Case Study Prompts

Question 1: For an observed uplift of +3.8pp with 95% CI [−0.4pp, +8.0pp], how would you brief a stakeholder?

Question 2: Your trial reports p = 0.047 once, but 8 secondary outcomes were also tested. What does “significant” mean now?

Key Takeaway: Part 2

p-values tell you if an effect is surprising under the null. Confidence intervals tell you how big it might be. Always report both.

Good reporting:

  • Effect size ✅
  • Confidence interval ✅
  • p-value (optional) ✅
  • Sample size ✅
  • Method ✅

Part 3: Power & Sample Size

Duration: 16 minutes

The Problem: Planning Ahead

Scenario: You plan an SMS reminder to reduce missed appointments at a community clinic.

  • Current rate: 42% of people miss appointments
  • Your goal: Reduce to 39% (−3pp improvement)

Ops lead asks: “How many people per arm do we need to reliably detect this?”

Your challenge: Design a study with enough power to detect real effects.

What Is Power?

Statistical power: Probability you’ll detect a real effect if it exists.

\[\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_1 \text{ is true})\]

Common target: 80% power (i.e., β = 20%)

Tradeoffs:

  • Higher power → need larger sample size
  • Smaller effects → need larger sample size
  • Lower α → need larger sample size

The Power-Sample Size Relationship

Rule of thumb: Detecting small uplifts in binary outcomes needs large n.

Approximate formula (two-proportion test, equal group sizes):

\[n \approx \frac{(z_{1-\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_{1-\beta}\sqrt{p_A(1-p_A)+p_B(1-p_B)})^2}{(p_B-p_A)^2}\]

Don’t memorize! Use a calculator or tool. But note:

  • \(n\) grows with \(1/(\text{effect size})^2\) → half the effect = 4× the sample
  • \(n\) grows with baseline variance → more variable outcomes need more data

Effect Size: MDE Thinking

Minimum Detectable Effect (MDE): Smallest effect your study can reliably detect.

Key question: “What effect size would actually matter for policy/practice?”

Example:

  • 1pp improvement in appointment attendance: probably not worth scaling
  • 5pp improvement: worth considering
  • 10pp improvement: definitely worth scaling

Design principle: Match your MDE to your practical significance threshold.

Graphs: Power Curves

Two useful visualizations:

  1. Power curve vs n (for fixed MDE)
    • Shows: how power increases as you add more people
    • Typical shape: S-curve (steep in the middle, flat at extremes)
  2. MDE curve vs n (for fixed power)
    • Shows: what effect sizes you can detect for a given n
    • Typical shape: hyperbola (diminishing returns to adding people)

Interactive Demo: Power & Sample Size Calculator

Key insight: Move the slider to see how MDE affects required sample size. Smaller effects (1pp) need huge samples; larger effects (10pp) are detectable with modest samples.

Case Study Prompts

Question 1: With baseline 42% missed appointments and desired MDE of −3pp (to 39%) at 80% power, what n per arm is needed?

Question 2: If you can only recruit 1,200 people total (600 per arm), what MDE becomes realistic at 80% power? Is this still meaningful?

Practical Constraints: What If n Is Fixed?

Reality: Often you CAN’T get more people (budget, time, eligible population).

Options:

  1. Accept lower power → risk false negatives, report this
  2. Accept larger MDE → can only detect big effects
  3. Reduce variance → better measurement, blocking, stratification
  4. Increase α → accept more false positives (rarely done)
  5. Use a more powerful design → within-subjects, stepped wedge

Key Takeaway: Part 3

Power determines whether you can reliably detect real effects. Design your sample size for the minimum effect that would change your decision.

\[\text{More power} \Leftrightarrow \text{Larger n OR Larger effect OR Lower variance}\]

Always report: “This study can detect effects ≥ Xpp with 80% power.”

Part 4: Regression & Linear Modelling

Duration: 24 minutes

The Problem: Confounding

Scenario: You roll out posters + emails to promote a community consultation.

Complication:

  • Younger residents (18-35) were more exposed to the campaign
  • Younger residents also use mobile devices more
  • Both age AND mobile usage might affect participation

Challenge: What’s the adjusted effect of the campaign, controlling for age and device?

What Is Regression?

Regression models the relationship between:

  • An outcome variable (Y): participation rate, completion rate, etc.
  • One or more predictor variables (X): treatment, age, device, etc.

Ordinary Least Squares (OLS) finds the line that best fits your data:

\[y_i = \beta_0 + \beta_1 \text{Treatment}_i + \beta_2 \text{Age}_i + \beta_3 \text{Mobile}_i + \varepsilon_i\]

β₁ = effect of treatment, holding age and device constant

Interpreting Coefficients

Example output:

Variable Coefficient Std Error p-value
Intercept 0.28 0.03 <0.001
Treatment 0.029 0.014 0.038
Age -0.002 0.001 0.045
Mobile 0.045 0.018 0.012

β₁ = 0.029: The campaign increased participation by 2.9 percentage points, adjusting for age and device.

Regression as Adjusted Means

Intuition: Regression is just comparing groups AFTER adjusting for other factors.

Visual: Imagine two histograms (treatment vs control), but you’ve matched them on age and device first.

β₁ is the difference in average outcomes after this matching.

Math detail: OLS is equivalent to weighted averages, where weights ensure balance on covariates.

Key Assumptions

For OLS to give reliable results:

  1. Linearity (in parameters): relationships can be additive
  2. Exogeneity: ε is uncorrelated with X (no unmeasured confounders)
  3. No perfect collinearity: predictors aren’t exact copies of each other
  4. Homoskedasticity: variance of ε doesn’t depend on X

Practical tip: Use robust standard errors (or clustered SEs) to relax #4.

Clustering & Robust Standard Errors

Problem: Outcomes within the same ward, school, or household are correlated.

Solution: Use cluster-robust standard errors by group.

Example: If 20 schools each recruit 50 families:

  • Don’t treat all 1,000 families as independent
  • Cluster by school → SEs will be larger (more conservative)

R code: lm(...) %>% sandwich::vcovCL(cluster = ~school_id)

Diagnostics: Residual Plots

Check your assumptions visually:

  1. Residuals vs Fitted: Should be randomly scattered
    • Pattern → model misspecification (try transformations or interactions)
  2. Q-Q plot: Should be roughly a straight line
    • Deviations → non-Normal errors (often okay with large n due to CLT)
  3. Leverage plot: Identify influential observations
    • High leverage + large residual → outlier that affects estimates

R² and Model Fit

R²: Proportion of variance in Y explained by the model.

  • R² = 0.15 → model explains 15% of variation
  • R² = 0.80 → model explains 80% of variation

Important: R² ≠ whether your model is good!

  • High R² doesn’t mean causal
  • Low R² can still have reliable β estimates
  • Focus on β₁ (your treatment effect) and its SE, not R²

Interactions: When Effects Vary

Sometimes the treatment effect differs by subgroup:

\[y_i = \beta_0 + \beta_1 \text{Treatment}_i + \beta_2 \text{Age}_i + \beta_3 (\text{Treatment} \times \text{Age}) + \varepsilon_i\]

β₃: How treatment effect changes with age

Example: If β₃ < 0, campaign is more effective for younger residents.

Case Study Prompts

Question 1: Unadjusted uplift is +5.2pp; adjusted β₁ is +2.9pp (robust SE 1.4pp). How do you report this to stakeholders?

Question 2: Adding Age² improves R² but leaves β₁ similar. What does that suggest about confounding vs functional form?

When Regression Isn’t Enough

Regression assumes:

  • No unmeasured confounders
  • Linear-additive effects (or specified interactions)
  • Correct functional form

If these fail, consider:

  • Instrumental Variables (Part 1 slides, quasi-experimental)
  • Difference-in-Differences (before/after × treatment/control)
  • Regression Discontinuity (exploit thresholds)
  • Propensity Score Matching (balance observables first)

Key Takeaway: Part 4

Regression estimates treatment effects while adjusting for confounders. Always interpret coefficients in context, check assumptions, and use robust/clustered SEs when appropriate.

Report: “The campaign increased participation by βpp (95% CI: [X, Y]), adjusting for age and device.”

Part 5: Bayesian Zoom-Out

Duration: 10 minutes

The Limitation of Frequentism

Frequentist approach:

  • Assumes a true fixed parameter (e.g., treatment effect)
  • Makes probability statements about data (p-values, CIs)
  • Can’t say “95% probability effect is positive”

Bayesian approach:

  • Treats parameters as uncertain (have distributions)
  • Makes probability statements about parameters
  • Can say “95% probability effect is between X and Y”

The Core Insight: Bayes’ Theorem

Update beliefs based on evidence:

\[P(\theta | \text{Data}) = \frac{P(\text{Data} | \theta) \times P(\theta)}{P(\text{Data})}\]

Plain English:

  • Prior: \(P(\theta)\) = what we believed before seeing data
  • Likelihood: \(P(\text{Data} | \theta)\) = how consistent data is with each possible θ
  • Posterior: \(P(\theta | \text{Data})\) = updated belief after seeing data

Toy Example: Beta-Binomial

Scenario: Prior evidence from other councils suggests outreach emails usually increase response by 0-5pp.

Model:

  • Prior: Response rate \(p \sim \text{Beta}(a, b)\) (flexible distribution on [0,1])
  • Data: \(x\) responses out of \(n\) emails
  • Posterior: \(p | \text{Data} \sim \text{Beta}(a+x, b+n-x)\)

Result: Direct probability distribution over the response rate!

Prior Selection: Art or Science?

Types of priors:

  1. Uninformative: Flat, let data speak (e.g., Beta(1,1))
  2. Weakly informative: Regularize, prevent extreme estimates
  3. Informative: Based on past studies, theory, expert judgment

Best practice:

  • Be transparent about prior choice
  • Run sensitivity analysis: how do results change with different priors?
  • In civic tech: use priors from similar interventions if available

Credible Intervals vs Confidence Intervals

95% Confidence Interval (frequentist):

  • “If we repeated the study many times, 95% of intervals would contain the true parameter”
  • Statement about the procedure, not this particular interval

95% Credible Interval (Bayesian):

  • “There’s a 95% probability the parameter is in this interval, given our data and prior”
  • Direct statement about the parameter

Visual: Show prior, likelihood, posterior curves; credible interval is 95% of posterior mass.

When Bayesian Thinking Helps

Situations where Bayes shines:

  • Small samples: Incorporate prior knowledge to improve estimates
  • Sequential testing: Update beliefs as data accumulates
  • Decision analysis: Compute probability of meeting a threshold
  • Complex models: Hierarchy, missing data (MCMC handles these well)

Example: “What’s the probability the campaign increases response by at least 3pp?” → Integrate posterior above 3pp.

Case Study Prompts

Question 1: With a skeptical prior centered at +1pp and observed +3.2pp uplift (n=400 per arm), does your posterior still support a meaningful positive effect?

Question 2: How does doubling the prior sample weight (stronger prior) change conclusions vs using a flat prior?

Key Takeaway: Part 5

Bayesian inference lets you make direct probability statements about parameters, incorporating prior knowledge. It’s especially useful for small samples, sequential testing, and decision analysis.

Frequentist: “Given a true parameter, what’s the probability of this data?”
Bayesian: “Given this data, what’s the probability distribution of the parameter?”

Part 6: Wrap & Q/A

Duration: 5 minutes

Cross-Cutting Pitfalls

Common mistakes in evaluation:

  1. Multiple comparisons without correction → inflated false positive rate
  2. p-hacking → testing many things, only reporting “significant” ones
  3. Regression to the mean → target worst performers, they improve anyway
  4. Measurement drift → definition changes over time
  5. Ignoring clustering → SEs too small, false positives

Prevention: Pre-register analysis plans, use robust methods, be transparent.

Good Practice Checklist

Design phase:

  • Pre-register primary outcomes and analysis plan
  • Calculate required sample size (power analysis)
  • Plan for covariates to adjust for

Analysis phase:

  • Report effect sizes with confidence intervals
  • Use robust/clustered SEs where appropriate
  • Check diagnostics (residual plots, balance checks)
  • Don’t p-hack (stick to pre-specified analyses)

Reporting phase:

  • Transparent about limitations
  • Share data and code where possible
  • Plain-language interpretation
  • Acknowledge what you DON’T know

Tools & Resources

Interactive tools from today:

  • Sampling Distribution Explorer (understand sampling distributions)
  • A/B Testing Simulator (Type I vs Type II errors)
  • Power & Sample Size Calculator (sample size planning)

Further reading:

  • Gelman & Hill, Regression and Other Stories
  • McElreath, Statistical Rethinking (Bayesian)
  • Gerber & Green, Field Experiments
  • mySociety Research Methods documentation

Exit Tickets

Before you leave, please answer:

  1. Write a one-sentence interpretation of a 95% confidence interval for a treatment effect of +3.5pp, CI [0.2pp, 6.8pp].

  2. Name one design change to increase statistical power without inflating α.

Q&A

Open floor for questions on any part of the session.

Common questions:

  • “When should I use Bayesian vs frequentist?”
  • “How do I deal with small sample sizes?”
  • “What if I can’t randomize?”
  • “How do I explain this to non-technical stakeholders?”

Key Takeaways: Statistics 101

1. Distributions & LLN: Variation is normal. More data → less uncertainty.

2. p-values & CIs: Effect size + confidence interval > binary significance.

3. Power: Design for the minimum effect that matters. Don’t run underpowered studies.

4. Regression: Adjust for confounders, check assumptions, use robust SEs.

5. Bayesian: Direct probability statements, incorporate prior knowledge.

“Good statistics makes good evaluation possible. Good evaluation makes good decisions possible.”