Statistics 101 for Evaluation

Evidence and Impact Module 2025-26

Andreas Varotsis

Statistics 101 for Evaluation

“Understanding uncertainty is the foundation of credible impact measurement.”

Learning Goals

By the end of this session, you’ll be able to:

Build a mental model of uncertainty, inference, and design trade-offs
Correctly interpret p-values, confidence intervals, and power
Understand what a regression coefficient actually tells you
Recognize when frequentist tools work — and when Bayesian thinking helps

Today’s Journey (90 min)

Distributions & Law of Large Numbers (18 min)
p-Values, Significance & Confidence Intervals (17 min)
Power & Sample Size (16 min)
Regression & Linear Modelling (24 min)
Bayesian Zoom-Out (10 min)
Wrap & Q/A (5 min)

Each section: Scenario → Math + Intuition → Tool Demo → Case Questions

Part 1: Distributions & Law of Large Numbers

Duration: 18 minutes

The Problem: Noise vs Signal

Scenario: A council launches a redesigned service form.

Week 1 (before): Daily completion rates bounce between 38% and 56%
Week 2 (after): Daily completion rates bounce between 42% and 58%

Stakeholder question: “Did the redesign work?”

Your challenge: Explain why variation alone doesn’t prove anything

What Is a Distribution?

A distribution describes how outcomes vary across repeated observations.

Example: Daily completion rates over 30 days

Mean (μ): average completion rate (e.g., 45%)
Variance (σ²): how spread out the rates are
Standard deviation (σ): typical distance from the mean (e.g., 8pp)

Key insight: Even with no change, you’ll see variation day-to-day!

The Law of Large Numbers (LLN)

As sample size grows, the sample mean (\(\bar{x}\)) converges to the true mean (μ).

Formula: \[\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \xrightarrow{n \to \infty} \mu\]

Plain English: With more data, your estimate gets more stable and accurate.

Central Limit Theorem (Sneak Peek)

Even if individual outcomes are weird, averages tend to look Normal.

This lets us:

Calculate confidence intervals
Run hypothesis tests
Make probabilistic statements about our estimates

Visual: Show sampling distribution getting narrower and more Normal as n increases

Binary Outcomes: Special Case

For Yes/No outcomes (completed the form, clicked the link, attended the meeting):

Bernoulli distribution:

Each person has probability \(p\) of success
\(E[p] = p\) (expected value is just \(p\))
\(\text{Var}(p) = \frac{p(1-p)}{n}\) (variance decreases with sample size!)

Key implication: The more people in your sample, the more confident you are about the true completion rate.

Graphs: Visualizing Variation

Three key visualizations:

Histogram of daily completions (before/after periods)
Sampling distribution of the mean for n=20 vs n=200
Binomial → Normal approximation overlay

The sampling distribution shows us what repeated samples would look like — this is where uncertainty comes from!

Interactive Demo: Sampling Distribution Explorer

Key insight: Notice how n=200 (coral) is ~2× narrower than n=50 (blue). This demonstrates √n scaling: 4× more data → 2× less uncertainty.

Case Study Prompts

Question 1: If baseline completion is ~45%, what run length (days of data) stabilizes your weekly estimate within ±2pp most of the time?

Question 2: You observe +4pp improvement after the redesign. When is this just noise vs real change? What sample size changes that answer?

Key Takeaway: Part 1

Variation is normal. Without understanding the distribution and sample size, you can’t distinguish signal from noise.

\[\text{Uncertainty} \propto \frac{1}{\sqrt{n}}\]

More data → less uncertainty → stronger conclusions

Part 2: p-Values, Significance & Confidence Intervals

Duration: 17 minutes

The Problem: Did It Actually Work?

Scenario: Two outreach emails (A vs B) invite residents to a community safety survey.

Email A (control): Standard invitation → 12.3% response rate
Email B (treatment): Personalized invitation → 15.8% response rate

Stakeholder question: “Email B is clearly better, right?”

Your challenge: Is this difference real or could it be random chance?

The Null Hypothesis (H₀)

Null hypothesis (H₀): There is no real difference between A and B.

\[H_0: p_A = p_B\]

Alternative hypothesis (H₁): There is a real difference.

\[H_1: p_A \neq p_B\]

Our test asks: “If H₀ were true, how surprising is what we observed?”

The Test Statistic

What we observe: Difference in sample proportions

\[\hat{p}_B - \hat{p}_A = 0.158 - 0.123 = 0.035\]

Translation: Email B had a 3.5 percentage point higher response rate.

But: Is +3.5pp a lot? Depends on:

Sample sizes (nₐ and n_B)
Baseline variability
What we’d expect from random chance

Standard Error & p-Values

Standard Error measures uncertainty in our estimate:

\[SE(\hat{p}_B - \hat{p}_A) = \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_B(1-\hat{p}_B)}{n_B}}\]

p-value: If the null were true, how often would we see a difference this large or larger?

p < 0.05: “Statistically significant” (common threshold)
p = 0.03: Only 3% chance of seeing this if emails were really identical

Type I and Type II Errors

	H₀ True (no effect)	H₁ True (real effect)
Reject H₀	Type I Error (α)	✅ Correct
Fail to Reject H₀	✅ Correct	Type II Error (β)

α (alpha): False positive rate (typically 5%)
β (beta): False negative rate
Power: 1 - β (typically aim for 80%)

Confidence Intervals: Better Than p-Values

95% Confidence Interval for the difference:

\[\text{Estimate} \pm 1.96 \times SE\]

Example: Email B uplift = 3.5pp with 95% CI [0.2pp, 6.8pp]

Interpretation: We’re 95% confident the true effect is between 0.2pp and 6.8pp.

Why better than p-values? Shows magnitude and precision, not just “significant/not significant.”

Formula Detail: Two-Proportion Test

For large samples, the 95% CI is approximately:

\[(\hat{p}_B - \hat{p}_A) \pm 1.96 \times \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_B(1-\hat{p}_B)}{n_B}}\]

Better alternatives for small n:

Agresti-Coull CI (adds pseudo-observations)
Score interval (inverts the test)

Graph: Visualizing p-Values

Null distribution of the test statistic:

Bell curve centered at 0 (no difference)
Observed difference marked with a vertical line
Shaded tail area = p-value

Visual: The further your observed difference from zero, the smaller the tail area (lower p-value)

Interactive Demo: A/B Testing Simulator

Key insight: Use the 4 sliders to explore how sample size, effect size, alpha, and number of trials affect Type I error (null scenario, coral) and statistical power (alternative scenario, blue).

The Danger of p-Hacking

What is p-hacking?

Testing multiple hypotheses but only reporting the “significant” ones.

The problem: With α = 0.05, pure chance gives you 1 false positive per 20 tests!

Common forms:

Testing many outcomes, reporting only “significant” ones
Analyzing by multiple subgroups (age, gender, location, device…)
Stopping data collection when p < 0.05
Trying different statistical methods until one “works”

Result: Published findings that won’t replicate

Interactive Demo: P-Hacking Simulator

Key insight: Use the 4 sliders to explore how testing multiple hypotheses inflates false positives. Even when there are NO real effects, you’ll find “significant” results by pure chance!

This visualization powerfully demonstrates p-hacking in action.

Key observations to point out:

Every single test is NULL - there’s no real effect anywhere. Both groups come from the same distribution.
With α=0.05 and 20 tests, we EXPECT 1 false positive (20 × 0.05 = 1). Run it multiple times and you’ll see this average holds.
The “significant” results (blue dots below the red line) are purely chance findings.
If you only reported these significant results, you’d be p-hacking! You’d claim “The program increased engagement for young males using mobile” or whatever, when it was just random noise.
Bonferroni correction: If you MUST test 20 hypotheses, divide α by 20. So use α=0.0025 instead of 0.05. This controls the family-wise error rate.

Real-world parallel: “We tested our civic tech app and found it worked! (for women over 60 using tablets on Thursdays)” - this is how p-hacking manifests.

Prevention: Pre-register ONE primary outcome before collecting data.

Preventing p-Hacking: Pre-Registration

The gold standard: Pre-register your analysis plan

Before collecting data:

Specify ONE primary outcome
Define your analysis plan (method, covariates, sample size)
Register it publicly (OSF, AsPredicted, clinical trials registry)

Why this works:

Removes researcher degrees of freedom
Makes deviations transparent
Increases trust in findings
Prevents fooling yourself

Example: Clinical trials must pre-register to prevent selective reporting

Pre-registration is the gold standard for credible evaluation.

Practical steps: 1. Before starting, write down: “Our primary outcome is X. We’ll test it using Y method. Sample size is Z.” 2. Register this at OSF.io or AsPredicted.org (takes 5 minutes) 3. Stick to the plan! If you do exploratory analyses, label them as such.

Why this works: - Removes researcher degrees of freedom - Makes p-hacking transparent (if you deviate from plan, it’s visible) - Increases trust in your findings - Prevents you from fooling yourself

Real example: Clinical trials MUST pre-register. Why? Because pharmaceutical companies were testing dozens of outcomes and only reporting the “positive” ones.

Civic tech should adopt these standards!

Exploratory analysis is fine - just label it: “Pre-registered primary outcome: no effect. Exploratory analysis suggests effect for subgroup X, but this needs confirmation in a new study.”

Dealing with Multiple Comparisons

If you must test multiple outcomes:

Bonferroni correction: Divide α by number of tests
- Testing 20 outcomes? Use α = 0.05/20 = 0.0025
False Discovery Rate (FDR): More powerful alternative (Benjamini-Hochberg)
Report ALL tests, not just significant ones

Red flags to watch for:

“We found X worked for [oddly specific subgroup]”
No pre-registered analysis plan
Only reporting significant results

Bottom line: p-hacking is easy to do accidentally. Pre-registration prevents it.

Multiple comparisons correction: - Bonferroni: very conservative, divide α by number of tests. If testing 20 hypotheses at α=0.05, use α=0.0025 for each test. This controls family-wise error rate. - Benjamini-Hochberg (FDR): less conservative, controls expected proportion of false discoveries rather than probability of any false discovery - Both implemented in standard statistical software (R, Python, Stata)

When to use: - If you have ONE pre-specified primary outcome → no correction needed - If you’re testing multiple secondary outcomes → use correction OR clearly label as exploratory - If doing post-hoc subgroup analyses → definitely label as exploratory, needs replication

Red flags in papers/reports: - “We tested many things and found this one significant result in this specific subgroup” - No mention of how many tests were run - Suspiciously specific findings that weren’t pre-specified

Best practice: Pre-register ONE primary outcome. If you do exploratory analyses, be transparent: “Pre-registered outcome showed no effect. Exploratory analysis suggests effect for subgroup X, but this is hypothesis-generating and needs confirmation.”

Bottom line: p-hacking is easy to do accidentally. Pre-registration prevents it.

Case Study Prompts

Question 1: For an observed uplift of +3.8pp with 95% CI [−0.4pp, +8.0pp], how would you brief a stakeholder?

Question 2: Your trial reports p = 0.047 once, but 8 secondary outcomes were also tested. What does “significant” mean now?

Key Takeaway: Part 2

p-values tell you if an effect is surprising under the null. Confidence intervals tell you how big it might be. Always report both.

Good reporting:

Effect size ✅
Confidence interval ✅
p-value (optional) ✅
Sample size ✅
Method ✅

Part 3: Power & Sample Size

Duration: 16 minutes

The Problem: Planning Ahead

Scenario: You plan an SMS reminder to reduce missed appointments at a community clinic.

Current rate: 42% of people miss appointments
Your goal: Reduce to 39% (−3pp improvement)

Ops lead asks: “How many people per arm do we need to reliably detect this?”

Your challenge: Design a study with enough power to detect real effects.

What Is Power?

Statistical power: Probability you’ll detect a real effect if it exists.

\[\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_1 \text{ is true})\]

Common target: 80% power (i.e., β = 20%)

Tradeoffs:

Higher power → need larger sample size
Smaller effects → need larger sample size
Lower α → need larger sample size

The Power-Sample Size Relationship

Rule of thumb: Detecting small uplifts in binary outcomes needs large n.

Approximate formula (two-proportion test, equal group sizes):

\[n \approx \frac{(z_{1-\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_{1-\beta}\sqrt{p_A(1-p_A)+p_B(1-p_B)})^2}{(p_B-p_A)^2}\]

Don’t memorize! Use a calculator or tool. But note:

\(n\) grows with \(1/(\text{effect size})^2\) → half the effect = 4× the sample
\(n\) grows with baseline variance → more variable outcomes need more data

Effect Size: MDE Thinking

Minimum Detectable Effect (MDE): Smallest effect your study can reliably detect.

Key question: “What effect size would actually matter for policy/practice?”

Example:

1pp improvement in appointment attendance: probably not worth scaling
5pp improvement: worth considering
10pp improvement: definitely worth scaling

Design principle: Match your MDE to your practical significance threshold.

Graphs: Power Curves

Two useful visualizations:

Power curve vs n (for fixed MDE)
- Shows: how power increases as you add more people
- Typical shape: S-curve (steep in the middle, flat at extremes)
MDE curve vs n (for fixed power)
- Shows: what effect sizes you can detect for a given n
- Typical shape: hyperbola (diminishing returns to adding people)

Interactive Demo: Power & Sample Size Calculator

Key insight: Move the slider to see how MDE affects required sample size. Smaller effects (1pp) need huge samples; larger effects (10pp) are detectable with modest samples.

Case Study Prompts

Question 1: With baseline 42% missed appointments and desired MDE of −3pp (to 39%) at 80% power, what n per arm is needed?

Question 2: If you can only recruit 1,200 people total (600 per arm), what MDE becomes realistic at 80% power? Is this still meaningful?

Practical Constraints: What If n Is Fixed?

Reality: Often you CAN’T get more people (budget, time, eligible population).

Options:

Accept lower power → risk false negatives, report this
Accept larger MDE → can only detect big effects
Reduce variance → better measurement, blocking, stratification
Increase α → accept more false positives (rarely done)
Use a more powerful design → within-subjects, stepped wedge

Key Takeaway: Part 3

Power determines whether you can reliably detect real effects. Design your sample size for the minimum effect that would change your decision.

\[\text{More power} \Leftrightarrow \text{Larger n OR Larger effect OR Lower variance}\]

Always report: “This study can detect effects ≥ Xpp with 80% power.”

Part 4: Regression & Linear Modelling

Duration: 24 minutes

The Problem: Confounding

Scenario: You roll out posters + emails to promote a community consultation.

Complication:

Younger residents (18-35) were more exposed to the campaign
Younger residents also use mobile devices more
Both age AND mobile usage might affect participation

Challenge: What’s the adjusted effect of the campaign, controlling for age and device?

What Is Regression?

Regression models the relationship between:

An outcome variable (Y): participation rate, completion rate, etc.
One or more predictor variables (X): treatment, age, device, etc.

Ordinary Least Squares (OLS) finds the line that best fits your data:

\[y_i = \beta_0 + \beta_1 \text{Treatment}_i + \beta_2 \text{Age}_i + \beta_3 \text{Mobile}_i + \varepsilon_i\]

β₁ = effect of treatment, holding age and device constant

Interpreting Coefficients

Example output:

Variable	Coefficient	Std Error	p-value
Intercept	0.28	0.03	<0.001
Treatment	0.029	0.014	0.038
Age	-0.002	0.001	0.045
Mobile	0.045	0.018	0.012

β₁ = 0.029: The campaign increased participation by 2.9 percentage points, adjusting for age and device.

Walk through each row:

Intercept (0.28): baseline participation for someone age=0, not treated, not on mobile. Often not interpretable literally (no one is age 0), but needed for the equation.

Treatment (0.029): This is our target! The effect of the campaign after accounting for confounders. It’s smaller than the unadjusted difference, suggesting some of the raw difference was due to age/device.

Age (-0.002): Each additional year of age is associated with 0.2pp lower participation. So a 10-year difference → 2pp lower participation.

Mobile (0.045): Using mobile increases participation by 4.5pp compared to desktop.

p-values: All are <0.05, so “statistically significant” - but remember Part 2, we care more about effect size and CI!

Standard errors: Measure uncertainty in each coefficient. Use these to construct CIs.

Regression as Adjusted Means

Intuition: Regression is just comparing groups AFTER adjusting for other factors.

Visual: Imagine two histograms (treatment vs control), but you’ve matched them on age and device first.

β₁ is the difference in average outcomes after this matching.

Math detail: OLS is equivalent to weighted averages, where weights ensure balance on covariates.

Key Assumptions

For OLS to give reliable results:

Linearity (in parameters): relationships can be additive
Exogeneity: ε is uncorrelated with X (no unmeasured confounders)
No perfect collinearity: predictors aren’t exact copies of each other
Homoskedasticity: variance of ε doesn’t depend on X

Practical tip: Use robust standard errors (or clustered SEs) to relax #4.

These assumptions sound technical, but they’re about whether regression will give you the right answer.

Linearity: The model is linear in βs, not necessarily in Xs. You can include Age² or log(Income) - that’s fine. But β₁×β₂ interactions need to be specified.
Exogeneity: This is the big one! If treatment is assigned based on unobserved factors (e.g., more motivated people opt in), then β₁ will be biased. This is why randomization or natural experiments are so valuable.
No multicollinearity: If Age and YearsOfEducation are nearly identical, regression can’t tell them apart. Coefficients get unstable. Solution: drop one or combine them.
Homoskedasticity: In practice, often violated (e.g., variance is higher for younger people). Robust SEs fix this without changing point estimates.

In civic tech evaluations, #2 is usually the concern. Always ask: “What confounders might I be missing?”

Clustering & Robust Standard Errors

Problem: Outcomes within the same ward, school, or household are correlated.

Solution: Use cluster-robust standard errors by group.

Example: If 20 schools each recruit 50 families:

Don’t treat all 1,000 families as independent
Cluster by school → SEs will be larger (more conservative)

R code: lm(...) %>% sandwich::vcovCL(cluster = ~school_id)

Diagnostics: Residual Plots

Check your assumptions visually:

Residuals vs Fitted: Should be randomly scattered
- Pattern → model misspecification (try transformations or interactions)
Q-Q plot: Should be roughly a straight line
- Deviations → non-Normal errors (often okay with large n due to CLT)
Leverage plot: Identify influential observations
- High leverage + large residual → outlier that affects estimates

R² and Model Fit

R²: Proportion of variance in Y explained by the model.

R² = 0.15 → model explains 15% of variation
R² = 0.80 → model explains 80% of variation

Important: R² ≠ whether your model is good!

High R² doesn’t mean causal
Low R² can still have reliable β estimates
Focus on β₁ (your treatment effect) and its SE, not R²

Interactions: When Effects Vary

Sometimes the treatment effect differs by subgroup:

\[y_i = \beta_0 + \beta_1 \text{Treatment}_i + \beta_2 \text{Age}_i + \beta_3 (\text{Treatment} \times \text{Age}) + \varepsilon_i\]

β₃: How treatment effect changes with age

Example: If β₃ < 0, campaign is more effective for younger residents.

Case Study Prompts

Question 1: Unadjusted uplift is +5.2pp; adjusted β₁ is +2.9pp (robust SE 1.4pp). How do you report this to stakeholders?

Question 2: Adding Age² improves R² but leaves β₁ similar. What does that suggest about confounding vs functional form?

Q1: Good reporting emphasizes the adjusted estimate, because it’s more credible (controls for confounders).

“After accounting for differences in age and device usage between groups, the campaign increased participation by 2.9 percentage points (95% CI: 0.1pp to 5.7pp). This is a meaningful improvement, though smaller than the unadjusted difference of 5.2pp, which partially reflected that younger people (who participate more) were more exposed to the campaign.”

Be transparent about adjustment and why it matters!

Q2: This suggests Age² helps predict participation (better fit) but isn’t a confounder (doesn’t bias β₁).

Confounding: variable associated with both treatment and outcome → biases estimates if omitted Functional form: nonlinear relationship in the outcome model → affects fit but not necessarily bias

Adding Age² improved the model’s ability to predict Y (higher R²) but didn’t change the treatment effect estimate much → Age was already capturing the confounding, Age² is just refinement.

When Regression Isn’t Enough

Regression assumes:

No unmeasured confounders
Linear-additive effects (or specified interactions)
Correct functional form

If these fail, consider:

Instrumental Variables (Part 1 slides, quasi-experimental)
Difference-in-Differences (before/after × treatment/control)
Regression Discontinuity (exploit thresholds)
Propensity Score Matching (balance observables first)

Regression is powerful but not magic. It only controls for variables you include.

If there’s an unmeasured confounder (e.g., motivation, prior interest), regression can’t save you. You need better design (RCT) or a natural experiment.

Quick overview of alternatives:

IV: Use an external factor (instrument) that affects treatment but not outcome directly. Example: distance to clinic affects attendance, use as instrument for treatment uptake.

DiD: Compare treatment group’s change to control group’s change. Controls for time-invariant confounders.

RD: If treatment is assigned based on a threshold (e.g., age 18, income <£20k), people just above/below are similar. Compare them.

PSM: Estimate probability of treatment (propensity score), match treated/control units with similar scores, then compare.

These are advanced topics (2-hour workshop each!), but know they exist.

Key Takeaway: Part 4

Regression estimates treatment effects while adjusting for confounders. Always interpret coefficients in context, check assumptions, and use robust/clustered SEs when appropriate.

Report: “The campaign increased participation by βpp (95% CI: [X, Y]), adjusting for age and device.”

Part 5: Bayesian Zoom-Out

Duration: 10 minutes

The Limitation of Frequentism

Frequentist approach:

Assumes a true fixed parameter (e.g., treatment effect)
Makes probability statements about data (p-values, CIs)
Can’t say “95% probability effect is positive”

Bayesian approach:

Treats parameters as uncertain (have distributions)
Makes probability statements about parameters
Can say “95% probability effect is between X and Y”

The Core Insight: Bayes’ Theorem

Update beliefs based on evidence:

\[P(\theta | \text{Data}) = \frac{P(\text{Data} | \theta) \times P(\theta)}{P(\text{Data})}\]

Plain English:

Prior: \(P(\theta)\) = what we believed before seeing data
Likelihood: \(P(\text{Data} | \theta)\) = how consistent data is with each possible θ
Posterior: \(P(\theta | \text{Data})\) = updated belief after seeing data

Bayes’ theorem is just: prior belief + new evidence → updated belief.

This is how humans naturally think! “I thought the campaign would help by ~2pp (prior). Data shows 3.8pp. Now I believe it’s probably around 3-4pp (posterior).”

The formula looks scary but it’s intuitive: - Start with prior (accumulated knowledge, theory, past studies) - Multiply by likelihood (how much does THIS data support each possible effect size?) - Normalize (technical detail to make it a proper probability)

The prior is controversial: where does it come from? Subjectivity? - Subjective: expert judgment, theory - Objective: past data from similar contexts - Weakly informative: regularization, prevent overfitting

In practice: if you have lots of data, the prior doesn’t matter much (data overwhelms it). If you have little data, the prior has more influence.

Toy Example: Beta-Binomial

Scenario: Prior evidence from other councils suggests outreach emails usually increase response by 0-5pp.

Model:

Prior: Response rate \(p \sim \text{Beta}(a, b)\) (flexible distribution on [0,1])
Data: \(x\) responses out of \(n\) emails
Posterior: \(p | \text{Data} \sim \text{Beta}(a+x, b+n-x)\)

Result: Direct probability distribution over the response rate!

Beta-Binomial is the simplest Bayesian model for proportions. Perfect for A/B tests.

Beta distribution: - Parameters a, b control shape - Beta(1,1) = uniform prior (know nothing) - Beta(20,80) = skeptical prior centered around 20/(20+80)=20% - Beta(45,55) = optimistic prior centered around 45%

Conjugacy: Beta prior + Binomial likelihood → Beta posterior (analytically tractable, no simulation needed)

Example: - Prior: Beta(45, 55) → expect ~45% response - Data: 58 responses out of 100 emails → 58% observed - Posterior: Beta(45+58, 55+42) = Beta(103, 97) → expect ~52% response

The posterior is a compromise between prior and data. With more data, data dominates. With little data, prior matters.

Credible interval: 95% of posterior mass → direct interpretation “95% probability true response rate is between X% and Y%”

Prior Selection: Art or Science?

Types of priors:

Uninformative: Flat, let data speak (e.g., Beta(1,1))
Weakly informative: Regularize, prevent extreme estimates
Informative: Based on past studies, theory, expert judgment

Best practice:

Be transparent about prior choice
Run sensitivity analysis: how do results change with different priors?
In civic tech: use priors from similar interventions if available

Priors are often misunderstood. They’re not “bias” - they’re accumulated knowledge.

When to use which:

Uninformative: - First study in a new area - Very large dataset - Want to match frequentist results

Weakly informative: - Prevent overfitting (especially with small data) - Rule out extreme/implausible values (e.g., “response rate can’t be 99%”)

Informative: - Rich past data from similar contexts (meta-analysis) - Theory makes strong predictions - Sequential trials (posterior from trial 1 = prior for trial 2)

Criticism: “Priors are subjective!” Response: So are modeling choices in frequentist analysis (which covariates to include, transformations, etc.). At least Bayesian analysis is transparent about assumptions.

Sensitivity analysis is key: show results under multiple priors. If they’re similar → robust. If they differ → data is weak, prior matters, be honest about uncertainty.

Credible Intervals vs Confidence Intervals

95% Confidence Interval (frequentist):

“If we repeated the study many times, 95% of intervals would contain the true parameter”
Statement about the procedure, not this particular interval

95% Credible Interval (Bayesian):

“There’s a 95% probability the parameter is in this interval, given our data and prior”
Direct statement about the parameter

Visual: Show prior, likelihood, posterior curves; credible interval is 95% of posterior mass.

This is the key practical difference.

Frequentist CI: - Correct interpretation is tortured: “In the long run, if we sampled repeatedly…” - What people want to say: “95% chance effect is in this range” - What you can say: “Procedure has 95% coverage”

Bayesian CI (credible interval): - Says exactly what people want: “95% probability effect is in this range” - Conditional on your prior and data

Example: - Frequentist: 95% CI [0.2pp, 6.8pp] → “If H0 true, 5% of such intervals wouldn’t contain 0” - Bayesian: 95% CrI [0.5pp, 6.5pp] → “95% probability effect is between 0.5pp and 6.5pp”

For decision-making, Bayesian is more intuitive! “Should we scale the program?” → “Yes, there’s a 97% probability it increases participation”

Caveat: Bayesian interpretation is conditional on your prior being reasonable. If your prior is terrible, posterior is misleading.

When Bayesian Thinking Helps

Situations where Bayes shines:

Small samples: Incorporate prior knowledge to improve estimates
Sequential testing: Update beliefs as data accumulates
Decision analysis: Compute probability of meeting a threshold
Complex models: Hierarchy, missing data (MCMC handles these well)

Example: “What’s the probability the campaign increases response by at least 3pp?” → Integrate posterior above 3pp.

Bayesian methods are especially useful when:

Small n: You run a pilot with 200 people. Frequentist analysis is very uncertain. Bayesian can borrow strength from past studies via prior.
Sequential: You run a trial in waves (Phase 1 → Phase 2 → Phase 3). Bayesian lets you update beliefs after each phase. Frequentist has issues with “peeking” (multiplicity).
Decision: Stakeholder asks “What’s the probability the effect is large enough to justify scaling?” Bayesian directly computes this. Frequentist can’t (only gives p-values/CIs).
Complex models: Hierarchical models (e.g., effects vary by ward), missing data imputation, measurement error - all easier in Bayesian framework (use Stan, JAGS, PyMC).

Downsides of Bayesian: - Computationally harder (MCMC takes time) - Requires prior specification (can be controversial) - Less familiar to reviewers/stakeholders (educational burden)

In civic tech: Bayesian is growing but still less common than frequentist. Use when it adds value (small n, sequential, decision-focused). Document your approach carefully.

Case Study Prompts

Question 1: With a skeptical prior centered at +1pp and observed +3.2pp uplift (n=400 per arm), does your posterior still support a meaningful positive effect?

Question 2: How does doubling the prior sample weight (stronger prior) change conclusions vs using a flat prior?

Q1: This tests whether evidence overcomes skepticism.

Skeptical prior: Beta centered at 1pp (weak prior belief in effects) Data: +3.2pp observed Posterior: Will shift toward data, probably 95% CrI above 0pp

With n=400 per arm, data has moderate strength. Posterior will be a compromise: maybe 2pp to 2.5pp central estimate, 95% CrI [0.5pp, 4.5pp].

Interpretation: “Despite a skeptical prior, the data provides strong evidence for a positive effect (98% probability effect > 0pp, 85% probability effect > 1pp).”

Q2: Prior weight = effective sample size in the prior.

Weak prior: Beta(10,10) = like having 20 prior observations Strong prior: Beta(20,20) = like having 40 prior observations

With small dataset (n=100), strong prior has more influence. With large dataset (n=1000), prior hardly matters.

Show this: posterior with weak prior ≈ MLE. Posterior with strong prior is “shrunk” toward prior mean.

Trade-off: strong priors prevent overfitting (good for prediction) but might bias estimates (bad for unbiased causal inference).

Key Takeaway: Part 5

Bayesian inference lets you make direct probability statements about parameters, incorporating prior knowledge. It’s especially useful for small samples, sequential testing, and decision analysis.

Frequentist: “Given a true parameter, what’s the probability of this data?”
Bayesian: “Given this data, what’s the probability distribution of the parameter?”

Both paradigms have strengths:

Frequentist: ✅ Well-established, familiar to reviewers ✅ No prior specification needed ✅ Null hypothesis testing is standard ❌ p-values are confusing ❌ Can’t make direct probability statements ❌ Doesn’t incorporate prior knowledge

Bayesian: ✅ Intuitive probability statements ✅ Natural for decision-making ✅ Can incorporate prior information ❌ Computationally harder ❌ Prior choice can be controversial ❌ Less familiar (more explanation needed)

In practice: Use frequentist for standard evaluations where n is large and methods are established. Use Bayesian when you have good priors, small n, or need direct decision probabilities.

Many modern analyses combine: frequentist for primary analysis (transparent, standard) + Bayesian for sensitivity/decision analysis (adds value).

Part 6: Wrap & Q/A

Duration: 5 minutes

Cross-Cutting Pitfalls

Common mistakes in evaluation:

Multiple comparisons without correction → inflated false positive rate
p-hacking → testing many things, only reporting “significant” ones
Regression to the mean → target worst performers, they improve anyway
Measurement drift → definition changes over time
Ignoring clustering → SEs too small, false positives

Prevention: Pre-register analysis plans, use robust methods, be transparent.

Quick reminders of things we’ve touched on:

Multiple comparisons: If you test 20 outcomes, expect 1 false positive even if nothing works. Solutions: Bonferroni correction, FDR control, pre-specify primary outcome.
p-hacking: “We tested A/B emails, SMS vs call, 3 message variants, young vs old, mobile vs desktop… only the SMS to young people on mobile was significant!” → likely false positive. Prevention: pre-analysis plan.
Regression to the mean: “We targeted the 10 worst-performing schools. After our intervention, 8 improved!” → some would have improved anyway (regression to mean). Need control group.
Measurement drift: “Response rates increased!” → but you changed the definition of “response” halfway through. Keep definitions consistent.
Clustering: We covered this in Part 4. Don’t treat clustered data as independent.

Emphasize: these are ALL avoidable with good design and transparent reporting.

Good Practice Checklist

✅ Design phase:

Pre-register primary outcomes and analysis plan
Calculate required sample size (power analysis)
Plan for covariates to adjust for

✅ Analysis phase:

Report effect sizes with confidence intervals
Use robust/clustered SEs where appropriate
Check diagnostics (residual plots, balance checks)
Don’t p-hack (stick to pre-specified analyses)

✅ Reporting phase:

Transparent about limitations
Share data and code where possible
Plain-language interpretation
Acknowledge what you DON’T know

Tools & Resources

Interactive tools from today:

Sampling Distribution Explorer (understand sampling distributions)
A/B Testing Simulator (Type I vs Type II errors)
Power & Sample Size Calculator (sample size planning)

Further reading:

Gelman & Hill, Regression and Other Stories
McElreath, Statistical Rethinking (Bayesian)
Gerber & Green, Field Experiments
mySociety Research Methods documentation

Exit Tickets

Before you leave, please answer:

Write a one-sentence interpretation of a 95% confidence interval for a treatment effect of +3.5pp, CI [0.2pp, 6.8pp].
Name one design change to increase statistical power without inflating α.

These are quick checks for understanding. Collect on paper or online form.

Q1 tests: Do they understand CIs?

Good answer: “We’re 95% confident the true treatment effect is between 0.2pp and 6.8pp.”

Okay answer: “There’s a 95% chance the effect is between 0.2pp and 6.8pp” (slightly wrong - that’s Bayesian, but shows intuition)

Bad answer: “The effect is significant” (misses the point)

Q2 tests: Do they understand power?

Good answers: - Increase sample size - Increase effect size (better intervention) - Reduce outcome variance (better measurement, stratification) - Use a more powerful design (within-subjects, matched pairs)

Bad answers: - Increase α (technically works but defeats the purpose) - “Make p-value smaller” (confuses power with significance)

Use responses to identify concepts to revisit in follow-up sessions.

Q&A

Open floor for questions on any part of the session.

Common questions:

“When should I use Bayesian vs frequentist?”
“How do I deal with small sample sizes?”
“What if I can’t randomize?”
“How do I explain this to non-technical stakeholders?”

Budget 5-10 minutes for Q&A. Encourage all questions, no matter how basic.

Prepared answers:

Bayesian vs frequentist: - Default to frequentist (more standard, easier to explain) - Use Bayesian when: small n, strong priors, need decision probabilities

Small sample sizes: - Be honest about uncertainty (wide CIs) - Don’t over-interpret - Consider Bayesian with informative priors - Or just collect more data

Can’t randomize: - Use quasi-experimental methods (DiD, RD, IV - see main eval slides) - Regression with good covariates - Be very careful about causal claims - Sensitivity analyses

Non-technical stakeholders: - Focus on effect size and CI, not p-values - Use visuals (graphs, not tables) - Plain language: “increased by Xpp” not “β=X, p<0.05” - Tell the story: what does this mean for our mission?

If time allows, work through 1-2 questions in depth. Use the board.

Key Takeaways: Statistics 101

1. Distributions & LLN: Variation is normal. More data → less uncertainty.

2. p-values & CIs: Effect size + confidence interval > binary significance.

3. Power: Design for the minimum effect that matters. Don’t run underpowered studies.

4. Regression: Adjust for confounders, check assumptions, use robust SEs.

5. Bayesian: Direct probability statements, incorporate prior knowledge.

“Good statistics makes good evaluation possible. Good evaluation makes good decisions possible.”

Final summary. Reiterate the main points:

Statistics isn’t about formulas - it’s about thinking clearly under uncertainty.

Key mindset shifts: - From “is it significant?” to “how big is the effect?” - From “p<0.05” to “here’s the range of plausible effects” - From “ignore uncertainty” to “quantify and report uncertainty” - From “one test” to “pre-registered plan”

You don’t need to be a statistician to do good evaluation. But you do need: 1. Clear questions 2. Appropriate methods 3. Honest reporting 4. Humility about what you don’t know

These principles apply whether you’re evaluating a civic tech tool, a policy intervention, or your own organization’s work.

Thank everyone. Share slides and tools. Encourage follow-up questions via email/Slack.