Part 1: Distributions & Law of Large Numbers
Duration: 18 minutes
The Problem: Noise vs Signal
Scenario: A council launches a redesigned service form.
Week 1 (before): Daily completion rates bounce between 38% and 56%
Week 2 (after): Daily completion rates bounce between 42% and 58%
Stakeholder question: “Did the redesign work?”
Your challenge: Explain why variation alone doesn’t prove anything
This is THE foundational problem in evaluation. Everyone sees numbers go up and down. How do you know it’s real change vs random noise?
Ask the room: “How would you respond to the stakeholder? What would you need to know?”
Common answers: more time, more data, baseline comparison. All correct! We’re going to formalize that intuition.
What Is a Distribution?
A distribution describes how outcomes vary across repeated observations.
Example: Daily completion rates over 30 days
Mean (μ): average completion rate (e.g., 45%)
Variance (σ²): how spread out the rates are
Standard deviation (σ): typical distance from the mean (e.g., 8pp)
Key insight: Even with no change , you’ll see variation day-to-day!
Draw a histogram on the board if possible. Show actual data: maybe 30 days of completion rates.
Emphasize: the distribution tells you what “normal variation” looks like. If you don’t know the distribution, you can’t tell signal from noise.
Math note: σ (sigma) is measured in the same units as your data. So if measuring %, then σ is in percentage points.
The Law of Large Numbers (LLN)
As sample size grows, the sample mean (\(\bar{x}\) ) converges to the true mean (μ).
Formula: \[\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \xrightarrow{n \to \infty} \mu\]
Plain English: With more data, your estimate gets more stable and accurate.
This is why “wait and collect more data” is usually good advice!
Example: If you flip a fair coin 10 times, you might get 7 heads (70%). Flip 1,000 times? You’ll be very close to 50%.
For our form example: measuring completion rate over 1 day is noisy. Over 30 days? Much more stable.
The LLN doesn’t say HOW FAST you converge - that depends on variance. Which brings us to…
Central Limit Theorem (Sneak Peek)
Even if individual outcomes are weird, averages tend to look Normal.
This lets us:
Calculate confidence intervals
Run hypothesis tests
Make probabilistic statements about our estimates
Visual: Show sampling distribution getting narrower and more Normal as n increases
Don’t dive too deep into CLT proof, but show the magic: even if you’re sampling from a bizarre distribution, the AVERAGE of many samples looks bell-curved.
This is why n matters so much. Small n = wide, uncertain estimates. Large n = narrow, precise estimates.
For binary outcomes (like “did they complete the form?”), the Binomial distribution approximates Normal when n is large enough.
Binary Outcomes: Special Case
For Yes/No outcomes (completed the form, clicked the link, attended the meeting):
Bernoulli distribution:
Each person has probability \(p\) of success
\(E[p] = p\) (expected value is just \(p\) )
\(\text{Var}(p) = \frac{p(1-p)}{n}\) (variance decreases with sample size!)
Key implication: The more people in your sample, the more confident you are about the true completion rate.
This is the workhorse distribution for civic tech evaluation. Most of our outcomes are binary: - Did they use the platform? - Did they respond to the survey?
- Did they attend the meeting?
Note the formula: variance is HIGHEST when p=0.5 (maximum uncertainty), and decreases toward 0 as p approaches 0 or 1.
Also note: variance decreases with 1/n. So to cut uncertainty in half, you need 4× the sample size.
Graphs: Visualizing Variation
Three key visualizations:
Histogram of daily completions (before/after periods)
Sampling distribution of the mean for n=20 vs n=200
Binomial → Normal approximation overlay
The sampling distribution shows us what repeated samples would look like — this is where uncertainty comes from!
Walk through each graph type:
Histogram: Shows the raw variation in your data. Are there outliers? Is it symmetric?
Sampling distribution: If you ran this experiment 1000 times, what would you get? This is theoretical but crucial for understanding CIs and p-values later.
Binomial-Normal: For large n, the Binomial distribution (discrete) looks like a Normal distribution (continuous). This makes math easier.
If you have time, sketch these on the board. The sampling distribution is the most important - it’s where all inference comes from.
Interactive Demo: Sampling Distribution Explorer
Key insight: Notice how n=200 (coral) is ~2× narrower than n=50 (blue). This demonstrates √n scaling: 4× more data → 2× less uncertainty.
Key teaching moment: Have people make predictions BEFORE you change n.
“If we go from n=50 to n=200, will the sampling distribution be: a) 2× narrower b) 4× narrower
c) Square root of 4 = 2× narrower”
Answer is (c)! Standard error decreases with 1/√n.
The visualization shows this directly - students can see the histogram narrow as they increase sample size.
Case Study Prompts
Question 1: If baseline completion is ~45%, what run length (days of data) stabilizes your weekly estimate within ±2pp most of the time?
Question 2: You observe +4pp improvement after the redesign. When is this just noise vs real change? What sample size changes that answer?
These are the questions we’ll be able to answer by the end of the session!
Q1 is about planning: how long do you need to run your measurement period?
Q2 is about inference: given what you observed, what can you conclude?
Don’t answer these fully now - we need confidence intervals (next section) and power calculations (section 3).
But note the connection: both depend on understanding the sampling distribution and how it relates to n.
Key Takeaway: Part 1
Variation is normal. Without understanding the distribution and sample size, you can’t distinguish signal from noise.
\[\text{Uncertainty} \propto \frac{1}{\sqrt{n}}\]
More data → less uncertainty → stronger conclusions
Reinforce: this is why anecdotes and small pilots are dangerous for decision-making. They’re valuable for learning and iteration, but not for inference.
The formula is a simplification but useful: to cut uncertainty in half, you need 4× more data.
Transition: “Now we know how to think about distributions and samples. Next question: how do we formally test if an observed difference is real?”
Part 2: p-Values, Significance & Confidence Intervals
Duration: 17 minutes
The Problem: Did It Actually Work?
Scenario: Two outreach emails (A vs B) invite residents to a community safety survey.
Email A (control): Standard invitation → 12.3% response rate
Email B (treatment): Personalized invitation → 15.8% response rate
Stakeholder question: “Email B is clearly better, right?”
Your challenge: Is this difference real or could it be random chance ?
This is where most evaluation reports go wrong. They see a difference and declare victory.
But: what if you had slightly different people in each group? What if it was just a lucky week?
Ask the room: “What would make you more or less confident this is real?” - Sample size - Size of the difference - How variable the responses are
All correct! We’re about to formalize this.
The Null Hypothesis (H₀)
Null hypothesis (H₀): There is no real difference between A and B.
\[H_0: p_A = p_B\]
Alternative hypothesis (H₁): There is a real difference.
\[H_1: p_A \neq p_B\]
Our test asks: “If H₀ were true, how surprising is what we observed?”
The null hypothesis is your skeptical starting point. It’s the “prove it to me” position.
This might feel backward! We want to prove B is better, but we start by assuming it’s NOT better.
Why? Because we can calculate what random chance looks like under the null. Then we see if our data is inconsistent with that.
Key philosophical point: we never “prove” the alternative. We only reject or fail to reject the null.
The Test Statistic
What we observe: Difference in sample proportions
\[\hat{p}_B - \hat{p}_A = 0.158 - 0.123 = 0.035\]
Translation: Email B had a 3.5 percentage point higher response rate.
But: Is +3.5pp a lot? Depends on:
Sample sizes (nₐ and n_B)
Baseline variability
What we’d expect from random chance
The “hat” symbol (^) means “estimated from data” vs true population value.
Emphasize: the raw difference alone tells us nothing about statistical significance. A difference of 3.5pp might be: - Huge if you have 10,000 people per group - Tiny if you have 50 people per group
We need to standardize this difference relative to its uncertainty. That’s what the standard error does.
Standard Error & p-Values
Standard Error measures uncertainty in our estimate:
\[SE(\hat{p}_B - \hat{p}_A) = \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_B(1-\hat{p}_B)}{n_B}}\]
p-value: If the null were true, how often would we see a difference this large or larger?
p < 0.05: “Statistically significant” (common threshold)
p = 0.03: Only 3% chance of seeing this if emails were really identical
Standard Error is the standard deviation of your estimate. It comes from the sampling distribution we discussed in Part 1.
The formula looks scary but it’s just: “combine the uncertainty from group A and group B.”
p-value interpretation is tricky! It’s NOT: ❌ “Probability the null is true” ❌ “Probability Email B doesn’t work”
It IS: ✅ “Probability of seeing data this extreme if the null were true”
Common mistake: treating p=0.049 as “significant” and p=0.051 as “not significant” - the difference between these is trivial!
Type I and Type II Errors
Reject H₀
Type I Error (α)
✅ Correct
Fail to Reject H₀
✅ Correct
Type II Error (β)
α (alpha): False positive rate (typically 5%)
β (beta): False negative rate
Power: 1 - β (typically aim for 80%)
This is the error budget in hypothesis testing.
Type I error (α): Saying it works when it doesn’t. False alarm. This is what p<0.05 controls.
Type II error (β): Saying it doesn’t work when it does. Missed opportunity.
In civic tech, which is worse? - Type I: You scale a program that doesn’t work → waste resources - Type II: You abandon a program that works → miss impact
There’s always a tradeoff! Can’t minimize both simultaneously with fixed n.
We’ll talk about power more in Part 3.
Confidence Intervals: Better Than p-Values
95% Confidence Interval for the difference:
\[\text{Estimate} \pm 1.96 \times SE\]
Example: Email B uplift = 3.5pp with 95% CI [0.2pp, 6.8pp]
Interpretation: We’re 95% confident the true effect is between 0.2pp and 6.8pp.
Why better than p-values? Shows magnitude and precision , not just “significant/not significant.”
CIs are criminally underused! They tell you much more than a p-value.
The CI says: “Here’s a range of plausible true effects, given our data.”
Note: The CI excludes zero (barely), which is consistent with p<0.05. But the CI tells us HOW MUCH the effect might be.
Key interpretation note: “95% confident” means “if we repeated this study many times, 95% of our CIs would contain the true effect.” It does NOT mean “95% chance the true effect is in this range” - that’s Bayesian thinking (Part 5).
Better reporting: “Email B increased response by 3.5pp (95% CI: 0.2pp to 6.8pp, p=0.037, n=800 per group)”
Graph: Visualizing p-Values
Null distribution of the test statistic:
Bell curve centered at 0 (no difference)
Observed difference marked with a vertical line
Shaded tail area = p-value
Visual: The further your observed difference from zero, the smaller the tail area (lower p-value)
Draw this on the board if possible! It’s the most intuitive way to understand p-values.
The null distribution shows: “If emails A and B were really the same, here’s the range of differences we’d see just from random sampling.”
Your observed difference is WAY out in the tail? Low p-value → probably not chance.
Your observed difference is near the center? High p-value → could easily be chance.
Pro tip: always show the distribution, not just the p-value. Helps people understand what “significant” means.
Interactive Demo: A/B Testing Simulator
Key insight: Use the 4 sliders to explore how sample size, effect size, alpha, and number of trials affect Type I error (null scenario, coral) and statistical power (alternative scenario, blue).
This is a powerful teaching tool. The simulation shows:
When the null is TRUE (trueUplift = 0): only ~5% of trials are “significant” (Type I error rate)
When there IS an effect: power determines how often you detect it. With small n, even real effects often show p>0.05.
CIs: 95% should contain the true effect ~95% of the time
Let people experiment: “What happens if you double the sample size? Set effect to zero?”
Key insight: with small n, even real effects often show p>0.05 (Type II error). This motivates Part 3 on power.
The Danger of p-Hacking
What is p-hacking?
Testing multiple hypotheses but only reporting the “significant” ones.
The problem: With α = 0.05, pure chance gives you 1 false positive per 20 tests!
Common forms:
Testing many outcomes, reporting only “significant” ones
Analyzing by multiple subgroups (age, gender, location, device…)
Stopping data collection when p < 0.05
Trying different statistical methods until one “works”
Result: Published findings that won’t replicate
This is one of the most important methodological problems in science and evaluation.
The math is simple: if you test 20 null hypotheses at α=0.05, you EXPECT one false positive just by chance.
Example scenario: “We tested our civic tech app on: - 5 different outcomes (completion, engagement, satisfaction, referrals, time-on-task) - 4 demographic groups (young/old × male/female) - 3 devices (mobile/tablet/desktop) = 60 possible comparisons!”
If you report only the 3 that were “significant,” you’re misleading everyone.
Real-world impact: - Interventions that don’t work get scaled - Resources wasted - Trust in evaluation eroded - Replication crisis
Prevention is key: pre-registration!
Interactive Demo: P-Hacking Simulator
Key insight: Use the 4 sliders to explore how testing multiple hypotheses inflates false positives. Even when there are NO real effects, you’ll find “significant” results by pure chance!
This visualization powerfully demonstrates p-hacking in action.
Key observations to point out:
Every single test is NULL - there’s no real effect anywhere. Both groups come from the same distribution.
With α=0.05 and 20 tests, we EXPECT 1 false positive (20 × 0.05 = 1). Run it multiple times and you’ll see this average holds.
The “significant” results (blue dots below the red line) are purely chance findings.
If you only reported these significant results, you’d be p-hacking! You’d claim “The program increased engagement for young males using mobile” or whatever, when it was just random noise.
Bonferroni correction: If you MUST test 20 hypotheses, divide α by 20. So use α=0.0025 instead of 0.05. This controls the family-wise error rate.
Real-world parallel: “We tested our civic tech app and found it worked! (for women over 60 using tablets on Thursdays)” - this is how p-hacking manifests.
Prevention: Pre-register ONE primary outcome before collecting data.
Preventing p-Hacking: Pre-Registration
The gold standard: Pre-register your analysis plan
Before collecting data:
Specify ONE primary outcome
Define your analysis plan (method, covariates, sample size)
Register it publicly (OSF, AsPredicted, clinical trials registry)
Why this works:
Removes researcher degrees of freedom
Makes deviations transparent
Increases trust in findings
Prevents fooling yourself
Example: Clinical trials must pre-register to prevent selective reporting
Pre-registration is the gold standard for credible evaluation.
Practical steps: 1. Before starting, write down: “Our primary outcome is X. We’ll test it using Y method. Sample size is Z.” 2. Register this at OSF.io or AsPredicted.org (takes 5 minutes) 3. Stick to the plan! If you do exploratory analyses, label them as such.
Why this works: - Removes researcher degrees of freedom - Makes p-hacking transparent (if you deviate from plan, it’s visible) - Increases trust in your findings - Prevents you from fooling yourself
Real example: Clinical trials MUST pre-register. Why? Because pharmaceutical companies were testing dozens of outcomes and only reporting the “positive” ones.
Civic tech should adopt these standards!
Exploratory analysis is fine - just label it: “Pre-registered primary outcome: no effect. Exploratory analysis suggests effect for subgroup X, but this needs confirmation in a new study.”
Dealing with Multiple Comparisons
If you must test multiple outcomes:
Bonferroni correction: Divide α by number of tests
Testing 20 outcomes? Use α = 0.05/20 = 0.0025
False Discovery Rate (FDR): More powerful alternative (Benjamini-Hochberg)
Report ALL tests, not just significant ones
Red flags to watch for:
“We found X worked for [oddly specific subgroup]”
No pre-registered analysis plan
Only reporting significant results
Bottom line: p-hacking is easy to do accidentally. Pre-registration prevents it.
Multiple comparisons correction: - Bonferroni: very conservative, divide α by number of tests. If testing 20 hypotheses at α=0.05, use α=0.0025 for each test. This controls family-wise error rate. - Benjamini-Hochberg (FDR): less conservative, controls expected proportion of false discoveries rather than probability of any false discovery - Both implemented in standard statistical software (R, Python, Stata)
When to use: - If you have ONE pre-specified primary outcome → no correction needed - If you’re testing multiple secondary outcomes → use correction OR clearly label as exploratory - If doing post-hoc subgroup analyses → definitely label as exploratory, needs replication
Red flags in papers/reports: - “We tested many things and found this one significant result in this specific subgroup” - No mention of how many tests were run - Suspiciously specific findings that weren’t pre-specified
Best practice: Pre-register ONE primary outcome. If you do exploratory analyses, be transparent: “Pre-registered outcome showed no effect. Exploratory analysis suggests effect for subgroup X, but this is hypothesis-generating and needs confirmation.”
Bottom line: p-hacking is easy to do accidentally. Pre-registration prevents it.
Case Study Prompts
Question 1: For an observed uplift of +3.8pp with 95% CI [−0.4pp, +8.0pp], how would you brief a stakeholder?
Question 2: Your trial reports p = 0.047 once, but 8 secondary outcomes were also tested. What does “significant” mean now?
Q1: This is about communication. The CI crosses zero (just barely), so p is probably around 0.06 (not “significant”).
Good brief: “Email B increased response by 3.8pp, but we can’t rule out no effect (CI includes negative values). We’d need more data to be confident.”
Bad brief: “Email B didn’t work” or “The increase was 3.8pp so it worked”
Q2: This is the multiple comparisons problem! With 8 tests, even if nothing works, you’d expect ~0.4 false positives (8 × 0.05).
Solution: Bonferroni correction (divide α by number of tests) or pre-specify ONE primary outcome.
This is why p-hacking is dangerous: if you test enough things, something will be “significant” by chance.
Key Takeaway: Part 2
p-values tell you if an effect is surprising under the null. Confidence intervals tell you how big it might be. Always report both.
Good reporting:
Effect size ✅
Confidence interval ✅
p-value (optional) ✅
Sample size ✅
Method ✅
Emphasize: move away from “significant/not significant” binary thinking.
Effect size matters! A “significant” effect might be too small to care about. A “non-significant” effect might be important but underpowered.
Best practice: pre-register your primary outcome and analysis plan. This prevents p-hacking and makes your inference credible.
Transition: “We can now test if an effect is real. But how do we design studies to DETECT effects reliably? That’s about power…”
Part 3: Power & Sample Size
Duration: 16 minutes
The Problem: Planning Ahead
Scenario: You plan an SMS reminder to reduce missed appointments at a community clinic.
Current rate: 42% of people miss appointments
Your goal: Reduce to 39% (−3pp improvement)
Ops lead asks: “How many people per arm do we need to reliably detect this?”
Your challenge: Design a study with enough power to detect real effects.
This is the planning phase of evaluation. Too often, people run pilots with 50 people and wonder why results are inconclusive.
The answer depends on: 1. How big an effect you expect (Minimum Detectable Effect) 2. How much uncertainty you can tolerate (α) 3. How reliably you want to detect real effects (power = 1-β)
Ask the room: “What do you think determines sample size?” Common answers: budget, time, effect size. All correct!
What Is Power?
Statistical power: Probability you’ll detect a real effect if it exists.
\[\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_1 \text{ is true})\]
Common target: 80% power (i.e., β = 20%)
Tradeoffs:
Higher power → need larger sample size
Smaller effects → need larger sample size
Lower α → need larger sample size
Power is your insurance against Type II errors (false negatives).
80% power means: if there IS a real effect, you’ll detect it 80% of the time.
Why not 90% or 95% power? Because sample size grows fast. Doubling power roughly doubles n.
Analogy: Power is like the strength of your microscope. With low power (small n), you can only see BIG effects. With high power (large n), you can detect subtle effects.
In civic tech, low power is expensive! You run a pilot, find “no effect,” but actually the effect was there - you just couldn’t see it.
The Power-Sample Size Relationship
Rule of thumb: Detecting small uplifts in binary outcomes needs large n .
Approximate formula (two-proportion test, equal group sizes):
\[n \approx \frac{(z_{1-\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_{1-\beta}\sqrt{p_A(1-p_A)+p_B(1-p_B)})^2}{(p_B-p_A)^2}\]
Don’t memorize! Use a calculator or tool. But note:
\(n\) grows with \(1/(\text{effect size})^2\) → half the effect = 4× the sample
\(n\) grows with baseline variance → more variable outcomes need more data
This formula looks intimidating! The key insights:
Effect size appears in the denominator, SQUARED. So tiny effects need huge samples.
The z-scores (1.96 for α=0.05, 0.84 for 80% power) are constants from the Normal distribution.
Baseline variance matters: outcomes near 50% have highest variance, so need more data than outcomes near 0% or 100%.
Example: to detect a 3pp effect with baseline 42%, 80% power, α=0.05: - n ≈ 1,370 per arm (2,740 total)
That’s a lot! Many civic tech pilots are underpowered.
In practice: use an online calculator or R package (pwr, WebPower).
Effect Size: MDE Thinking
Minimum Detectable Effect (MDE): Smallest effect your study can reliably detect.
Key question: “What effect size would actually matter for policy/practice?”
Example:
1pp improvement in appointment attendance: probably not worth scaling
5pp improvement: worth considering
10pp improvement: definitely worth scaling
Design principle: Match your MDE to your practical significance threshold .
This is where stats meets policy! Don’t just design for statistical significance.
Ask: “What’s the smallest effect that would change our decision?”
If you need ≥5pp to justify scaling the SMS program, but your study can only detect ≥10pp (underpowered), then you might miss important effects.
Conversely, if you power for 1pp but would never scale for <5pp, you’re oversampling.
MDE depends on: - Sample size (larger n → smaller MDE) - Baseline variance - Desired power - α level
Trade off: precision vs feasibility. Sometimes you can only afford n=500, so MDE is fixed. Be transparent about this!
Graphs: Power Curves
Two useful visualizations:
Power curve vs n (for fixed MDE)
Shows: how power increases as you add more people
Typical shape: S-curve (steep in the middle, flat at extremes)
MDE curve vs n (for fixed power)
Shows: what effect sizes you can detect for a given n
Typical shape: hyperbola (diminishing returns to adding people)
Draw these on the board or show pre-made graphs.
Power curve: - Low n: low power (e.g., 20%) - Moderate n: power crosses 80% threshold - High n: diminishing returns (90% → 95% power needs many more people)
MDE curve: - With n=200 per arm, maybe you can detect 8pp effects - With n=800 per arm, maybe 4pp effects - With n=3200 per arm, maybe 2pp effects
Key insight: there’s no magic n. It depends on what effect size matters to you.
Show where your planned study sits on these curves.
Interactive Demo: Power & Sample Size Calculator
Key insight: Move the slider to see how MDE affects required sample size. Smaller effects (1pp) need huge samples; larger effects (10pp) are detectable with modest samples.
This tool should be the go-to for planning evaluations.
Walk through a live example: 1. Start with ambitious MDE (1pp) → note huge n required 2. Relax to realistic MDE (3pp) → more feasible n 3. Show what happens if you adjust power from 80% to 90% → note increase in n
Key insight: The relationship is quadratic. To detect half the effect, you need 4× the sample.
Emphasize: power analysis happens BEFORE data collection, not after.
Post-hoc power analysis (calculating power after seeing results) is problematic - it’s circular reasoning.
Students can use this for their own evaluation planning!
Case Study Prompts
Question 1: With baseline 42% missed appointments and desired MDE of −3pp (to 39%) at 80% power, what n per arm is needed?
Question 2: If you can only recruit 1,200 people total (600 per arm), what MDE becomes realistic at 80% power? Is this still meaningful?
Q1: Approximately n≈1,370 per arm (2,740 total) for 80% power, α=0.05.
This is often surprising to people! “That many?” Yes. Small effects need large samples.
Q2: With n=600 per arm, your MDE is probably around 5pp (depending on baseline variance).
Is that meaningful? Depends on context: - If 5pp improvement justifies scaling → you’re good - If you need to detect 3pp effects → underpowered
Be honest about limitations. Don’t run an underpowered study and conclude “no effect” when you simply couldn’t detect a real effect.
Better to acknowledge: “Our study can detect effects ≥5pp but not smaller effects.”
Practical Constraints: What If n Is Fixed?
Reality: Often you CAN’T get more people (budget, time, eligible population).
Options:
Accept lower power → risk false negatives, report this
Accept larger MDE → can only detect big effects
Reduce variance → better measurement, blocking, stratification
Increase α → accept more false positives (rarely done)
Use a more powerful design → within-subjects, stepped wedge
This is the real world! Sample size isn’t always flexible.
Option 3 is underused: if you can reduce outcome variance, you effectively increase power.
Examples of variance reduction: - Stratify by high-risk vs low-risk (then analyze by strata) - Use paired/matched designs - Control for baseline covariates in regression (Part 4) - Use more precise outcome measures
Option 5: stepped wedge designs give you more power by using each unit as its own control.
Key message: do the power analysis early, so you know your constraints. Don’t wait until after data collection to realize you were underpowered.
Key Takeaway: Part 3
Power determines whether you can reliably detect real effects. Design your sample size for the minimum effect that would change your decision.
\[\text{More power} \Leftrightarrow \text{Larger n OR Larger effect OR Lower variance}\]
Always report: “This study can detect effects ≥ Xpp with 80% power.”
Power analysis is not optional! It’s the difference between: - A credible evaluation that informs decisions - A pilot that’s “inconclusive” because it was underpowered
Emphasize: underpowered studies waste everyone’s time and resources. You spend money collecting data that can’t answer your question.
Funders increasingly expect power calculations in proposals. Be ready to justify your sample size.
Transition: “We can now design studies and test for effects. But what if we need to control for confounders? That’s where regression comes in…”
Part 4: Regression & Linear Modelling
Duration: 24 minutes
The Problem: Confounding
Scenario: You roll out posters + emails to promote a community consultation.
Complication:
Younger residents (18-35) were more exposed to the campaign
Younger residents also use mobile devices more
Both age AND mobile usage might affect participation
Challenge: What’s the adjusted effect of the campaign, controlling for age and device?
This is the situation where simple before/after or treatment/control comparisons break down.
You can’t just compare people who saw the campaign vs those who didn’t, because they differ in OTHER ways (age, device).
Regression lets you estimate the effect of one variable while “holding constant” other variables.
Ask: “Why might younger people participate more, even without the campaign?” Answers: more familiar with online platforms, more time, different interests, etc.
If we don’t control for age, we might attribute all the difference to the campaign when some is really about age.
What Is Regression?
Regression models the relationship between:
An outcome variable (Y): participation rate, completion rate, etc.
One or more predictor variables (X): treatment, age, device, etc.
Ordinary Least Squares (OLS) finds the line that best fits your data:
\[y_i = \beta_0 + \beta_1 \text{Treatment}_i + \beta_2 \text{Age}_i + \beta_3 \text{Mobile}_i + \varepsilon_i\]
β₁ = effect of treatment, holding age and device constant
OLS is a fancy name for “line of best fit.” It minimizes the sum of squared errors.
The equation is just: outcome = intercept + effects of various factors + random error
Key insight: each β coefficient is the effect of THAT variable, assuming all other variables stay the same.
β₁ is what we care about (treatment effect). β₂ and β₃ are “nuisance parameters” we include to avoid bias from confounding.
This extends t-tests and ANOVAs. It’s the workhorse of applied evaluation.
Regression ≠ causation! It only gives causal estimates if you have good design (RCT, natural experiment, or credible adjustment strategy).
Interpreting Coefficients
Example output:
Intercept
0.28
0.03
<0.001
Treatment
0.029
0.014
0.038
Age
-0.002
0.001
0.045
Mobile
0.045
0.018
0.012
β₁ = 0.029: The campaign increased participation by 2.9 percentage points , adjusting for age and device.
Walk through each row:
Intercept (0.28): baseline participation for someone age=0, not treated, not on mobile. Often not interpretable literally (no one is age 0), but needed for the equation.
Treatment (0.029): This is our target! The effect of the campaign after accounting for confounders. It’s smaller than the unadjusted difference, suggesting some of the raw difference was due to age/device.
Age (-0.002): Each additional year of age is associated with 0.2pp lower participation. So a 10-year difference → 2pp lower participation.
Mobile (0.045): Using mobile increases participation by 4.5pp compared to desktop.
p-values: All are <0.05, so “statistically significant” - but remember Part 2, we care more about effect size and CI!
Standard errors: Measure uncertainty in each coefficient. Use these to construct CIs.
Regression as Adjusted Means
Intuition: Regression is just comparing groups AFTER adjusting for other factors.
Visual: Imagine two histograms (treatment vs control), but you’ve matched them on age and device first.
β₁ is the difference in average outcomes after this matching.
Math detail: OLS is equivalent to weighted averages, where weights ensure balance on covariates.
This is the most intuitive way to think about regression: it’s a fancy way of comparing apples to apples.
Without adjustment: - Treatment group is younger, more mobile → participates more - Some of this is due to treatment, some due to age/mobile
With adjustment: - We estimate: “For people of the SAME age, using the SAME device, how much does treatment matter?”
This is why randomization is powerful: in RCTs, treatment and control are already balanced (on average), so you don’t need adjustment. But in observational studies or when randomization isn’t perfect, adjustment helps.
Limitation: regression only adjusts for variables you INCLUDE. If there’s an unmeasured confounder (e.g., prior interest in the topic), your estimate can still be biased.
Key Assumptions
For OLS to give reliable results:
Linearity (in parameters): relationships can be additive
Exogeneity: ε is uncorrelated with X (no unmeasured confounders)
No perfect collinearity: predictors aren’t exact copies of each other
Homoskedasticity: variance of ε doesn’t depend on X
Practical tip: Use robust standard errors (or clustered SEs) to relax #4.
These assumptions sound technical, but they’re about whether regression will give you the right answer.
Linearity: The model is linear in βs, not necessarily in Xs. You can include Age² or log(Income) - that’s fine. But β₁×β₂ interactions need to be specified.
Exogeneity: This is the big one! If treatment is assigned based on unobserved factors (e.g., more motivated people opt in), then β₁ will be biased. This is why randomization or natural experiments are so valuable.
No multicollinearity: If Age and YearsOfEducation are nearly identical, regression can’t tell them apart. Coefficients get unstable. Solution: drop one or combine them.
Homoskedasticity: In practice, often violated (e.g., variance is higher for younger people). Robust SEs fix this without changing point estimates.
In civic tech evaluations, #2 is usually the concern. Always ask: “What confounders might I be missing?”
Clustering & Robust Standard Errors
Problem: Outcomes within the same ward, school, or household are correlated.
Solution: Use cluster-robust standard errors by group.
Example: If 20 schools each recruit 50 families:
Don’t treat all 1,000 families as independent
Cluster by school → SEs will be larger (more conservative)
R code: lm(...) %>% sandwich::vcovCL(cluster = ~school_id)
This is a common mistake in civic tech evaluation: ignoring clustering.
Why it matters: if outcomes within schools are similar (because of school-level factors like leadership, demographics), then you don’t have 1,000 independent observations - you have 20 semi-independent clusters.
Ignoring clustering makes your SEs too small → p-values too low → false positives.
Examples of clustering: - Participants from the same council area - Households (people in the same house are similar) - Repeated measures (same person measured multiple times)
Rule: if your design has clustering, ALWAYS use clustered SEs.
Software support: R (sandwich, lmtest, fixest), Stata (vce(cluster)), Python (statsmodels).
Diagnostics: Residual Plots
Check your assumptions visually:
Residuals vs Fitted: Should be randomly scattered
Pattern → model misspecification (try transformations or interactions)
Q-Q plot: Should be roughly a straight line
Deviations → non-Normal errors (often okay with large n due to CLT)
Leverage plot: Identify influential observations
High leverage + large residual → outlier that affects estimates
Residual = observed outcome - predicted outcome. These are the “errors” ε in your model.
Plot 1 (Residuals vs Fitted): - Random scatter → good - Funnel shape → heteroskedasticity (use robust SEs) - Curved pattern → maybe need Age² or log transform - Clusters of outliers → investigate data quality
Plot 2 (Q-Q plot): - Straight line → errors are approximately Normal - Heavy tails → some extreme values (might need robust regression) - Skewness → consider transforming Y
Plot 3 (Leverage): - Points with high leverage “pull” the regression line - Check: Are these data errors? Valid but unusual observations? - Sensitivity analysis: re-run without them, see if conclusions change
Never skip diagnostics! They often reveal data issues or model problems.
R² and Model Fit
R²: Proportion of variance in Y explained by the model.
R² = 0.15 → model explains 15% of variation
R² = 0.80 → model explains 80% of variation
Important: R² ≠ whether your model is good!
High R² doesn’t mean causal
Low R² can still have reliable β estimates
Focus on β₁ (your treatment effect) and its SE, not R²
R² is overrated! It measures fit, not validity.
You can have: - R²=0.05 but β₁ is precisely estimated and credible (e.g., many unmeasured factors affect Y, but treatment effect is clear) - R²=0.90 but β₁ is biased due to confounding
In social science, R² is often low (0.10-0.30) because human behavior has many unmeasured causes. That’s okay!
What matters: 1. Is β₁ estimated with reasonable precision (small SE)? 2. Have you controlled for the main confounders? 3. Do your diagnostics look okay?
Don’t chase high R² by adding irrelevant predictors. Focus on the causal question.
Interactions: When Effects Vary
Sometimes the treatment effect differs by subgroup:
\[y_i = \beta_0 + \beta_1 \text{Treatment}_i + \beta_2 \text{Age}_i + \beta_3 (\text{Treatment} \times \text{Age}) + \varepsilon_i\]
β₃: How treatment effect changes with age
Example: If β₃ < 0, campaign is more effective for younger residents.
Interactions let you test: “Does the campaign work better for some groups than others?”
Interpretation: - β₁: effect for Age=0 (not usually meaningful on its own) - β₃: change in effect per unit of Age
Example: β₁=8pp, β₃=-0.1pp/year - At age 20: effect ≈ 8 - 0.1×20 = 6pp - At age 60: effect ≈ 8 - 0.1×60 = 2pp
This is called “heterogeneous treatment effects” or “effect modification.”
When to include interactions: - Pre-specified hypothesis (e.g., “we expect campaigns to work better for young people”) - Exploratory analysis (but be transparent about multiple testing)
Avoid: testing dozens of interactions and only reporting “significant” ones. That’s p-hacking.
Case Study Prompts
Question 1: Unadjusted uplift is +5.2pp; adjusted β₁ is +2.9pp (robust SE 1.4pp). How do you report this to stakeholders?
Question 2: Adding Age² improves R² but leaves β₁ similar. What does that suggest about confounding vs functional form ?
Q1: Good reporting emphasizes the adjusted estimate, because it’s more credible (controls for confounders).
“After accounting for differences in age and device usage between groups, the campaign increased participation by 2.9 percentage points (95% CI: 0.1pp to 5.7pp). This is a meaningful improvement, though smaller than the unadjusted difference of 5.2pp, which partially reflected that younger people (who participate more) were more exposed to the campaign.”
Be transparent about adjustment and why it matters!
Q2: This suggests Age² helps predict participation (better fit) but isn’t a confounder (doesn’t bias β₁).
Confounding: variable associated with both treatment and outcome → biases estimates if omitted Functional form: nonlinear relationship in the outcome model → affects fit but not necessarily bias
Adding Age² improved the model’s ability to predict Y (higher R²) but didn’t change the treatment effect estimate much → Age was already capturing the confounding, Age² is just refinement.
When Regression Isn’t Enough
Regression assumes:
No unmeasured confounders
Linear-additive effects (or specified interactions)
Correct functional form
If these fail, consider:
Instrumental Variables (Part 1 slides, quasi-experimental)
Difference-in-Differences (before/after × treatment/control)
Regression Discontinuity (exploit thresholds)
Propensity Score Matching (balance observables first)
Regression is powerful but not magic. It only controls for variables you include.
If there’s an unmeasured confounder (e.g., motivation, prior interest), regression can’t save you. You need better design (RCT) or a natural experiment.
Quick overview of alternatives:
IV: Use an external factor (instrument) that affects treatment but not outcome directly. Example: distance to clinic affects attendance, use as instrument for treatment uptake.
DiD: Compare treatment group’s change to control group’s change. Controls for time-invariant confounders.
RD: If treatment is assigned based on a threshold (e.g., age 18, income <£20k), people just above/below are similar. Compare them.
PSM: Estimate probability of treatment (propensity score), match treated/control units with similar scores, then compare.
These are advanced topics (2-hour workshop each!), but know they exist.
Key Takeaway: Part 4
Regression estimates treatment effects while adjusting for confounders. Always interpret coefficients in context, check assumptions, and use robust/clustered SEs when appropriate.
Report: “The campaign increased participation by βpp (95% CI: [X, Y]), adjusting for age and device.”
Regression is your workhorse for observational evaluations and for improving precision in experiments.
Best practices: 1. Pre-specify which covariates to include (avoid p-hacking) 2. Always check diagnostics 3. Use clustered SEs if there’s any grouping 4. Report adjusted estimates with CIs 5. Be honest about potential unmeasured confounders
Don’t oversell: “We found an effect after adjusting for X, Y, Z, but can’t rule out confounding from unmeasured factors.”
Transition: “So far we’ve used frequentist inference - p-values, CIs, power. But there’s another paradigm: Bayesian. Let’s zoom out…”
Part 5: Bayesian Zoom-Out
Duration: 10 minutes
The Limitation of Frequentism
Frequentist approach:
Assumes a true fixed parameter (e.g., treatment effect)
Makes probability statements about data (p-values, CIs)
Can’t say “95% probability effect is positive”
Bayesian approach:
Treats parameters as uncertain (have distributions)
Makes probability statements about parameters
Can say “95% probability effect is between X and Y”
This is a paradigm shift! Frequentist and Bayesian answer different questions.
Frequentist CI: “If we repeated this study many times, 95% of CIs would contain the true effect.” - Statement about a procedure, not about THIS particular CI - Can’t say “95% chance the effect is in this interval”
Bayesian Credible Interval: “Given the data, there’s a 95% probability the effect is in this interval.” - Direct probability statement about the parameter - Intuitively what people THINK frequentist CIs mean!
Neither is “right” or “wrong” - they’re answering different questions. Use the one that matches your goals.
The Core Insight: Bayes’ Theorem
Update beliefs based on evidence:
\[P(\theta | \text{Data}) = \frac{P(\text{Data} | \theta) \times P(\theta)}{P(\text{Data})}\]
Plain English:
Prior: \(P(\theta)\) = what we believed before seeing data
Likelihood: \(P(\text{Data} | \theta)\) = how consistent data is with each possible θ
Posterior: \(P(\theta | \text{Data})\) = updated belief after seeing data
Bayes’ theorem is just: prior belief + new evidence → updated belief.
This is how humans naturally think! “I thought the campaign would help by ~2pp (prior). Data shows 3.8pp. Now I believe it’s probably around 3-4pp (posterior).”
The formula looks scary but it’s intuitive: - Start with prior (accumulated knowledge, theory, past studies) - Multiply by likelihood (how much does THIS data support each possible effect size?) - Normalize (technical detail to make it a proper probability)
The prior is controversial: where does it come from? Subjectivity? - Subjective: expert judgment, theory - Objective: past data from similar contexts - Weakly informative: regularization, prevent overfitting
In practice: if you have lots of data, the prior doesn’t matter much (data overwhelms it). If you have little data, the prior has more influence.
Toy Example: Beta-Binomial
Scenario: Prior evidence from other councils suggests outreach emails usually increase response by 0-5pp.
Model:
Prior: Response rate \(p \sim \text{Beta}(a, b)\) (flexible distribution on [0,1])
Data: \(x\) responses out of \(n\) emails
Posterior: \(p | \text{Data} \sim \text{Beta}(a+x, b+n-x)\)
Result: Direct probability distribution over the response rate!
Beta-Binomial is the simplest Bayesian model for proportions. Perfect for A/B tests.
Beta distribution: - Parameters a, b control shape - Beta(1,1) = uniform prior (know nothing) - Beta(20,80) = skeptical prior centered around 20/(20+80)=20% - Beta(45,55) = optimistic prior centered around 45%
Conjugacy: Beta prior + Binomial likelihood → Beta posterior (analytically tractable, no simulation needed)
Example: - Prior: Beta(45, 55) → expect ~45% response - Data: 58 responses out of 100 emails → 58% observed - Posterior: Beta(45+58, 55+42) = Beta(103, 97) → expect ~52% response
The posterior is a compromise between prior and data. With more data, data dominates. With little data, prior matters.
Credible interval: 95% of posterior mass → direct interpretation “95% probability true response rate is between X% and Y%”
Prior Selection: Art or Science?
Types of priors:
Uninformative: Flat, let data speak (e.g., Beta(1,1))
Weakly informative: Regularize, prevent extreme estimates
Informative: Based on past studies, theory, expert judgment
Best practice:
Be transparent about prior choice
Run sensitivity analysis: how do results change with different priors?
In civic tech: use priors from similar interventions if available
Priors are often misunderstood. They’re not “bias” - they’re accumulated knowledge.
When to use which:
Uninformative: - First study in a new area - Very large dataset - Want to match frequentist results
Weakly informative: - Prevent overfitting (especially with small data) - Rule out extreme/implausible values (e.g., “response rate can’t be 99%”)
Informative: - Rich past data from similar contexts (meta-analysis) - Theory makes strong predictions - Sequential trials (posterior from trial 1 = prior for trial 2)
Criticism: “Priors are subjective!” Response: So are modeling choices in frequentist analysis (which covariates to include, transformations, etc.). At least Bayesian analysis is transparent about assumptions.
Sensitivity analysis is key: show results under multiple priors. If they’re similar → robust. If they differ → data is weak, prior matters, be honest about uncertainty.
Credible Intervals vs Confidence Intervals
95% Confidence Interval (frequentist):
“If we repeated the study many times, 95% of intervals would contain the true parameter”
Statement about the procedure , not this particular interval
95% Credible Interval (Bayesian):
“There’s a 95% probability the parameter is in this interval, given our data and prior”
Direct statement about the parameter
Visual: Show prior, likelihood, posterior curves; credible interval is 95% of posterior mass.
This is the key practical difference.
Frequentist CI: - Correct interpretation is tortured: “In the long run, if we sampled repeatedly…” - What people want to say: “95% chance effect is in this range” - What you can say: “Procedure has 95% coverage”
Bayesian CI (credible interval): - Says exactly what people want: “95% probability effect is in this range” - Conditional on your prior and data
Example: - Frequentist: 95% CI [0.2pp, 6.8pp] → “If H0 true, 5% of such intervals wouldn’t contain 0” - Bayesian: 95% CrI [0.5pp, 6.5pp] → “95% probability effect is between 0.5pp and 6.5pp”
For decision-making, Bayesian is more intuitive! “Should we scale the program?” → “Yes, there’s a 97% probability it increases participation”
Caveat: Bayesian interpretation is conditional on your prior being reasonable. If your prior is terrible, posterior is misleading.
When Bayesian Thinking Helps
Situations where Bayes shines:
Small samples: Incorporate prior knowledge to improve estimates
Sequential testing: Update beliefs as data accumulates
Decision analysis: Compute probability of meeting a threshold
Complex models: Hierarchy, missing data (MCMC handles these well)
Example: “What’s the probability the campaign increases response by at least 3pp?” → Integrate posterior above 3pp.
Bayesian methods are especially useful when:
Small n: You run a pilot with 200 people. Frequentist analysis is very uncertain. Bayesian can borrow strength from past studies via prior.
Sequential: You run a trial in waves (Phase 1 → Phase 2 → Phase 3). Bayesian lets you update beliefs after each phase. Frequentist has issues with “peeking” (multiplicity).
Decision: Stakeholder asks “What’s the probability the effect is large enough to justify scaling?” Bayesian directly computes this. Frequentist can’t (only gives p-values/CIs).
Complex models: Hierarchical models (e.g., effects vary by ward), missing data imputation, measurement error - all easier in Bayesian framework (use Stan, JAGS, PyMC).
Downsides of Bayesian: - Computationally harder (MCMC takes time) - Requires prior specification (can be controversial) - Less familiar to reviewers/stakeholders (educational burden)
In civic tech: Bayesian is growing but still less common than frequentist. Use when it adds value (small n, sequential, decision-focused). Document your approach carefully.
Case Study Prompts
Question 1: With a skeptical prior centered at +1pp and observed +3.2pp uplift (n=400 per arm), does your posterior still support a meaningful positive effect?
Question 2: How does doubling the prior sample weight (stronger prior) change conclusions vs using a flat prior?
Q1: This tests whether evidence overcomes skepticism.
Skeptical prior: Beta centered at 1pp (weak prior belief in effects) Data: +3.2pp observed Posterior: Will shift toward data, probably 95% CrI above 0pp
With n=400 per arm, data has moderate strength. Posterior will be a compromise: maybe 2pp to 2.5pp central estimate, 95% CrI [0.5pp, 4.5pp].
Interpretation: “Despite a skeptical prior, the data provides strong evidence for a positive effect (98% probability effect > 0pp, 85% probability effect > 1pp).”
Q2: Prior weight = effective sample size in the prior.
Weak prior: Beta(10,10) = like having 20 prior observations Strong prior: Beta(20,20) = like having 40 prior observations
With small dataset (n=100), strong prior has more influence. With large dataset (n=1000), prior hardly matters.
Show this: posterior with weak prior ≈ MLE. Posterior with strong prior is “shrunk” toward prior mean.
Trade-off: strong priors prevent overfitting (good for prediction) but might bias estimates (bad for unbiased causal inference).
Key Takeaway: Part 5
Bayesian inference lets you make direct probability statements about parameters, incorporating prior knowledge. It’s especially useful for small samples, sequential testing, and decision analysis.
Frequentist: “Given a true parameter, what’s the probability of this data?”
Bayesian: “Given this data, what’s the probability distribution of the parameter?”
Both paradigms have strengths:
Frequentist: ✅ Well-established, familiar to reviewers ✅ No prior specification needed ✅ Null hypothesis testing is standard ❌ p-values are confusing ❌ Can’t make direct probability statements ❌ Doesn’t incorporate prior knowledge
Bayesian: ✅ Intuitive probability statements ✅ Natural for decision-making ✅ Can incorporate prior information ❌ Computationally harder ❌ Prior choice can be controversial ❌ Less familiar (more explanation needed)
In practice: Use frequentist for standard evaluations where n is large and methods are established. Use Bayesian when you have good priors, small n, or need direct decision probabilities.
Many modern analyses combine: frequentist for primary analysis (transparent, standard) + Bayesian for sensitivity/decision analysis (adds value).
Part 6: Wrap & Q/A
Duration: 5 minutes
Cross-Cutting Pitfalls
Common mistakes in evaluation:
Multiple comparisons without correction → inflated false positive rate
p-hacking → testing many things, only reporting “significant” ones
Regression to the mean → target worst performers, they improve anyway
Measurement drift → definition changes over time
Ignoring clustering → SEs too small, false positives
Prevention: Pre-register analysis plans, use robust methods, be transparent.
Quick reminders of things we’ve touched on:
Multiple comparisons: If you test 20 outcomes, expect 1 false positive even if nothing works. Solutions: Bonferroni correction, FDR control, pre-specify primary outcome.
p-hacking: “We tested A/B emails, SMS vs call, 3 message variants, young vs old, mobile vs desktop… only the SMS to young people on mobile was significant!” → likely false positive. Prevention: pre-analysis plan.
Regression to the mean: “We targeted the 10 worst-performing schools. After our intervention, 8 improved!” → some would have improved anyway (regression to mean). Need control group.
Measurement drift: “Response rates increased!” → but you changed the definition of “response” halfway through. Keep definitions consistent.
Clustering: We covered this in Part 4. Don’t treat clustered data as independent.
Emphasize: these are ALL avoidable with good design and transparent reporting.
Good Practice Checklist
✅ Design phase:
Pre-register primary outcomes and analysis plan
Calculate required sample size (power analysis)
Plan for covariates to adjust for
✅ Analysis phase:
Report effect sizes with confidence intervals
Use robust/clustered SEs where appropriate
Check diagnostics (residual plots, balance checks)
Don’t p-hack (stick to pre-specified analyses)
✅ Reporting phase:
Transparent about limitations
Share data and code where possible
Plain-language interpretation
Acknowledge what you DON’T know
This is your takeaway checklist. Print it, share it, use it for every evaluation.
Design: Upfront investment in planning pays off. Don’t start collecting data without a clear plan.
Analysis: Follow best practices. Use modern statistical methods (robust SEs, pre-registration, sensitivity analyses).
Reporting: Transparency builds credibility. Don’t hide limitations or negative results.
The civic tech community benefits from shared learning. When you evaluate something, share your methods and data (when ethical). Helps everyone improve.
Meta-point: Good evaluation is iterative. First evaluation might be imperfect - that’s okay! Learn, document what you’d change, do better next time.
Exit Tickets
Before you leave, please answer:
Write a one-sentence interpretation of a 95% confidence interval for a treatment effect of +3.5pp, CI [0.2pp, 6.8pp].
Name one design change to increase statistical power without inflating α.
These are quick checks for understanding. Collect on paper or online form.
Q1 tests: Do they understand CIs?
Good answer: “We’re 95% confident the true treatment effect is between 0.2pp and 6.8pp.”
Okay answer: “There’s a 95% chance the effect is between 0.2pp and 6.8pp” (slightly wrong - that’s Bayesian, but shows intuition)
Bad answer: “The effect is significant” (misses the point)
Q2 tests: Do they understand power?
Good answers: - Increase sample size - Increase effect size (better intervention) - Reduce outcome variance (better measurement, stratification) - Use a more powerful design (within-subjects, matched pairs)
Bad answers: - Increase α (technically works but defeats the purpose) - “Make p-value smaller” (confuses power with significance)
Use responses to identify concepts to revisit in follow-up sessions.
Q&A
Open floor for questions on any part of the session.
Common questions:
“When should I use Bayesian vs frequentist?”
“How do I deal with small sample sizes?”
“What if I can’t randomize?”
“How do I explain this to non-technical stakeholders?”
Budget 5-10 minutes for Q&A. Encourage all questions, no matter how basic.
Prepared answers:
Bayesian vs frequentist: - Default to frequentist (more standard, easier to explain) - Use Bayesian when: small n, strong priors, need decision probabilities
Small sample sizes: - Be honest about uncertainty (wide CIs) - Don’t over-interpret - Consider Bayesian with informative priors - Or just collect more data
Can’t randomize: - Use quasi-experimental methods (DiD, RD, IV - see main eval slides) - Regression with good covariates - Be very careful about causal claims - Sensitivity analyses
Non-technical stakeholders: - Focus on effect size and CI, not p-values - Use visuals (graphs, not tables) - Plain language: “increased by Xpp” not “β=X, p<0.05” - Tell the story: what does this mean for our mission?
If time allows, work through 1-2 questions in depth. Use the board.
Key Takeaways: Statistics 101
1. Distributions & LLN: Variation is normal. More data → less uncertainty.
2. p-values & CIs: Effect size + confidence interval > binary significance.
3. Power: Design for the minimum effect that matters. Don’t run underpowered studies.
4. Regression: Adjust for confounders, check assumptions, use robust SEs.
5. Bayesian: Direct probability statements, incorporate prior knowledge.
“Good statistics makes good evaluation possible. Good evaluation makes good decisions possible.”
Final summary. Reiterate the main points:
Statistics isn’t about formulas - it’s about thinking clearly under uncertainty.
Key mindset shifts: - From “is it significant?” to “how big is the effect?” - From “p<0.05” to “here’s the range of plausible effects” - From “ignore uncertainty” to “quantify and report uncertainty” - From “one test” to “pre-registered plan”
You don’t need to be a statistician to do good evaluation. But you do need: 1. Clear questions 2. Appropriate methods 3. Honest reporting 4. Humility about what you don’t know
These principles apply whether you’re evaluating a civic tech tool, a policy intervention, or your own organization’s work.
Thank everyone. Share slides and tools. Encourage follow-up questions via email/Slack.