| | Variable | AI mean | AI SD | Ctrl mean | Ctrl SD |
|---:|:-------------------------|----------:|--------:|------------:|----------:|
| 1 | Pre-period commits | 81.9 | 85.4 | 79.4 | 94.3 |
| 2 | Post-period commits | 54 | 53.4 | 92.8 | 132 |
| 3 | Pre active weeks | 7.8 | 8.1 | 13.2 | 11.1 |
| 4 | Post active weeks | 3.8 | 5.2 | 17.7 | 16.7 |
| 5 | Pre commits/active week | 13.7 | 13.6 | 6.6 | 4.7 |
| 6 | Post commits/active week | 23.5 | 24.6 | 5.3 | 4.1 |
| 7 | Pre inter-commit hours | 281.1 | 664.6 | 180.5 | 206.2 |
| 8 | Post inter-commit hours | 57.7 | 125.9 | 324.9 | 373 |
Abstract
We study the effect of AI coding tool adoption on developer productivity using two complementary empirical designs. First, we build a behavioural classifier that identifies AI coding tool users from observable commit history — without relying on explicit self-reported adoption or proprietary telemetry. The classifier achieves cross-validated AUC of 0.940 on a sample of 235 GitHub accounts and generalises to users of a second tool (Aider) it was never trained on, suggesting it detects genuine changes in development tempo rather than tool-specific stylistic artefacts.
Second, we use the classifier in two causal designs. An account-level difference-in-differences on 235 accounts (33 confirmed adopters, 202 controls) finds large, statistically significant effects: AI adopters increase commits per active week by 13.1 (p < 0.001) and reduce inter-commit hours by 275 (p < 0.001) relative to controls, consistent with AI assistance reducing friction in the development loop. A country-level panel regression across 20 countries (2022–2024) finds no significant effect (coef = −4.91, SE = 6.13, p = 0.43), a null result we attribute primarily to measurement noise in country-level productivity aggregates rather than the absence of an underlying effect.
The classifier methodology is a contribution independent of the productivity findings: it demonstrates that AI tool adoption can be detected at scale from public commit behaviour, opening possibilities for non-survey measurement of AI adoption across the developer population.
Working paper. Code and data: github.com/AndreasThinks/ai-productivity-analysis. A more accessible write-up is available as a blog post.
Introduction
The rapid diffusion of AI coding assistants since late 2022 has prompted widespread speculation about their effects on software developer productivity. Measuring these effects empirically is difficult: AI tool usage is largely invisible in public data, selection into adoption is severe, and the appropriate unit of analysis is contested.
This difficulty is illustrated by recent aggregate analyses. Gallagher and Dimmendaal (2026) examine PyPI package creation and update rates and find no broad productivity boom attributable to AI tools — the exception being packages about AI, which they attribute to increased funding flows rather than developer productivity gains. Our results are consistent with their aggregate null, but suggest the picture is more nuanced at the individual level.
This paper addresses the measurement problem directly. We construct a behavioural classifier that identifies AI coding tool users from observable signals in public GitHub commit history — temporal patterns, commit cadence, message structure — without requiring self-reported adoption data or proprietary telemetry. We validate the classifier on a held-out set and on users of a second tool (Aider) the classifier was not trained on, establishing that it detects general AI-assisted coding behaviour rather than tool-specific patterns.
We then deploy the classifier in two causal designs. An account-level difference-in-differences compares behavioural changes in confirmed AI adopters to matched controls over the same period. A country-level panel regression uses per-country classifier-derived adoption rates as an instrument for the productivity effect at national aggregates.
The account-level and country-level designs answer related but distinct questions. The account-level design asks whether individual developers who adopt AI tools change their behaviour. The country-level design asks whether countries with higher aggregate adoption rates show higher productivity growth — less subject to selection bias but more exposed to measurement noise.
We find strong evidence for the account-level effect and a null result at the country level, which we interpret as consistent with a real individual-level effect that is too small, or the adoption window too short, to detect reliably in national aggregates.
Data
GitHub Archive
All data derive from GitHub Archive, a public record of GitHub activity events available from 2011 onward. We use three samples:
Classifier training sample. 12 hourly windows spanning November 2024, January 2025, and March 2025, yielding approximately 380,000 unique active developer accounts. From this pool we identify ground-truth positive accounts using explicit repository artefacts: CLAUDE.md files, .claude/ directories, or Co-Authored-By: Claude commit trailers. We scrape full commit and pull request history for each account via the GitHub REST API.
Productivity panel. 9 quarterly hourly windows from Q4 2022 through Q4 2024, sampling 500 active developers per window. User profile locations are mapped to ISO 3166-1 alpha-2 country codes and productivity metrics are aggregated by country and quarter (347 country-quarter observations, 54 countries).
Population scoring sample. 887 GitHub accounts with parseable location fields, scored by the trained classifier to yield per-country AI adoption rates. Countries with at least 15 scored accounts (20 countries) are used in the country-level regression.
Ground Truth Labels
Positives are confirmed via GitHub Code Search (filename:CLAUDE.md) and GH Archive co-author trailer scan. We assign marker_confidence = high to co-author trailer accounts (adoption timestamp known) and marker_confidence = low to Code Search accounts. Of 33 positives in the final training set, 25 are high-confidence.
Negatives are randomly sampled from GH Archive, filtered to accounts with commit activity in both pre-period (Jan 2022 – Dec 2023) and post-period (Jan 2024+) and zero AI markers across full history.
Summary Statistics
Behavioural Classifier
Design Rationale
The central methodological challenge is identifying AI tool users without relying on explicit markers (rare, biased toward power users) or surveys (costly, subject to recall bias). We exploit the hypothesis that AI assistance changes how developers work — reducing friction in the commit loop — in ways detectable from public commit histories.
Critical design constraint: explicit artefacts used to identify ground truth cannot also be classifier features. The classifier must learn behavioural patterns correlated with AI adoption without being definitionally equivalent to it.
Features
We extract 43 behavioural features per account across three categories:
- Message and documentation (15 features): commit message length, multiline fraction, conventional commit fraction, test mentions, PR body length, PR body rate
- Temporal and activity patterns (15 features): active weeks, commits per active week, inter-commit hours, burst commit fraction
- Temporal change features (15 features, Δ = post − pre): difference in each of the above between pre and post periods
Performance
| | Model | CV AUC | Ablation AUC | Drop |
|---:|:--------------------|:--------------|---------------:|-------:|
| 1 | Logistic Regression | 0.906 ± 0.060 | 0.896 | -0.01 |
| 2 | Random Forest | 0.940 ± 0.054 | 0.909 | -0.031 |
| 3 | Gradient Boosting | 0.898 ± 0.097 | 0.89 | -0.008 |
Ablation: all message/documentation features removed (21 of 43 features).
Random Forest selected as primary model.
The Random Forest achieves CV AUC of 0.940 ± 0.054. The writing-style ablation (removing all 21 message and documentation features) drops AUC by only 3.1 points to 0.909, confirming the classifier detects genuine change in development tempo rather than Claude’s distinctive commit message style.
Cross-Tool Generalisation


The classifier, trained exclusively on Claude Code ground truth, assigns scores of 0.727 (mean) to Aider users — statistically indistinguishable from the Claude training positives (Mann-Whitney p = 0.065) and far above negative controls (p < 0.0001). 80.6% of Aider accounts score above the 0.5 decision threshold vs 90.9% of Claude positives and 0% of controls.
This cross-tool generalisation is the key validity result: the classifier detects general AI-assisted coding behaviour, not Claude-specific stylistic artefacts. The independent variable in the causal analysis is therefore interpretable as a measure of AI-assisted coding broadly.
Causal Designs
Account-Level Difference-in-Differences
Estimator. For each outcome Y we estimate:
\[\Delta Y_i = \alpha + \beta \cdot \text{Treatment}_i + \gamma \cdot Y^{\text{pre}}_i + \varepsilon_i\]
where \(\Delta Y_i\) is the within-account change, Treatment\(_i = 1\) for AI adopters, and \(Y^{\text{pre}}_i\) controls for baseline differences (Angrist-Pischke regression adjustment). Standard errors are HC3 robust. The coefficient \(\beta\) estimates the average treatment effect on the treated.
Identifying assumption. Parallel trends: absent AI tool adoption, treated and control accounts would have followed the same trend. Significant pre-period differences between groups indicate selection; the regression adjustment partially but not fully addresses this.
Country-Level Panel Regression
IV construction. For each country c with at least 15 scored accounts, the AI adoption variable is:
- Pre-2024: pct_ai_users = 0 (pre-launch baseline for all countries)
- 2024: pct_ai_users = mean post-period classifier score for country c
This gives cross-country variation in 2024 treatment intensity with pre-treatment held at zero.
Estimator. PanelOLS with country and time fixed effects, clustered SE by country. DV: log(commits_per_dev + 1).
Three specifications: (A) Oxford Insights AI Readiness Index as IV (Phase 1 baseline); (B) global mean classifier score in 2024 (broken time proxy, reported for reference); (C) per-country classifier scores from population sample (primary).
Results
Per-Country AI Adoption Rates


Mean classifier-derived adoption rates range from 6.3% (Italy) to 10.7% (Netherlands), with a cross-country standard deviation of 1.4 percentage points. English-speaking and northern European countries show the highest adoption rates; East Asian and southern European countries the lowest. The narrow cross-country range (4.4 pp) is a key limitation of the country-level analysis.
Account-Level Diff-in-Diff


| | Outcome | Coef | SE | 95% CI | p | Sig |
|---:|:-----------------------|---------:|:---------|:---------------------|-------:|:------|
| 1 | Commits / active week | 13.073 | (3.073) | [7.049, 19.096] | 0 | *** |
| 2 | Inter-commit hours | -275.258 | (37.647) | [-349.046, -201.471] | 0 | *** |
| 3 | Active weeks | -11.253 | (1.714) | [-14.611, -7.894] | 0 | *** |
| 4 | Message length (chars) | 54.259 | (26.125) | [3.055, 105.464] | 0.0378 | * |
| 5 | PR has body (frac) | 0.322 | (0.086) | [0.153, 0.490] | 0.0002 | *** |
| 6 | Test co-write rate | 0.144 | (0.062) | [0.023, 0.265] | 0.0196 | * |
| 7 | Conventional commits | 0.076 | (0.049) | [-0.021, 0.173] | 0.1228 | |
*** p<0.001 ** p<0.01 * p<0.05 | HC3 robust SE
Commits per active week increases by 13.1 (p < 0.001) and inter-commit hours decreases by 275 (p < 0.001). The combination of fewer active weeks (−11.3, p < 0.001) with higher commit intensity per session is consistent with AI assistance concentrating productive output into shorter, more intense coding sessions.
Results are robust to restricting the treated group to 25 high-confidence accounts: point estimates are larger (commits/week +15.7, inter-commit hours −335) and significance is maintained throughout.
Country-Level Panel Regression
| | Spec | IV | N | Countries | Coef | SE | p |
|---:|:------------------------|:-----------------------------|----:|------------:|:-------|:------|------:|
| 1 | A — Phase 1 Baseline | Oxford Insights AI Readiness | 88 | 51 | 0.067 | 0.090 | 0.462 |
| 2 | B — Time proxy (broken) | Global mean score 2024 | 111 | 52 | — | — | 0.998 |
| 3 | C — Primary | Per-country classifier score | 59 | 20 | -4.911 | 6.135 | 0.429 |
DV: log(commits_per_dev + 1). Country + time FE. Clustered SE by country.
Spec B collinear with time FE — reported for reference only.
Regression C (primary specification) yields coef = −4.91 (SE = 6.13, p = 0.43). Not statistically significant. The null is attributable to (a) measurement noise in the outcome variable (median 2 developers per country-year), (b) narrow cross-country IV variation (4.4 pp range), and (c) the compressed one-year post-treatment window.
Discussion
Reconciling Account-Level and Country-Level Results
The strong account-level effects alongside a country-level null are consistent under three mechanisms, likely operating jointly.
Measurement noise. The productivity panel has median 2 located developers per country-year. The individual-level effect (+13 commits/active week) is large on a within-person comparison; across 2 developers in a country-year, variance dominates.
Narrow IV variation. The 4.4 pp cross-country range in adoption rates gives the panel regression limited power to identify the slope even absent outcome noise.
Short post-treatment horizon. Country-level productivity effects of technology adoption typically take several years to manifest, as diffusion propagates beyond early adopters. The account-level effect is visible immediately at the individual level; country aggregates require broader diffusion than had occurred by 2024.
These results are broadly consistent with Gallagher and Dimmendaal (2026), who find no broad PyPI productivity boom. Our individual-level result adds the following: the effect is real but concentrated in confirmed adopters, too diluted by uneven adoption to show up in ecosystem-wide metrics in the first two years.
The Classifier as a Measurement Tool
A finding independent of the productivity question: AI coding tool adoption is detectable from public commit behaviour at AUC 0.940, generalising across tools. This is a scalable, non-survey measure applicable retroactively to any period for which GitHub data is available, with no direct inspection of message content required (AUC 0.909 using activity features only).
Limitations
- Selection bias. Treated accounts identified via explicit markers are likely power users, not representative of average adopters. Effect sizes are plausibly upper bounds.
- Parallel trends. Pre-period differences between treated and controls indicate selection; regression adjustment is partial.
- Temporal confound. Post-2024 behaviour reflects both adoption and tool capability improvements.
- Location coverage. ~21% of GitHub users have parseable location fields, skewing toward more active developers.
- Panel thinness. Median 2 developers per country-year in the productivity panel.
Conclusion
We make two contributions. First, a behavioural classifier for AI coding tool adoption that achieves CV AUC 0.940, generalises across tools, and requires only public commit data. Second, an account-level diff-in-diff finding large, robust effects of AI tool adoption on development tempo and documentation quality, alongside a country-level null we attribute to measurement limitations rather than absence of an underlying effect.
The most productive direction for future work: a larger productivity panel (5,000+ users per quarterly window) and a longer post-period (2024–2026) would substantially increase power to test whether individual-level effects aggregate to national productivity measures.
References
Angrist, J. and Pischke, J.S. (2009). Mostly Harmless Econometrics. Princeton University Press.
Gallagher, A. and Dimmendaal, R. (2026). So where are all the AI apps? Answer.AI. https://www.answer.ai/posts/2026-03-12-so-where-are-all-the-ai-apps.html
Oxford Insights (2023). Government AI Readiness Index 2023. Oxford Insights.