When fewer commits don’t mean less work

AI coding tools appear to change commit behaviour, which makes commit counts a shaky productivity metric.

AI
research
data-science
Published

May 7, 2026

I have been vibe coding a lot. Like, a lot a lot…. and it’s been leaving me feeling a bit conflicted. I’ve been committing hard, and feeling tremendously productive. But I also get this weird flash of existential dread when I catch myself furiously hitting AGAIN AGAIN AGAIN like I’m 3 Mai Tais deep at a Vegas slot machine, until I suddenly find myself staring at a codebase I’ve never seen, like some groggy father waking up from a thirty year coma to kids he barely recognises.

I feel like a goddamn engineering hero. But am I?

The research has left me confused, and a teeny bit sceptical. The big daddy of AI coding research, METR found that AI tools can slow experienced developers down on mature codebases… but those are experienced programmers working on codebases they know, which is obviously nothing like me. Then Answer.AI published their analysis, showing that despite all the excitement, PyPI productivity really did not seem to be shifting much… and that one threw me for a spin. I am coding more. I am releasing more stuff. So what the hell is going on?

So I thought I’d check the obvious source of truth on what we are all actually doing: GitHub. I scraped an awful lot of code, and looked for how people who use AI coding tools differ.

Here’s my working theory: AI really is making me commit hard. In fact, it has me committing furiously, more frequently, and in tighter bursts. But that does not necessarily mean I am producing more. It may mean AI is changing how I work, in a way that is extremely seductive. It is a a sneaky little trap, and I worry I’ve fallen head first into it.

So I tried to figure out what makes coders who use AI different. Do they code longer? Do they commit more? Do they work in tighter bursts?

It turns out the pattern is not just “more commits” or “less commits”. It is a rhythm change. AI adopters move toward more concentrated coding sessions: more commits per active week, shorter gaps between commits, but fewer active weeks overall.

Account-level before/after behaviour. Points show group means; whiskers show bootstrapped 95% confidence intervals over accounts.

Account-level before/after behaviour. Points show group means; whiskers show bootstrapped 95% confidence intervals over accounts.

I don’t think this means we are automatically more productive. I think we are working differently, and that difference is tricking our brains.

Picking out the vibe coders

To study this at scale, I built a classifier that assigns each GitHub account a rough probability of AI coding-tool adoption from public behavioural signals.

The training sample is 276 GitHub accounts: 74 confirmed AI adopters, identified through CLAUDE.md files or Co-Authored-By: Claude trailers, and 202 controls with active commit histories before and after late 2023 and no AI markers. The classifier uses 43 behavioural features: commit cadence, message length and structure, PR body patterns, and inter-commit timing. It explicitly excludes the labelling artefacts themselves.

The random forest model I ended up with correctly identifies the right outcome around 94% of the time. Remove all markers from the content (so it’s working purely from patterns of activity) and it still gets it right about 91% of the time.

Thankfully, Claude Code is not the only tool that leaves obvious traces. Aider does too. When I run the classifier on Aider users, a tool it was never trained on, it flags them as AI-tool users 73% of the time, compared to just 3% for people not using any AI coding tools.

So whatever the classifier is detecting, it is not just Claude-specific style. It looks like a broader “this developer’s behaviour looks AI-assisted” signal.

If you want the methodology in detail, the paper is here. For this post, the key point is simpler: I can assign an approximate AI-adoption score to public GitHub accounts, then ask whether higher adoption lines up with different activity metrics.

So what about countries?

So, we know at an individual level that AI users seem to work differently. But what about entire countries? If all of China has indeed gone OpenClaw-crazy, are they pushing endless agentic products? I was hoping this might give me a signal about what was going on.

It turns out it is just as puzzling.

With per-account AI scores in hand, I aggregated them by country. Take the 4,824 scored accounts, group them by country, and use the mean classifier score as a country-level AI adoption measure.

This is the weaker part of the analysis, and I want to be upfront about that. Country-level GitHub Archive data is noisy. The 2024 cross-section is thin. The adoption measure inherits whatever biases the classifier has. So I do not read this as clean evidence that AI adoption causes country-level productivity changes.

I read it as a warning sign about measurement.

For the dependent variables, I used a GitHub Archive panel from 2022 to 2024 with two common activity measures: commits per active developer and pull requests per active developer. If both were clean productivity metrics, you would expect them to broadly tell the same story.

They do not.

2024 country-level cross-section. Each circle is a country, sized by number of developers in the GH Archive panel. Lines are unweighted OLS fits. Treat this as an illustrative aggregate puzzle, not clean causal evidence.

2024 country-level cross-section. Each circle is a country, sized by number of developers in the GH Archive panel. Lines are unweighted OLS fits. Treat this as an illustrative aggregate puzzle, not clean causal evidence.

On the left: commits per developer in 2024, plotted against the country’s mean classifier-derived AI adoption score. Higher adoption, lower commits. The slope tilts down.

On the right: pull requests per developer, same countries, same year, same adoption measure. The slope is slightly positive and statistically indistinguishable from zero.

Same countries. Same year. Same adoption measure. Different activity metric, different story.

That is the point. I do not think the country result is strong enough to carry a claim about national productivity. But it is exactly what you would expect to see if commit counts are partly measuring workflow granularity rather than output.

Coefficient on the country-level AI adoption rate across illustrative specifications. Whiskers are ±1 standard error. Commits estimates cluster on the negative side; PR estimates straddle zero.

Coefficient on the country-level AI adoption rate across illustrative specifications. Whiskers are ±1 standard error. Commits estimates cluster on the negative side; PR estimates straddle zero.

So, does any of this make sense?

So, you’re confused. I am too. But I think I can make some educated guesses about what is going on here.

AI tooling changes how you work, but not necessarily how much you ship

This might just be because we are looking at individuals. Maybe the truly transformative agentic organisation is already happening somewhere else, higher up the stack, in workflows that GitHub commits will never capture properly.

But beware the siren temptress of Claude Code YOLO. No matter her dulcet singing tones, you are not necessarily the hot-shit programmer she is making you out to be.

The behavioural evidence says something subtler: AI users work in a different rhythm. More concentrated bursts. Shorter gaps between commits. Fewer active weeks. That can feel like momentum, and maybe sometimes it is. But it can also make the old metrics lie.

Agentic coders really are built different

The classifier may be picking up something other than AI adoption. A pre-period placebo test on the training data showed that 6 of 8 features differ significantly between confirmed AI adopters and controls before AI tools existed in their current form. The classifier captures some combination of “uses AI tools” and “is the sort of person who does vibe coding”.

I think these might just be very different populations.

Some developers might commit less frequently per unit of work for reasons that have nothing to do with AI. Some may be more deliberate. Some may be more senior. Some may just have different habits. I think it is going to take a while longer before we can separate those things properly.

So, here we are.

I should add: there is loads of stuff I am not sure of. This is a noisy, self-selective, observational analysis. The data is far from ideal, and generalising from Claude Code to Aider may tell me diddly squat about agentic coding more broadly. But I kind of think it does tell us something.

Claude Code might be making us feel like heroes. I am not actually sure that it is making individual programmers push out features any faster. It sure feels like it though, and that is one damn appealing trap to fall into.

The next version of this needs cleaner task-level outcomes: PR merge time, CI success, issue resolution, feature ship time. Until then, I am treating commit counts as evidence about workflow shape, not productivity.


If you have seen this commits/PRs split in your own data, I would like to compare notes. The repo for the analysis is here on GitHub, and the formal paper version lives over here.

An awful lot of the analysis for this work was done by my Hermes Agent. I’m mostly confident, but also, if I’ve cocked it up, I’m very sorry.