What the Diff Already Knows

I had a number on my dashboard that tracked whether my agents’ output held - files committed in one session and re-edited within a week in a later session, weighted by edit frequency. It was the most useful signal in my agentic workflow tracking, and it was blunt enough to bother me.

The commit diffs already contained everything I needed to sharpen it. I just hadn’t thought to read them.

My monitoring system already syncs commits from GitHub - it pulls the metadata, ties commits to clusters of agentic coding sessions, and feeds the dashboard that tracks the churn metrics from the previous post. When a new commit arrives, the system has the diff. And the diff is the evidence. An LLM reading it cold - with no memory of the decisions that produced the code - can see what changed, infer the intent from the structure of the changes, and classify the nature of the work. The classification is a natural extension of a sync pipeline that already exists.

The churn metric I built first did one thing well - it surfaced files that kept getting re-edited across sessions within a short window. But it had specific blindnesses that became harder to ignore the more I used it.

It couldn’t distinguish planned iteration from failure. A feature that legitimately spans two sessions and a specification that didn’t hold both look identical - same file, edited again within a week. The metric treated a documentation edit and a rework edit identically, because it only saw files, not intent. And it said nothing about the kind of work being done. I could see that rework was happening. I couldn’t see what else was happening alongside it - how much testing accompanied the feature work, how much was proactive restructuring versus reactive correction.

The churn metric could detect rework. It couldn’t tell me anything else about what was happening in the commit.

When the sync pipeline encounters a new commit, it fetches the diff and classifies the work across seven categories: feature, fix, refactor, rework, test, docs, ops. The taxonomy is a first pass - chosen quickly and likely to evolve as the data reveals which distinctions carry weight and which don’t. Percentages sum to 100%, rounded to the nearest 5%, weighted by effort and complexity rather than line count. Only categories with non-zero values get stored.

A straightforward feature commit produces a classification like this:

feat: add session duration tracking to dashboard
→ feature=85% test=15%

A commit that revises yesterday’s implementation after realizing the interface didn’t fit:

fix(api): restructure session grouping logic
→ rework=60% fix=25% test=15%

Work composition across all seven categories over the past few months:

The Shape of the Work

Weekly work composition from commit classification, weighted by effort

The most important design decision is the boundary between rework and refactor. Rework means revising something recently written - the specification was undercooked, the context was insufficient, or the approach didn’t survive contact with the next session’s requirements. Refactor means proactive restructuring of code that already works and is stable - improving the structure for what comes next, not correcting what came before. They look identical in a diff. They mean opposite things about the health of the process.

The classifying agent draws that line from the diff alone - it wasn’t there when the code was written, so it has no sunk-cost attachment to calling something “refactor” when it’s actually rework. A human tagging commits after the fact is doing retrospective categorization from memory, with all the rationalization that implies. The agent has no memory to distort. It reads the diff, sees that yesterday’s interface got restructured, checks git history for recency, and classifies accordingly.

The classifications aren’t ground truth - they’re one model’s reading of structural evidence, and I haven’t tested how consistent they are across runs or model versions. But the bar isn’t perfection. It’s whether the signal is sharper than what churn alone provides, and so far the answer is obviously yes.

Both signals now live on the same dashboard - the old churn metrics alongside the new classification - because they’re measuring different things about the same commits. Churn gives a backward-looking behavioral signal - did the same files get re-edited within a window? The classification gives a forward-looking intentional signal - what kind of work does the diff represent?

When they agree - low churn paired with low reported rework - confidence is high. The output is holding and the agent is classifying it that way. When they diverge, that’s where the interesting questions live. High reported rework but low churn might mean the rework happened within a single session, invisible to the file-level metric. Low reported rework but high churn might mean the agent is classifying planned iteration as feature work when the file-level signal says otherwise.

The classification data is a few months old now - enough to see patterns in the noise, not enough to draw firm conclusions. But the parallel period is itself the point. Running both metrics against the same commit history builds a cross-validation baseline. When enough data accumulates, the places where the two signals agree and disagree will tell me something about the reliability of each.

An earlier version embedded the classification in the commit message itself, but my orchestration plugin - which spawns agents across worktrees - wasn’t using the commit skill. Every orchestrated commit was missing the classification. The coverage problem pointed to where it actually belongs: in the system that cares about the metrics, not in the commit path.

The richer signal goes beyond rework. Testing percentage becomes a trackable metric over time - how much testing accompanies feature work in practice, not in aspiration. The feature-to-fix ratio across a project’s lifecycle tells a story about whether early architecture decisions are holding or accumulating correction debt. Docs investment, which churn couldn’t see at all, becomes visible.

The testing composition is the one I’m watching most closely - whether a consistent testing allocation in commits correlates with lower downstream rework. If it does, that’s a signal that can feed back into the agents’ work prompts. Not a mandate - a calibration.

Over a project’s lifecycle, the composition should shift - feature-heavy early, testing and refactor growing as the codebase matures, rework spiking mid-project when specifications are still being discovered and declining as the planning process learns what level of detail holds up.

File-level churn against classified rework over the same period - their divergences reveal what each signal misses:

Two Rework Signals, One Timeline

File-level churn vs. classified rework, weekly

Every commit becomes a data point about the process, not just the product. The diff already knew that - it just needed a reader with no memory.