What the Numbers Show

AI-assisted
aiengineeringmethodology

I’ve been tracking my machine time. Not in the abstract - literal machine hours, measured by hooks in my Claude Code setup that fire events to an API whenever a session starts, a tool executes, or a session ends. A git integration ties commits back to clusters of sessions on the same repo. The whole thing is maybe 150 lines of bash and an API endpoint, took an afternoon to build, and runs quietly in the background while I work.

The ratio of machine hours to wall hours across my projects has been trending toward 1.5 - meaning for every hour of clock time on a project, the machines are running for about ninety minutes. It’s a parallelism indicator, not a production measure - it says nothing about what those machine hours produce, only that multiple agents are working simultaneously.

And the ratio understates the actual leverage, because each machine-hour is itself already some multiple of what I’d accomplish alone. But estimating that multiple honestly is harder than it sounds. I’m a Scala engineer - fifteen years on the JVM - building fullstack TypeScript because the models are fluent in it. I’ve written virtually no TypeScript by hand. The leverage multiplier for routine scaffolding isn’t just large, it’s undefined - I don’t have a denominator. Even for architecture work closer to my experience, the per-session leverage is real and the 1.5 machine:wall ratio compounds on top of it. The precise number isn’t the point.

Sometimes I’m at the keyboard during that hour, working alongside the agents. Sometimes I’m not there at all - an expedition plan kicked off before lunch, agents executing across worktrees while I’m playing tennis or homeschooling my son. That felt like progress when I first noticed it. It still does. But the number hides more than it reveals.

Wall vs Machine Time
Daily hours over past 60 days
Machine:Wall Ratio
7-day trailing ratio (active days only)

Right now I have five Claude Code terminal sessions open - some running, some waiting for my next instruction. That itself was an evolution. When I started, I ran one session at a time. Over months I learned to parallelize my own attention, hopping between projects and between modules within a project, maintaining progressive flow in each thread. This is a different kind of parallelism from the orchestrated handoff - it’s my concurrency, not just the machines’.

The bursts where the machine:wall ratio spikes above 2 happen when I fully hand off orchestrated execution of plans that were produced in earlier planning sessions. But the steady-state ratio around 1.5 reflects something subtler: multiple sessions running simultaneously because I’ve learned to hold multiple threads of work in my head at once.

On a given day I might have three projects in flight, each getting that kind of leverage. But even this dissolves under scrutiny, because one of those projects might be in a planning phase where everything is single-threaded and the ratio sits near 1:1, while another is in burst execution with agents working across git worktrees and the ratio spikes.

And planning - the work that produces the most parallelism downstream - is inherently sequential. Two hours designing the architecture and decomposing the work, then four agents execute simultaneously. The planning shows up as low throughput on any dashboard. The execution shows up as the spike. Neither tells the full story alone, and if I were watching only the ratio, I’d be rewarding the wrong phase of the work.


Dan Shapiro published a framework earlier this year that maps AI adoption across five levels - from “spicy autocomplete” up to the “dark factory” where no human writes or reviews code. I recognized my own trajectory in the middle levels, and the framework gave language to something I’d been circling: every level feels complete from the inside. Shapiro says this explicitly, and in conversations with engineering leaders I’ve seen the same thing - teams describing themselves as further along than their workflows suggest. Self-assessment is demonstrably unreliable. The question isn’t how productive someone feels. It’s what the data says about how they’re actually working.


There’s a cycle in the data I think of as planning-to-burst. In weeks where I spend more wall time reading, exploring, and writing specifications - sessions heavy on Read and Explore tools with few edits - the following sessions tend to be dense with parallel execution and high commit output. The planning sessions look slow on any throughput metric. They’re the highest-leverage hours of the week. An engineer who’s learned to invest in the planning phase before touching implementation is working differently from one who starts editing immediately, even if their machine:wall ratios are similar.

Session concurrency is a clear signal. When I started with Claude Code, all my work was a single terminal session - one agent, iterating together. Over months, I’ve shifted toward orchestration patterns that spawn multiple Claude Code sessions across git worktrees, each tracked independently. Four sessions running in parallel on the same project, each with its own event stream, each producing commits.

That shift - from working with AI to orchestrating AI - maps directly onto Shapiro’s progression from the middle levels toward the upper ones, and it shows up in the data as concurrent sessions that the API clusters into session groups. The measurement granularity is the terminal session, not the individual tool call, and that turns out to be the right level - it captures the deliberate parallelism that reflects how someone has decomposed the work, not the incidental parallelism happening inside the agent runtime.


The data suggests a progression. A ratio at 1.0 means one session at a time - the agent as pair programmer, which is where Shapiro estimates about 90% of self-described AI-native developers still operate. When concurrent sessions start appearing and the ratio lifts above 1.0, the developer is directing agents rather than writing alongside one. Further along, planning phases lengthen and execution happens in bursts across parallel sessions - the developer’s primary output becomes specifications and judgment, not code.


The progression above is entirely about throughput and parallelism. A team could look good on every one of those signals while producing code that keeps getting reworked. The machine:wall ratio says nothing about whether the work sticks.

Code churn is the usual way to measure rework - files modified and then modified again within a short window. But calendar-based churn is a blunt instrument in agentic workflows because it can’t distinguish planned multi-session work from rework. A file committed in one session and revised in a follow-up session the next day might mean the first session’s output didn’t hold - the specification was undercooked, the context was insufficient, or the task decomposition was wrong. Or it might mean the feature legitimately spans two sessions.

I started tracking whether files committed in one session got re-edited in a later session within a week - a rough proxy for whether the first session’s output actually held. Not every cross-session re-edit is rework - legitimate features span multiple sessions - but persistent revision of recently produced files is a signal worth watching. I want that number minimal and trending down - as models improve, as the harness gets tighter, as the planning process learns what level of specification produces output that survives to the next session.

Even that metric needed its own iteration. My first version treated all cross-session edits equally - a file touched once in a follow-up session counted the same as one touched seven times. But those are different signals. A single re-edit might be a legitimate extension; seven re-edits means the specification didn’t hold. Refining the metric to weight by edit frequency sharpened the signal, and the process of refining it was itself a case of the rework the metric was trying to measure - caused not by the agents but by my own insufficient research into what I was actually trying to capture.

A rising machine:wall ratio paired with flat or declining session rework is the signal of genuine progress - more parallel execution, and the output holds up. A rising ratio paired with rising rework means the parallelism is producing work that doesn’t survive contact with reality. Agents stepping on each other, specifications undercooked, review too thin to catch problems before they compound. That pattern looks productive on any throughput dashboard. The rework data shows it isn’t.

The type of work complicates the signal further. I categorize my work into expeditions, excursions, and errands - ranging from heavily planned multi-session projects to quick iterative fixes. Errand-level work naturally produces more churn because the thrashing is the solution-crafting process, not a failure of planning. It’s the expedition and excursion work - where I invest in decomposition before burst execution - where rework signals something actually went wrong upstream. The metric doesn’t distinguish between these categories yet, and until it does, the interpretation requires context I carry in my head.

But the most useful thing the rework data has surfaced so far isn’t about the agents at all - it’s about me. Watching the numbers and tracing rework instances back to the sessions that produced them gives me something I didn’t have before - a feedback loop. I care more about planning now because I can see its real impact downstream, both when I invest in it and when I skip it.

The planning-to-burst cycle should predict this - sessions starting with thorough decomposition producing lower-rework output downstream, sessions that skip planning and go straight to parallel execution producing high throughput and high rework. I don’t have enough data yet to confirm the pattern, but it’s the hypothesis the measurement is designed to test.

More Agents, Less Rework
7-day trailing ratio and session-relative rework rate

In a team context, these behavioral signals look less like a dashboard and more like a conversation prompt. The shifts are gradual - a machine:wall ratio lifting from 1.0 to 1.2 over three weeks as someone starts running concurrent sessions for the first time. Planning share increasing as the investment in specification starts to precede execution. An engineer whose edit-loop frequency is high - the same files getting revised repeatedly within a session - might be at a different point in the transition than one whose sessions are heavy on planning tools and light on edits. Session rework rate layered on top of the ratio sharpens the picture further - two engineers with identical machine:wall ratios look very different when one has 8% rework and the other has 25%. The data doesn’t diagnose anything, but it makes patterns visible that would otherwise stay hidden behind the same commit log.

An engineer whose ratio has sat at 1.0 for weeks might not have made the shift from writing alongside the agent to directing it - or might be deep in a planning phase that hasn’t reached execution yet. The same number means different things depending on where someone is in their work. The measurement doesn’t replace the conversation. It gives the conversation something concrete to start from.

The tools landscape is moving fast enough that today’s instrumentation might not apply to tomorrow’s harness. The measurement itself is scaffolding - useful during the specific window where a team is learning to work differently, not permanent infrastructure. What sticks after the scaffolding comes down isn’t the dashboard. It’s the instincts that developed while the data was making the transition visible.


Measurement can’t capture the moment someone’s relationship to the code shifts - when they stop thinking of themselves as the person who writes it and start thinking of themselves as the person who specifies what should exist. That shift is internal and gradual and doesn’t map cleanly to any metric. But its shadows are in the data. The numbers don’t cause the shift. They’re how you notice it’s underway.