Mark Schaake

Eval: eforge harness backends

Tue, 14 Apr 2026 00:00:00 GMT

import HarnessTokenRatioChart from '../../components/HarnessTokenRatioChart.astro'; import HarnessReviewerReversalChart from '../../components/HarnessReviewerReversalChart.astro';

Setup

<a href="https://eforge.build" target="_blank" rel="noopener noreferrer">Eforge</a> is an open source Agentic Build System. It sizes work, decomposes plans, spawns agents, arbitrates review, and delegates execution to a configurable backend. Backend choice is open - the Claude Agent SDK, pi against any supported provider, or any other harness implementation that conforms to eforge's backend contract.

For this evaluation, two variants were configured and run against the same scenarios:

<a href="https://github.com/eforge-build/eval/blob/6dda3b2e8369ca2b15c8626fca2d58a8e6a8cdb0/variants.yaml#L6-L8" target="_blank" rel="noopener noreferrer">claude-sdk</a>: backend set to the Claude Agent SDK.
<a href="https://github.com/eforge-build/eval/blob/6dda3b2e8369ca2b15c8626fca2d58a8e6a8cdb0/variants.yaml#L66-L77" target="_blank" rel="noopener noreferrer">anthropic-api</a>: backend set to pi with the Anthropic API as the model provider.

The orchestration layer, the per-agent model assignments, and the scenarios were held constant across both variants. The backend was the only controlled difference.

Runs were executed via <a href="https://github.com/eforge-build/eval" target="_blank" rel="noopener noreferrer">eforge's eval framework</a>, where each scenario is defined in <a href="https://github.com/eforge-build/eval/blob/6dda3b2e8369ca2b15c8626fca2d58a8e6a8cdb0/scenarios.yaml" target="_blank" rel="noopener noreferrer">scenarios.yaml</a>. Two scenarios were used:

<a href="https://github.com/eforge-build/eval/blob/6dda3b2e8369ca2b15c8626fca2d58a8e6a8cdb0/scenarios.yaml#L11-L23" target="_blank" rel="noopener noreferrer">todo-api-errand-health-check</a> (errand scale): add a /health endpoint to an Express API.
<a href="https://github.com/eforge-build/eval/blob/6dda3b2e8369ca2b15c8626fca2d58a8e6a8cdb0/scenarios.yaml#L50-L59" target="_blank" rel="noopener noreferrer">workspace-api-excursion-engagement</a> (excursion scale): build an engagement module with reactions, threads, and pins across four parallel plans.

The errand scenario was run three times; the excursion scenario twice. Each run executed against both variants:

./run.sh todo-api-errand-health-check --variants claude-sdk,anthropic-api
./run.sh workspace-api-excursion-engagement --variants claude-sdk,anthropic-api

Model routing (both variants): claude-opus-4-6 for every build stage, claude-sonnet-4-6 for the PRD validator.
Measurement: per-run totals for tokens, cost, and cache hit rate, plus per-stage metrics (reviewer accept rate, stage execution, resolver invocations, gap-closer substance) derived from eforge's monitor-DB and per-run variant analyses.

All five runs were executed on 2026-04-14. All five passed validation on both variants. <a href="https://github.com/eforge-build/eval/tree/206cb797f86372de16af65b75f405ada6ddcd49c/analyses/2026-04-14-harness-backends" target="_blank" rel="noopener noreferrer">Per-run variant analyses</a> are published alongside the eval framework.

Token cost replicated

Claude-sdk used more tokens than pi on every run, at both scales.

Run	Scale	Pi tokens	Claude-sdk tokens	Ratio	Pi cost	Claude-sdk cost
1	errand	167k	406k	2.43×	$0.41	$0.63
2	errand	167k	352k	2.11×	$0.42	$0.61
3	errand	159k	379k	2.38×	$0.39	$0.62
4	excursion	6.74M	9.32M	1.38×	$10.52	$11.80
5	excursion	4.92M	12.10M	2.46×	$8.42	$14.32

The token ratio ranges from 1.38× to 2.46×. Cost ratio from 1.10× to 1.70×. Both remain above one on every run, across both scales. Claude-sdk's cache hit rate was higher than pi's on every run (86-93% against pi's 73-90%), but raw input volume was large enough that caching efficiency did not close the gap. Output tokens showed a similar pattern: on errand runs pi produced 4-5k output tokens while claude-sdk produced 8k.

Claude-sdk silently dispatches a second model

Both excursion runs showed a model-usage pattern on the claude-sdk planner that eforge did not configure. The planner, routed to claude-opus-4-6, also ran claude-haiku-4-5 in the same turn.

Run	Planner opus input	Planner haiku input
4	294k	349k
5	168k	507k

The haiku tokens appear inside the planner's modelUsage block, not as a separate agent in the run manifest. The mechanism isn't directly observable from the token data - the structure is consistent with sub-agent dispatch, but this evaluation didn't verify that.

Pi exhibited no equivalent behavior. When eforge routed opus to the planner, pi executed on opus alone.

"Same models on both sides" is technically wrong. Claude-sdk ran a second model that eforge never configured, and the orchestration layer above it had no visibility into that.

Behavioral metrics did not replicate

After run 4, the behavioral picture looked decisive. Five dimensions favored pi:

Reviewer accept rate: pi 57%, claude-sdk 28%
Plan-evaluator arbitration: pi ran it, claude-sdk skipped
Gap-closer substance: pi identified a defective idempotent cascade-hook registration affecting pins, reactions, and threads; claude-sdk flagged package.json indentation and a type assertion
Merge-conflict-resolver: pi 1 invocation, claude-sdk 2 across four parallel worktrees
"review" evaluator verdict class: emitted by claude-sdk only

Each dimension is documented in the <a href="https://github.com/eforge-build/eval/blob/206cb797f86372de16af65b75f405ada6ddcd49c/analyses/2026-04-14-harness-backends/run-4-excursion-20-06-55.md" target="_blank" rel="noopener noreferrer">run-4 variant analysis</a>.

Run 5 did not replicate any of them.

Reviewer accept rate: pi dropped to 33%, claude-sdk rose to 62%. Complete reversal.
Plan-evaluator: this run, claude-sdk executed plan-evaluator and pi skipped it. Reversal.
Gap-closer: neither variant identified a substantive bug on run 5.
Merge-conflict-resolver: not invoked on either variant. The run-4 invocation counts had no analogue.
"review" verdict class: emitted on both variants (pi 8 of 27 verdicts, claude-sdk 2 of 26). The exclusivity observed on run 4 did not hold.

Composer scope-sizing across the three errand runs showed a related instability. Pi's composer sized the errand correctly as errand on run 2 and as excursion on runs 1 and 3. Claude-sdk's composer sized all three as excursion. The pattern of pi being sometimes correct and claude-sdk being consistently over-scoped held across three errand runs, but a single additional sample in either direction could change the characterization.

Interpretation

Two findings replicated across five runs:

Pi consumes fewer tokens than claude-sdk on the same scenario under identical model routing (1.38-2.46×).
The claude-sdk backend dispatches claude-haiku-4-5 as a planner sub-agent; pi does not. Observed on both excursion runs.

Behavioral metrics on a single excursion-scale scenario did not replicate across two samples. The direction of every observed difference reversed or washed out when the scenario was re-run. The <a href="https://github.com/eforge-build/eval/blob/206cb797f86372de16af65b75f405ada6ddcd49c/analyses/2026-04-14-harness-backends/run-5-excursion-21-27-12.md" target="_blank" rel="noopener noreferrer">run-5 variant analysis</a> notes this directly: any single-run comparison at this scale should be read as one sample, not a backend verdict.

A harness-comparison post published on the basis of run 4 alone would have reported five behavioral advantages for pi that were not reproducible. A reversed post based on run 5 alone would have reported the same pattern in the other direction.

Also replicated across all five runs

No MCP tool calls or skill invocations were recorded on either variant. The observed token gap cannot be attributed to capability that claude-sdk exposed and pi did not.

One-run observations worth flagging

On run 5, claude-sdk's plan-reviewer consumed $1.24 on 1.14M input tokens. Pi's plan-reviewer on the same PRD consumed $0.29 on 94k input tokens. The 4× spend differential was not present at that magnitude on run 4. One excursion run isn't enough to separate signal from the run-to-run variance the behavioral metrics already showed.

Limitations

Dollar figures reflect API pricing on both variants. The claude-sdk backend draws on a Claude subscription when one is present, so for subscription users claude-sdk's real-world cost is capped by the subscription rather than billed per token. The token ratio holds; the cost ratio does not necessarily survive.
Two scenarios, one per scale. Neither is representative of engineering workloads in general.
n=3 at errand scale, n=2 at excursion scale. Behavioral claims require larger samples.
All runs executed on a single day. Shared API-side conditions cannot be ruled out as a factor in any observed metric.
The "review" verdict class is not currently counted in result.review.accepted or result.review.rejected. Metrics that depend on those totals may understate review activity on runs where "review" verdicts are emitted.

Token and cost differentials between the two harness configurations are stable at this sample size. Behavioral differentials are not. Claims about which harness produces higher-quality decisions under identical model routing require more runs than this evaluation contains.

The Open Source Alternative

Sun, 05 Apr 2026 00:00:00 GMT

I didn't set out to build an alternative to anything. I was automating my own workflow - the manual orchestration of plan, build, review, evaluate that I'd been running by hand across Claude Code sessions for months. Each piece got automated, then the coordination between pieces, and at some point I had a multi-agent build system with blind review, dependency-aware parallel orchestration, and a daemon that runs overnight.

Then Factory AI shipped <a href="https://factory.ai/news/missions" target="_blank" rel="noopener noreferrer">Missions</a> - a long-horizon orchestration system that decomposes large projects into milestones, assigns work to parallel workers, and runs autonomously for hours to days. $200 a month, backed by $50M from Sequoia, NEA, and NVIDIA. The overlap with <a href="https://eforge.run" target="_blank" rel="noopener noreferrer">eforge's</a> <a href="/posts/expedition-excursion-errand/" target="_blank" rel="noopener noreferrer">expedition mode</a> was hard to miss - same problem space, same multi-agent decomposition, same "hand off and come back later" model.

The convergence validates the problem space. The divergence is where it gets interesting.

Different bets about quality

Both systems solve "how do I build something large without being present?" But they make opposite bets about where quality breaks down.

Factory validates by asking does it work? Their <a href="https://factory.ai/news/missions" target="_blank" rel="noopener noreferrer">Missions announcement</a> describes validation workers that "launch the application, navigate through flows, check that pages render correctly, and flag visual or functional issues." Quality means functional completeness - the output runs, the UI renders, the state transitions work. Their <a href="https://docs.factory.ai/cli/features/missions" target="_blank" rel="noopener noreferrer">documentation</a> lists "How do you maximize correctness?" as an open research question, which says something about where the architecture is focused and where it's still searching.

eforge validates by asking is it correct? A separate agent reviews the code blind - zero context from the builder, no access to the builder's reasoning, no ability to rationalize implementation choices. Then an evaluator applies per-hunk verdicts, accepting changes that strictly improve the code and rejecting anything that alters intent. Anthropic's engineering team <a href="https://www.anthropic.com/engineering/harness-design-long-running-apps" target="_blank" rel="noopener noreferrer">independently validated this pattern</a> - solo agents approve their own work, adversarial evaluation dramatically improves quality.

These aren't the same question. A codebase can pass every functional test while accumulating structural drift - wrong abstractions, subtle coupling, decisions that compound into maintenance debt but never surface in a UI flow. Factory catches what breaks. eforge catches what drifts.

My experience building with agents over the past couple of years is that the code usually works. Frontier models are good at making things function. What they're less reliable at is staying faithful to a plan across a large coordinated change - holding the architectural intent through thirty files without quietly reinterpreting, simplifying, or drifting from the spec. That's the failure mode blind review is engineered to catch. Not "does this code run" but "does this code do what was planned, the way it was planned."

For an engineer who spent real time on a careful plan - thinking through every module boundary, every data flow, every tradeoff - the question that matters is fidelity. Did the build execute the plan as specified, or did the agent decide it knew better? eforge's review pipeline is built around that question. The blind reviewer can't rationalize the builder's choices because it never saw them. The per-hunk evaluator rejects changes that alter intent even if they'd technically work. The goal is minimizing drift and slop - keeping the output coherent with the plan, not just functional.

Different ideas about who's using it

Factory's framing is explicit: the user is a project manager. From their announcement - "the skillset for working with Missions looks less like traditional debugging and more like project management of agents, where you monitor a team of workers, unblocking them when they get stuck, redirecting them when priorities change." The human manages a fleet. The platform handles the engineering.

That's a coherent model, and it works for enterprise teams that need to multiply throughput across an organization. But it's not the only model.

eforge is built for a developer who wants to stay close to the work. <a href="/posts/the-flinch/" target="_blank" rel="noopener noreferrer">The planning is the engineering</a> - every architectural choice, every module boundary, every acceptance criterion thought through before anything gets built. The build phase <a href="/posts/the-handoff/" target="_blank" rel="noopener noreferrer">runs in the background</a>, but the person who planned it understands the codebase at the level of someone who wrote it by hand - because the planning demands that depth.

The difference isn't capability - it's proximity. Factory abstracts you above the engineering. eforge keeps you in it, automating the execution but not the judgment. The review architecture exists to protect the investment the engineer made in the plan - careful planning should produce faithful execution, not a reinterpretation filtered through an agent's preferences. A tool for someone who wants the leverage of autonomous builds without giving up control over what gets built.

Plans belong in version control

Factory's plan lives in Mission Control - a dashboard where you iterate with an orchestrator, approve the plan, and monitor execution. eforge commits plans to git - orchestration.yaml, module plans, dependency graphs, all committed as the build runs and cleaned up when it's done. The full planning history is in version control, diffable and traceable, without cluttering the working tree.

This is a design choice, not a limitation. When a build goes wrong, I can diff the plan against what was built and see exactly where the divergence happened. When a queued build starts, the planner re-assesses against the current codebase - if earlier builds changed the architecture, later plans adapt before executing rather than building against stale assumptions.

Open source, embeddable, free

eforge is <a href="https://github.com/eforge-build/eforge" target="_blank" rel="noopener noreferrer">Apache 2.0</a>. The tool I needed didn't exist, and the thing I built should be useful to others.

It installs as a Claude Code plugin, a <a href="https://pi.dev/" target="_blank" rel="noopener noreferrer">Pi</a> extension, or standalone via npm. Two backends - Claude Agent SDK and Pi, which connects to <a href="/posts/what-a-cheaper-model-actually-costs/" target="_blank" rel="noopener noreferrer">over 20 LLM providers</a>. The engine exposes an AsyncGenerator<EforgeEvent> - it's a library, not just a CLI. It can live inside CI pipelines, get called from other agents, embed in custom tooling. Factory is a platform you log into. eforge is a component you compose with.

The intent is a tool that feels free to use - not free as in "free tier before the paywall" but genuinely open, configurable, yours to run however you want. Agent-assisted setup through the plugin or extension means configuration takes seconds, not documentation-diving. The complexity is there if you want it - model selection down to individual agent roles, hooks, MCP servers - with sane defaults.

Different tools for different people

Factory is an enterprise platform - $50M in funding, deep integrations with Linear and Sentry and PagerDuty, computer-use validation, skill accumulation across builds. It's built for organizations that need to multiply throughput across teams.

eforge is built for an individual developer who wants to drive. The tool amplifies judgment, it doesn't accumulate its own. The developer holds the patterns, understands the codebase, owns the plan. eforge executes faithfully and verifies rigorously - that's the scope, and it's deliberate.

I've been using eforge daily for three weeks, building real features across multiple projects - including eforge itself. It's young and rough around the edges. I built it with care, and I plan to keep building it.

The Flinch

Sat, 04 Apr 2026 00:00:00 GMT

I've had a few conversations recently with people building in the agentic development space, and in each one, at some point, there's a flinch. One person gravitates back toward the IDE. Another hears about a workflow where no human reviews pull requests and says his team wouldn't go for that. The reasons vary but the discomfort is the same, and it's upstream of the tools entirely.

The actual bottleneck

The conventional workflow is: quick plan, get into the code, iterate fast. This works when the human is writing the code - course-correction happens in real time because the person executing is also the person thinking. Planning can be loose because execution is tight.

Agentic development inverts this. The agent handles execution. The human handles planning. And the quality of that planning becomes the single biggest constraint on what comes out the other side.

With <a href="https://eforge.build" target="_blank" rel="noopener noreferrer">eforge</a>, I spend real time on a plan before any code gets written - not because the tool is slow, but because the plan has to be that good. Every architectural choice, every module boundary, every acceptance criterion needs to be thought through to the point where I'm confident handing it off to parallel agents who will build it without me watching. I read every line Claude produces during planning. I push back, ask for alternatives, challenge assumptions.

It's closer to architecture review than coding - except the architecture review is the work. Going that deep into the planning keeps me close to the codebase in a way that isn't so different from writing the code myself. I know the module boundaries, the data flows, the tradeoffs - not because I typed the implementation, but because I thought through every decision that shaped it.

Even thirty minutes of careful, concentrated planning before a line of code exists feels wasteful when the instinct is to start building immediately. But with agents that can execute an entire module in parallel - and do it well if the spec is clear - the leverage is overwhelmingly in the planning.

Where the advantage actually is

The advantage I have here isn't technical skill - it's practice. Close to a year of agentic engineering, starting with skills and subagent frameworks I built myself and eventually building eforge out of that work. The whole time, building the instinct for what a plan needs to contain before it's ready to hand off. Early on I made the same mistakes: plans that were too vague, specs that left too much to interpretation, assumptions I didn't realize I was making. The agents would build something reasonable but wrong in ways that took longer to fix than if the plan had been right.

The reps changed that. Planning now feels like the primary work - the thing that matters most, not the step before the thing that matters. But getting there took discipline that runs against every instinct built up from decades of coding.

A friend I was talking with about this put it well: the gap isn't in the models or the tools, it's in the trust-building - the planner's ability to produce something complete enough that autonomous execution is safe. Handing off execution and trusting the output will be right - that only happens after enough confidence in the handoff builds up, and it takes time. For me it was Opus 4.5 and 4.6 that got me there - the model was consistently good enough that I stopped needing to check every line of code and started trusting the plan-to-execution loop.

How I actually work now

The build phase <a href="/posts/the-handoff" target="_blank" rel="noopener noreferrer">already runs in the background</a> - while eforge builds one thing, I'm planning the next, sometimes in Claude's plan mode, sometimes in <a href="https://pi.dev/" target="_blank" rel="noopener noreferrer">Pi</a> using an <a href="https://github.com/eforge-build/eforge/tree/main/packages/pi-eforge" target="_blank" rel="noopener noreferrer">eforge planning extension</a> I built. The bar I hold each plan to isn't "does this sound right" but "could someone who's never seen this codebase execute this and produce something I'd ship." If five different agents could build it independently and produce roughly the same result, the plan is ready.

Context-switching between plans is its own skill to develop, but the alternative is sitting idle while agents build, and that's wasted leverage.

The transition is the skill

Most AI developer tools are still built around human quality gates - a person reviewing code mid-development, a person approving pull requests. Those checkpoints made sense when the human was the only one who could catch problems. In my own work, both the writing and reviewing of code have moved to agents - I stay in the plan. Whether that's where everyone ends up, I don't know, but for this workflow the checkpoints feel like artifacts of an earlier constraint.

The code is a downstream artifact. And the skill that determines whether that works - given models smart enough to execute well - is the ability to think clearly enough about what needs to be built that an autonomous system can build it right.

What a Cheaper Model Actually Costs

Sat, 04 Apr 2026 00:00:00 GMT

I was mid-plan when the Anthropic announcement dropped - Claude Code subscriptions would no longer cover third-party harnesses. Tools like OpenClaw, which run continuous automated builds on Claude's infrastructure, now require separate pay-as-you-go billing. The subtext was straightforward: flat-rate pricing doesn't survive automated systems that run 24/7.

My first question was whether <a href="https://eforge.run" target="_blank" rel="noopener noreferrer">eforge</a> fell into the same bucket. Within hours, Boris Cherny - the creator of Claude Code - confirmed in an X reply that personal local tools wrapping the Claude Code harness, headless mode, and Agent SDK are still covered under subscriptions. eforge uses the Agent SDK to spin up Claude Code instances under my Max subscription - so it's probably fine. For now.

But "probably fine" is doing a lot of work in that sentence. The clarification came from an X reply, not official documentation. Boris himself said he's "working on improving clarity" - which means the clarity doesn't exist yet. The policy that governs whether my primary development workflow keeps working is, at this moment, one person's tweet. That's its own kind of vendor risk, even when the answer is favorable.

This is why I added a second backend to eforge last week. <a href="https://pi.dev/" target="_blank" rel="noopener noreferrer">Pi</a> connects to over 15 LLM providers including local models. I built it as a vendor lock-in hedge. The Anthropic announcement didn't break anything - but it confirmed that the hedge was the right call.

The three costs of a cheaper model

The conversation about model pricing usually stops at the sticker price - cost per million tokens. That's the wrong number to optimize for. In an agentic build system, there are three distinct costs, and only one of them shows up on the invoice.

Per-token cost is what the provider charges. Opus is expensive. GPT 5.4 is cheaper. Local models like Google's Gemma 4 are free. Right now the pool of models I actually trust for this work is small - Opus, Sonnet, GPT 5 - but as more cross that threshold, per-token price becomes the tiebreaker.

Correction overhead is the hidden multiplier. I ran the same eval build across Opus and GPT 5.4 this week - Opus completed it in about four minutes, GPT 5.4 in three and a half. I also tried Gemma 4 locally, not as a serious alternative but to see how eforge's guardrails held up against a much smaller model. Despite roughly six times the token generation speed, it took over seven minutes - the speed advantage entirely consumed by rework. More correction passes, more review cycles, more back-and-forth before reaching a valid state. A model that's cheaper per token but triggers more review cycles isn't saving money. It's hiding the cost in a different line item.

Build failure is the catastrophic case. Not just more expensive than expected - a total loss. With Opus and a well-written plan, I've rarely seen an eforge build fail because the model drifted too far during execution. The final validation pass recovers from minor issues, and the build lands. I had an hour-and-forty-five-minute Opus build yesterday that completed successfully - a large coordinated change across many files. The time didn't matter. What mattered was that I handed it off, moved on to other planning work, and came back to a successful result.

The Gemma 4 experiment showed the extreme version of this - the same small eval succeeded once in 7 minutes, then failed the next attempt at 19 minutes. Same prompt, same eval, completely different outcomes. With Opus, a four-minute eval build lands every time, and so does a two-hour build. The confidence scales because the model holds coherence. That reliability gap is what eforge needs closed before smaller models become viable.

What the workflow actually requires

The way I work with eforge is simple: I plan carefully, I hand off, I move to planning the next thing. The build runs in the background. I trust it to succeed. That trust is the entire value of the system - it's what lets me stay in the planning layer instead of dropping back into implementation mode. And the trust isn't blind - it comes from two places: the quality of the model executing the build, and the quality of the plan I wrote before handing it off. Both matter. A great model with a sloppy plan will drift. A detailed plan with a weak model will still collapse. The combination is what makes the workflow reliable.

The moment builds start failing unpredictably, I'm pulled back into monitoring - and the "free" local model is actually costing me flow state.

This reframes what "model quality" means in a harness context. The question isn't "can this model write decent code?" - it's whether it can hold architectural coherence across a thirty-file build while an adversarial reviewer pushes back on every hunk, and do it reliably enough that I never think about whether it will land. That's the difference between a model that demos well and a model that builds well. Most models can do the first. Only a handful of frontier models consistently do the second.

The dependency is real

eforge doesn't need the best model. It needs a model above a quality floor - good enough that the adversarial reviewer is catching edge cases and style issues, not rewriting fundamentals. When the builder's output is structurally sound, the review layer refines it. When the builder's output drifts - wrong abstractions, misunderstood intent, conversational filler where precise code should be - the reviewer isn't refining, it's salvaging.

That floor isn't fixed - it depends on the complexity of the build, the clarity of the plan, the number of files in play. But for the kind of work I do with eforge - large coordinated builds across many files - the floor is high. Right now, a small cluster of frontier models clear it: Opus, GPT 5, likely a couple of others. Before models at this level existed, the workflow wasn't possible - not difficult or expensive, just not possible.

The pool is growing, but it's still small. The idea of eforge is dependent on frontier intelligence in a way I can't architect around. I can make the model swappable. I can't make the methodology work with a model that isn't good enough.

Building for a shift that's visible but not here yet

That dependency is real, but it doesn't have to mean locked into one provider.

The competitive dynamics are moving fast - Anthropic, OpenAI, Google, Meta, Mistral all racing on the same capabilities, each leap matched within months. That race drives per-token prices down and quality up simultaneously. What Opus costs today will buy something better in six months.

The architectural response is to make the model a swappable component. eforge's Pi backend makes each agent role independently configurable, and the eval infrastructure is what makes it durable - if I can programmatically measure whether a model crosses the quality floor for each role, model selection becomes a config change backed by data, not a gut decision. The eval suite is where the durability lives, not the model it happens to evaluate.

Anthropic's policy change is a nudge toward something that was going to happen anyway. Flat-rate subscriptions for automated agentic usage were a transitional artifact. The future is per-token pricing, and the question is whether competitive pressure drives those prices down fast enough that the workflow stays accessible.

I think it will. The trajectory from Gemma 2 to Gemma 4 is steep. A future 70B model tuned for sustained multi-file implementation could close the gap on the metrics that actually matter for this workflow. Local execution with enough unified memory would make the marginal cost of a build effectively zero.

The architecture needs to be ready for that moment. Provider-agnostic backends, per-agent model configuration, eval-driven quality gates - these decisions cost nothing if only one model is good enough today. They pay off the moment a second one crosses the floor. The planning layer is the same regardless - I wrote about that model-independence in <a href="/posts/the-flinch/" target="_blank" rel="noopener noreferrer">The Flinch</a>.

That's the bet - not on any specific model, but on the market producing what the workflow needs at a price the workflow can sustain. I'd rather be ready to swap than stuck rewriting when the economics shift.

The Hill

Wed, 01 Apr 2026 00:00:00 GMT

A friend was describing what it must be like inside a frontier AI company - "a wonderful place to be sitting on a hill and looking at everyone else trying to climb it." They'd be working with next-gen models before public release, already knowing which current friction is temporary while the rest of us place bets blind.

This is the information asymmetry that everyone outside those companies lives with. We're investing in tooling, building integrations, designing architectures - and someone inside already knows which of those investments will be obsolete in a quarter. Not because they're smarter, but because they've seen what's coming.

Yesterday someone <a href="https://news.ycombinator.com/item?id=47584540" target="_blank" rel="noopener noreferrer">discovered that Claude Code's entire source code</a> had shipped in an npm package - 500,000 lines, nearly 2,000 files, dozens of feature flags for capabilities that are fully built but haven't shipped. I looked at it, along with everyone else - some wanting to understand how things work under the hood, but a lot of the energy focused on the feature flags. What's coming next. What bets are about to pay off or stop making sense.

I watch it playing out around me. People grabbing at opportunities, building fast, trying to find the angle that holds. Some will land on something durable. Many won't - they'll burn through savings or enthusiasm and quietly step back, and the industry will look completely different in a year. It's a discouraging dynamic to sit with, and I'm not immune to it.

The bitter lesson

The AI community calls it "the bitter lesson" - as models get smarter, human-added complexity increasingly gets in the way. Simpler works best. <a href="https://www.youtube.com/watch?v=hV5_XSEBZNg" target="_blank" rel="noopener noreferrer">Nate's recent video</a> lays this out well: "Every AI system in production contains an invisible layer of workarounds for the last model's weaknesses, and most teams stopped seeing them as workarounds a long time ago."

The core question he poses: is this instruction here because the model needs it, or because I needed the model to need it?

The people most at risk aren't those who can't use AI - it's those whose value was built on compensating for AI's limitations. Elaborate prompt scaffolding, complex retrieval pipelines, workaround architecture that outlived the limitation it was built for. Each model upgrade erodes that layer.

Where I sit

AI has written all my code for over six months now, and I feel secure for now. Deep systems experience translates directly to driving AI effectively - knowing which architectural choices are good or bad, course-correcting before bad decisions compound. There's consulting work as far as I can see.

But I'm also completely consumed by eforge right now - I'm betting on it. Not casually - it's where my time goes, where my energy goes, and I think about what happens if it all turns out to be for nothing. When models internalize deep systems expertise - not just syntax but judgment about distributed systems, data modeling, failure modes - the value I bring narrows, and the tool I'm building may narrow with it.

Building for the fade

So <a href="https://eforge.build" target="_blank" rel="noopener noreferrer">eforge</a> forces me to look ahead. <a href="/posts/the-handoff" target="_blank" rel="noopener noreferrer">The build phase already runs in the background</a> - eforge is meant to be <a href="/posts/the-layer-above" target="_blank" rel="noopener noreferrer">the layer above</a> - but that layer needs to constantly evolve upward. Some structure is load-bearing: orchestration, topological merge order, the blind review architecture. Some is compensatory scaffolding I expect to remove as models improve. The hard part is knowing which is which, and having the discipline to let go of the compensatory parts even when I built them myself.

Overprompting is the clearest example - all the careful wording that accommodates a model's current quirks but won't matter once the next version drops. Same with treating code review as a procedural checklist rather than thin scaffolding around model judgment, or building an opinionated abstraction layer where a shim would do. "Am I adding this because the model needs it, or because I'm uncomfortable letting go?"

One bet I'm more confident in: staying flexible on providers. eforge supports <a href="https://github.com/badlogic/pi-mono/" target="_blank" rel="noopener noreferrer">pi</a> as a multi-provider backend specifically because locking into one vendor's API when the landscape shifts this fast seems like exactly the kind of compensatory decision that ages badly. Maybe eforge itself becomes the tool people use, maybe it's an inspiration for whatever comes next, maybe the value is just in the design principles it forces me to confront. I don't know which.

What stays

Models can solve problems, and they're getting better at it fast. What they don't do is decide which problems matter, whether the solution is actually right, or when to change direction entirely. I don't have a clean name for that role. It's not "problem solver" because the models are taking over more of that space. It's closer to judgment - the ability to evaluate, steer, and know when something is off before the cost compounds.

Currently that judgment requires implementation intuition - deep enough understanding of underlying choices to know which ones are good. That requirement may fade as models improve, and I want to believe the judgment itself stays human.

The Handoff

Fri, 20 Mar 2026 00:00:00 GMT

AI has written 100% of my code for over six months now - that part is settled. What's been shifting is everything else - the orchestration, the oversight, the gap between expressing what I want and trusting that it gets built correctly.

I've been classifying every commit across six projects by what kind of work it represents - feature, fix, refactor, rework, test, docs, ops. The weekly breakdown tells a story the aggregates can't.

The docs category in the chart above is planning and methodology artifacts - architecture documents, module decompositions, scope assessments, verification criteria. It starts as a sliver and by the final week accounts for 42% of measured effort - not because the planning expanded, but because everything around it automated away. None of it was written by hand.

The shift happened in stages over the past few months, each one noticeable as it arrived.

Once agents could read a plan and execute, the quality of the output tracked the quality of the plan, not the quality of any individual prompt. I started building skills and plugins that automated more of the coordination - planning workflows, review cycles, scope assessments. Each layer removed friction - the builds themselves ran hands-free once the plans were compiled and launched - but I was still the one invoking the skills to compile plans, kicking off the builds, managing worktrees, merging branches. The coordination was partially automated but still required manual orchestration, and there was no clean way to queue work - a build might run thirty minutes or more, and while I could start planning the next thing, I couldn't hand that plan off until the previous build finished. I was almost at the point of full handoff, but not quite.

The plans themselves didn't get simpler - they're still detailed, still low-level enough that I can verify Claude has thought through the full architectural impact before anything gets built. I need to see enough specifics to know it isn't assuming too much. What changed is the form - less implementation specification, more intent specification - one tells the machine what to type, the other tells it what to solve. I expect the fidelity I need will drop as models improve, but right now getting the details right in the plan is where the engineering happens.

Last week I started building <a href="https://eforge.run" target="_blank" rel="noopener noreferrer">the layer above</a> all of it - a tool that takes a plan at whatever fidelity I provide, assesses the complexity of the buildout against the existing codebase, and handles everything from there. Simple intent gets a simple execution path, complex intent gets compiled into a dependency graph of sub-plans that run in parallel across isolated workspaces, each one independently reviewed by parallel agents that have no knowledge of how the code was written - evaluating only whether it's correct.

That's the part that changed how I work. I express the intent, hand it off, and move on to the next thing. Multiple builds run concurrently while I'm already planning the next feature or thinking through the next architectural question. The build phase - the part that used to be the job - runs in the background. I check the results when they're done, and the independent review cycle catches the things I'd catch in a code review if I were paying close attention.

I no longer think about the build phase, and the early evidence suggests the output is better for it. The tool doesn't get tired and skip a review step. It doesn't decide a verification pass isn't worth the time at the end of a long session. After a week of dogfooding, the builds coming out of the automated pipeline are already more consistent than what my manually-orchestrated work produced the week before - not because the methodology changed, but because it actually gets followed every time.

Handing off the build doesn't mean handing off the thinking - it means the thinking is all that's left. The skills that matter now are architectural, not mechanical - the ability to think clearly about systems above the code and express that thinking in a form machines can execute reliably. The tool works because I understand the systems I'm asking it to build - how the components interact, where the failure modes live, what tradeoffs matter for this specific context. An intent spec written without that understanding produces exactly the kind of confidently wrong output that gives AI-generated code its reputation.

I don't know yet whether the tool I've been building is a product or just my own workflow made portable, but the pattern underneath it feels durable - build quality as a function of intent quality, not engineering effort.

What the Diff Already Knows

Mon, 02 Mar 2026 00:00:00 GMT

import ChurnVsReportedReworkChart from '../../components/ChurnVsReportedReworkChart.astro'; import WorkCompositionChart from '../../components/WorkCompositionChart.astro';

I had a number on my dashboard that tracked whether my agents' output held - files committed in one session and re-edited within a week in a later session, weighted by edit frequency. It was the most useful signal in my <a href="/posts/what-the-numbers-show">agentic workflow tracking</a>, and it was blunt enough to bother me.

The commit diffs already contained everything I needed to sharpen it. I just hadn't thought to read them.

My monitoring system already syncs commits from GitHub - it pulls the metadata, ties commits to clusters of agentic coding sessions, and feeds the dashboard that tracks the churn metrics from the previous post. When a new commit arrives, the system has the diff. And the diff is the evidence. An LLM reading it cold - with no memory of the decisions that produced the code - can see what changed, infer the intent from the structure of the changes, and classify the nature of the work. The classification is a natural extension of a sync pipeline that already exists.

The churn metric I built first did one thing well - it surfaced files that kept getting re-edited across sessions within a short window. But it had specific blindnesses that became harder to ignore the more I used it.

It couldn't distinguish planned iteration from failure. A feature that legitimately spans two sessions and a specification that didn't hold both look identical - same file, edited again within a week. The metric treated a documentation edit and a rework edit identically, because it only saw files, not intent. And it said nothing about the kind of work being done. I could see that rework was happening. I couldn't see what else was happening alongside it - how much testing accompanied the feature work, how much was proactive restructuring versus reactive correction.

The churn metric could detect rework. It couldn't tell me anything else about what was happening in the commit.

When the sync pipeline encounters a new commit, it fetches the diff and classifies the work across seven categories: feature, fix, refactor, rework, test, docs, ops. The taxonomy is a first pass - chosen quickly and likely to evolve as the data reveals which distinctions carry weight and which don't. Percentages sum to 100%, rounded to the nearest 5%, weighted by effort and complexity rather than line count. Only categories with non-zero values get stored.

A straightforward feature commit produces a classification like this:

feat: add session duration tracking to dashboard
→ feature=85% test=15%

A commit that revises yesterday's implementation after realizing the interface didn't fit:

fix(api): restructure session grouping logic
→ rework=60% fix=25% test=15%

Work composition across all seven categories over the past few months:

The most important design decision is the boundary between rework and refactor. Rework means revising something recently written - the specification was undercooked, the context was insufficient, or the approach didn't survive contact with the next session's requirements. Refactor means proactive restructuring of code that already works and is stable - improving the structure for what comes next, not correcting what came before. They look identical in a diff. They mean opposite things about the health of the process.

The classifying agent draws that line from the diff alone - it wasn't there when the code was written, so it has no sunk-cost attachment to calling something "refactor" when it's actually rework. A human tagging commits after the fact is doing retrospective categorization from memory, with all the rationalization that implies. The agent has no memory to distort. It reads the diff, sees that yesterday's interface got restructured, checks git history for recency, and classifies accordingly.

The classifications aren't ground truth - they're one model's reading of structural evidence, and I haven't tested how consistent they are across runs or model versions. But the bar isn't perfection. It's whether the signal is sharper than what churn alone provides, and so far the answer is obviously yes.

Both signals now live on the same dashboard - the old churn metrics alongside the new classification - because they're measuring different things about the same commits. Churn gives a backward-looking behavioral signal - did the same files get re-edited within a window? The classification gives a forward-looking intentional signal - what kind of work does the diff represent?

When they agree - low churn paired with low reported rework - confidence is high. The output is holding and the agent is classifying it that way. When they diverge, that's where the interesting questions live. High reported rework but low churn might mean the rework happened within a single session, invisible to the file-level metric. Low reported rework but high churn might mean the agent is classifying planned iteration as feature work when the file-level signal says otherwise.

The classification data is a few months old now - enough to see patterns in the noise, not enough to draw firm conclusions. But the parallel period is itself the point. Running both metrics against the same commit history builds a cross-validation baseline. When enough data accumulates, the places where the two signals agree and disagree will tell me something about the reliability of each.

An earlier version embedded the classification in the commit message itself, but my orchestration plugin - which spawns agents across worktrees - wasn't using the commit skill. Every orchestrated commit was missing the classification. The coverage problem pointed to where it actually belongs: in the system that cares about the metrics, not in the commit path.

The richer signal goes beyond rework. Testing percentage becomes a trackable metric over time - how much testing accompanies feature work in practice, not in aspiration. The feature-to-fix ratio across a project's lifecycle tells a story about whether early architecture decisions are holding or accumulating correction debt. Docs investment, which churn couldn't see at all, becomes visible.

The testing composition is the one I'm watching most closely - whether a consistent testing allocation in commits correlates with lower downstream rework. If it does, that's a signal that can feed back into the agents' work prompts. Not a mandate - a calibration.

Over a project's lifecycle, the composition should shift - feature-heavy early, testing and refactor growing as the codebase matures, rework spiking mid-project when specifications are still being discovered and declining as the planning process learns what level of detail holds up.

File-level churn against classified rework over the same period - their divergences reveal what each signal misses:

Every commit becomes a data point about the process, not just the product. The diff already knew that - it just needed a reader with no memory.

What the Numbers Show

Tue, 24 Feb 2026 00:00:00 GMT

import WallVsMachineChart from '../../components/WallVsMachineChart.astro'; import MachineWallRatioChart from '../../components/MachineWallRatioChart.astro'; import RatioVsChurnChart from '../../components/RatioVsChurnChart.astro';

I've been tracking my machine time. Not in the abstract - literal machine hours, measured by hooks in my Claude Code setup that fire events to an API whenever a session starts, a tool executes, or a session ends. A git integration ties commits back to clusters of sessions on the same repo. The whole thing is maybe 150 lines of bash and an API endpoint, took an afternoon to build, and runs quietly in the background while I work.

The ratio of machine hours to wall hours across my projects has been trending toward 1.5 - meaning for every hour of clock time on a project, the machines are running for about ninety minutes. It's a parallelism indicator, not a production measure - it says nothing about what those machine hours produce, only that multiple agents are working simultaneously.

And the ratio understates the actual leverage, because each machine-hour is itself already some multiple of what I'd accomplish alone. But estimating that multiple honestly is harder than it sounds. I'm a Scala engineer - fifteen years on the JVM - building fullstack TypeScript because the models are fluent in it. I've written virtually no TypeScript by hand. The leverage multiplier for routine scaffolding isn't just large, it's undefined - I don't have a denominator. Even for architecture work closer to my experience, the per-session leverage is real and the 1.5 machine:wall ratio compounds on top of it. The precise number isn't the point.

Sometimes I'm at the keyboard during that hour, working alongside the agents. Sometimes I'm not there at all - an expedition plan kicked off before lunch, agents executing across worktrees while I'm playing tennis or homeschooling my son. That felt like progress when I first noticed it. It still does. But the number hides more than it reveals.

Right now I have five Claude Code terminal sessions open - some running, some waiting for my next instruction. That itself was an evolution. When I started, I ran one session at a time. Over months I learned to parallelize my own attention, hopping between projects and between modules within a project, maintaining progressive flow in each thread. This is a different kind of parallelism from the orchestrated handoff - it's my concurrency, not just the machines'.

The bursts where the machine:wall ratio spikes above 2 happen when I fully hand off orchestrated execution of plans that were produced in earlier planning sessions. But the steady-state ratio around 1.5 reflects something subtler: multiple sessions running simultaneously because I've learned to hold multiple threads of work in my head at once.

On a given day I might have three projects in flight, each getting that kind of leverage. But even this dissolves under scrutiny, because one of those projects might be in a planning phase where everything is single-threaded and the ratio sits near 1:1, while another is in burst execution with agents working across git worktrees and the ratio spikes.

And planning - the work that produces the most parallelism downstream - is inherently sequential. Two hours designing the architecture and decomposing the work, then four agents execute simultaneously. The planning shows up as low throughput on any dashboard. The execution shows up as the spike. Neither tells the full story alone, and if I were watching only the ratio, I'd be rewarding the wrong phase of the work.

Dan Shapiro <a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/" target="_blank" rel="noopener noreferrer">published a framework</a> earlier this year that maps AI adoption across five levels - from "spicy autocomplete" up to the "dark factory" where no human writes or reviews code. I recognized my own trajectory in the middle levels, and the framework gave language to something I'd been circling: every level feels complete from the inside. Shapiro says this explicitly, and in conversations with engineering leaders I've seen the same thing - teams describing themselves as further along than their workflows suggest. Self-assessment is <a href="/posts/from-assisted-to-agentic">demonstrably unreliable</a>. The question isn't how productive someone feels. It's what the data says about how they're actually working.

There's a cycle in the data I think of as planning-to-burst. In weeks where I spend more wall time reading, exploring, and writing specifications - sessions heavy on Read and Explore tools with few edits - the following sessions tend to be dense with parallel execution and high commit output. The planning sessions look slow on any throughput metric. They're the highest-leverage hours of the week. An engineer who's learned to invest in the planning phase before touching implementation is working differently from one who starts editing immediately, even if their machine:wall ratios are similar.

Session concurrency is a clear signal. When I started with Claude Code, all my work was a single terminal session - one agent, iterating together. Over months, I've shifted toward orchestration patterns that spawn multiple Claude Code sessions across git worktrees, each tracked independently. Four sessions running in parallel on the same project, each with its own event stream, each producing commits.

That shift - from working with AI to orchestrating AI - maps directly onto Shapiro's progression from the middle levels toward the upper ones, and it shows up in the data as concurrent sessions that the API clusters into session groups. The measurement granularity is the terminal session, not the individual tool call, and that turns out to be the right level - it captures the deliberate parallelism that reflects how someone has decomposed the work, not the incidental parallelism happening inside the agent runtime.

The data suggests a progression. A ratio at 1.0 means one session at a time - the agent as pair programmer, which is where <a href="https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/" target="_blank" rel="noopener noreferrer">Shapiro estimates</a> about 90% of self-described AI-native developers still operate. When concurrent sessions start appearing and the ratio lifts above 1.0, the developer is directing agents rather than writing alongside one. Further along, planning phases lengthen and execution happens in bursts across parallel sessions - the developer's primary output becomes specifications and judgment, not code.

The progression above is entirely about throughput and parallelism. A team could look good on every one of those signals while producing code that keeps getting reworked. The machine:wall ratio says nothing about whether the work sticks.

Code churn is the usual way to measure rework - files modified and then modified again within a short window. But calendar-based churn is a blunt instrument in agentic workflows because it can't distinguish planned multi-session work from rework. A file committed in one session and revised in a follow-up session the next day might mean the first session's output didn't hold - the specification was undercooked, the context was insufficient, or the task decomposition was wrong. Or it might mean the feature legitimately spans two sessions.

I started tracking whether files committed in one session got re-edited in a later session within a week - a rough proxy for whether the first session's output actually held. Not every cross-session re-edit is rework - legitimate features span multiple sessions - but persistent revision of recently produced files is a signal worth watching. I want that number minimal and trending down - as models improve, as the harness gets tighter, as the planning process learns what level of specification produces output that survives to the next session.

Even that metric needed its own iteration. My first version treated all cross-session edits equally - a file touched once in a follow-up session counted the same as one touched seven times. But those are different signals. A single re-edit might be a legitimate extension; seven re-edits means the specification didn't hold. Refining the metric to weight by edit frequency sharpened the signal, and the process of refining it was itself a case of the rework the metric was trying to measure - caused not by the agents but by my own insufficient research into what I was actually trying to capture.

A rising machine:wall ratio paired with flat or declining session rework is the signal of genuine progress - more parallel execution, and the output holds up. A rising ratio paired with rising rework means the parallelism is producing work that doesn't survive contact with reality. Agents stepping on each other, specifications undercooked, review too thin to catch problems before they compound. That pattern looks productive on any throughput dashboard. The rework data shows it isn't.

The type of work complicates the signal further. I categorize my work into <a href="/posts/expedition-excursion-errand">expeditions, excursions, and errands</a> - ranging from heavily planned multi-session projects to quick iterative fixes. Errand-level work naturally produces more churn because the thrashing is the solution-crafting process, not a failure of planning. It's the expedition and excursion work - where I invest in decomposition before burst execution - where rework signals something actually went wrong upstream. The metric doesn't distinguish between these categories yet, and until it does, the interpretation requires context I carry in my head.

But the most useful thing the rework data has surfaced so far isn't about the agents at all - it's about me. Watching the numbers and tracing rework instances back to the sessions that produced them gives me something I didn't have before - a feedback loop. I care more about planning now because I can see its real impact downstream, both when I invest in it and when I skip it.

The planning-to-burst cycle should predict this - sessions starting with thorough decomposition producing lower-rework output downstream, sessions that skip planning and go straight to parallel execution producing high throughput and high rework. I don't have enough data yet to confirm the pattern, but it's the hypothesis the measurement is designed to test.

In a team context, these behavioral signals look less like a dashboard and more like a conversation prompt. The shifts are gradual - a machine:wall ratio lifting from 1.0 to 1.2 over three weeks as someone starts running concurrent sessions for the first time. Planning share increasing as the investment in specification starts to precede execution. An engineer whose edit-loop frequency is high - the same files getting revised repeatedly within a session - might be at a different point in the transition than one whose sessions are heavy on planning tools and light on edits. Session rework rate layered on top of the ratio sharpens the picture further - two engineers with identical machine:wall ratios look very different when one has 8% rework and the other has 25%. The data doesn't diagnose anything, but it makes patterns visible that would otherwise stay hidden behind the same commit log.

An engineer whose ratio has sat at 1.0 for weeks might not have made the shift from writing alongside the agent to directing it - or might be deep in a planning phase that hasn't reached execution yet. The same number means different things depending on where someone is in their work. The measurement doesn't replace the conversation. It gives the conversation something concrete to start from.

The tools landscape is moving fast enough that today's instrumentation might not apply to tomorrow's harness. The measurement itself is scaffolding - useful during the specific window where a team is learning to work differently, not permanent infrastructure. What sticks after the scaffolding comes down isn't the dashboard. It's the instincts that developed while the data was making the transition visible.

Measurement can't capture the moment someone's relationship to the code shifts - when they stop thinking of themselves as the person who writes it and start thinking of themselves as the person who specifies what should exist. That shift is internal and gradual and doesn't map cleanly to any metric. But its shadows are in the data. The numbers don't cause the shift. They're how you notice it's underway.

Support Theater

Sat, 21 Feb 2026 00:00:00 GMT

My son got $40 in Gorilla Tag gift cards for his birthday. He has a parent-managed Meta account - the kind Meta requires for kids under 13 - and the gift cards wouldn't redeem to his account. So I bought him $50 in Meta credits directly, thinking that would let him buy what he wanted in-game. Those didn't work either. $90 total, locked somewhere in Meta's system, inaccessible.

I opened a support case. Case #09205181. What followed was the single worst customer support experience I've had with any company, and I've been through Comcast hold queues.

The first agent asked for my full name, my son's account details, and the gift card codes. I provided everything. The second agent greeted me warmly - fresh introduction, fresh name, same questions. I provided everything again. The third agent did the same thing. Then the fourth. Then the fifth.

Five different agents in a single day, each one opening with a variation of "Hi Mark, welcome back to Meta Store Support, hope this email finds you well." Each one asking for information sitting right there in the thread they were replying to. None of them investigating anything.

The fifth agent, Cullen, tried something new. Instead of asking me for information I'd already given, he asked me to do his job - log into auth.meta.com with my son's account and report back whatever error messages I saw. The support process had outsourced the investigation to the customer.

I replied. I was direct. I pointed out the pattern - five agents, five identical greetings, five requests for the same information, zero investigation. I asked them to stop and actually look at what I'd already sent. I asked for escalation.

Agent number six replied within the hour. Cherry. "Hi Mark, welcome back to Meta Store Support. Hope this email finds you well."

She asked if I had attempted to redeem the gift card. The thing I'd explained in my very first message. The thing five agents had already asked about. The thing sitting in the thread she was replying to.

Then she wrote this sentence: "After we receive the information you have provided, we will be happy to investigate further."

She acknowledges the information exists - I've provided it - and then says they need to receive it before they can act. It's contradictory within a single sentence.

Nobody is reading anything. These aren't humans - the pattern makes that obvious. An agent opens the case, scans for a category, generates a warm greeting and a template response, and hands it off. The next one does the same thing. Six times. Each with a different name, each with the same inability to actually read the thread or do anything with the information in it. It's support as performance - every element of helpfulness present except the help itself.

What makes this worse is the context. Meta built the parent-managed account system. They require it for kids under 13. They sell gift cards designed to be redeemed on these accounts. And when their own system breaks at the intersection of these two things they built, there's no one on the other end who can actually look at the account and figure out what's wrong. Six agents deep and not a single one has touched the actual system.

$90 isn't a lot of money. But it's birthday money and a dad's attempt to work around a broken system, all locked behind a support process that exists to look like support without functioning as support. The cruelty isn't in the dollar amount - it's in the theater.

Then the account got suspended. Age verification, according to Meta - triggered, I can only assume, by the support thread itself. I asked for help, asked for escalation, and the system responded by locking the account.

To unsuspend, I need to enter my son's birthday - the exact date I used when I created the account. I don't use my kids' real birthdays on platforms - I use dates close to their real ones because I don't want their personal information sitting in systems I don't control. My son is well past Meta's age threshold, but I can't remember the precise date I entered when we created his account, and Meta's verification requires an exact match. No alternative path, no human to work through it with.

The account is effectively bricked. Not because my son is too young, not because we violated any policy - because I asked for support, something in the process flagged the account, and the recovery gate assumes perfect recall of a date I deliberately made inexact for privacy reasons.

My son bought his Quest with his own money last summer. He's concluded that Meta stole his birthday money and his account. We're buying a <a href="https://store.steampowered.com/sale/steamframe" target="_blank" rel="noopener noreferrer">Steam Frame</a> as soon as they're available, and Meta won't see another dollar from us.

From Assisted to Agentic

Wed, 18 Feb 2026 00:00:00 GMT

I did a fractional CTO engagement last year where I spent weeks building with agentic engineering workflows - specs before prompting, validation as real-time feedback, AI as a system component rather than a typing assistant. When it came time to hand the work off, the gap between how I'd been building and how the team was working was enormous. Not a gap in talent or effort - a gap in methodology. I spent time teaching, trying to transfer not just the tools but the framework around them. Whether it stuck, I don't know.

In conversations with engineering leaders at other companies since, I hear the same starting point. Some say their teams have adopted AI tools, but when I ask how it's changed the way they work, the answer gets vague. Others are openly skeptical that AI has much to offer their engineering process at all - unaware, it seems to me, of how much the landscape has shifted underneath them.

The data confirms what I saw in that handoff is the norm. Deloitte's <a href="https://www.deloitte.com/us/en/about/press-room/state-of-ai-report-2026.html" target="_blank" rel="noopener noreferrer">2026 State of AI report</a> surveyed over 3,000 leaders and found that 37% of organizations are using AI at a surface level with little or no change to existing processes, and only 30% are actually redesigning workflows around the tools. Stack Overflow's <a href="https://survey.stackoverflow.co/2025/ai/" target="_blank" rel="noopener noreferrer">2025 developer survey</a> found 84% of developers using or planning to use AI tools, but 66% cite "almost right" code as their biggest frustration - output close enough to look useful but wrong enough to require debugging.

And the abandonment numbers tell the same story. An <a href="https://www.ciodive.com/news/AI-project-fail-data-SPGlobal/742590/" target="_blank" rel="noopener noreferrer">S&P Global survey</a> found 42% of companies pulled back from most AI initiatives in 2025, up from 17% the year before - not because the tools got worse, but because the organizations never adapted around them. The tools are installed but the workflows haven't moved.

The clearest picture of what the gap looks like in practice comes from <a href="https://www.tweag.io/blog/2025-10-23-agentic-coding-intro/" target="_blank" rel="noopener noreferrer">Tweag</a>, which ran a controlled experiment - two engineering squads building the same product, one using traditional methods, the other using structured agentic workflows. The agentic team had 30% fewer engineers and delivered in half the time, with consistent code quality verified by SonarQube and human reviewers.

The important word there is structured - the agentic team didn't just have better autocomplete, they worked differently. Specs before prompting, validation tools as feedback loops, a deliberate workflow architecture that treated the AI as a system component rather than a typing assistant. That's what I was trying to transfer in the handoff - not a tool recommendation but a way of working.

<a href="https://every.to/source-code/inside-the-ai-workflows-of-every-s-six-engineers" target="_blank" rel="noopener noreferrer">Every.to</a> documents something similar at a smaller scale - six engineers running four AI products, multiple AI instances in parallel for different task streams. The gap between "we have Copilot" and "we have structured agentic workflows" is mostly a gap of deliberate design, not capability.

What makes this harder to navigate is that the productivity gains are genuinely difficult to measure - and teams consistently misjudge them. When <a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/" target="_blank" rel="noopener noreferrer">METR studied</a> experienced developers working on real tasks from their own codebases, the developers believed they were faster with AI tools. They were actually slower. That perception gap - feeling productive while measurably losing ground - ran consistently across the study and suggests that self-reported gains from AI tooling deserve more skepticism than they're getting.

And the organizational dynamics compound the problem. <a href="https://jellyfish.co/blog/2025-ai-metrics-in-review/" target="_blank" rel="noopener noreferrer">Jellyfish analyzed over two million pull requests</a> and found that full AI adoption increased PRs merged per engineer by 113%, but review time increased by 91% - more code flowing through a review pipeline that wasn't designed for the volume, with <a href="https://www.faros.ai/blog/ai-software-engineering" target="_blank" rel="noopener noreferrer">no correlation</a> between individual productivity gains and what organizations actually shipped.

Small teams have a structural advantage here - fewer handoffs, shorter feedback loops, less process between the developer and the deployment. At five people, the person writing with AI assistance often reviews their own output. At twenty or fifty, the review queue backs up and individual gains disappear into organizational friction.

But the productivity framing might be the wrong lens entirely. Since I started working with Claude Code in 2025, the shift I've felt most isn't speed - it's scope. Early on I built an AI microservice that replaced an AWS Textract dependency in my SaaS - better extraction quality at lower cost. Without agentic workflows, that project would have meant weeks of dedicated focus - reading documentation, researching APIs, the kind of concentrated effort that demands its own block of calendar. Instead I built it in parallel with higher-priority work, almost as a side channel. The leverage isn't just that each task gets faster - it's that tasks which previously required dedicated focus can now run alongside other work, which means the total surface area of what one person can take on expands in ways that a speed metric doesn't capture.

<a href="https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic" target="_blank" rel="noopener noreferrer">Anthropic's internal engineering study</a> found the same pattern at scale - they surveyed 132 engineers and discovered that 27% of Claude-assisted work consisted of tasks that wouldn't have been done otherwise. Backend engineers building UIs. Researchers creating data visualizations. Not the same things faster, but things that wouldn't have been attempted at all. A five-person team operating with the breadth of a fifteen-person team isn't faster - it's structurally different.

The teams that haven't made this transition aren't behind on tooling - the tools are a commodity, everyone has access to roughly the same models and capabilities. What separates the Tweag experiment from the 42% that abandoned their initiatives is whether the team redesigned how they work or just added a new tool to the existing process.

The Layer Above

Wed, 18 Feb 2026 00:00:00 GMT

The Codex-vs-Claude-Code discourse has been hard to avoid lately. OpenAI launched GPT-5.3-Codex with a dedicated desktop app, and the takes landed fast - Codex is the autonomous co-worker, Claude Code is the interactive collaborator. The framing is tidy: if you can write clean specs, hand them to Codex and walk away. If your work is messier, more exploratory, more iterative, Claude Code meets you where you are. Pick the right tool for the job.

I installed Codex and was about to start testing it when a familiar feeling pulled me back - the sense that evaluating was detracting from the actual work, that I wasn't yet convinced this would be a good use of time right now. And the question I kept circling wasn't really about Codex at all - it was whether the comparison framework itself is the right lens for what matters.

The models underneath are converging. <a href="https://www.vals.ai/benchmarks/swebench" target="_blank" rel="noopener noreferrer">SWE-bench Verified</a> has Opus 4.5 at 80.9%, Opus 4.6 at 80.8%, and GPT-5.2 Codex at 80.0% - essentially a dead heat. The capability gap that might justify switching tools isn't in the numbers. Where the tools actually differ is in the harness - the layer above the model: the interface, the defaults, the philosophy of how much autonomy the tool assumes out of the box. Claude Code, Codex, Pi - different harnesses, different fittings, same converging capability pool underneath.

Some of the analysis being published draws broad conclusions from narrow evidence. A <a href="https://blog.nilenso.com/blog/2026/02/12/codex-cli-vs-claude-code-on-autonomy/" target="_blank" rel="noopener noreferrer">Nilenso blog post</a> characterized Codex as a "scripting-proficient intern" and Claude Code as a "senior developer who asks clarifying questions" - but this was based largely on reading system prompts, not extended use. In practice, Claude Code with permissions bypassed and a clear plan doesn't pause to ask anything. It executes. The characterization describes a default out-of-the-box posture, not how the tool actually behaves once configured for autonomous work.

Meanwhile, a <a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/" target="_blank" rel="noopener noreferrer">METR study</a> found that experienced developers using AI tools were 19% slower on average while believing they were 20% faster. If experienced developers can't reliably tell whether their tools are helping, the confident declarations about which tool is better for which task deserve more skepticism than they're getting.

The framing that Codex is the right tool for autonomous, spec-driven work treats Claude Code's interactive default as a limitation. It's not - it's a starting position.

I've been building on top of Claude Code for the better part of a year. The result is a <a href="/posts/expedition-excursion-errand">workflow framework</a> that ranges from fully hands-off parallel execution across git worktrees to conversational single-agent work - the kind of thing that accrues naturally when you stay with a tool long enough to understand where its edges are and how to extend them. The "hand it off and walk away" experience isn't unique to Codex. It's what depth produces on any capable foundation.

So when the comparison discourse says Codex is better for autonomous work and Claude Code is better for iterative work, what it's really comparing is one harness's out-of-the-box experience against another harness's customized experience. That's not a comparison of harnesses - it's a comparison of how fitted each one is.

The practical economics make the depth question concrete. Claude Code Max is $200 a month. That's not a trial subscription - it's a commitment that shapes how I work, because it needs to justify itself. Adding Codex at a similar price point means $400+ a month in AI tooling alone, plus the attention cost of maintaining fluency in both. For a solo developer, that's material.

I tried Claude Code's fast mode recently - same Opus 4.6 model, faster output, pay-as-you-use on top of Max. The shift was immediate - faster responses created a different flow state, tighter iteration, less time spent waiting and losing context. I pulled back after a short run because the cost added up quickly, but the experience clarified something: the faster responses created a different kind of attention - tighter loops, less drift between intention and result. That's not just speed, it's a different way of working. OpenAI's Codex-Spark promises dramatically faster generation, but it's a smaller model with a reduced context window and lower benchmarks - speed traded for capability. I'd rather wait a beat longer for the full model than get faster answers from a smaller one.

And the financial picture keeps shifting underneath. Open-source models arrive competitive and often debut with free usage periods on OpenRouter. The floor of what capable tooling costs is dropping steadily. What requires a $200/month subscription today might be available for a fraction of that in six months - or open-source entirely. The financial investment creates a natural pull toward depth, not because switching is intellectually wrong, but because the budget and attention required to maintain expertise across multiple platforms is a real constraint - one the comparison articles almost never mention.

The pattern extends beyond any one developer's setup. <a href="https://fortune.com/2026/01/29/100-percent-of-code-at-anthropic-and-openai-is-now-ai-written-boris-cherny-roon/" target="_blank" rel="noopener noreferrer">Anthropic reports</a> that roughly 90% of Claude Code's codebase is now Claude-written - the tool building itself. The community around these tools is doing something similar: skills ecosystems, custom frameworks, plugins that extend the foundation in ways the original developers didn't anticipate.

The leverage I've found doesn't come from the tool itself - it comes from what's built on top of it. The tool is the foundation. The workflow layer - the custom skills, the calibrated sense of when to trust output and when to verify, the muscle memory of how to prompt effectively for a specific codebase - is where the actual leverage lives.

Switching means abandoning that layer - or so I assumed. In practice, the file-level infrastructure is becoming more portable than you'd expect. CLAUDE.md is readable by Pi, by Codex, by most of the emerging tools. Skills conventions are standardizing. The literal configs and agent-facing instructions could, in theory, travel with me. Whether they actually work as well in a different tool's runtime is something I haven't tested, and the difference between theoretical portability and practical portability might be the whole game.

But the deeper investment isn't in the files - it's in the intuition - the calibrated feel for how a specific model thinks, where it drifts, what prompting patterns produce reliable output for a particular codebase. That intuition has a natural expiration date, since new model versions shift the behavior underneath, but while it lasts it directly shapes the flow state that makes depth productive. It's the part of the layer above that doesn't port, and it's the part the portability arguments never mention.

I installed Codex and never spun it up. I'd spent a few hours with <a href="https://github.com/badlogic/pi-mono" target="_blank" rel="noopener noreferrer">Pi</a> not long before that - another harness, this one built for customizability, the kind that appeals to the Emacs-tinkerer part of my brain - but felt the same pull back toward the work already in progress. The tension arrived at the moment of installation, not after hours of evaluation. What kept stopping me wasn't loyalty to Claude Code or certainty that I'd chosen right. It was the recognition that constant sampling has its own cost, and that I hadn't yet been deliberate about when to look up from the work and when to stay in it.

The Stack You Carry

Mon, 16 Feb 2026 00:00:00 GMT

Someone I care about lobbed a political question at me recently - the kind that's really an invitation to pick a side. I gave a nothing answer and let it die. Not because I don't have opinions, but because I've had this conversation enough times to know the shape of it. We'd circle the same ground - not actually disagreeing about the facts, not even disagreeing about what matters. We both care about fairness, about safety, about institutions working the way they should. The conversation would go nowhere anyway, because we rank those things differently and neither of us can quite hear the other's ranking as legitimate.

I've been thinking about this pattern for years, and the best metaphor I've found is a stack - a personal, weighted ranking of what matters most. Not a list of beliefs or positions. A priority ordering of values.

Everyone carries one, whether they've examined it or not. Freedom, safety, fairness, loyalty, autonomy, tradition, compassion, personal responsibility - most people care about most of these things. The differences show up in the ordering. When freedom and safety conflict, which one wins? When compassion and personal responsibility pull in opposite directions, which instinct is louder?

Two people can hold identical values and reach opposite conclusions about the same policy, the same event, the same vote - because their stacks are ordered differently. One person ranks economic freedom above collective safety. Another reverses that. Neither has to be lying, confused, or morally broken. They're just carrying different stacks.

Stacks aren't static. Mine has shifted more than once. What I ranked highest at twenty-five isn't what I rank highest now - parenthood alone moved things I thought were fixed. Some values I treated as foundational turned out to be inherited assumptions I hadn't examined. Others I dismissed as sentimental turned out to be load-bearing.

And they're not logically clean. Someone can rank personal freedom at the top and still want the government to regulate the things that scare them. Someone can value fiscal responsibility and support programs that cost more than they save, because compassion outweighs the math in that particular case. The contradictions aren't a failure of reasoning. They're a feature of actually holding multiple values at once.

The stack is also only part of the input. Assumptions do real work underneath the ranking - beliefs about how the world operates that shape which conclusions the stack produces. If I assume government is generally competent at managing large-scale investment, my stack generates different policy conclusions than if I assume it's generally wasteful, even if the stack itself is identical. Two people with the same values and the same ordering can still land in different places because they're working from different models of cause and effect. The whole system is messier than the metaphor suggests - values tangled with assumptions tangled with experience tangled with things absorbed before anyone was old enough to examine them.

Some people resolve the difficulty by not maintaining their own stack at all. They adopt one wholesale from an authority - a political leader, a party platform, a media figure, a church. The appeal is real: a prefabricated stack comes with a community, clear answers, and no ambiguity about which value takes priority. It's the mutual fund version of a value system - managed on someone else's behalf.

And like a mutual fund, it scales. When thousands of people carry the same prefabricated stack, it becomes loud, coordinated, self-reinforcing. People still doing the individual work - still living with the ambiguity of values that genuinely conflict - are quieter by nature. Individual stacks are harder to organize around, harder to reduce to a slogan. The richness of genuine value reasoning gets drowned out by the sheer volume of the outsourced version.

This is part of what makes political polarization feel so intractable. The divide isn't primarily about different facts or even different values - it's about the refusal to accept that someone else's ordering could be valid. When someone ranks economic freedom above collective safety, or vice versa, the instinct isn't to hear a different priority. It's to hear "I don't care about safety" or "I don't care about freedom." The reduction of a stack to a single position - the flattening - is where conversations break down.

If there were one correct stack - a universal ordering everyone should arrive at - then disagreement would just be error, and the conversation would be about correction. That's a cleaner world. It has clear sides and a right answer. The alternative - that someone I love might rank something I care deeply about quite low, and that this isn't a moral failure on their part - requires a kind of tolerance that feels like giving ground.

The outsourced stacks make the flattening worse. A managed portfolio comes with built-in rhetoric for why the other ordering is not just different but wrong. It makes the holder fluent in dismissal without requiring them to engage with the actual structure of the disagreement - which is almost never about the values themselves, and almost always about the ordering.

A Higher Gear

Sun, 15 Feb 2026 00:00:00 GMT

The closest thing I have to a second brain is a directory on my laptop and a Claude Code session. I open a terminal, type /start-day, and my AI partner reads my journal, checks my calendar, surfaces reminders, tells me where things were left off, checks email and drafts replies. It works well - better than org-mode ever did, better than any productivity app I've tried. But it only thinks when I'm sitting in front of it. The moment I close the terminal, the second brain goes quiet.

I've been wanting something that doesn't go quiet. An open-source agent platform called OpenClaw promised exactly that - agents with heartbeats, periodic check-ins where the system considers whether something needs attention. Cron jobs that triage email overnight, surface calendar conflicts before I wake up, notice that an invoice hasn't been paid. The distinction matters: one system is a thinking partner I visit, the other is a staff that's always working.

This morning I got ambitious and tried to set up an accounting agent. The bigger vision is to eventually replace the accounting firm I pay to handle things across my personal and business finances - but the starting point was deliberately minimal. Read-only access to QuickBooks, just enough to test whether an agent could surface what needed attention without touching the actual books. Even that turned into a mess. Getting API access meant registering a developer app through QuickBooks Online's developer program, and when I finally got to the permissions screen, the only available scope was read-write. No read-only option. So the minimal safe starting point I'd imagined didn't actually exist. Then the OpenClaw configuration got tangled somewhere along the way, and I found myself staring at a broken installation I no longer had the energy to untangle. I uninstalled the whole thing. Maybe I'll come back when things mature, or when the next wave of ambition hits.

But the shape of what it could become is clear enough to think about, even if the reality isn't there yet.

The end state I keep imagining isn't a conversation with an AI. It's the absence of all the things that currently crowd out what matters.

Right now, a typical morning involves checking three email accounts, processing invoices, following up on messages that fell through cracks, reviewing calendar conflicts, triaging reminders. None of this is hard. All of it is present - it takes up cognitive space, it fragments attention, it sits in the background of every conversation with a low hum of "I still need to deal with that."

The promise of always-on agents isn't that they'd do these things better than I do. It's that they'd do them instead of me. And what opens up in that space is the interesting question.

I lift with my oldest son when we can make it work - a couple afternoons a week if we're lucky, though the workout matters less than the time together. I homeschool my younger son - or try to, between everything else. My wife and I are touring colleges with our daughter. I play tennis with a lifelong friend. I pray, sometimes well, sometimes going through the motions. These are the things that, when I'm honest about it, make the actual substance of a good life - the relationships, the physical presence, the spiritual grounding that I consistently say matters most and consistently give the scraps of my attention after everything else has been handled.

The vision isn't AI as companion. It's AI as the thing that handles everything else so that when I'm sitting across from my son working through a math lesson, I'm there - not half-thinking about whether I replied to that email.

If the noise were genuinely handled - if the email triage and the invoice tracking and the calendar management and the reminder surfacing all just worked, reliably, in the background, without my involvement - what would I do with the space?

I'd like to say I'd pray more, be more present with my kids, read more deeply, invest in friendships with the kind of attention they deserve. And maybe I would. But I've had periods of less work before - between projects, during transitions - and the pattern isn't always encouraging. Sometimes the space fills with different noise. New projects, new rabbit holes, new ways to stay busy that feel productive but aren't the things I claim to care about most.

And there's a more aggressive version of that failure mode - the one I'm most susceptible to. If AI handles the tedium and frees up two hours a day, the competitive instinct doesn't say "go be present with your family." It says "now you can take on more." Another client, another SaaS idea, another repo. The rat race doesn't disappear when the tools get better - it just offers a higher gear. Freed capacity looks a lot like opportunity, and I have a long pattern of seeing how far I can leverage myself, testing what's possible when the constraints loosen. I suspect I have a lot of company in that.

What makes the pull harder to resist is that it's not just ambition talking - it's genuine excitement. This is a wild time to be alive with a technical background. The things I can build right now, the leverage I can get from tools that barely existed a year ago, the problems I can take on that would have been absurd to attempt solo - it's thrilling. And the window is probably temporary. A couple of years, maybe less, before the tools mature enough that everyone has access to the same leverage.

Whether leaning into that window turns out to be the right call is something only hindsight will answer. But the pull is real, and it sits right at the center of the decision.

The technology doesn't resolve this tension - it sharpens it. AI agents that clear the noise give me a choice that was always mine but is becoming more stark: use the space for presence, or use it for more. That's not a software problem. It's a formation question, and the person responsible for the answer has always been the individual. What I do with my time is mine to decide. The tools just make the decision harder to avoid.

Finding the Edge of AI Trust

Sun, 08 Feb 2026 00:00:00 GMT

I'm not interested in 10x productivity from AI. I'm chasing something bigger.

The safe way to use AI agents - reviewing every output, second-guessing every decision - gets maybe 2-3x. Useful, but not transformative. The real leverage lives at the edge: trusting the AI enough to move fast, catching mistakes through tight feedback loops rather than constant oversight.

The problem with edges is that sometimes I fall over them. This week I did.

The fall cost me 1.5 hours of throwaway development, a credit to my client, and - harder to quantify - some trust. All because of a GET vs POST.

I was building an AI classification feature for a client that depended on a third-party API. It had recently changed versions, and the fields we needed were coming back empty.

Stakes were high - without this data, the feature couldn't work. This was exactly the kind of ambiguous debugging task where AI agents shine, or where they can lead someone confidently off a cliff.

I asked Claude to investigate. It dove in, examined the responses, and delivered a verdict with characteristic confidence: "The provider no longer has access to this data."

That made sense.

I asked Claude to verify. It ran cURL tests, examined response structures, and produced a detailed markdown report. Responses came back 200 OK, data returned, but the target fields were empty.

The report looked thorough, professional, convincing. This is where the edge gets dangerous - not when AI is obviously wrong, but when it's confidently, plausibly wrong.

I shared Claude's report with my client as evidence of the problem. They forwarded it to the third-party's support team. Major stakeholders got pulled into discussions about workarounds and alternatives.

This was the moment I crossed the edge. The edge isn't crossed when AI makes a mistake - it's crossed when that mistake gets propagated outward.

Meanwhile, I'd already moved on. Built an alternative solution in about 90 minutes. AI-first development is fast like that. By the time anyone could catch the error, I'd traveled far past it.

Twenty-four hours later, support responded: it's a GET vs POST issue.

The GET endpoint I'd been hitting - that Claude had been testing - isn't documented or officially supported. Use POST with a JSON body and everything works. All the data is there.

The kicker: I couldn't have verified this earlier. The API documentation was behind authentication I didn't have access to. We were debugging blind - and the edge is hardest to see when there's no ground truth.

The technical problem evaporated, but the real damage didn't.

Stakeholders had wasted time on a non-problem. I'd built 1.5 hours of code that went straight to the trash. Worse, I'd confidently led my client down the wrong path, backed by a professional-looking report that turned out to be confidently wrong.

The cost of falling over the edge scales with how fast I was moving - AI-first development is fast, so when the fall comes, it goes far.

I credited the client three hours - more than the total dev time wasted. It wasn't really about the money. It was about acknowledging that I'd caused the disruption, even if the root cause was "the AI told me so."

"The AI was wrong" isn't an excuse. I'm the one who trusted it. I'm the one who escalated it. The buck stops with the human.

This wasn't an AI catastrophe - just a mundane mistake amplified by speed and confidence. But it showed me where the edge actually is.

The edge is where AI is confidently, plausibly wrong. Not obviously wrong - that's easy to catch. Not randomly wrong - that wouldn't inspire trust in the first place. The danger zone is when the AI's answer is confident, coherent, and fits the existing mental model. That's when it's most likely to get propagated without verification.

Speed amplifies distance past the edge - I had an alternative solution built before support even replied. That's the value proposition of AI-assisted development, and the risk. The faster the workflow, the more braking mechanisms matter.

The edge moves, and this might be the hardest part. AI capabilities shift constantly - model updates, service degradation, the inherent stochasticity of responses. And the movement isn't always forward. Independent benchmarks have detected statistically significant performance regressions in Claude Code over 30-day windows (based on daily SWE-Bench-Pro evaluations). A workflow that kept me safely on the right side last month might cross the edge today. Calibrating once and forgetting isn't an option. Staying close to the edge requires ongoing navigation.

I don't have clean rules for staying close to the edge - it moves, AI capabilities shift. What I do have are questions I'm still working through.

When to bet versus wait: I built 90 minutes of throwaway code, which is a reasonable bet with bounded downside. But I also escalated an unverified diagnosis to stakeholders, and that's where the real cost came from. The code was cheap, the misdirection wasn't.

How to communicate the tradeoff: maybe the right move is setting expectations upfront - "We can wait 24 hours for certainty, or we can bet on the plausible answer and keep moving, knowing that rework is part of the deal." That's a conversation to have before being mid-crisis.

What signals to watch for: confident AI plus plausible explanation plus no ground truth. Not a reason to stop, but a sign that a bet is being placed.

I paid for this lesson with 1.5 hours of code, a credit to my client, and some trust. Probably cheap tuition. The edge is where the leverage lives - and where the falls happen.

Outsourcing the Thinking

Sun, 08 Feb 2026 00:00:00 GMT

A client sent me a PRD to build against. The people behind it were technical, experienced - exactly the kind of people I trust to know what they need. The document itself was thorough in the way that makes implementation feel clean: well-sectioned, specific about acceptance criteria, detailed enough that I could start building without much back-and-forth. So I started building.

A month of implementation followed. Multiple rounds of refinement, each one tightening the alignment between what I'd built and what the document described. The work felt productive in that satisfying way where there's a clear target and measurable progress toward it. I was building exactly what the PRD specified, and I could prove it.

The problem wasn't visible because the artifact was doing its job perfectly - not the job of describing what the client needed, but the job of looking like someone had already thought that through.

The client eventually told me the PRD hadn't really been reviewed. The requirements didn't match what they actually needed. A month of faithful implementation against a document nobody had vetted. The people who provided it were apologetic, and I wasn't angry - I'd implemented exactly what I was given. But a month of work had drifted quietly in the wrong direction, and none of us had noticed because the document was good enough to make checking feel unnecessary.

I suggested we set the PRD aside entirely. Let me look at your current system, understand your actual workflows, and bridge from what I've already built to what you need. That worked - not because I'm especially good at reading minds, but because replacing the artifact with direct observation removed the thing that had been substituting for judgment. The gap between the PRD and reality wasn't a specification error. It was a thinking error. The document had done the thinking, and nobody had checked whether the thinking was right.

I recognized the pattern because I'd done the same thing to myself. Earlier, I'd used AI to draft a contract for a fractional CTO engagement - vesting language that looked like contract language should look, right format, right sections, plausible terms. I didn't think about whether vesting was appropriate for this kind of engagement. The artifact looked like thinking had happened, so I treated it like thinking had happened. The consequences played out over months and contributed to parting ways with that client. I'd outsourced the thinking on my own contract terms.

I wrote about a faster version of this pattern in Finding the Edge of AI Trust - an AI debugging report that led me off a cliff in hours, with a sharp landing 24 hours later. That was a fast fall. Here, a PRD shaped a month of work and a contract shaped an entire engagement. The pattern is identical - an AI-generated artifact that looks thorough enough to skip scrutiny - but the timescale determines how deep the consequences go. The fast version gets caught in a day. The slow version plays out for weeks or months before anyone realizes the artifact was doing the thinking instead of the humans.

Speed makes the fall dramatic. Slowness makes it invisible. The Edge post's lesson was about momentum carrying me past the point of error before I could stop. This is about never feeling the fall at all - just a gradual drift, comfortable and productive, until the distance between where I am and where I should be is too large to ignore.

I keep wondering how many other artifacts in my work are substituting for judgment without me noticing. The PRD fix was simple enough - stop building against the document and look at the actual thing. But the difficulty isn't knowing that. The difficulty is remembering to do it when the document is good enough to make scrutiny feel like a waste of time.

The Assumed Room

Thu, 05 Feb 2026 00:00:00 GMT

I left a company once, partly because of how its leadership talked about politics. Not because I disagreed with every position - I didn't. Because of the assumption that everyone in the room agreed.

All-hands meetings, quarterly updates. Somewhere between the roadmap and the team announcements, political opinions delivered like weather - just stated, not argued, as if acknowledging something everyone obviously felt. There's a gap between "I believe this, and here's why" and "obviously we all believe this." The first acknowledges that someone might disagree. The second can't imagine it. One invites a response. The other makes responding feel beside the point - socially costly, or just not worth it today.

The effect is quieter than people imagine. Nobody storms out of an all-hands. What happens is smaller - I stopped going to the optional events, started answering "yeah, crazy times" when politics came up at lunch, kept my head down and did good work and saved my real thoughts for people I trusted outside the building.

Before I left, I had a conversation with someone on my team who'd started being openly political at work - taking cues from leadership, reading the room the way it was designed to be read. I told him privately: be careful with that. You don't know how everyone here actually thinks. This isn't the place.

That was as brave as I got. One private conversation with one person. The assumed room had me trained - even my pushback was quiet, behind closed doors.

The political atmosphere wasn't the only reason I left, but it was part of the texture - one of those things that makes a place feel less like yours over time.

I've thought about the pattern since, how assumed rooms stay assumed. The people who think differently don't argue - they leave. And because they leave quietly, the room never notices the gap. The people who stay all agree, which confirms the assumption, which makes the next meeting feel even more like consensus. The room gets more assumed over time, not less, because everyone who might have complicated things has already walked out.

Nobody tracks this. Exit interviews don't capture "I got tired of pretending to agree with leadership's politics." The departures are always mixed, the reasons always multiple, and the pattern invisible from inside. That might be the sharpest thing about political assumption versus political conviction - conviction invites a response, even disagreement. Assumption forecloses one. And the cost is measured in absence, which nobody remaining can see.

I don't regret leaving. I do think about the fact that my bravest moment was a private conversation - telling one colleague to be careful, behind closed doors. The assumed room had me trained. Even my dissent was quiet.

That's how these rooms work. Not by suppressing disagreement - by making it feel private, individual, not worth the cost. The people who think differently leave, and the room never notices, because absence doesn't register the way agreement does.

Expedition, Excursion, Errand

Tue, 03 Feb 2026 00:00:00 GMT

Clients want senior-level judgment and junior-level throughput. The traditional answer is to hire people, take on overhead, dilute margins. The agentic answer is different - multiply execution capacity while retaining judgment authority.

This is the solo developer's paradox, and I've been trying to solve it for months. Not through abstraction or theory, but through building actual systems and watching where they break.

The methodology I've landed on names three modes. I think of them as positions on a spectrum - not phases that happen in order, but stances that shift based on what the work requires.

Expedition mode is for greenfield projects, major rewrites, new feature areas. Planning is deep, upfront, collaborative. Execution is fully autonomous, highly parallel - multiple AI agents working across multiple git worktrees simultaneously. The human is the expedition leader, planning the route while the team deploys across the terrain. Human involvement is heavy during planning, hands-off during execution.

Excursion mode is for significant features, refactors, integrations. Planning is moderate, focused on the specific change. Execution is semi-autonomous - maybe a few parallel worktrees, maybe sequential. Base camp is established, and the human sends agents out to explore and develop specific areas. Human involvement means approving the plan, monitoring execution, intervening if something gets stuck.

Errand mode is for bug fixes, small enhancements, polish work. Planning is light or inline - just part of the implementation conversation. Execution is interactive, single agent. The human and agent handle a quick trip to address something nearby - no elaborate planning needed. Human involvement is conversational, iterative, responsive.

The modes aren't discrete phases. An Expedition effort decomposes into Excursion-sized chunks. Excursion work spawns Errand-level subtasks. The shift happens fluidly based on scope.

What matters is recognizing which mode fits the current work. Expedition mode with a bug fix wastes planning overhead. Errand mode with a major rewrite leads to thrashing without direction.

The methodology identifies what humans must do versus what agents can do. This is the leverage - not replacing human work entirely, but reducing it to what's irreducible.

Judgment calls are always human. Architectural decisions, trade-off evaluation, scope definition - these require understanding context that agents don't have. So does UX and aesthetic review. An agent can verify that a button is clickable; it can't tell whether clicking it feels right. Plan approval is human. Client communication is human.

Code implementation is delegable. Test execution is delegable. Verification of functional requirements, documentation generation, dependency updates, formatting, linting - all delegable. These aren't less important; they're just amenable to automation.

Some work is genuinely collaborative. Plan creation happens when humans guide and agents draft. Codebase exploration happens when agents search and humans direct. Test plan design happens when humans define scenarios and agents structure them. Debugging happens when agents investigate and humans decide.

The planning artifact matters more than I expected. Plans serve both human reviewers and agent executors - they need to be readable by people approving the approach and parseable by agents carrying it out.

Dual format works. Human-readable prose describing what the plan accomplishes, implementation steps with enough detail for autonomous execution, acceptance criteria, verification commands. Machine-parseable frontmatter with IDs, dependencies, estimated files, test suites to run.

For Expedition mode, large efforts break into multiple plans. Each plan is independently executable. Dependencies are explicit - plan B waits for plan A. Plans without dependencies run in parallel. This is where the throughput multiplication comes from.

Quality assurance should be automated wherever possible, agent-executable rather than requiring the human to run tests manually, and defined during planning rather than as an afterthought.

Agents verify functional correctness - tests pass. They verify type safety - no type errors. They verify code quality - linting, formatting. They verify branching logic - all paths tested. They verify error states - failures handled gracefully.

Humans verify UX and aesthetics - does it feel right? They verify business logic alignment - does it solve the actual problem? They verify architectural judgment - is this the right approach?

Test plans get created collaboratively during planning, structured for agent execution. The human defines the scenarios; the agent structures them into something runnable.

The minimal setup is an agentic coding tool, git with worktree support, and automated test infrastructure. That's enough for Excursion and Errand modes.

Expedition mode needs more - multi-plan decomposition tooling, an orchestration layer for worktree and agent management, maybe a progress dashboard. Browser testing integration if the work involves UX verification.

I've been building these tools incrementally, and they're not done. The methodology is ahead of the implementation.

The tooling landscape is still emerging. For implementation, I use <a href="https://claude.ai/claude-code" target="_blank">Claude Code</a> daily - it handles Errand and Excursion mode work well. I've experimented with <a href="https://www.npmjs.com/package/@mariozechner/pi-coding-agent" target="_blank">Pi</a>, and there are others - Cursor, Codex, the list grows monthly.

Orchestration is less mature. <a href="https://github.com/steveyegge/gastown" target="_blank">GasTown</a> is interesting - a framework for coordinating multiple agents. I've been building a custom Claude Code plugin that handles planning facilitation and orchestration, spawning multiple Claude Code instances across git worktrees. It works, mostly. The missing piece is monitoring - a decent UI for watching what the agents are doing, which is next on the roadmap.

Open questions remain.

How to handle failures mid-execution in parallel mode? An agent hits a wall - does the orchestration stop everything, notify the human, try to recover?

What's the right plan granularity for parallelization? Too coarse and the throughput gains disappear. Too fine and the coordination overhead dominates.

How to communicate progress to clients during autonomous execution? The work is happening, but the developer isn't actively doing it. How does that translate to status updates and expectations?

How to price this work? I've been tracking wall hours - my time - versus machine hours - agent time - via Claude Code hooks that feed into a custom billing system. The goal is a leverage ratio above 1.5, meaning the agents work more hours than I do. Currently I bill by wall hour or fixed per project, but I'm thinking about how machine time fits into the equation. The time I spend is mostly planning and verification. The work that happens is mostly agent-hours. What's the fair exchange?

I don't have answers to these yet. The methodology is a working hypothesis - a way of naming what I've been doing so I can examine it, refine it, share it. Expedition, Excursion, Errand. Plan deeply when the scope demands it, parallelize when the structure allows it, iterate when the work is small.

The leverage is real. The details are still evolving.

Building for Communities You Belong To

Sat, 31 Jan 2026 00:00:00 GMT

My primary social outlet - outside of family - is my tennis club. I play a few times a week, often with my lifelong best friend Dan. Tennis keeps us in regular contact, and as a result, our relationship close. There's nothing like competitive sports - the physical and mental exercise of a tennis match just feels healthy. At 46, that matters more than it used to.

I grew up playing tennis at this club. It's simply a special place to me. The people are good people - happy, caring, invested in something together. It's member-owned, which means we vote, we serve on committees, we show up for work parties. There's no corporate landlord. It's ours.

I also happen to build software for a living. And at some point I started to wonder - what if I actually built something for this place? Not for users, not for a market. For my club. For the people I play with every week.

The idea started small - our club uses software that works, more or less, but it's old and clunky and owned by a company that doesn't know us. When something breaks, we submit a ticket. When we want a feature, we wait. It's the standard vendor relationship - we're customers, not partners.

I kept noticing friction. Booking courts was harder than it should be. Feedback from members disappeared into a void. The board had ideas but no easy way to act on them. And every time I noticed something, I'd think: I could fix that.

So I started building. Not a startup. Not a product. Just - software for my club.

The difference is hard to explain until you've felt it. Building for strangers means guessing - interviewing users, running surveys, analyzing funnels, trying to understand people I've never met. It's necessary work, but it's work.

Building for my own community, I already knew. I knew the problems because I had them. I knew the people because I played with them. The feedback loop isn't a Slack channel or a support ticket - it's a text from a friend after Tuesday night doubles.

There's no pitch deck. No TAM calculation. No investor asking about my go-to-market strategy. The "market" is 500 members, and I know half of them by name. Success isn't MRR - it's whether booking a court on Saturday morning is easier than it was last month.

A year ago, I wouldn't have taken this on. Building a full member portal - auth, reservations, payments, admin dashboards - is a lot of work. Not impossible, but enough that it would have felt like a second job. The kind of project that starts enthusiastic and dies quietly in a half-finished GitHub repo.

But the tools have changed. I build with Claude Code now, and it's genuinely different. Not "AI writes my code for me" different - more like having a senior engineer available at 6am when I'm working before the day starts. The velocity is higher. The activation energy is lower. Projects that used to feel like a six-month commitment now feel like something I can chip away at on weekends.

That shift matters. It means passion projects are actually possible again. Not just for companies with engineering teams, but for one person who cares about their tennis club and knows how to ship software.

I'm not suggesting everyone should build software for their tennis club. But most of us belong to something. A gym, a church, a hobby group, a neighborhood association. Communities that matter to us, run by people we know, with problems we understand.

What would it look like to build for them?

The AI Consultant's Window

Fri, 30 Jan 2026 00:00:00 GMT

AI tools have collapsed the cost of building software. I built a 200,000-line TypeScript MVP in 7 weeks - work that would have taken a team 12-18 months before these tools existed. <a href="https://fortune.com/2026/01/29/100-percent-of-code-at-anthropic-and-openai-is-now-ai-written-boris-cherny-roon/" target="_blank" rel="noopener noreferrer">Anthropic reports 70-90% of code</a> at the company is now AI-written. Microsoft says AI generates about 30% of theirs.

For those who can wield these tools, there's an obvious opportunity - businesses need custom software, they don't have engineers who know how to use AI agents effectively, and consultants who can bridge that gap should find work.

Building a consulting practice takes years, though. Clients, reputation, systems, trust - these things compound slowly. There are slow periods to survive. By the time a practice is mature, will the tools be accessible enough that businesses don't need a consultant?

This isn't a victory lap for consultants - it's an uncertain bet. The energy required to build a consulting business is substantial. Is it worth it for a window that might be 2 years? 5 years? 10?

The direction seems clear - these tools will eventually be usable by people without engineering backgrounds. The question for anyone considering this path is whether they're building a career or riding a wave.

Not all custom builds are created equal, and this distinction matters for understanding where the window might stay open longest.

Greenfield is straightforward - no existing system, no data to migrate, no users with entrenched workflows, just building exactly what's needed from scratch. This is where AI tools shine brightest, and where the consultant's window is most likely to close first.

Migration is harder - replacing existing SaaS means data export and import (often messy, incomplete, or requiring transformation), transition planning (running parallel systems, phased rollouts, fallback procedures), change management (users who learned the old system need to learn the new one), and mapping legacy workflows (existing processes may not map cleanly to new designs).

My tennis club is considering this exact problem. They're thinking about replacing their Club Automation membership system - the software handles court booking, membership billing, lesson scheduling, pro shop transactions. A typical SaaS tool doing too much, none of it particularly well.

Building a custom replacement is now feasible - the core functionality isn't that complex. But the transition involves years of member data, payment histories, recurring billing relationships, staff trained on specific workflows. The build might take weeks; the migration and transition planning could take months.

This is where consulting value persists longer. It's not just "can someone use AI to write code" - it's "can someone manage the organizational complexity of moving from one system to another." That's a different skill set, one that AI tools are further from automating.

There's another dimension worth considering: what happens after the build?

I've spent nine years building and maintaining a SaaS product. Once it reaches maintenance mode - stable customers, predictable feature requests, infrastructure that mostly runs itself - it's a good business. Recurring revenue, compounding familiarity with the codebase, known edge cases.

A consulting practice built on custom software is different. Not one product for many customers - many products for many customers. Each client has their own codebase, their own infrastructure, their own quirks. The portfolio grows, and so does the maintenance burden.

This is more labor intensive by nature. Ten clients with ten custom applications means ten different contexts to hold in memory, ten deployment pipelines, ten sets of dependencies that need updating. The work compounds in a way that SaaS maintenance doesn't.

Can AI help here? Almost certainly. The same tools that accelerate initial builds should accelerate maintenance - understanding unfamiliar codebases, writing migrations, updating dependencies, debugging production issues. The question is whether this can be systematized well enough to keep a growing portfolio manageable.

This might be the real skill that determines whether a consulting practice scales: not just building fast, but building in ways that stay maintainable across a diverse portfolio. Standardized stacks, consistent patterns, good documentation, automated testing - the same things that always mattered, but now with AI agents as part of the maintenance workflow.

AI will eventually handle migration work too. Data transformation, migration scripts, even change management playbooks - these are all learnable patterns. Portfolio maintenance will get easier as AI agents get better at context-switching between codebases. The timeline for automating this work is probably longer than pure greenfield builds, but it's still finite.

The window exists, and betting on it is rational - betting the farm on it is risky.

Migration work and organizational complexity seem more durable than pure code generation. So does the ability to systematize maintenance across a growing portfolio. The businesses that need the most help are ones replacing existing systems, not building from scratch. The skills that matter aren't just technical - they're about managing transitions and building in ways that stay maintainable.

The Disposable Software Era

Fri, 30 Jan 2026 00:00:00 GMT

<a href="https://www.axios.com/2026/01/13/anthropic-claude-code-cowork-vibe-coding" target="_blank" rel="noopener noreferrer">Anthropic shipped Cowork</a> in about 10 days using Claude Code itself. Four engineers, AI-assisted development. These are the companies building the tools - and they're using them to build even faster.

What happens to SaaS when building is this cheap?

Software used to be expensive to build and cheap to distribute. That's the entire SaaS model - amortize high development costs across thousands of customers, deliver via browser, collect monthly fees. The math worked because building was hard.

Now building is fast. <a href="https://fortune.com/2026/01/29/100-percent-of-code-at-anthropic-and-openai-is-now-ai-written-boris-cherny-roon/" target="_blank" rel="noopener noreferrer">Anthropic reports 70-90% of code</a> at the company is written by AI. <a href="https://www.anthropic.com/news/anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone" target="_blank" rel="noopener noreferrer">Claude Code grew from research preview to billion-dollar product</a> in six months. <a href="https://sacra.com/research/cursor-at-100m-arr/" target="_blank" rel="noopener noreferrer">Cursor hit $100M ARR</a> faster than any company in history - 12 months.

The economics are inverting. When building is cheap, buying loses its appeal. Why pay $50/seat/month for a tool that does 80% of what the business needs when the exact thing can be built in a weekend?

a16z calls this "<a href="https://a16z.com/disposable-software/" target="_blank" rel="noopener noreferrer">disposable software</a>" - applications that never justified engineering investment now make sense because the investment required has collapsed. The constraint is no longer ROI, it's imagination.

The Cowork example is instructive - four engineers, 10 days, using Claude Code itself. That's not a prototype, that's a product.

This isn't hypothetical for me. Last year I built a 200,000-line TypeScript MVP for a startup client in 7 weeks - by myself. Before AI tools, that would have been 12-18 months of work for a software team. And that was before the latest model improvements and tooling refinements. The floor keeps rising.

Before declaring SaaS dead, though, the counterarguments deserve serious treatment.

Security is a real problem. <a href="https://www.veracode.com/blog/genai-code-security-report/" target="_blank" rel="noopener noreferrer">Veracode tested 100+ LLMs</a> and found 45% of code samples failed security tests, introducing OWASP Top 10 vulnerabilities. Java was worst at 72% failure. The <a href="https://cloudsecurityalliance.org/blog/2025/07/09/understanding-security-risks-in-ai-generated-code" target="_blank" rel="noopener noreferrer">Cloud Security Alliance</a> found 62% of AI-generated code contains design flaws or known vulnerabilities. <a href="https://www.theregister.com/2025/12/17/ai_code_bugs/" target="_blank" rel="noopener noreferrer">CodeRabbit's analysis</a> showed AI code has 1.75x more logic errors, 1.57x more security findings, and 2.74x more XSS vulnerabilities than human-written code. These studies vary in scope, language coverage, and model versions tested - but the pattern is consistent enough to take seriously.

The "vibe coding hangover" is real - speed up front, chaos downstream.

Enterprise systems also have moats. Compliance frameworks like GDPR and SOC 2 are hard to replicate. Systems of record - clinical trial management, financial reconciliation, regulated industries - have data moats and regulatory logic that custom solutions can't easily absorb. <a href="https://www.bain.com/insights/will-agentic-ai-disrupt-saas-technology-report-2025/" target="_blank" rel="noopener noreferrer">Bain's analysis</a> is useful here: workflows that rely on human judgment, proprietary data, and regulatory oversight are defensible. The CRM someone built in a weekend won't replace Salesforce for companies that need 15 years of institutional knowledge encoded in their data model.

History suggests transitions expand rather than replace. "Cloud will replace on-prem" became "hybrid is the norm." Global SaaS is still growing - $197B in 2023, $232B projected for 2025. These transitions are slower and messier than they look in the moment.

And then there's the 80/20 problem. AI gets the work 80% of the way fast, but the last 20% - reliable, secure, maintainable - still requires engineering judgment. Getting to "works on my machine" is trivial. Getting to "runs in production under load without leaking data" is not.

SaaS doesn't die overnight - it evolves.

Seat-based pricing makes less sense when the software required to service a seat can be generated on demand. Outcome-based pricing rises. The winning position for existing SaaS players is probably unique data plus AI augmentation - hard to replicate datasets wrapped in AI-powered interfaces.

For most businesses, the calculation is shifting. Custom solutions that were never worth the engineering investment are now worth considering. Not for everything - nobody's building their own Stripe - but for the long tail of internal tools, specific workflows, and niche applications.

The question isn't whether software development economics are changing. The examples are too stark to ignore. The question is how fast - and for any given company, whether the tools have crossed the threshold where custom solutions start making more sense than off-the-shelf software that does 80% of what's needed.

Seat-based SaaS models are under pressure - the companies with unique data and real switching costs will survive, but the ones competing on features alone are more vulnerable than they were a year ago.