future_agents.md

on what is shipped, what is being built, and what is yours.

// the gap

There is a gap between what has been released and what is being used, and to my eye it is the only gap that matters right now. Models have arrived faster than the practices to wield them. Rumors have arrived faster than models. The builder who spends the day refreshing the changelog is not building.

The work is to close the distance between your hands and the tools already in them. This is an attempt to take stock — what has actually shipped, what people are doing with it, what is worth your attention, and what is noise.

// what shipped

In April of this year Anthropic shipped the Memory tool. A small thing on the surface: a directory the model can read and write between sessions, client-side, with the developer holding the keys. The shift, though, is total. The agent stops trying to hold the whole world in a prompt. It writes a note. It re-reads the note when the note is needed. It forgets the rest.

This is, more or less, how humans work, and now it is how agents work too. Memory also moved into Claude Managed Agents in public beta, which means one agent can leave a trace another agent will follow — a shared notebook across what would otherwise be isolated processes.

Simon Willison, who has been chronicling this corner of the field longer than most, argued some months ago that the most promising way to implement long-term memory was “with an extra set of tools” — note-taking and plan files written by the model rather than a separate memory layer. The Memory tool is, in effect, the standardization of what he and others had already been hand-building.

If you are looking for the single largest unlock for persistent agent behavior, it is this. Not a future model. The thing already in your hands.

// the harness

Around the model, a discipline has been forming. Mitchell Hashimoto, the HashiCorp co-founder now building the Ghostty terminal full-time, gave it the name that stuck: harness engineering. The idea is that any time an agent makes a mistake, you take the time to engineer a solution so the agent never makes that mistake again. His AGENTS.md for Ghostty grows incrementally, one rule per past failure, as a living rebuke to the agent’s worst habits.

Armin Ronacher, the creator of Flask, wrote what is by now the most-cited single post on the practice. He argued that sub-agents can break down work to run twelve hours or more, provided you give them careful markdown planning. He has also opened a live debate about whether MCP is the right abstraction or whether plain code-as-tools wins on simplicity. Martin Fowler and Addy Osmani have both written extended treatments under the same banner.

The principle these writers share is simple. Small explicit prompts, a handful of tools, and a rigorous loop of tests and lints the agent re-invokes forever, will beat complex hidden behavior every time. Boris Cherny, who runs Claude Code at Anthropic, has said in public that his own setup is “surprisingly vanilla” — worktrees, plan mode, a team-shared CLAUDE.md, a few PostToolUse hooks, a handful of subagents. The person who built the tool uses it close to the way it ships.

The harness reframes the agent debate. The question stops being “is the model good enough?” and becomes “is your scaffolding around it good enough?” That is a much more answerable question, and a much more productive one.

// the operators

The clearest way to see where this is going is to look at what specific people are shipping.

Garry Tan publishes his entire setup as gstack — twenty-three skill files for Claude Code, plus power tools, organized by role: planning, building, review, release, safety. The design pipeline is the showcase. /design-shotgun generates four to six mockup variants in a browser comparison board and learns “taste memory” over rounds; /design-html then converts the chosen variant into framework-aware production HTML. The repo encodes what gstack calls Boil the Lake — do the complete thing when AI makes marginal cost near zero. It is an argument, written in code, that an experienced operator’s judgment can be compressed into runtime skills.

Geoffrey Huntley, who now works on the Amp agent at Sourcegraph, built the Ralph loop: a while true bash wrapper that re-feeds the same prompt to Claude Code, with each iteration’s work persisting in files and in git. Named for Ralph Wiggum’s “ignorance, persistence, and optimism.” Anthropic upstreamed the pattern as the official ralph-wiggum plugin. Huntley’s most-cited demonstration was running a Ralph loop for three consecutive months on a single prompt — “a programming language like Golang but with Gen Z slang” — and shipping Cursed, an LLVM-compiling language with a stdlib. The case Ralph makes is contested but pointed: with good verification, you can let an agent run autonomously for days.

Sahil Lavingia, the founder of Gumroad, stepped down as CEO in November 2025 after fourteen years. By his own account, Gumroad uses v0 for prototypes, Cursor for edits, and Devin for autonomous implementation. AI was writing forty-one percent of code commits at the time of his last public update, with a target of eighty percent by year-end. He has stated publicly — and acted on the statement — that he is no longer hiring senior or even staff-level software engineers. Whatever one makes of the methods, he is the canonical named example of the agent-replaces-org-chart claim being acted on at a real company with real revenue.

And then there is Pieter Levels, the limit case. PhotoAI reached roughly $132,000 to $138,000 in monthly recurring revenue by late last year — somewhere around $1.6 million ARR — running on fourteen thousand lines of raw PHP, jQuery, and SQLite. No framework. He is not building a harness. He is demonstrating, almost by accident, that the harness has gotten good enough that a single operator can ship serious revenue without much stack discipline at all.

The surrounding ecosystem matters too. Paul Gauthier’s Aider is the minimalist Unix reference point — terminal-native, auto-commits to git, model-agnostic. Cline is the open-source extension with roughly four million installs and a strict approve-every-change philosophy. Conductor.build, recently raised at $22M from Spark and Matrix, runs parallel agents in git-worktree-isolated workspaces and is being used by engineers at Google, Meta, Linear, Notion, Vercel, and Supabase. Devin, from Cognition Labs, is sold as the autonomous SaaS engineer rather than something you script — Devin 2.2 added desktop access and recurring self-scheduled sessions, with state persisted between runs.

The contrast is clean. The toolkit camp — Claude Code, Cursor, Cline, Aider, Conductor, gstack — appears to hold most of the developer mindshare today. The product camp, led by Devin, owns the enterprise “buy a managed agent” niche. The honest answer to which is right is probably “both, for different jobs.”

// the day

Across all of these implementations, a pattern is now visible in how a day of work is shaped.

Time has rearranged itself. Roughly eighty percent reading and deciding. Twenty percent communicating with the agent. Almost none writing code by hand. The value concentrates at the start and at the end — defining intent, judging output. The middle, the typing, has been given away.

The repository carries the discipline. A CLAUDE.md at the root tells every session who you are and how the codebase wants to be touched. A MEMORY.md grows beside it, written by the agent itself, recording what it learned the hard way. Hashimoto’s AGENTS.md is the most-cited single example. Project conventions become text. Text becomes context. Context becomes behavior.

Specialists handle bounded work. One subagent reviews diffs. One runs the test suite. One audits security. The main agent owns planning and integration and never sees their verbose output, only their reports. Up to ten can run in parallel, each in its own context window, each on its own git worktree, none colliding.

Skills sit dormant until called, costing almost nothing in tokens. MCP servers connect the model to anything outside it. GitHub. The database. The file system. An internal API. Skills are how. MCP is where.

The day ends by writing the next day’s prompt. The agent works while you sleep. You wake up and read.

Boris Cherny has said in public that he has not written a single line of code since November. He runs Claude Code at Anthropic. Some of his teams have ninety percent of their code written by Claude Code, with per-engineer productivity up roughly seventy percent even as headcount tripled. These are the most extreme numbers being reported, and they come from the people who built the tool. They should be read as the upper bound of what is currently possible for a team that has fully reorganized around the practice — not the floor.

The shape of the work has changed. For the people who have gone furthest with it, it looks less like building software and more like running a small team of one.

// the model you have

Opus 4.7 was released on April 16. It costs five dollars per million input tokens and twenty-five per million output. It replaced 4.6. It is the model you would use today, and the model you would use tomorrow, and almost certainly the model you would use the day after that.

This is the motorcycle for the mind you are already riding. The question is not whether the engine is fast enough. The question is whether you know the road.

// the curve

The release cadence and the research previews tell you something the changelog alone does not, and it is worth taking seriously.

There is no public Opus 4.8 yet. The number surfaced in a leaked source map at the end of March, with no API identifier and no benchmarks attached. But the cadence of the last year — 4.5, 4.6, 4.7 inside twelve months — tells you the next one is close.

Above the public family sits Mythos Preview, a frontier research model from Anthropic that can autonomously find and chain zero-day vulnerabilities in major operating systems and browsers. Access is invitation-only behind Project Glasswing, with a hundred million dollars in usage credits committed to that consortium. You cannot call it from a public API today. But its existence is a signal of what is technically possible inside Anthropic right now, and what will become available to the rest of us as the safeguards catch up.

Then there is Conway, the always-on background agent platform revealed in the half-million-line Claude Code source leak in early April. Webhooks. Chrome control. A .cnw extension format. A million tokens of context. Expected before the year ends, possibly bundled into Claude Pro.

Boris Cherny has made the point in public, repeatedly, that the right move is to build for the model that is coming, not the model you have. The discipline you write today — the CLAUDE.md, the subagents, the skills, the memory layer — should be the kind of scaffolding that compounds when the model behind it improves. Patterns that depend on a specific model’s quirks tend to rot. Patterns that depend on the model getting smarter, faster, and longer-context grow more valuable with every release.

Treat the current model as the worst the model will ever be. That is the right planning assumption.

// restraint

Two things have to be true at once.

You build with what is shipped today, because that is the only thing you can ship with. Memory is here. The harness is here. The discipline is here. Opus 4.7 is here. The named operators — Tan, Hashimoto, Cherny, Huntley, Willison, Ronacher, Lavingia, Levels — have done you the favor of working out the patterns in public. There is roughly a year of accumulated, documented practice you can read and copy this weekend.

And you build for what is coming, because the model six months from now will be meaningfully better than the model on your machine tonight. The subagents, the slash commands, the project memory — all of it compounds. The work you do this week on your scaffolding becomes more valuable, not less, as the model behind it improves.

The mistake is to wait for the next release. The other mistake is to assume the current release is the ceiling. The discipline is to do the work now, in a way that the next model will inherit.

The future agent is not the one announced next quarter. It is the one running on your machine tonight, doing the work you taught it yesterday — and it is the one running here next year, on top of everything you built this year.

// the skill

Two installers, one for each tool. Copy whichever matches what you use, paste it into a terminal, hit enter. Then in any project run /agentic-harness — the skill will open a sketch of the workspace in your browser, ask two questions, and fill in the diagram as it works.

claude code · paste in terminal

codex · paste in terminal