If you used LLMs for coding before the coding agents (~prior to Claude Code release in Feb last year), and if you got annoyed with typing the same context into LLM chat boxes, you’d start saving it into per-project markdown files. Many of those who used LLMs heavily ended up doing this eventually.

Then we got Claude Code and other agents, and by mid-summer, this technique got formalized as the AGENTS.md file and its tool-specific equivalents like CLAUDE.md, .cursor/rules, and .github/copilot-instructions.md. A few of us have been doing this before it had a name. At a bigger scale, that’s what happened to me with “Harness Engineering.”

It’s the context that’s missing

As coding agents proved their capability and more people started using them, users kept hitting the same kinds of edge cases. The model could write code fine. What it lacked was local, team- or org-specific context that engineers carry around in their heads (and Slack threads).

Things like:

  • this repo uses uv, not pip
  • tests must be run from a package subdirectory, not the repo root
  • migrations are generated, never hand-written
  • the staging database needs a tunnel
  • frontend changes need a Playwright screenshot because “looks good” is not a test
  • our production logs are in Datadog, but the useful field is trace_id, not request_id
  • don’t touch this generated file; edit the schema and regenerate it
  • PRs need a specific label and a reviewer from the owning team

None of this requires advanced reasoning. But if the agent doesn’t know it, it will waste time, invent local conventions, or produce a diff that looks plausible and fails the first real check.

MCP, CLI, Skills and more

Anthropic announced the MCP in late 2024, and through 2025 it became the default answer to “how do agents connect to tools.” Some treated it as if it would solve the whole “lack of relevant context” problem. Standardizing agent-to-tool connections is useful, but addresses only part of the problem.

Sometimes pointing an agent at a CLI tool is more flexible. MCP servers load their full tool schemas into context at connection time (though clients are starting to lazy-load schemas to address this), consuming tokens before the agent has done any useful work. CLI calls have zero schema overhead. Per several writeups (IBM, ScaleKit, CircleCI), in certain casesm CLI-based tool invocation can be significantly cheaper per task.

If you expose hundreds of MCP tools directly to the model, the definitions eat the context window. Letting the agent write code that calls tools inside a sandbox is cheaper and less error-prone. MCP still wins for centralized auth, audit trails, or structured access control across shared infrastructure. The practical answer is both: CLI for local dev work, MCP for pipeline coordination and governance.

Skills are also markdown files, like AGENTS.md, but not loaded into every session by default. They get triggered by the user (through something like /debug or /review) or by the agent when the skill description matches the task. This is the right direction. A migration playbook or an incident-review checklist should not live in the model’s context all day. It should load when needed.

Once the feature is implemented by the agent, you realize you don’t have the time or desire to review the 10 new files it created. So you ask for a critical or adversarial review. Maybe you call a subagent. Maybe you use a different model. Maybe you give it a prompt that says, in effect: “assume this implementation is wrong; find the bug before production does.”

So by this point we’d have:

  • AGENTS.md
  • multiple SKILL.md files
  • MCP servers
  • configured and authenticated CLI tools
  • project-specific test commands
  • subagent instructions
  • sandbox rules
  • browser automation
  • local logs and traces
  • permission prompts
  • retry limits
  • CI gates

What is harness engineering

All of this now goes by the name of “harness engineering.” The shortest definition I have is:

everything around the model that lets a coding agent do engineering work effectively, repeatedly and safely.

That includes instructions, tools, environment, tests, permissions, and feedback loops.

Feedforward and feedback

OpenAI’s harness engineering post is a good writeup from a team that used this on a large codebase. Their main point: once agents write most of the code, engineers move up a level. They design the environment, expose the right signals, and encode taste and constraints so the agent can do useful work without constant hand-holding.

Birgitta Böckeler’s writeup on Martin Fowler’s site has a useful framing: feedforward controls and feedback sensors. Feedforward is everything the agent gets before acting: AGENTS.md, architecture docs, commands, examples, skills. Feedback is everything it gets after acting: compiler errors, test failures, linter output, screenshots, logs, reviewer comments.

Prompt engineering says: “tell the model to be careful.”

Harness engineering says: “make the careful path the default path.”

If the agent writes code that doesn’t compile, putting a stronger sentence in the prompt won’t help. Compile and test should be part of the loop. If it guesses an API shape, give it a typed client or a local command that returns the truth. If it ships a UI that looks broken, wire up screenshots and a review loop that can “see” the result.

Ordinary engineering still matters

A stable just test, useful error messages, deterministic fixtures, and fast CI are part of the agent interface now. They used to be just human conveniences. Now they’re load-bearing infrastructure for autonomous work.

The same applies to documentation. A giant AGENTS.md is usually a mistake. OpenAI says they tried the 1,000-page-manual version and moved to a short table of contents instead. I had the same experience: the more CLAUDE.md grew beyond 400-500+ lines, the more often the agent started ignoring it’s instructions.

The root instruction file should point: where architecture lives, how tests run, what commands are blessed, what not to touch. Detailed procedures belong in docs or skills. Hard rules belong in tests and linters.

Where to start

For a small team, I would start with something relatively simple:

  • root AGENTS.md under 150 lines
  • docs/architecture.md, docs/testing.md, and docs/runbooks/
  • a couple of “blessed” command: just, make or whatever
  • skills for repeated workflows like debugging, code review, migrations, and release prep
  • browser checks for UI work
  • a skeptical reviewer agent
  • sandboxed credentials
  • a rule that every task ends with evidence, not a claim

The last point is really important. “Done” should mean: here is the test output, screenshot, benchmark, trace, or reproduction proving it. Not “the implementation looks correct.”

Responses like “You’re right to push back. I need to run tests to verify, let me do that now” are amusing at first, but get annoying quickly. You want the agent to just run the tests without the preamble. That’s what a good harness gives you: evidence by default.

Security

The more useful the agent is, the more dangerous it is. The lethal trifecta is a good model to keep in mind: private data, untrusted content, and external communication. If an agent has all three, prompt injection is an architecture problem. The answer is least privilege, approval boundaries, sandboxing, and separating read-only investigation from state-changing actions.

Don’t overbuild

Models improve quickly, and some scaffolding that matters today will be obsolete soon. Every harness component encodes an assumption about what the model cannot do on its own, and some of those assumptions will go stale.

The right harness is the smallest one that catches the failures you actually see.

If the agent keeps using the wrong package manager, put the right command in AGENTS.md and make the wrong path fail loudly. If it keeps writing shallow tests, add a review skill. If it keeps breaking migrations, create a migration playbook and a smoke test. If a rule matters every time, promote it from prose into code.

That is the practical loop: agent fails, you inspect why, then you move the fix into instructions, tools, tests, or skills.

If you do that for long enough, you’ll notice that the % of correct one-shot implementations (those that didn’t require your follow-up) keeps going up. The Pareto principle applies: stop expanding your harness once the agent hits a good-enough success rate. You’ll notice diminishing returns quickly, and no harness will replace good judgment from an experienced engineer.