Evaluating AGENTS.md: are they helpful for coding agents?
arxiv.org153 points by mustaphah a day ago
153 points by mustaphah a day ago
I read the study. I think it does the opposite of what the authors suggest - it's actually vouching for good AGENTS.md files.
> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).
This "surprisingly", and the framing seems misplaced.
For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.
> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)
This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.
The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.
Hey thanks for your review, a paper author here.
Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.
The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.
But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.
Without measuring quality of output, this seems irrelevant to me.
My use of CLAUDE.md is to get Claude to avoid making stupid mistakes that will require subsequent refactoring or cleanup passes.
Performance is not a consideration.
If anything, beyond CLAUDE.md I add agent harnesses that often increase the time and tokens used many times over, because my time is more expensive than the agents.
CLAUDE.md isn't a silver bullet either, I've had it lose context a couple of questions deep. I do like GSD[1] though, it's been a great addition to the stack. I also use multiple, different LLMs as a judge for PRs, which captures a load of issues too.
In Theory There Is No Difference Between Theory and Practice, While In Practice There Is.
In large projects, having a specific AGENTS.md makes the difference between the agent spending half of its context window searching for the right commands, navigating the repo, understanding what is what, etc., and being extremely useful. The larger the repository, the more things it needs to be aware of and the more important the AGENTS.md is. At least that's what I have observed in practice.
This reads a lot like bargaining stage. If agentic AI makes me a 10 times more productive developer, surely a 4% improvement is barely worth the token cost.
> If agentic AI makes me a 10 times more productive
I'm not sure what you are suggesting exactly, but wanted to highlight this humongous "if".
It's not only about the token cost! It's also my TIME cost! Much-much more expensive than tokens, it turns out ;)
If something makes you 10x as effective and then you improve that thing by 4%...
Honestly, the more research papers I read, the more I am suspicious. This "surprisingly" and other hyperbole is just to make reviewers think the authors actually did something interesting/exciting. But the more "surprises" there are in a paper, the more I am suspicious of it. Often such hyperbole ought to be at best ignored, at worst the exact opposite needs to be examined.
It seems like the best students/people eventually end up doing CS research in their spare time while working as engineers. This is not the case for many other disciplines, where you need e.g. a lab to do research. But in CS, you can just do it from your basement, all you need is a laptop.
4% is yuuuge. In hard projects, 1% is the difference between getting it right with an elegant design or going completely off the rails.
we've been running AGENTS.md in production on helios (https://github.com/BintzGavin/helios) for a while now.
each role owns specific files. no overlap means zero merge conflicts across 1800+ autonomous PRs. planning happens in `.sys/plans/{role}/` as written contracts before execution starts. time is the mutex.
AGENTS.md defines the vision. agents read the gap between vision and reality, then pull toward it. no manager, no orchestration.
we wrote about it here: https://agnt.one/blog/black-hole-architecture
agents ship features autonomously. 90% of PRs are zero human in the loop. the one pain point is refactors. cross-cutting changes don't map cleanly to single-role ownership
AGENTS.md works when it encodes constraints that eliminate coordination. if it's just a roadmap, it won't help much.
The study measures the wrong thing. Task completion ("does the PR pass tests?") is a narrow proxy for what AGENTS.md actually helps with in production.
I run a system with multiple AI agents sharing a codebase daily. The AGENTS.md file doesn't exist to help the agent figure out how to fix a bug. It exists to encode tribal knowledge that would take a human weeks to accumulate: which directory owns what, how the deploy pipeline works, what patterns the team settled on after painful debates. Without it, the agent "succeeds" at the task but produces code that looks like it was written by someone who joined the team yesterday. It passes tests but violates every convention.
The finding that context files "encourage broader exploration" is actually the point. I want the agent to read the testing conventions before writing tests. I want it to check the migration patterns before creating a new table. That costs more tokens, yes. But reverting a merged PR that used the wrong ORM pattern costs more than 20% extra inference.
What are you putting in the file? When I’ve looked at them they just looked like a second readme file without the promotional material in a typical GitHub readme.
This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.
I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.
You can also save time/tokens if you see that every request starts looking for the same information. You can front-load it.
Also take the randomness out of it. Sometimes the agent executing tests one way, sometimes the other way.
Agree. I also found out that rule discovery approach like this perform better. It is like teaching a student, they probably have already performed well on some task, if we feed in another extra rule that they already well verse at, it can hinder their creativity.
That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.
So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.
same with people, no matter what info you give a person you cant be sure they will follow it the same every time
When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification. Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code. Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.
I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:
- Do not try to fill gaps in your knowledge with overzealous assumptions.
- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.
- If a task seems to require extra changes, pause and ask before proceeding.
If these are not enough to prevent stuff like that, I don't know what could.
Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?
It depends. If you have an LLM that uses reasoning the explanation for why decisions are made can often be found in the reasoning token output. So if the agent later has access to that context it could see why a decision was made.
Reasoning, in majority of cases, is pruned at each conversation turn.
The cursor-mirror skill and cursor_mirror.py script lets you search through and inschpekt all of your chat histories, all of the thinking bubbles and prompts, all of the context assembly, all of the tool and mcp calls and parameters, and analyze what it did, even after cursor has summarized and pruned and "forgotten" it -- it's all still there in the chat log and sqlite databases.
cursor-mirror skill and reverse engineered cursor schemas:
https://github.com/SimHacker/moollm/tree/main/skills/cursor-...
cursor_mirror.py:
https://github.com/SimHacker/moollm/blob/main/skills/cursor-...
The German Toilet of AI
"The structure of the toilet reflects how a culture examines itself." — Slavoj Zizek
German toilets have a shelf. You can inspect what you've produced before flushing. French toilets rush everything away immediately. American toilets sit ambivalently between.
cursor-mirror is the German toilet of AI.
Most AI systems are French toilets — thoughts disappear instantly, no inspection possible. cursor-mirror provides hermeneutic self-examination: the ability to interpret and understand your own outputs.
What context was assembled?
What reasoning happened in thinking blocks?
What tools were called and why?
What files were read, written, modified?
This matters for:
Debugging — Why did it do that?
Learning — What patterns work?
Trust — Is this skill behaving as declared?
Optimization — What's eating my tokens?
See: Skill Ecosystem for how cursor-mirror enables skill curation.
----https://news.ycombinator.com/item?id=23452607
According to Slavoj Žižek, Germans love Hermeneutic stool diagnostics:
https://www.youtube.com/watch?v=rzXPyCY7jbs
>Žižek on toilets. Slavoj Žižek during an architecture congress in Pamplona, Spain.
>The German toilets, the old kind -- now they are disappearing, but you still find them. It's the opposite. The hole is in front, so that when you produce excrement, they are displayed in the back, they don't disappear in water. This is the German ritual, you know? Use it every morning. Sniff, inspect your shits for traces of illness. It's high Hermeneutic. I think the original meaning of Hermeneutic may be this.
https://en.wikipedia.org/wiki/Hermeneutics
>Hermeneutics (/ˌhɜːrməˈnjuːtɪks/)[1] is the theory and methodology of interpretation, especially the interpretation of biblical texts, wisdom literature, and philosophical texts. Hermeneutics is more than interpretive principles or methods we resort to when immediate comprehension fails. Rather, hermeneutics is the art of understanding and of making oneself understood.
----
Here's an example cursor-mirror analysis of an experiment with 23 runs with four agents playing several turns of Fluxx per run (1 run = 1 completion call), 1045+ events, 731 tool calls, 24 files created, 32 images generated, 24 custom Fluxx cards created:
Cursor Mirror Analysis: Amsterdam Fluxx Championship -- Deep comprehensive scan of the entire FAFO tournament development:
amsterdam-flux CURSOR-MIRROR-ANALYSIS.md:
https://github.com/SimHacker/moollm/blob/main/skills/experim...
amsterdam-flux simulation runs:
https://github.com/SimHacker/moollm/tree/main/skills/experim...
Just an update re German toilets: No toilet set up in the last 30 years (I know of) uses a shelf anymore. This reduces water usage by about 50% per flush.
of course not, but it can often give a plausible answer, and it's possible that answer will actually happen to be correct - not because it did any - or is capable of any - introspection, but because it's token outputs in response to the question might semi-coincidentally be a token input that changes the future outputs in the same way.
Isn't that question a category error? The "why" the agent did that is that it was the token that best matched the probability distribution of the context and the most recent output (modulo a bit of randomness). The response to that question will, again, be the tokens that best match the probability distribution of the context (now including the "why?" question and the previous failed attempt).
if the agent can review its reasoning traces, which i think is often true in this era of 1M token context, then it may be able to provide a meaningful answer to the question.
Wait, no, that's the category error I'm talking about. Any answer other than "that was the most likely next token given the context" is untrue. It is not describing what actually happened.
I think this statement is on the same level as "a human cannot explain why they gave the answer they gave because they cannot actually introspect the chemical reactions in their brain." That is true, but a human often has an internal train of thought that preceded their ultimate answer, and it is interesting to know what that train of thought was.
In the same way, it is often quite instructive to know what the reasoning trace was that preceded an LLM's answer, without having to worry about what, mechanically, the LLM "understood" about the tokens, if this is even a meaningful question.
But it's not a reasoning trace. Models could produce one if they were designed to (an actual stack of the calls and the states of the tensors with each call, probably with a helpful lookup table for the tokens) but they specifically haven't been made to do that.
When you put an LLM in reasoning mode, it will approximately have a conversation with itself. This mimics an inner monologue.
That conversation is held in text, not in any internal representation. That text is called the reasoning trace. You can then analyse that trace.
Unless things have changed drastically in the last 4 months (the last time I looked at it) those traces are not stored but reconstructed when asked. Which is still the same problem.
They aren't necessarily "stored" but they are part of the response content. They are referred to as reasoning or thinking blocks. The big 3 model makers all have this in their APIs, typically in an encrypted form.
Reconstruction of reasoning from scratch can happen in some legacy APIs like the OpenAI chat completions API, which doesn't support passing reasoning blocks around. They specifically recommend folks to use their newer esponses API to improve both accuracy and latency (reusing existing reasoning).
For a typical coding agent, there are intermediate tool call outputs and LLM commentary produced while it works on a task and passed to the LLM as context for follow up requests. (Hence the term agent: it is an LLM call in a loop.) You can easily see this with e.g. Claude Code, as it keeps track of how much space is left in the context and requires "context compaction" after the context gradually fills up over the course of a session.
In this regard, the reasoning trace of an agent is trivially accessible to clients, unlike the reasoning trace of an individual LLM API call; it's a higher level of abstraction. Indeed, I implemented an agent just the other day which took advantage of this. The OP that you originally replied to was discussing an agentic coding process, not an individual LLM API call.