Evaluating AGENTS.md: are they helpful for coding agents?

arxiv.org

153 points by mustaphah a day ago


deaux - 4 hours ago

I read the study. I think it does the opposite of what the authors suggest - it's actually vouching for good AGENTS.md files.

> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).

This "surprisingly", and the framing seems misplaced.

For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.

> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)

This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.

The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.

GBintz - 4 minutes ago

we've been running AGENTS.md in production on helios (https://github.com/BintzGavin/helios) for a while now.

each role owns specific files. no overlap means zero merge conflicts across 1800+ autonomous PRs. planning happens in `.sys/plans/{role}/` as written contracts before execution starts. time is the mutex.

AGENTS.md defines the vision. agents read the gap between vision and reality, then pull toward it. no manager, no orchestration.

we wrote about it here: https://agnt.one/blog/black-hole-architecture

agents ship features autonomously. 90% of PRs are zero human in the loop. the one pain point is refactors. cross-cutting changes don't map cleanly to single-role ownership

AGENTS.md works when it encodes constraints that eliminate coordination. if it's just a roadmap, it won't help much.

Arifcodes - 2 hours ago

The study measures the wrong thing. Task completion ("does the PR pass tests?") is a narrow proxy for what AGENTS.md actually helps with in production.

I run a system with multiple AI agents sharing a codebase daily. The AGENTS.md file doesn't exist to help the agent figure out how to fix a bug. It exists to encode tribal knowledge that would take a human weeks to accumulate: which directory owns what, how the deploy pipeline works, what patterns the team settled on after painful debates. Without it, the agent "succeeds" at the task but produces code that looks like it was written by someone who joined the team yesterday. It passes tests but violates every convention.

The finding that context files "encourage broader exploration" is actually the point. I want the agent to read the testing conventions before writing tests. I want it to check the migration patterns before creating a new table. That costs more tokens, yes. But reverting a merged PR that used the wrong ORM pattern costs more than 20% extra inference.

pamelafox - 6 hours ago

This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.

I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.

avhception - 5 hours ago

When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification. Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code. Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.

I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:

- Do not try to fill gaps in your knowledge with overzealous assumptions.

- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.

- If a task seems to require extra changes, pause and ask before proceeding.

If these are not enough to prevent stuff like that, I don't know what could.