My Agent Skill for Test-Driven Development

131 points by laxmena a day ago

This article would benefit from a date. It looks like it's recent (Internet Archive first grabbed it on May 29th) but it's the kind of information that can quickly become stale as models and agents improve.

(I've been getting solid results recently from simply telling Claude Code and Codex "Test with uv run pytest, use red/green TDD".)

__mharrison__ - 4 hours ago

Here's a portion of my AGENTS.md from this week (playing FDE, implementing a custom workflow for a client that 20x their productivity).

    # Python Tooling
    
    - Use `uv` to manage Python environments and dependencies.
    - Use `uv run` to execute Python scripts and commands.
    - Use `pytest` for testing your code.
    - Use the `hypothesis` library for property-based testing when you have complex input spaces or need to test edge cases.
    - Don't edit `pyproject.toml` directly. Instead, use `uv add` and `uv add --dev` to manage dependencies.
    - Use ruff, ty, prek, wily for code quality and linting.
    - Don't use excessive casting. If you find yourself needing to cast types frequently, consider refactoring your code to use more appropriate types. Casting should only be done in boundary layers where you are interfacing with external systems.
    - Run appropriate tooling after making changes to your code to ensure it meets quality standards.
    - When you come across a bug or regression, think hard about writing a test and also how to create code that will prevent this from happening again in the future.
    - When creating a command line interface, add `--verbose` flag that provides logging output useful for debugging issues.
    - Before creating code, brainstorm 5 different approaches to solve the problem and sort them by their probable effectiveness. Then, choose the best approach and implement it.
    - Use Test Driven Development (TDD) for all code you write. Write tests before writing the implementation code. 
    - Collect pytest fixtures in a `conftest.py` file to avoid duplication 
    - Prefer testing real code where possible. Use doubles and `monkeypatch` when absolute necessary. Try to avoid mocking as much as possible.
    - Favor pytest monkeypatch to mock.
    - When a test fails, run the last failed test first using `uv run pytest --last-failed` 
    - Use numpy-style docstrings for all functions and classes you create.
    - Include doctests in the docstrings of your functions to provide examples
    - Use type hints for all function parameters and return types.
    - Use logging to provide insight into failures. Don't use print for debugging. Don't use logging to hide stack traces.

0123456789ABCDE - 2 hours ago

[dead]

jasonswett - an hour ago

Good point! Will add a date.
porphyra - 5 hours ago

A lot of prompt engineering goes out of date quickly. Nobody nowadays goes "you are an expert software engineer. make no mistakes" lol.
As a personal anecdote, I find that a lot of big prompts and skills use up context window budget and in many cases agents will eagerly try to use a skill even if it isn't super relevant or necessary for the current task. So when I have too many skills I have to spend a bunch of time toggling the checkboxes to figure out which ones are needed for the task at hand before starting...
chrisweekly - an hour ago

Every article should include a date!
disgruntledphd2 - 5 hours ago

Me too, although I dislike the fact that it over-focuses on mocks (which I accept is over-represented in the training data).
- galsapir - 5 hours ago
  
  sometimes I also feel it tries to optimise for "per line coverage" over more "real, complex use cases" type tests
0123456789ABCDE - 2 hours ago

fwiw, response headers include: Last-Modified: Fri, 22 May 2026 19:08:09 GMT

SubiculumCode - an hour ago

One issue that I've run into with codex has been excessive use of fallbacks routines. Perhaps this is good practice in.professional programming in many situations, but for mine (in this case): computing geodesic distances and analysis, a silent bad fallback means the processed data is not what I thought it was..e.g. used an inaccurate geodesic method in place of the accurate one.

jasonswett - an hour ago

I HATE this. I call it speculative coding. Claude often calls it "defensive" programming. It's easily my #1 LLM pet peeve. I have yet to figure out a reliable way to make this stop happening.
- homieg33 - 2 minutes ago
  
  I’m going to second this. Probably a side effect of its training to always produce an output, even if its some naive handling of issues it really should have root caused and fixed.

fowlie - 3 hours ago

Haven't tried this, but I've recently become a big fan of Matt Pococks skills. Workflow: /grill-with-docs -> /to-prd -> /to-issue -> /tdd. That will interview relentlessy until there is a "shared understanding" using "ubiquitous language", then it will spec all requirements with user stories, create issues and implement them using tdd.

zuzululu - 5 hours ago

TDD sounds great on paper for agentic development but you quickly realize it balloons the token cost. Often I write some feature and then its repurposed or removed, code is refactored moved around as time goes. With TDD I would be taxed heavily and velocity slow to a crawl.

The waterfall approach is better after trying out TDD especially when you have a multi-agent setup. Also I found that in some cases the tests were just superficial hallucinations that never actually tested the components written or there some some context corruption and ultimately triggered a false positive that kicked off a completely unintentional refactoring.

__mharrison__ - 4 hours ago

My experience is the opposite. TDD keeps the guardrails on and let's me refactor with confidence.
Crazy times here in the development world. I'm always curious to watch other's best practices.
- dools - 4 hours ago
  
  Yeah I specifically tell it not to pre-emptively fix tests that it knows will break as a result of changes its making and instead limit itself only to creating new tests for new changes. I want to see the tests break, then we go through and review each set of breakages versus the mission and assess if they’re regressions or stale assertions. This is a) how I know it’s actually writing meaningful tests b) a very functional and useful form of “code review” versus just trying to catch problems by reading diffs and c) helped me find real problems and regressions.
  Almost all the breakages after a big refactor are stale assertions but every time I catch a couple of critical problems that make the entire exercise very worth it.
  The whole dev process is so fast compared to writing software manually that I find it absurd that I wouldn’t invest heavily in automated tests.
  - __mharrison__ - 4 hours ago
    
    See my AGENTS.md in nearby comment
rsalus - 2 hours ago

I was a big proponent of encoding TDD red-green-refactor methodology into my agent workflows until recently when I made the same realization after reading this study: https://arxiv.org/pdf/2602.07900
TLDR; it found test-writing volume only weakly correlates with success and that encoding test-writing principles did not move resolution rates but _did_ materially change cost. Encouraging tests cost +19.8% output tokens for 0% gain; discouraging them saved 33–49% input tokens for ≤2.6pp accuracy loss. Separately, imposing the TDD procedure specifically seems like it can backfire: it actually _increased_ regressions from 6.08% to 9.94%.
IMO, where tests clearly help is primarily as an "oracle" applied after generation. It gives the models a signal that enables them to verify and self-correct if necessary.
- zuzululu - 35 minutes ago
  Very interesting paper and it lines up exactly with my observations. The ROI just isn't there writing tests up front and the conclusion in that paper lays it out clearly
  Overall, these findings suggest that agent-written tests often behave more like a habitual software-development rou- tine than a dependable source of validation in this setting. More agent-written tests do not mean more solves; what they more reli- ably change is the process footprint—API calls, token usage, and interaction patterns. Improving the value of testing for code agents may therefore require better oracles and more actionable validation signals, rather than simply inducing agents to write more tests.
  > IMO, where tests clearly help is primarily as an "oracle" applied after generation
  Bingo. I'm not against writing tests it's that the returns are better when its used as verification feedback and as "Oracle" exactly as you put it.
reg_dunlop - 5 hours ago

But that repurposing/removal is exactly what's avoided if you follow through with the SEF framework he outlines.
I have to push back on the idea that token costs balloon when using TDD within the context of a strong framework such as Jason has laid out here.
If the feature is repurposed/removed/refactored....I'd argue the specification wasn't well thought out prior to burning into tokens.
We're so eager to do a lot of the wrong things quickly, when it may serve us better to do a more precise thing slowly.
- zuzululu - 4 hours ago
  
  You cant spec out what you dont know, scope, requirements change from real world feedback
manmal - 4 hours ago

> With TDD I would be taxed heavily and velocity slow to a crawl.
And the code will be good.
- rsalus - 2 hours ago
  
  not necessarily, TDD has little bearing on output quality
jzig - 4 hours ago

Pattern-based testing can theoretically reduce the token cost?

dluxem - 5 hours ago

I believe using a skill here is the wrong approach. LLMs already know what TDD is and how to do it, just like object oriented programming.

If this is encoded in a skill, that skill essentially has to be loaded for everything thing your LLM is doing. This is probably one of the few areas where direct instructions via AGENTS.md is best, and I don't believe it requires much direction here to force the issue.

But I think the OP is just trying to have their agent work in a very specific way -- that is fine too.

> 5. Show me the test and ask for approval before continuing

jasonswett - an hour ago

My experience has been that yes, LLMs already know about TDD, OOP, etc., but they won't necessarily BEHAVE according to what they know unless you tell them. And of course, they "know" a lot of things that conflict with each other.
zuzululu - 5 hours ago

People forget skill is just a markdown file and I don't think TDD makes sense. It's more for specific niches like working on your custom codebase or some less beaten paths you take and save the lessons going forward
But everybody is free to choose how they work and it may be required in ways that we can't know about.

jvuygbbkuurx - 5 hours ago

All of these post are missing actual comparisons on results. I read exactly opposite 'you should do x' everyday. If TDD actually was better it would simply be in the system prompts already.

bisonbear - 3 hours ago

Agree - all of this is based on vibes (I also use TDD based on vibes FWIW). The only way to settle "does TDD / caveman / [insert random skill here] help" is to replay real PRs from your repo and measure quality

realty_geek - 3 hours ago

As an aside, check out Jason's podcast (codewithjason.com) - its pretty good.

The latest one is with "Uncle Bob Martin" who has some interesting takes on coding with AI from .... can I say an oldie?

jasonswett - an hour ago

Thanks, I'm glad you like it!

servercobra - 5 hours ago

This overall is pretty close to how I've set up my implementation skill. One thing I'm curious about is how well the analogies like "We don't make dinner in a dirty kitchen." work vs something a lot more straightforward. Any input OP?

jasonswett - an hour ago

OP here. I don't know, in my experience Claude took "clean the kitchen before we make dinner" to heart in an astonishingly productive way. I haven't tried many other analogies though.

__mharrison__ - 4 hours ago

Testing is so important for development.

Even more so when coding with agents. I think it is the probably the biggest lever to keep AI in guardrails.

(It's also why I wrote my latest book, Effective Testing, because I routinely find that my clients are very poor at treating.)

enraged_camel - 3 hours ago

Spawning separate agents to review the original agent's implementation results in a very noticeable increase in code quality and decrease in bugs. This is why I encode two or three rounds of sub-agent review during the planning process, where I tell the agent authoring the plan to include those review rounds at the end. If the code is particularly load-bearing, I then ask a fourth agent, usually from the other frontier lab.

All of this burns more tokens of course, but probably way less than coming back to the code later to fix bugs. It is also slower, but in the long run saves time.

nullc - 2 hours ago

If you don't follow up with a pass of injecting bugs and validating that the tests fail in the presence of bugs... then you've only confirmed that the tests can pass and they may be substantially useless.

tokenfaucet - 5 minutes ago

[flagged]

Koyukoyu - 2 hours ago

[dead]

keenseller709 - 3 hours ago

[flagged]

behnamoh - 6 hours ago

Snake oil. Just ask the model, all these custom agents/skills haven't proven that useful in practice.

jw1224 - 6 hours ago

Skills already are "just asking the model". Unless you'd prefer to type out the same instructions every single time?
Skills are literally just Markdown documents that get loaded into context when the /skill-name is invoked.
- Zetaphor - 5 hours ago
  
  I think they're maybe confusing Skills and MCP servers
- dominotw - 5 hours ago
  
  i belive gp means llms produce what they see in training data/rl there isnt much too much customization you can do with skills.
  they are being sold as more powerful than they are. Like llms are intelligent blank slates that can be customized with mere markdown files.
  - calebkaiser - 5 hours ago
    
    I don't understand this line of criticism exactly. By putting new information in the context window, you are materially changing the activations at your point of sampling, which is literally "customizing with mere markdown files."
    Taken to the extreme, the attitude that there is some special incantation that will unlock all capabilities is silly, and a lot of the "prompt engineering" discourse is similarly kind of dumb, but in-context learning is clearly a real thing.
    
    dominotw - 5 hours ago
    
    even if that works one time you can never be sure that your customization is in place or fell out of context's important zone or is contradicted by later context . you've reverted back to base llm behavior.
    you are treating skill like sure thing
coffeeaddict1 - 6 hours ago

I disagree. Not all skills are useless. For example, I sometime use Qt for GUI projects and I have found their skills [0] very useful to improve the quality and performance of my projects. I their absence, I would each time have to direct the agents to find the docs or specific tools, wasting tokens and thus decreasing the quality of the output.
[0] https://github.com/TheQtCompanyRnD/agent-skills
pramodbiligiri - 6 hours ago

I don't think the idea of skills is quite snake oil. It seems you can change what LLM outputs next by what's called few-shot prompting or in-context learning: https://www.promptingguide.ai/techniques/fewshot
john_strinlai - 6 hours ago

not that i know much about the effectiveness of these skill files, i find it odd to call something given for free "snake oil", which i thought referred to the sale of fraudulent products (to the benefit of the snake oil salesperson), typically around healthcare-related stuff.
- dominotw - 5 hours ago
  
  i think gp is calling skills snakeoil in genral
internet101010 - 5 hours ago

Lol wut. One of first things people do at a company when they get enterprise LLM tools is share a skill with company-specific color palettes or standards for creating visualizations (I prefer Tufte's principles).
beezlewax - 6 hours ago

I've found them useful for in house stuff where you are using a specific design system or architecture. But custom everything works best. Are that Claude works well on its own though at this point.
theptip - 5 hours ago

Nah. Skills are great. But you should write your own.
wyre - 5 hours ago

Ya, if im constantly asking a model to do TDD development, you know what would make it a lot easier? A skill.

steno132 - 5 hours ago

Test driven development is one of the worst ideas nowadays in the LLM age. We have models that can consistently write expert level, usually bug free code for you and rapidly fix even complex bugs in your codebase.

The token cost and tech debt introduced by tests is just not worth it. There's usually no bugs and if there are, you can fix them quickly if and when it's needed.

Ginop - 5 hours ago

I disagree
Testing was and is still very important, as LLMs can still miss important points in business logic or other edge cases I would argue that tests became as important as code, if not more.
esafak - 3 hours ago

IF your code has no bugs it's either trivial or you haven't noticed the bugs.