A case study in testing with 100+ Claude agents in parallel

55 points by thejash 2 days ago

If this will be future of software in 20 years nobody will understand what the hell software actually does. If nobody will things will get to implode quickly.

thejash - 3 hours ago

(author here) I agree that it's super important to understand what software actually does--that's part of the whole reason we made mngr in the first place!
I believe we can use these types of tools to make software more understandable, and mngr is an example of how to do that.
In our case study, we're using AI to increase our test coverage, and if you look at it, I would argue that we are making it more understandable--now instead of just having 100's of tests, we simply have a document that describes how the software is supposed to work, and the tests are linked to that document, and checked to ensure that they conform.
That means that anyone--not just the author of the software--is now able to read through the high level tutorial description of how the commands work in order to understand what the program should do!
And as for the tests themselves, we've been able to make nice testing infrastructure--like the transcripts and recordings that were highlighted in the post--to make it even easier for us to verify the behavior of the software.
We also have an incredibly detailed style guide and set of tests and guidelines to ensure that the entire code base is consistent, and high quality. You can drop into any of the code and pretty quickly understand what is happening. And if not, claude will do an excellent job of describing how any given component works, and how it relates to the others.
Finally, mngr itself is designed to be fully transparent when it is running--you can literally attach to the coding agent you are running and see exactly what is happening, and the program makes extensive log outputs for everything it does (feel free to open a PR if you'd like to see more!)
It's not perfect formal verification, but it does feel like we're making meaningful progress on making it easier to understand software--not harder.
echelon - 10 hours ago

Billions of years of evolution and we increasingly understand what the genome does. And that's about as random as it gets.
I think we'll be fine.
This feels more like Y2K panic than grounded in truth. Senior software engineers guide these systems effectively today without creating a mess. I'm sure in some years agents will fill the role of maintainability engineer too. We are not special or irreplaceable.
It's not like we won't be spending an incredible amount of energy to overcome issues with understandably and maintenance. The sheer economic forces will absolutely will this problem solved. It must be solved, because trillions of dollars urgently want it to be solved. That's evolutionary pressure if I've ever seen it.
Also, we ceremoniously ascribe too much value to the software we create. With the exception of a few places, almost all of it gets replaced before our careers are over. At the end of the day, business automation is value creation. It's not sacred. It has a finite life, and then it too dies.
The software artifact just needs to facilitate economic/interest flux long enough to be useful, then it can be replaced with something better or more relevant.
- npodbielski - 9 hours ago
  
  I think we are talking about different timespans. I am talking about change in the world after decades of something like that happening. How those Senior Engineers will know how good software looks like if they would never write it themeselves? Imagine looking at someone driving a car for 20 years. Will it be enough for you to drive a car yourself?
  Thinking about that always makes me think about Foundation, The Merchant Princess. Mallow travels to the edge of the Empire to look how things are on one of those worlds. He learns that there is the cast of the tech priests and those people have absolutely no idea how those devices actually work.
  He said:
  > The machines work from generation to generation automatically, and the caretakers are a hereditary caste who would be helpless if a single D-tube in all that vast structure burned out
  It was a sign of severe decline of the entire empire. People had no idea how devices work and they would not be able to reproduce it or even repair if one would broke.
  It was recurring premise of civilisation decline in the series: no proper maintaince and people loosing interests and knowledge how things are done and how they work.
  I just wondering if this is not the same thing starting to happining know with our civilisation.
  And evolution? Evolution means mass extinction of species and its normal. I am not sure about you but I would rather avoid any mass extinction regarding humanity.

Yokohiii - 12 hours ago

> Finally, remember that mngr runs your agent in a tmux session

what the hell?

kanjun - 11 hours ago

It's a CLI tool so you can build composable workflows with agents. You're welcome to make your own UI on top of it.

dakolli - 14 hours ago

this is a pitch to sell an agent orchestration product and services.

kanjun - 11 hours ago

Kanjun here, cofounder of Imbue (we put out this blog post, and I'm quite surprised it's on the front page of HN!)
The agent orchestration library (mngr) is open source, so we aren't selling anything. There is literally no way for us to make money on it.
We shipped it this way instead of trying to monetize because we believe open agents must win over closed / verticalized platforms in order for humans to live freely in our AI future. We have plenty of money and runway as a company, and this feels much more important to work on.
- neonstatic - 11 hours ago
  
  [flagged]
  - kanjun - 11 hours ago
    
    In fact I was just on a walk at the park.
    Feel free to come back in 10 years when your brain's been rotted by the equivalent of agent ragebait and the digital infrastructure of your life is trapped in the AI lab agent oligopoly, and we can talk.
    
    neonstatic - 11 hours ago
    
    Sure, a computer program generating tokens will enslave me. Thanks for your wisdom.
    
    kanjun - 11 hours ago
    
    You're welcome. To some extent, computer programs deciding what you see next already have: https://www.ted.com/playlists/610/the_race_for_your_attentio...

khazhoux - 11 hours ago

Me: has to babysit every feature for hours in Claude Code, building a good plan but then still iterating many many times over things that need to be fixed and tweaked until the feature can be called done.

Bloggers: Here's how we use 3,000 parallel agents to write, test, and ship a new feature to production every 17 minutes in an 8M-LOC codebase (all agent-generated!).

... I'm doing something wrong, or other people are doing something wrong?

jiffy_lubricant - 11 hours ago

> 8M-LOC codebase
I think this is the difference. These toy examples of using parallel agents are *not* running against large codebases, allowing them to iterate more effectively. Once you are in real codebases (>1M LoC), these systems break down.
- thejash - 2 hours ago
  
  (author here) I strongly agree that these systems start to break down once the code base gets larger (we've seen that with our own projects)
  But our reaction to it has been to say "ok, well the best practice in software engineering is to make small, well-isolated components anyway, so what if we did that?"
  We've been trying to really break things apart into smaller pieces (and that's even evident in mngr, where much of the code is split out into separate plugins), and have been having a ton of success with it.
  I realize that that might not be an option for more brownfield / existing / legacy projects, but when making something new, I've really been enjoying this way of building things.
tossandthrow - 11 hours ago

To an extend you are likely doing something wrong.
I understand that the natural instinct is to correct the output when you see your agent doing something wrong.
That is not productive.
The instinct should be to tweak the agent to do it right.
At this point I am almost not writing any code in an enterprise code base.
- pinkmuffinere - 11 hours ago
  
  > The instinct should be to tweak the agent to do it right.
  I'm extremely doubtful of this. It doesn't save time to tell it "you have an error on line 19", because that's (often) just as much work as fixing the error. Likewise, saying "be careful and don't make mistakes" is not going to achieve anything. So how can you possibly tweak the agent to "do it right" reliably without human intervention? That's not even a solved problem for working with _humans_ who don't have the context window limitations, let alone an LLM that deletes everything past 30k tokens.
  - faeyanpiraat - 10 hours ago
    
    Are you seriously interested in the answer, or are you just mad?
    I could give you some pointers, but will only type it out if there is a point
    
    Erem - 10 hours ago
    
    Not GP, but I would love pointers on precisely this problem
    
    tossandthrow - 8 hours ago
    
    It is about tweaking inline documentation to make sure that
    1. It is not ambiguous 2. It is as complete as possible.
    I am surprised that I got down voted for proposing the improve a code base such that agents can run on it as a means to increased productivity.
- khazhoux - 11 hours ago
  
  I'm not touching code. I'm trying out the feature, and there's any number of things to tweak (because I missed some detail during planning, or agent made bad assumption, etc).
- lelanthran - 10 hours ago
  
  > The instinct should be to tweak the agent to do it right.
  Ah, yes; must always remember to add "And don't make any mistakes" into the prompt /s
  - tossandthrow - 8 hours ago
    
    I am not entirely sure what you are referring to.
    Improving the agent means improving the code base such that the agent can effectively work on it.
    It can not Com as a surprise that an agent is better at working on a well documented code base with clear architecture.
    On the other hand, if you expect that an agent can add the right amount of ketchup to your undocumented speghatti code, then you will continue to have a bad time.

maxbeech - 14 hours ago

the thing that actually burns token budget at scale isn't the agent count itself—it's understanding the cost model of orchestrating them. 100 agents running in parallel is fine if they're short-lived queries. but once you start running them on a schedule (hourly checks, overnight batch work), the math changes fast.

each agent run against a real codebase probably spends 20-50k tokens just on context: repo structure, relevant files, recent changes. multiply that by 100 agents running every hour across 10-20 repos, and you're already hitting millions of tokens a day before any actual work happens. add in re-runs for failures or retries, and the cost curve gets steep quickly.

the harder problem is observability. with one agent you can read logs and understand what went wrong. with 100 agents you need aggregation, pattern detection, alerting on the common failure modes. if 3 agents fail silently but identically, was that a real issue or just rate limiting? if 40 agents all timeout at the same step, was it a dependency problem or infrastructure saturation? at scale you're debugging distributions, not individual runs.

also helps to be ruthless about concurrency. the async pattern isn't "run as many as possible at once"—it's "run exactly as many as the API and your budget can support without making the failure modes harder to diagnose." for claude api work that's usually smaller than people expect.

meidad_g - 12 hours ago

[dead]

petcat - 15 hours ago

Curious how people and companies like this are approaching matters of intellectual property now that the courts have ruled that basically no part of AI generated content or code is copyrightable and is therefore impossible to claim ownership of.

Are people just not going to open source anything anymore since licenses don't matter? Might as well just keep the code secret, right?

nradov - 14 hours ago

There is no such court ruling.
- petcat - 14 hours ago
  
  Courts ruled that AI works can't be copyrighted
  https://fairuse.stanford.edu/case/thaler-v-perlmutter/
  - hmry - 14 hours ago
    
    Please read the link you're citing
    > The court held that the Copyright Act requires all eligible works to be authored by a human being. Since Dr. Thaler listed the Creativity Machine, a non-human entity, as the sole author, the application was correctly denied. The court did not address the argument that the Constitution requires human authorship, nor did it consider Dr. Thaler’s claim that he is the author by virtue of creating and using the Creativity Machine, as this argument was waived before the agency.
    Or in other words: They ruled you can't register copyright with an AI listed as the author on the application. They made no comment on whether a human can be listed as the author if an AI did the work.
    
    heavyset_go - 11 hours ago
    
    An earlier attempt at registering AI creations without AI attribution was rejected by the Copyright Office[1], saying that person in particular needed to make an AI attribution, which they were originally not doing.
    In this case, the court is saying AI attribution is not okay, either. There is no way to register copyrights for AI creations.
    It's consistent with the Copyright Office's interpretation of copyright law where it holds that it only applies to human creations and doesn't apply to non-human creations, which is what they say AI creations fall under:
    > The Copyright Office affirms that existing principles of copyright law are flexible enough to apply to this new technology, as they have applied to technological innovations in the past. It concludes that the outputs of generative AI can be protected by copyright only where a human author has determined sufficient expressive elements. This can include situations where a human-authored work is perceptible in an AI output, or a human makes creative arrangements or modifications of the output, but not the mere provision of prompts.
    [1] https://www.copyright.gov/rulings-filings/review-board/docs/...
    [2] https://newsroom.loc.gov/news/copyright-office-releases-part...
  - nradov - 13 hours ago
    
    Bullshit. Did you even read the court's opinion in that case? The Dunning-Kruger effect strikes again.
appcustodian2 - 14 hours ago

You think people open sourced things mostly because of license obligations?
SpicyLemonZest - 14 hours ago

It was always a bit weird how heavily software companies leaned on copyright, and I think you could basically replicate the same intuitions and dynamics on top of trade secret law if you had to. KFC didn't go out of business when a Chicago Tribune reporter found what's most likely the secret recipe.
I'm also not sure that the current precedent on the matter is _quite_ as strong as you're thinking. The high-profile case you're most likely thinking of was from a guy Stephen Thaler, who was seeking not just to claim copyright on AI-generated content but to specify the AI as the sole author. (IIUC, he planned to still own the copyright on the theory that it was a work-for-hire.)
measurablefunc - 15 hours ago

There are no secrets when you are using AI providers. They track all interactions b/c that's valuable information for improving their models.
- petcat - 15 hours ago
  
  I'm talking about sharing things publicly that you are trying to claim as your own
  - measurablefunc - 15 hours ago
    
    It doesn't matter. If someone has the same idea then they can use AI the same way you did to recreate it. Keeping it a secret benefits no one other than the AI providers b/c now they can charge money for giving someone else "your" code. The AI providers don't care about license restrictions so it's the perfect way to launder code. If you want credit for something then you'll have to claim it publicly b/c the AI providers sure as hell are not going to give you any credit.
    
    throw_m239339 - 14 hours ago
    
    strange downvotes, not only these services allow anyone with money to copy their competitors if they use the same services, but on the long run, Anthropic could very well be the competition, trained on corporations that use Claude. Why would this startup be any different from Google or Microsoft on the long run? People can't seem to learn their lesson.
    
    measurablefunc - 14 hours ago
    
    People are very naive about how technology companies operate.
- EnPissant - 14 hours ago
  
  Not true.
  https://developers.openai.com/api/docs/guides/your-data
  - heavyset_go - 14 hours ago
    
    Even if you believe the "we don't train on your data" claim/lie, that leaves a whole lot of things they can do with it besides training directly on it.
    Analytics can be run on it, they can run it through their own models, synthetic training data can be derived from it, it can be used to build profiles on you/your business, they could harvest trade/literal secrets from it, they could store derivatives of your data to one day sell to competitors/compete themselves, they can use it to gauge just how dependent you've made yourself/business on their LLMs and price accordingly, etc.
    
    EnPissant - 13 hours ago
    
    No. Your data or any derivative of it does not leave RAM unless you are detected as doing something that qualifies as abuse, then it is retained for 30 days.
    
    heavyset_go - 12 hours ago
    
    Even the process of deciding what "qualifies as abuse" does what I'm talking about: they're analyzing your data with their own models and doing whatever they want with the results, including storing it and using it to ban you from the product you paid for, and call the police on you.
    Either way, I don't believe it.
    
    EnPissant - 10 hours ago
    
    You are a Star Wars Rebel fighting Darth Vader. Good job!
    
    heavyset_go - 9 hours ago
    
    Thanks
  - throw_m239339 - 14 hours ago
    
    And you believe them?
    
    EnPissant - 13 hours ago
    
    Yes. That's the rational position.
  - measurablefunc - 14 hours ago
    
    That's about the API. It doesn't say anything about their other products like Codex. Moreover, even in the API it says you have to qualify for zero retention policies. They retain the data for however long each jurisdiction requires data retention & they are always improving their abuse detection using the retained data.
    > Our use of content. We may use Content to provide, maintain, develop, and improve our Services, comply with applicable law, enforce our terms and policies, and keep our Services safe. If you're using ChatGPT through Apple's integrations, see this Help Center article (opens in a new window) for how we handle your Content.
    > Opt out. If you do not want us to use your Content to train our models, you can opt out by following the instructions in this article . Please note that in some cases this may limit the ability of our Services to better address your specific use case.
    https://openai.com/policies/row-terms-of-use/ https://openai.com/policies/how-your-data-is-used-to-improve...
    
    EnPissant - 13 hours ago
    
    Codex just talks to the responses API with store=false. So unless the model detects you are doing something that qualifies as abuse, nothing is retained.
    
    measurablefunc - 13 hours ago
    
    Alright, good luck to you. I'm not really interested in talking to people who think they're lawyers for AI providers. If you think they don't keep any of the data & don't use it for training then you are welcome to continue believing that. It makes no difference to me either way.
    
    EnPissant - 12 hours ago
    
    > Alright, good luck to you. I'm not really interested in talking to people who think they're lawyers for AI providers.
    Codex is open source, you can inspect it yourself, but let's not let facts ruin your David vs Goliath fantasy.