Agents that run while I sleep
claudecodecamp.com327 points by aray07 14 hours ago
327 points by aray07 14 hours ago
This _all_ (waves hands around) sounds like alot of work and expense for something that is meant to make programming easier and cheaper.
Writing _all_ (waves hands around various llm wrapper git repos) these frameworks and harnesses, built on top of ever changing models sure doesn't feel sensible.
I don't know what the best way of using these things is, but from my personal experience, the defaults get me a looong way. Letting these things churn away overnight, burning money in the process, with no human oversight seems like something we'll collectively look back at in a few years and laugh about, like using PHP!
These people play around with shit and try to sell you on their secret sauce. If it actually works it will come to claude code - so you can consider them practical SOTA and honestly just plopping CC to a mid sized codebase is a pretty great experience for me already. Not ideal but I get real tangible value out of it. Not 10x or any such nonsense but enough to think that I don't think I want to be managing junior developers anymore, the ROI with LLMs is much faster and significant IMO.
> sounds like alot of work and expense for something that is meant to make programming easier and cheaper.
Not if you are an AI gold rush shovel salesman.
From the article:
> I've run Claude Code workshops for over 100 engineers in the last six months
Yeah, my colleague recently said "hey I've burnt through $200 in Claude in 3 days". And he was prompting. Max 8hrs/day Imagine what would happen if AI was prompting.
As I like this allegory really much, AI is (or should be) like and exoskeleton, should help people do things. If you step out of your car putting it first in drive mode, and going to sleep, next day it will be farther, but the question is, is it still on road
I would encourage my competitors to use AI agents on their codebase as much as possible. Make sure every new feature has it, lots of velocity! Run those suckers day and night. Don't review it, just make sure the feature is there! Then when the music stops, the AI companies hit the economic realities, go insolvent, and they are left with no one who understands a sprawling tangled web of code that is 80% AI generated, then we'll see who laughs last.
Both be true at the same time: some teams spend a fortune on AI and the AI investments won't get the expected ROI (bubble collapse). What is sure is that a lot of capacity is been built and that capacity won't disappear.
What I could see happening in your scenario is the company suffers from diminishing return as every task becomes more expensive (new feature, debugging session, library update, refactoring, security audit, rollouts, infra cost). They could also end up with an incoherent gigantic product that doesn't make sense to their customer.
Both pitfall are avoidable, but they require focus and attention to detail. Things we still need humans for.
Qwen3 Coder Next and Qwen3.5-35B-A3B already very good and can be run on today's higher end home computers with good speed. Tomorrow's machines will not be slower but models are keep getting more efficient. A good sw engineer still would be valuable in Tomorrow's world but not as a software assembler.
Even cutting edge models are not very good. They are not even on mediocre level. Don’t get me wrong, they are improving, and they are awesome, but they are nowhere near good yet. Vibe coded projects have more bugs than features, their architecture and design system are terrible, and their tests are completely useless about half the time. If you want a good product you need to rewrite almost everything what’s written by LLMs. Probably this won’t be the case in a few years, but now even “very good” LLMs are not very good at all.
> Don't review it, just make sure the feature is there!
Bad idea. Use another agent to do automatic review. (And a third agent writing tests.)
Don't forget the architecting and orchestrating agent too!
I am not laughing about PHP. To this very day many of my best projects are built on PHP. And while last 7 years I have spent in full stack JavaScript/TypeScript environment it has never produced the same things I was actually able to do with PHP.
I actually feel that things I built 15 years ago in PHP were better than anything I am trying to achieve with modern things that gets outdated every 6 months.
what in God's Name could you do in PHP that you can't do in a modern framework?
Nothing; but PHP, in experienced hands, will be waaay more productive for small-to-medium things. One issue is that experienced hands are increasingly hard to come by. Truly big, complicated things, built by large teams or numbers of teams, teams with a lot of average brains or AIs trained on average brains, will be better off in something like Typescript/React. And everyone wants to work on the big complicated stuff. So the "modern frameworks" will continue to dominate while smaller, more niche shops will wonder why they waste their time.
I worked at a startup, they built their API in PHP because it was easy and fast. Now they're successful, app doesn't scale, high latency etc. What does their php code do? 95% of it is calling a DB.
You're telling me today with LLM power multiplier it's THAT much faster to write in PHP compared to something that can actually have a future?
“PHP was so easy and fast that they’ve built such a successful startup they now have scaling problems” is, as far as I can tell, an endorsement of PHP and not a criticism of it.
Yes, startup success has a direct correlation to the language chosen for your CRUD api…
> I worked at a startup, they built their API in PHP because it was easy and fast. Now they're successful
You can stop there! Sounds like PHP worked for them. Already doing better than 90% of startups.
If 95% of what app does is calling a DB, then the bottleneck is in the DB, not with the PHP.
You can use persistent DB connections, and app server such as FrankenPHP to persist state between requests, but that still wouldn't help if DB is the bottleneck.
Sometimes it’s still the app:
rows = select all accounts
for each row in rows:
update row
But that’s not necessarily a PHP problem. N+1 queries are everywhere.Depending on what you are doing, the above is not necessarily bad.. often much better than an SQL that locks an entire table (potentially blocking the whole DB, if this is one of the key tables).
> I worked at a startup, they built their API in PHP because it was easy and fast. Now they're successful, app doesn't scale, high latency etc. What does their php code do? 95% of it is calling a DB.
So PHP worked perfectly, but the DB is slow? Your DB isn't going any faster by switching to something else, if that's what you think.
PHP is the future, where React has been heading for years.
PHP did better than python and perl. Python is doomed. PHP got a good jit already, a good OO lately, good frameworks, stable extensions. It has a company behind.
Unlike python or ruby which break right and left all the time on updates. you have to use bunkers of venvs, without any security updates. A nightmare.
PHP can scale and has a future.
Python is doomed? That's new.
You use python docker images pinned to a stable version (3.11 etc), and between bigger versions, you test and handle any breaking changes.
I feel like this approach applies to pretty much every language?
Who on earth raw dogs on "language:latest" and just hopes for the best?
Granted I wouldn't be running Facebook's backend on something like this. But i feel that isn't a problem 95% of people need to deal with.
No, only to python. And partially ruby and ocaml. Not to typescript, perl or PHP.
I don't think it's true that experienced hands will be faster in PHP than in Python or JS or whatever. It's just about what you know, and experienced hands are experienced.
You can build those things in modern frameworks, it will just be more headache and will feel outdated in 6 months.
Where are my backbone apps? In the trash? Me ember apps? Next to them. My create-react-apps? On top of those. My Next apps? Being trashed as we speak. My rails apps? Online and making money every year with minimal upgrade time. What the hell was I thinking.
6 years ago I was writing apps in typescript and react, if I was starting a new project today I'd write it in typescript and react.
People bicker about PHP and Javascript, sorry Typescript, like they aren't both mule languages peoppe pick up to get work done. They both matured really well through years of production use.
They are in the same group, similar pedigree. If you were programming purely for the art of it, you would have had time to discover much nicer languages than either, but that's not what most people are doing so it doesn't really matter. They're different but they're about as good as eachother.
Not have to "build" anything. You edit code and it is already deployed on your dev instance.
Deploying to production is just scp -rv * production:/var/www/
Beautifully simple. No npm build crap.
You trade having to compile for actually having code that can scale
Not sure what you’re talking about, I scaled to millions of users on a pair of boxes with PHP, and its page generation time absolutely crushed Rails/Django times. Apache with mod PHP auto scales wonderfully.
It scales just fine the same way everything else scales: put a load balancer in front of multiple instances of your app.
> sounds like alot of work and expense for something that is meant to make programming easier and cheaper.
It's not more work; it's a convergence of roles. BA/PO/QA/SWE are merging.
AI has automated aspects of those roles that have made the traditional separation of concerns less desirable. A new hybrid role is emerging. The person writing these acceptance criteria can be the one guiding the AI to develop them.
So now we have dev-BAs or BA-devs or however you'd like to frame it. They're closer to the business than a dev might have been or closer to development than a BA might have been. The point is, smaller teams are able to play wider now.
> It's not more work
It literally is. You're spending weeks of effort babysitting harnesses and evaluating models while shipping nothing at all.
That hasn't been my experience, as a "ship or die" solopreneur. It takes work to set up these new processes and procedures, but it's like building a factory; you're able to produce more once they're in place.
And you're able to play wider, which is why the small team is king. Roles are converging both in technologies and in functions. That leads to more software that's tailored to niche use cases.
> you're able to produce more once they're in place
Cool story, unfortunately the proof is not in the pudding and none of this fantom x10 vibe-coded software actually works or can be downloaded and used by real people.
P.S. Compare to AI-generated music which is actually a thing now and is everywhere on every streaming platform. If vibe coding was a real thing by now we'd have 10 vibecoded repos on Github for every real repo.
It being a lot of work is why they didn't do it at all for weeks and still, without self reflection, wrote that they care about the code quality of the code they hadn't looked at or tested
I can't believe we're back to advocating for TDD. It was a failed paradigm that last few times we tried it. This time isn't any different because the fundamental flaw has always been the same: tests aren't proofs, they don't have complete coverage.
Before anyone gets too confused, I love tests. They're great. They help a lot. But to believe they prove correctness is absolutely laughable. Even the most general tests are very narrow. I'm sure they help LLMs just as they help us, but they're not some cure all. You have to think long and hard about problems and shouldn't let tests drive your development. They're guardrails for checking bonds and reduce footguns.
Oh, who could have guessed, Dijkstra wrote about program completeness. (No, this isn't the foolishness of natural language programming, but it is about formalism ;)
https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD288...
Testing works because tests are (essentially) a second, crappy implementation of your software. Tests only pass if both implementations of your software behave the same way. Usually that will only happen if the test and the code are both correct. Imagine if your code (without tests) has a 5% defect rate. And the tests have a 5% defect rate (with 100% test coverage). Then ideally, you will have a 5%^2 defect rate after fixing all the bugs. Which is 0.25%.
The price you pay for tests is that they need to be written and maintained. Writing and maintaining code is much more expensive than people think.
Or at least it used to be. Writing code with claude code is essentially free. But the defect rate has gone up. This makes TDD a better value proposition than ever.
TDD is also great because claude can fix bugs autonomously when it has a clear failing test case. A few weeks ago I used claude code and experts to write a big 300+ conformance test suite for JMAP. (JMAP is a protocol for email). For fun, I asked claude to implement a simple JMAP-only mail server in rust. Then I ran the test suite against claude's output. Something like 100 of the tests failed. Then I asked claude to fix all the bugs found by the test suite. It took about 45 minutes, but now the conformance test suite fully passes. I didn't need to prompt claude at all during that time. This style of TDD is a very human-time efficient way to work with an LLM.
I think there is a difference whether you do TDD or write tests after the fact to avoid regression. TDD can only work decently if you already know your specs very well, but not so much when you still need to figure them out, and need to build something actual to be able to figure it out.
When you write tests with LLM-generated code you're not trying to prove correctness in a mathematically sound way.
I think of it more as "locking" the behavior to whatever it currently is.
Either you do the red-green-with-multiple-adversarial-sub-agents -thing or just do the feature, poke the feature manually and if it looks good then you have the LLM write tests that confirm it keeps doing what it's supposed to do.
The #1 reason TDD failed is because writing tests is BOORIIIING. It's a bunch of repetition with slight variations of input parameters, a ton of boilerplate or helper functions that cover 80% of the cases, but the last 20% is even harder because you need to get around said helpers. Eventually everyone starts copy-pasting crap and then you get more mistakes into the tests.
LLMs will write 20 test cases with zero complaints in two minutes. Of course they're not perfect, but human made bulk tests rarely are either.
Hmm, not so sure TDD is a failed paradigm. Maybe it isn't a pancea, but it is seems like it's changed how software development is done.
Especially for backend software and also for tools, seems like automated tests can cover quite a lot of use cases a system encounters. Their coverage can become so good that they'll allow you to make major changes to the system, and as long as they pass the automated tests, you can feel relatively confident the system will work in prod (have seen this many times).
But maybe you're separating automated testing and TDD as two separate concepts?
I think tests in general are good, just not TDD as it forces you to what I think bad and narrow paradigm of thinking. I think e.g. it is better that I build the thing, then get to 90%+ coverage once I am sure this is what I would also ship.
Indeed, they are two separate concepts.
I write lots of automated tests, but almost always after the development is finished. The only exception is when reproducing a bug, where I first write the test that reproduces it, then I fix the code.
TDD is about developing tests first then writing the code to make the tests pass. I know several people who gave it an honest try but gave up a few months later. They do advocate everyone should try the approach, though, simply because it will make you write production code that's easier to test later on.
> But to believe they prove correctness is absolutely laughable.
Sounds like a lack of tests for the correct things.
> But to believe they prove correctness is absolutely laughable.
You don't need to believe this to practice TDD. In fact I challenge you to find one single mainstream TDD advocate who believes this.
You can always tell claude to use red-green-refactor and that really is a step-up from "yeah don't forget to write tests and make sure they pass" at the end of the prompt, sure. But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.
The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.
> But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.
It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.
I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.
Reward hacking is very real and hard to guard against.
The trick is, with the setup I mentioned, you change the rewards.
The concept is:
Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.
Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.
Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.
It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.
You guys are describing wonderful things, but I've yet to see any implementation. I tried coding my own agents, yet the results were disappointing.
What kind of setup do you use ? Can you share ? How much does it cost ?
Paste the comment you replied to into a LLM good at planning. That’s something the codex/claude setups can create for you with a little back and forth.
rlm-workflow does all that TDD for you: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow
(I built it)
Why make powershell a requirement? I like powershell, but Python is very common and already installed on many dev systems.
Thanks for sharing. What does RLM stand for? Any idea why the socket security test fails?
If you are not spending 5-10k dollars a month for interesting projects, you likely won't see interesting results
Sounds a lot like paying for online ads, they don't work because you're not paying enough, when in reality bots, scrapers and now agents are just running up all the clicks.
You pay more to try and get above that noise and hope you'll reach an actual human.
The new "fast mode" that burns tokens at 6 times the rate is just scary because that's what everyone still soon say we all need to be using to get results.
It feels like everyone's gone mad.
Here I am mostly writing code by hand, with some AI assistant help. I have a Claude subscription but only use it occasionally because it can take more time to review and fix the generated code as it would to hand-write it. Claude only saves me time on a minority of tasks where it's faster to prompt than hand-write.
And then I read about people spending hundreds or thousands of dollars a month on this stuff. Doesn't that turn your codebase into an unreadable mess?
I've been thinking about this recently and it seems like the most enthusiastic boosters always suggest difference in results is a skill issue, but I feel like there are 4 factors which multiply out to influence how much value someone gets: - The quality of model output for _your particular domain / tech stack_. Models will always do better with languages and libraries they see a lot of than esoteric or proprietary - The degree to which "works" = "good" in your scenario. For a one off script, "works" is all that matters, for a long lived core library, there are other considerations. - The degree to which "works" can be easily (best yet, automatically) verified. - Techniques, existing code cleanliness, documentation etc.
Boosters tend to lay all different experiences at the feet of this last, yet I'd argue the others are equally significant.
On the other hand, if you want to get the best results you can given the first 3 (which are generally out of one's control) then don't presume there's nothing you can do to improve the 4th.
Why read code when you are getting results fast ? See https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...
I am not kidding. People don't seem to understand what's actually happening in our industry. See https://www.linkedin.com/posts/johubbard_github-eleutherailm...
Why is everyone obsessed with Mac Minis. They're awesome but for the work that these people are attempting to do? Just seems... nonsensical. Renting a server is cheaper and still just as "local" as any of this (they want "self hosted", I don't think anyone cares about local. Like are people air gapping networks? lol)
And a senior director of Nvidia? He had several Mac Minis? I really gotta imagine a Spark is better... at least it'll be a bit smarter of a cat (I'm pretty suspicious he used a LLM to help write that post)
No time to think, gotta go fast?