I'm going back to writing code by hand
blog.k10s.dev1020 points by dropbox_miner 6 days ago
1020 points by dropbox_miner 6 days ago
Yep. The only people I've heard saying that generated code is fine are those who don't read it.
The problem is that the mitigations offered in the article also don't work for long. When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. You can tell the agent what the constraints are with something like "Views do NOT access other views' state" as the post does.
Except, eventually, you'll want to add a feature that clashes with that invariant. At that point there are usually three choices:
- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.
- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.
Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences.
Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.
Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered. What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code that contorts itself to it when it needs to change. If you don't read what the agent does very carefully - more carefully than human-written code because the agent doesn't complain about contortious code - you will end up with the same "code that devours itself", only you won't know it until it's too late.
If you know how to write good code you can force AI to write good code with various techniques. It's 100% doable. You just need to figure out the problems AI has and find solutions to make it easier for it. Ex: extremely small contexts Modularize to modules with clear boundaries and only allow the AI to work within those boundaries. Make modules pure from IO so they are easily testable. Hide modules behind interfaces etc .. You can write 100 tests that executes within a second. You can write benchmarks etc .. AI needs boundaries and small contexts to work well. If you fail to give it that it will perform poorly. You are in charge.
That doesn't quite work, and precisely for the reason I mentioned: You can definitely tell the AI to follow some strategy, but at some point the strategy will need to change, and the AI won't tell you that (even if you tell it to). Unless you read the code every time you won't know if the AI is following the strategy and producing good results or following it and producing bad results because the strategy has to change. This can happen even in small changes: the AI will follow the strategy even if the change proves it's wrong, and if you don't pay close attention, these mistakes pile up.
So yes, you might get good results in one round, but not over time. What does work is to carefully review the AI's output, although the review needs to be more careful than review of human-written code because the agents are very good at hiding the time bombs they leave behind.
How do you define "bad code"?
If I instruct the AI to make small modules where I can verify they work, have tests and no side effects - then it is good enough code for me. It works, is readable and can be extended - and will turn into bad code if this is not done with care.
Sure, if you carefully review the agent's output, including tests, you can get good results. If you don't carefully review the output, you obviously have no idea if it's good enough for you. The only way to find out is that 30 changes down the line the agent won't be able to change one thing without breaking another, but by then the codebase will be too far gone to fix.
This is essentially true. There are other ways to achieve this goal though, that don’t require exhaustive human review, better models are able to do that part as well if properly guided. The key is that yes, some of the design constraints will morph over time, necessarily, since coding is as often about discovering the problem as solving it. But design principles don’t drift. If you have a design principle that can not be adhered to, it is not a proper principle, it’s an opinion about the problem.
The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code. Notice that there is 3 separate things written about the code, and the code itself…. Keeping all of that correct, coherent, and consistent (with a separate, invariant document that describes your design principles) keeps the model from going off the rails and gives ample opportunity to sense bad smells before they get set in stone.
It’s a token fire and you need a minimum 250k context model… but I still get as much work done in an hour as I used to do in a day, and the code I coauthor is better documented, more maintainable, and more tested than any code I have ever written before.
> There are other ways to achieve this goal though, that don’t require exhaustive human review, better models are able to do that part as well if properly guided.
Not at this time. Even if you could somehow get their success rate to 90%, it's still far too low because the mistakes can be (and are occassionally) catastrophic. It's only when you review everything that you find mistakes that will bite you down the line. If you don't review everything, you just don't know, but the rate of bad mistakes introduced by the agents is too high to trust, no matter how much prompting and orchestration you do. Maybe future models will address that, but we're not there yet.
> The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code.
That's helpful but it doesn't solve the problem, which is that the agents are happy to introduce horrendous workarounds, and they don't tell you that the code they've written is a horrendous workaround. The docs are fine and reflect the code and the code reflects the strategy, but you just don't know that the strategy is wrong.
I haven’t had this problem. Maybe it’s because of the language I’m using (C++) or maybe it’s because of the strict enforcement of modularity and public vs private interfaces, etc that I use? Also, the code is tested against the hardware with every change. Idk if that’s why my experience has been different from yours or not.
My workflow also requires a discussion of the architecture and methodology of each addition or change, but honestly because we define the interfaces first, and each concern is given its own .c and .h file, it’s very hard to sneak something in without me noticing and calling it out. (Which does happen occasionally)
I suspect that file level granularity may be one of the keys. It never is actually working on more than a couple hundred lines of code at a time, plus interfaces of related files. I end up with a hundred files where I might have had 30 coding by hand, but it is actually easier to reason about the code for me as well, and the number of files is not an issue because of the automation. Total LOC is about the same as I would produce by hand for the same work, which means it’s actually writing less, due to the interface overhead, so I’m pretty stoked about that. The only real nightmare for humans is the long includes.
OTOH if I don’t do all of this it will definitely go off the rails and produce garbage.
I’ve been writing c (and c++) for almost 40 years, and although that doesn’t mean I’m any good, it does mean I have developed a keen sense of smell and highly sensitive olfactory PTSD.
With the right structured environment, a SOTA model with a suspicious seasoned dev holding its hand can be easier to manage and much more productive than a small team. Or, maybe I’ve just sucked so bad my whole life that I can’t tell the difference, but at any rate it works well enough to ship without nightmares, and less bugs and patching than I had before.
Edit:
I should mention that if bugs get tricky, like hardware idiosyncrasies and things like that, the model just goes nuts.if I handle it very very carefully so that it does not try to understand the problem, and I just have it poke the firmware with a stick from a distance enough times and from enough angles, as long as I have successfully prevented it from trying to figure out the problem (which is not as easy as it seems like it would be) it actually will usually nail it. If it starts to guess it’s usually best just to roll back the context and start over with the poking (I have a harness so it does direct hardware probes)
There seems to be an analog for this for non hardware related issues, but it’s harder to sus out when you should be telling it that you specifically do not want it to attempt to understand or solve the problem until you’ve rigged and tested all of the debug messaging.
I don't think our experience is different. Letting the agent work on pieces no bigger than a couple hundred lines at a time and checking if there's something fishy or not and that the code is legible and logical is close human supervision. This is very much not what the people who wish AI could build products for them do or can do at the rate they're moving.
I get what you mean, but that can also happen with code written by humans.
Sure, by inexperienced ones.
30 years of experience writing bad code, with no effort to improve, doesn't make you any good. You need to right attitude and humility to become good.
Some of the worst programmers I have ever worked with had 30+ years of experience. They basically spend all of their time fixing bug after bug in a never ending cycle because the software they produced was so fragile that it would crash if you just looked at it wrong or the temperature in the room wasn't perfect.
While others with the same number of years of experience had massive systems in production for years with not a single bug reported by the happy users.
Hm. Some have rather a lot of experience of making such mess themself.
I mean for real, is the idea here, that all programmers are or were some kind of semi gods?
Because this is not what I remember from the pre LLM time, rather this:
I know I got into such developement hell myself. Fix a bug here, results in braking something there. Experience surely helps in avoiding it .. but even senior devs can make a mess. Otherwise there wouldn't be so many projects canceled.
So sure, agents can multiply a mess in a amazingly short time, but .. that is up to the humans guiding them.
That is correct. Using an AI to generate code and then not verify it yourself is IMHO unprofessional and should get you at a minimum a verbal warning. YOU are responsible for the code NOT the AI.
I let agents break things 30 changes down the line. If something breaks, I add a check to my project validator and start over, with the validator providing instructions on what was wrong and how to fix it. It's all automatic, and now I have a guard against the exact same error in the future.
Some of these checks have caught thousands of the same error, even with the latest Opus 4.7 writing the original code.
You proved that testing is a good idea, not that vibe coding is a good idea.
To be honest, I am past the point of wanting to convince people that AI is useful, if you want to refuse new tools other people find helpful, your loss.
(Also I stick to the original definition of "vibe coding = not looking at generated code", "LLM assisted coding = verify generated code", I do both, depending on the task)
So basically your only test is "it compiles" since you have no idea what it's actually testing.