Scaling long-running autonomous coding

126 points by srameshc 14 hours ago

Related: Scaling long-running autonomous coding - https://news.ycombinator.com/item?id=46624541 - Jan 2026 (187 comments)

andrewchambers - 3 hours ago

Test suites just increased in value by a lot and code decreased in value.

simonw - 11 hours ago

One of the big open questions for me right now concerns how library dependencies are used.

Most of the big ones are things like skia, harfbuzz, wgpu - all totally reasonable IMO.

The two that stand out for me as more notable are html5ever for parsing HTML and taffy for handling CSS grids and flexbox - that's vendored with an explanation of some minor changes here: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...

Taffy a solid library choice, but it's probably the most robust ammunition for anyone who wants to argue that this shouldn't count as a "from scratch" rendering engine.

I don't think it detracts much if at all from FastRender as an example of what an army of coding agents can help a single engineer achieve in a few weeks of work.

sealeck - 11 hours ago

I think the other question is how far away this is from a "working" browser. It isn't impossible to render a meaningful subset of HTML (especially when you use external libraries to handle a lot of this). The real difficulty is doing this (a) quickly, (b) correctly and (c) securely. All of those are very hard problems, and also quite tricky to verify.
I think this kind of approach is interesting, but it's a bit sad that Cursor didn't discuss how they close the feedback loop: testing/verification. As generating code becomes cheaper, I think effort will shift to how we can more cheaply and reliably determine whether an arbitrary piece of code meets a desired specification. For example did they use https://web-platform-tests.org/, fuzz testing (e.g. feed in random webpages and inform the LLM when the fuzzer finds crashes), etc? I would imagine truly scaling long-running autonomous coding would have an emphasis on this.
Of course Cursor may well have done this, but it wasn't super deeply discussed in their blog post.
I really enjoy reading your blog and it would be super cool to see you look at approaches people have to ensuring that LLM-produced code is reliable/correct.
- simonw - 11 hours ago
  
  Yeah, I'm hoping they publish a lot more about this project! It deserves way more then the few sentences they've shared about it so far.
mwcampbell - 4 hours ago

I was gratified to learn that the project used my own AccessKit for accessibility (or at least attempted to; I haven't verified if it actually works at all; I doubt it)... then horrified to learn that it used a version that's over 2 years old.
embedding-shape - 4 hours ago

For me, the biggest open question is currently "How autonomous is 'autonomous'?" because the commits make it clear there were multiple actors involved in contributing to the repository, and the timing/merges make it seem like a human might have been involved with choosing what to merge (but hard to know 100%) and also making smaller commits of their own. I'm really curious to understand what exactly "It ran uninterrupted for one week" means, which was one of Cursor's claims.
I've reached out to the engineer who seemed to have run the experiment, who hopefully can shed some more light on it and (hopefully) my update to https://news.ycombinator.com/item?id=46646777 will include the replies and more investigations.
shubhamjain - 9 hours ago

Why attempt something that has abundant number of libraries to pick and choose? To me, however impressive it is, 'browser build from scratch' simply overstates it. Why not attempt something like a 3D game where it's hard to find open source code to use?
- Banditoz - 9 hours ago
  
  Is something like a 3D game engine even hard to find source code for? There's gotta lots of examples/implementations scattered around.
- XenophileJKO - 6 hours ago
  
  There are a lot of examples out there. Funny that you mention this. I literally just last night started a "play" project having Claude Code build a 3D web assembly/webgl game using no frameworka. It did it, but it isn't fun yet.
  I think the current models are at a capability level that could create a decent 3D game. The challenges are creating graphic assets and debugging/Qa. The debugging problem is you need to figure out a good harness to let the model understand when something is working, or how it is failing.
- cheevly - 7 hours ago
  
  Assets are very hard to produce and largely unsolved by AI at the moment.
janoelze - 11 hours ago

Any views on the nature of "maintainability" shifting now? If a fleet of agents demonstrated the ability to bootstrap a project like that, would that be enough indication to you that orchestration would be able to carry the code base forward? I've seen fully llm'd codebases hit a certain critical weight where agents struggled to maintain coherent feature development, keeping patterns aligned, as well as spiralling into quick fixes.
- simonw - 11 hours ago
  
  Almost no idea at all. Coding agents are messing with all 25+ years of my existing intuitions about what features cost to build and maintain.
  Features that I'd normally never have considered building because they weren't worth the added time and complexity are now just a few well-structured prompts away.
  But how much will it cost to maintain those features in the future? So far the answer appears to be a whole lot less than I would previously budget for, but I don't have any code more than a few months old that was built ~100% by coding agents, so it's way too early to judge how maintenance is going to work over a longer time period.
  - visarga - 2 hours ago
    
    > But how much will it cost to maintain those features in the future?
    Very little if they have good specs and tests.
- brianjeong - 10 hours ago
  
  I think there's a somewhat valid perspective that the Nth+1 model can simply clean up the previous models mess.
  Essentially a bet that the rate of model improvement is going to be faster than the rate of decay from bad coding.
  Now this hurts me personally to see as someone who actually enjoys having quality code but I don't see why it doesn't have a decent chance of holding
teaearlgraycold - 9 hours ago

It looks like JS execution is outsourced to QuickJS?
- simonw - an hour ago
  
  No, it has its own JS implementation: https://github.com/wilsonzlin/fastrender/tree/main/vendor/ec...
  See also: https://news.ycombinator.com/item?id=46650998

ramon156 - 5 hours ago

I would also love to see the statistics regarding token cost, electricity cost, environmental damage etc.

Not saying that this only happens with LLMs, in fact it should be compared against e.g. a dev team of 4-5

cocoto - 2 hours ago

The complex thing is that you would need to take into account the energy used to feed the programmers, the energy used for their education or simply them growing up to the age they are working. For the LLMs it would have to take into account energy used for the GPU, the machine building the GPUs, datacenters, engineers maintaining it, their education etc etc. It’s so complex to really estimate these things from bottom up if you are not only looking locally, it feels impossible…
- marisen - 2 hours ago
  
  It is well known that a programmer that stops programming stops requiring food
xnx - 4 hours ago

Generally, if something costs less it has less environmental impact.
- oefrha - 2 hours ago
  
  If you exterminate the replaced human coders, sure.
- bauerd - 3 hours ago
  
  Generally wrong. It may cost less because its externalities aren't priced in.

vedmakk - 8 hours ago

After reading that post it feels so basic to sit here, watching my single humble claude code agent go along with its work... confident, but brittle and so easily distracted.

Chipshuffle - 4 hours ago

The more I think about LLMs the stranger it feels trying to grasp what they are. To me, when I'm working with them, they don't feel intelligence but rather an attempt at mimicking it. You can never trust, that the AI actually did something smart or dump. The judge always has to be you.

It's ability to pattern match it's way through a code base is impressive until it's not and you always have to pull it back to reality when it goes astray.

It's ability to plan ahead is so limited and it's way of "remembering" is so basic. Every day it's a bit like 50 first dates.

Nonetheless seeing what can be achieved with this pseudo intelligence tool makes me feel a little in awe. It's the contrast between not being intelligence and achieving clearly useful outcomes if stirred correctly and the feeling that we just started to understand how to interact with this alien.

visarga - 2 hours ago

> The judge always has to be you.
But you can automate much of that work by having good tests. Why vibe-test AI code when you can code-test it? Spend your extra time thinking how to make testing even better.
NiloCK - an hour ago

If you find yourself 50-first-dating your LLMs, it may be worth it to invest some energy into building up some better context indexing of both the codebase itself and of your roadmap.
Gazoche - 4 hours ago

> they don't feel intelligence but rather an attempt at mimicking it
Because that's exactly what they are. An LLM is just a big optimization function with the objective "return the most probabilistically plausible sequence of words in a given context".
There is no higher thinking. They were literally built as a mimicry of intelligence.
- encyclopedism - an hour ago
  
  I don't understand why this point is NOT getting across to so many on HN.
  LLM's do not think, understand, reason, reflect, comprehend and they never shall. I have commented elsewhere but this bears repeating
  If you had enough paper and ink and the patience to go through it, you could take all the training data and manually step through and train the same model. Then once you have trained the model you could use even more pen and paper to step through the correct prompts to arrive at the answer. All of this would be a completely mechanical process. This really does bear thinking about. It's amazing the results that LLM's are able to acheive. But let's not kid ourselves and start throwing about terms like AGI or emergence just yet. It makes a mechanical process seem magical (as do computers in general).
  I should add it also makes sense as to why it would, just look at the volume of human knowledge (the training data). It's the training data with the mass quite literally of mankind's knowledge, genius, logic, inferences, language and intellect that does the heavy lifting.
  - myrmidon - 38 minutes ago
    
    > If you had enough paper and ink and the patience to go through it, you could take all the training data and manually step through and train the same model.
    But you could make the exact same argument for a human mind? (could just simulate all those neural interactions with pen and paper)
    The only way to get out of it is to basically admit magic (or some other metaphysical construct with a different name).
    
    encyclopedism - 24 minutes ago
    
    > But you could make the exact same argument for a human mind?
    It would be an argument and you are free to make it. What the human mind is, is an open scientific and philosophical problem many are working on.
    The point is that LLM's are NOT the same because we DO know that LLM's are. Please see the myriad of tutorials 'write an LLM from scratch'
    
    cess11 - 14 minutes ago
    
    I'm not so sure "a human mind" is the kind of newtonian clockwork thingiemabob you "could just simulate" within the same degree of complexity as the thing you're simulating, at least not without some sacrifices.
  - Zababa - 34 minutes ago
    
    Can you give examples of how that "LLM's do not think, understand, reason, reflect, comprehend and they never shall" or that "completely mechanical process" helps you understand better when LLM works and when they don't?
    Many people are throwing around that they don't "think", that they aren't "conscious", that they don't "reason", but I don't see those people sharing interesting heuristics to use LLMs well. The "they don't reason" people tend to, in my opinion/experience, underestimate them by a lot, often claiming that they will never be able to do <thing that LLMs have been able to do for a year>.
    To be fair, the "they reason/are conscious" people tend to, in my opinion/experience, overestimate how much a LLM being able to "act" a certain way in a certain situation says about the LLM/LLMs as a whole ("act" is not a perfect word here, another way of looking at it is that they visit only the coast of a country and conclude that the whole country must be sailors and have a sailing culture).
    
    encyclopedism - 21 minutes ago
    
    We know what an LLM is in fact you can build one from scratch if you like. e.g. https://www.manning.com/books/build-a-large-language-model-f...
    It's an algorithm and a completely mechanical process which you can quite literally copy time and time again. Unless of course you think 'physical' computers have magical powers that a pen and paper Turing machine doesn't?
    > Many people are throwing around that they don't "think", that they aren't "conscious", that they don't "reason", but I don't see those people sharing interesting heuristics to use LLMs well.
    My digital thermometer doesn't think. Imbibing LLM's with thought will start leading to some absurd conclusions.
    A cursory read of basic philosophy would help elucidate why casually saying LLM's think, reason etc is not good enough.
    What is thinking? What is intelligence? What is consciousness? These questions are difficult to answer. There is NO clear definition. Some things are so hard to define (and people have tried for centuries) e.g. what is consciousness? That they are a problem set within themselves please see Hard problem of consciousness.
    https://en.wikipedia.org/wiki/Hard_problem_of_consciousness
    
    Zababa - 9 minutes ago
    
    >My digital thermometer doesn't think. Imbibing LLM's with thought will start leading to some absurd conclusions.
    What kind of absurd conclusions? And what kind of non absurd conclusions can you make when you follow your let's call it "mechanistic" view?
    >It's an algorithm and a completely mechanical process which you can quite literally copy time and time again. Unless of course you think 'physical' computers have magical powers that a pen and paper Turing machine doesn't?
    I don't, just like I don't think a human or animal brain has any magical power that imbues it with "intelligence" and "reasoning".
    >A cursory read of basic philosophy would help elucidate why casually saying LLM's think, reason etc is not good enough.
    I'm not saying they do or they don't, I'm saying that from what I've seen having a strong opinion about whether they think or they don't seem to lead people to weird places.
    >What is thinking? What is intelligence? What is consciousness? These questions are difficult to answer. There is NO clear definition.
    You see pretty certain that whatever those three things are a LLM isn't doing it, a paper and pencil aren't doing it even when manipulated by a human, the system of a human manipulating a paper and pencil isn't doing it.
- azan_ - 4 hours ago
  
  > Because that's exactly what they are. An LLM is just a big optimization function with the objective "return the most probabilistically plausible sequence of words in a given context". > There is no higher thinking. They were literally built as a mimicry of intelligence.
  Maybe real intelligence also is a big optimization function? Brain isn't magical, there are rules that govern our intelligence and I wouldn't be terribly surprised if our intelligence in fact turned out to be kind of returning the most plausible thoughs. Might as well be something else of course - my point is that "it's not intelligence, it's just predicting next token" doesn't make sense to me - it could be both!
- clbrmbr - an hour ago
  
  Life is more fun as a scruffie.
  [0] http://www.catb.org/~esr/jargon/html/N/neats-vs--scruffies.h...
cess11 - 24 minutes ago

It's a compressed database with diffuse indices. It's using probability matching rather than pattern matching. Write operations are called 'training' and 'fine-tuning'.

light_hue_1 - 6 hours ago

Browsers are pretty much the best case scenario for autonomous coding agents. A totally unique situation that mostly doesn't occur in the real world.

At a minimum:

1. You've got an incredibly clearly defined problem at the high level.

2. Extremely thorough tests for every part that build up in complexity.

3. Libraries, APIs, and tooling that are all compatible with one another because all of these technologies are built to work together already.

4. It's inherently a soft problem, you can make partial progress on it.

5. There's a reference implementation you can compare against.

6. You've got extremely detailed documentation and design docs.

7. It's a problem that inherently decomposes into separate components in a clear way.

8. The models are already trained not just on examples for every module, but on example browsers as a whole.

9. The done condition for this isn't a working browser, it's displaying something.

This isn't a realistic setup for anything that 99.99% of people work on. It's not even a realistic setup for what actual developers of browsers do who must implement new or fuzzy things that aren't in the specs.

Note 9. That's critical. Getting to the point where you can show simple pages is one thing. Getting to the point where you have a working production browser engine, that's not just 80% more work, it's probably considerably more than 100x more work.

retinaros - 6 hours ago

Agentic coding is a card castle built on another card castle (test time compute) built on another card castle (token prediction) the mere fact that using lot of iterations and compute works maybe tells us that nothing is really elegant about the things we craft.

halfcat - 10 hours ago

So AI makes it cheaper to remix anything already-seen, or anything with a stable pattern, if you’re willing to throw enough resources at it.

AI makes it cheap (eventually almost free) to traverse the already-discovered and reach the edge of uncharted territory. If we think of a sphere, where we start at the center, and the surface is the edge of uncharted territory, then AI lets you move instantly to the surface.

If anything solved becomes cheap to re-instantiate, does R&D reach a point where it can’t ever pay off? Why would one pay for the long-researched thing when they can get it for free tomorrow? There will be some value in having it today, just like having knowledge about a stock today is more valuable than the same knowledge learned tomorrow. But does value itself go away for anything digital, and only remain for anything non-copyable?

The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?

tornikeo - 9 hours ago

> The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?
It's nearly frictionless, not frictionless because someone has to use the output (or at least verify it works). Also, why do you think the "shape" of the knowledge is spherical? I don't assume to know the shape but whatever it is, it has to be a fractal-like, branching, repeating pattern.
ramraj07 - 9 hours ago

The fundamental idea that modern LLMs can only ever remix, even if its technically true (doubt), in my opinion only says to me that all knowledge is only ever a remix, perhaps even mathematically so. Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.
- omnicognate - 11 minutes ago
  
  Why doubt? Transformers are a form of kernel smoothing [1]. It's literally interpolation [2]. That doesn't mean it can only echo the exact items in its training data - generating new data items is the entire point of interpolation - but it does mean it's "remixing" (literally forming a weighted sum of) those items and we would expect it to lose fidelity when moving outside the area covered by those points - i.e. where it attempts to extrapolate. And indeed we do see that, and for some reason we call it "hallucinating".
  The subsequent argument that "LLMs only remix" => "all knowledge is a remix" seems absurd, and I'm surprised to have seen it now more than once here. Humanity didn't get from discovering fire to launching the JWST solely by remixing existing knowledge.
  [1] http://bactra.org/notebooks/nn-attention-and-transformers.ht...
  [2] Well, smoothing/estimation but the difference doesn't matter here.
- mrbungie - 9 hours ago
  
  > Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.
  You know this is a false dichotomy right? You can treat and consider LLMs statistical parrots and at the same time take advantage of them.
- pseudosavant - 9 hours ago
  
  But all of my great ideas are purely from my own original inspiration, and not learning or pattern matching. Nothing derivative or remixed. /sarcasm
- heavyset_go - 9 hours ago
  
  Yeah, Yann LeCun is just some luddite lol
  - NitpickLawyer - 8 hours ago
    
    I don't think he's a luddite at all. He's brilliant in what he does, but he can also be wrong in his predictions (as are all humans from time to time). He did have 3 main predictions in ~23-24 that turned out to be wrong in hindsight. Debatable why they were wrong, but yeah.
    In a stage interview (a bit after the "sparks of agi in gpt4" paper came out) he made 3 statemets:
    a) llms can't do math. They can trick us with poems and subjective prose, but at objective math they fail.
    b) they can't plan
    c) by the nature of their autoregressive architecture, errors compound. so a wrong token will make their output irreversibly wrong, and spiral out of control.
    I think we can safely say that all of these turned out to be wrong. It's very possible that he meant something more abstract, and technical at its core, but in the real life all of these things were overcome. So, not a luddite, but also not a seer.
    
    gjadi - 8 hours ago
    
    Have this shortcomings of llms been addressed by better models or by better integration with other tools? Like, are they better at coding because the models are truly better or because the agentic loops are better designed?
    
    encyclopedism - an hour ago
    
    Fundamentally these shortcomings cannot be addressed.
    They can and are improved (papered over) over time. For example by improving and tweaking the training data. Adding in new data sets is the usual fix. A prime example 'count the number of R's in Strawberry' caused quite a debacle at a time where LLM's were meant to be intelligent. Because they aren't they can trip up over simple problems like this. Continue to use an army of people to train them and these edge cases may become smaller over time. Fundamentally the LLM tech hasn't changed.
    I am not saying that LLM's aren't amazing, they absolutely are. But WHAT they are is an understood thing so lets not confuse ourselves.
    
    NitpickLawyer - 8 hours ago
    
    100% by better models. Since his talk models have gained more context windows (up to usable 1M), and RL (reinforcement learning) has been amazing at both picking out good traces, and taught the LLMs how to backtrack and overcome earlier wrong tokens. On top of that, RLAIF (RL with AI feedback) made earlier models better and RLVR (RL with verifiable rewards) has made them very good at both math and coding.
    The harnesses have helped in training the models themselves (i.e. every good trace was "baked in" the model) and have improved in enabling test time compute. But at the end of the day this is all put back into the models, and they become better.
    The simplest proof of this is on benchmarks like terminalbench and swe-bench with simple agents. The current top models are much better than their previous versions, when put in a loop with just a "bash tool". There's a ~100LoC harness called mini-swe-agent [1] that does just that.
    So current models + minimal loop >> previous gen models with human written harnesses + lots of glue.
    > Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent!
    [1] - https://github.com/SWE-agent/mini-swe-agent
  - CuriouslyC - an hour ago
    
    You don't understand Yann's argument. It's similar to Richard Sutton's, in that these things aren't thinking, they're emulating thinking, and the weak implicit world models that get built in the weights are insufficient for true "AGI."
    This is orthogonal to the issue of whether all ideas are essentially "remixes." For the record I agree that they are.
ukuina - 8 hours ago

Single-idea implementations ("one-trick ponies") will die off, and composites that are harder to disassemble will be worth more.

tinyhouse - 11 hours ago

Well, software is measured over time. The devil is always in the details.

aronowb14 - 9 hours ago

Yeah curious what would happen if they asked for an additional big feature on top of the original spec

Agent_Builder - 8 hours ago

[flagged]

faeyanpiraat - 4 hours ago

please stop spamming about your tool

anilgulecha - 12 hours ago

That's a wild idea-a browser from scratch! And ladybird has been moving at snails pace for a long time..

I think a good abstractions design and good test suite will make it break success of future coding projects.

vivzkestrel - 11 hours ago

I am waiting for that guy or a team that uses LLMs to write the most optimal version of Windows in existence, something that even surpasses what Microsoft has done over the years and honestly looking at the current state of Windows 11, it really feels like it shouldn't even be that hard to make something more user friendly

kimixa - 10 hours ago

Considering Microsoft's significant (and vocal) investment in LLMs, I fear the current state of Windows 11 is related to a team trying to do exactly that.
- g947o - 9 hours ago
  
  I noticed that dialog that has worked correctly for the past 10+ years is using a new and apparently broken layout. Elements don't even align properly.
  It's hard to imagine a human developer misses something so obvious.
bandrami - 4 hours ago

The problem there is the same problem with AI-generated commercial feature films: the copyrightability of the output of LLMs remains an unexplored morass of legal questions, and no big name is going to put their name on something until that's adjudicated.