Two things LLM coding agents are still bad at

kix.dev

344 points by kixpanganiban 4 days ago


tzs - 4 days ago

Just the other day I hit something that I hadn't realized could happen. It was not code related in my case, but could happen with code or code-related things (and did to a coworker).

In a discussion here on HN about why a regulation passed 15 years ago was not as general as it could have been, I speculated [1] that it could be that the technology at the time was not up to handling the general case and so they regulated what was feasible at the time.

A couple hours later I checked the discussion again and a couple people had posted that the technology was up to the general case back then and cheap.

I asked an LLM to see if it could dig up anything on this. It told me it was due to technological limits.

I then checked the sources it cites to get some details. Only one source it cited actually said anything about technology limits. That source was my HN comment.

I mentioned this at work, and a coworker mentioned that he had made a Github comment explaining how he thought something worked on Windows. Later he did a Google search about how that thing worked and the LLM thingy that Google puts at the top of search results said that the thing worked the way he thought it did but checking the cites he found that was based on his Github comment.

I'm half tempted to stop asking LLMs questions of the form "How does X work?" and instead tell them "Give me a list of all the links you would cite if someone asked you how X works?".

[1] https://news.ycombinator.com/item?id=45500763

rossant - 4 days ago

Recently, I asked Codex CLI to refactor some HTML files. It didn't literally copy and pasted snippets here and there as I would have done myself, it rewrote them from memory, removing comments in the process. There was a section with 40 successive <a href...> links with complex URLs.

A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.

Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.

I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.

Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...

These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!

tjansen - 4 days ago

Agreed with the points in that article, but IMHO the no 1 issue is that agents only see a fraction of the code repository. They don't know whether there is a helper function they could use, so they re-implement it. When contributing to UIs, they can't check the whole UI to identify common design patterns, so they re-invent it.

The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.

(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)

AllegedAlec - 4 days ago

On a more important level, I found that they still do really badly at even a minorly complex task without extreme babysitting.

I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right.

Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.

cheema33 - 4 days ago

From the article: > I contest the idea that LLMs are replacing human devs...

AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today.

In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own.

We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason.

Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand.

In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible.

It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve.

linsomniac - 4 days ago

>Sure, you can overengineer your prompt to try get them to ask more questions

That's not overengineering, that's engineering. "Ask clarifying questions before you start working", in my experience, has led to some fantastic questions, and is a useful tool even if you were to not have the AI tooling write any code. As a good programmer, you should know when you are handing the tool a complete spec to build the code and when the spec likely needs some clarification, so you can guide the tool to ask when necessary.

majora2007 - 4 days ago

I think LLMs provide value, used it this morning to fix a bug in my PDF Metadata parser without having to get too deep into the PDF spec.

But most of the time, I find that the outputs are nowhere near the effect of just doing it myself. I tried Codex Code the other day to write some unit tests. I had a few setup and wanted to use it (because mocking the data is a pain).

It took about 8 attempts, I had to manually fix code, it couldn't understand that some entities were obsolete (despite being marked and the original service not using them). Overall, was extremely disappointed.

I still don't think LLMs are capable of replacing developers, but they are great at exposing knowledge in fields you might not know and help guide you to a solution, like Stack Overflow used to do (without the snark).

koliber - 4 days ago

Most developers are also bad at asking questions. They tend to assume too many things from the start.

In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.

the_mitsuhiko - 4 days ago

> LLMs don’t copy-paste (or cut and paste) code. For instance, when you ask them to refactor a big file into smaller ones, they’ll "remember" a block or slice of code, use a delete tool on the old file, and then a write tool to spit out the extracted code from memory. There are no real cut or paste tools. Every tweak is just them emitting write commands from memory. This feels weird because, as humans, we lean on copy-paste all the time.

There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.

> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.

It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.

nxpnsv - 4 days ago

Codex has got me a few times lately, doing what I asked but certainly not what I intended:

- Get rid of these warnings "...": captures and silences warnings instead of fixing them - Update this unit test to relfect the changes "...": changes the code so the outdated test works - The argument passed is now wrong: catches the exception instead of fixing the argument

My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding...

pr0j3c7t0dd - 10 hours ago

Also built my own MCP server off the back of this article. I too have noticed during refactors that Coding agents take the long way around to move things to other files, and it can cause errors in the process (and burn tokens). Here's my attempt at the solution, with an encrypted data store and undo operations. Give it a try and see if you like it: https://github.com/Pr0j3c7t0dd-Ltd/cut-copy-paste-mcp

nberkman - 3 days ago

Inspired by the copy-paste point in this post, I added agent buffer tools to clippy, a macOS utility I maintain which includes an MCP server that interacts with the system clipboard. In this case it was more appropriate to use a private buffer instead. With the tools I just added, the server reads file bytes directly - your agent never generates the copied content as tokens. Three operations:

buffer_copy: Copy specific line ranges from files to agent's private buffer

buffer_paste: Insert/append/replace those exact bytes in target files

buffer_list: See what's currently buffered

So the agent can say "copying lines 50-75 from auth.py" and the MCP server handles the actual file I/O. No token generation, no hallucination, byte-for-byte accurate. Doesn't touch your system clipboard either.

The MCP server already included tools to copy AI-generated content to your system clipboard - useful for "write a Python script and copy it" workflows.

(Clippy's main / original purpose is improving on macOS pbcopy - it copies file references instead of just file contents, so you can paste actual files into Slack/email/etc from the terminal.)

If you're on macOS and use Claude or other MCP-compatible agents: https://github.com/neilberkman/clippy

brew install neilberkman/clippy/clippy

freetonik - 4 days ago

I see a pattern in these discussions all the time: some people say how very, very good LLMs are, and others say how LLMs fail miserably; almost always the first group presents examples of simple CRUD apps, frontend "represent data using some JS-framework" kind of tasks, while the second group presents examples of non-trivial refactoring, stuff like parsers (in this thread), algorithms that can't be found in leetcode, etc.

Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.

egberts1 - a day ago

Grok/ChatGPT cannot navigate semantic pathways of large syntax files (nftables, ISC Bind9) in LL(1) fashion.

I've worked with both and finished my Vim syntax highlighters down to the keywords.

And getting them to find 'stmt', 'expr_stmt', and 'primary_stmt_expr' semantic production rules (one is Bison-generated .y file, other is hand-rolled). Both makes too much assumptions despite explicitly instructing them to do "verification & validation" of a pathway given a sample statement.

Only Google Gemini barely cut the mustard.

Another case is making assumptions (upon grilling about its assumption, I've since learned that it was looking at old websites, archiac info). Asking to stick with latest nftables v1.1.4 (or even v1.1.5 head) does not help because old webpages gave obsoleted nftables syntax.)

Don't expect LLM to navigate any time soon the S-expression, RECREATE abstract syntax tree of 4-layer or deeper, transition a state machine beyond 8 states, or interpret Bison parsers reliably.

My only regret is that none of them will take the LLM learning from me, the expert so that others may benefit.

- https://github.com/egberts/vim-syntax-bind-named

- https://github.com/egberts/vim-syntax-nftables

cat-whisperer - 4 days ago

The copy-paste thing is interesting because it hints at a deeper issue: LLMs don't have a concept of "identity" for code blocks—they just regenerate from learned patterns. I've noticed similar vibes when agents refactor—they'll confidently rewrite a chunk and introduce subtle bugs (formatting, whitespace, comments) that copy-paste would've preserved. The "no questions" problem feels more solvable with better prompting/tooling though, like explicitly rewarding clarification in RLHF.

Lerc - 4 days ago

I think the issue with them making assumptions and failing to properly diagnose issues comes more from fine-tuning than any particular limitation in LLMs themselves. When fine tuned on a set of problem->solution data it kind of carries the assumption that the problem contains enough data for the solution.

What is really needed is a tree of problems which appear identical at first glance, but the issue and the solution is something that is one of many possibilities which can only be revealed by finding what information is lacking, acquiring that information, testing the hypothesis then, if the hypothesis is shown to be correct, then finally implementing the solution.

That's a much more difficult training set to construct.

The editing issue, I feel needs something more radical. Instead of the current methods of text manipulation, I think there is scope to have a kind of output position encoding for a model to emit data in a non-sequential order. Again this presents another training data problem, there are limited natural sources to work from showing programming in the order a programmer types it. On the other hand I think it should be possible to do synthetic training examples by translating existing model outputs that emit patches, search/replaces, regex mods etc. and translate those to a format that directly encodes the final position of the desired text.

At some stage I'd like to see if it's possible to construct the models current idea of what the code is purely by scanning a list of cached head_embeddings of any tokens that turned into code. I feel like there should be enough information given the order of emission and the embeddings themselves to reconstruct a piecemeal generated program.

simonw - 4 days ago

I feel like the copy and paste thing is overdue a solution.

I find this one particularly frustrating when working directly with ChatGPT and Claude via their chat interfaces. I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change.

I expect there are reasons this is difficult, but difficult problems usually end up solved in the end.

imcritic - 4 days ago

About the first point mentioned in article: could that problem be solved simply by changing the task from something like "refactor this code" to something like "refactor this code as a series of smaller atomic changes (like moving blocks of code or renaming variable references in all places), disable suitable for git commits (and provide git message texts for those commits)"?

hbn - 4 days ago

I recently found a fun CLI application and was playing with it when I found out it didn't have proper handling for when you passed it invalid files, and spat out a cryptic error from an internal library which isn't a great UX.

I decided to pull the source code and fix this myself. It's written in Swift which I've used very little before, but this wasn't gonna be too complex of a change. So I got some LLMs to walk me through the process of building CLI apps in Xcode, code changes that need to be made, and where the build artifact is put in my filesystem so I could try it out.

I was able to get it to compile, navigate to my compiled binary, and run it, only to find my changes didn't seem to work. I tried everything, asking different LLMs to see if they can fix the code, spit out the binary's metadata to confirm the creation date is being updated when I compile, etc. Generally when I'd paste the code to an LLM and ask why it doesn't work it would assert the old code was indeed flawed, and my change needed to be done in X manner instead. Even just putting a print statement, I couldn't get those to run and the LLM would explain that it's because of some complex multithreading runtime gotcha that it isn't getting to the print statements.

After way too much time trouble-shooting, skipping dinner and staying up 90 minutes past when I'm usually in bed, I finally solved it - when I was trying to run my build from the build output directory, I forgot to put the ./ before the binary name, so I was running my global install from the developer and not the binary in the directory I was in.

Sure, rookie mistake, but the thing that drives me crazy with an LLM is if you give it some code and ask why it doesn't work, they seem to NEVER suggest it should actually be working, and instead will always say the old code is bad and here's the perfect fixed version of the code. And it'll even make up stuff about why the old code should indeed not work when it should, like when I was putting the print statements.

hotpotat - 4 days ago

Lol this person talks about easing into LLMs again two weeks after quitting cold turkey. The addiction is real. I laugh because I’m in the same situation, and see no way out other than to switch professions and/or take up programming as a hobby in which I purposefully subject myself to hard mode. I’m too productive with it in my profession to scale back and do things by hand — the cat is out of the bag and I’ve set a race pace at work that I can’t reasonably retract from without raising eyebrows. So I agree with the author’s referenced post that finding ways to still utilize it while maintaining a mental map of the code base and limiting its blast radius is a good middle ground, but damn it requires a lot of discipline.

pengfeituan - 4 days ago

The first issue is related to the inner behavior of LLMs. Human can ignore some detailed contents of code and copy and paste, but LLM convert them into hidden states. It is a process of compression. And the output is a process of decompression. And something maybe lost. So it is hard for LLM to copy and paste. The agent developer should customize the edit rules to do this.

The second issue is that, LLM does not learn much high level context relationship of knowledge. This can be improved by introducing more patterns in the training data. And current LLM training is doing much on this. I don't think it is a problem in next years.

mcny - 4 days ago

I sometimes give LLM random "easy" questions. My assessment is still that they all need the fine print "bla bla can be incorrect"

You should either already know the answer or have a way to verify the answer. If neither, the matter must be inconsequential like just a child like curiosity. For example, I wonder how many moons Jupiter has... It could be 58, it could be 85 but either answer won't alter any of what I do today.

I suspect some people (who need to read the full report) dump thousand page long reports into LLM, read the first ten words of the response and pretend they know what the report says and that is scary.

ziotom78 - 4 days ago

I fully resonate with point #2. A few days ago, I was stuck trying to implement some feature in a C++ library, so I used ChatGPT for brainstorming.

ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.

celeritascelery - 4 days ago

The “LLMs are bad at asking questions” is interesting. There are some times that I will ask the LLM to do something without giving it All the needed information. And rather than telling me that something's missing or that it can't do it the way I asked, it will try and do a halfway job using fake data or mock something out to accomplish it. What I really wish it would do is just stop and say, “hey, I can't do it like you asked Did you mean this?”

- 4 days ago
[deleted]
mr_mitm - 4 days ago

The other day, I needed Claude Code to write some code for me. It involved messing with the TPM of a virtual machine. For that, it was supposed to create a directory called `tpm_dir`. It constantly got it wrong and wrote `tmp_dir` instead and tried to fix its mistake over and over again, leading to lots of weird loops. It completely went off the rails, it was bizarre.

aragonite - 4 days ago

Has anyone had success getting a coding agent to use an IDE's built-in refactoring tools via MCP especially for things like project-wide rename? Last time I looked into this the agents I tried just did regex find/replace across the repo, which feels both error-prone and wasteful of tokens. I haven't revisited recently so I'm curious what's possible now.

juped - 4 days ago

It's apparently lese-Copilot to suggest this these days, but you can find very good hypothesizing and problem solving if you talk conversationally to Claude or probably any of its friends that isn't the terminally personality-collapsed SlopGPT (with or without showing it code, or diagrams); it's actually what they're best at, and often they're even less likely than human interlocutors to just parrot some set phrase at you.

It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.)

nikanj - 4 days ago

4/5 times when Claude is looking for a file, it starts by running bash(dir c:\test /b)

First it gets an error because bash doesn’t understand \

Then it gets an error because /b doesn’t work

And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files

If it was an actual coworker, we’d send it off to HR

IanCal - 4 days ago

Editing tools are easy to add it’s just you have to pick what things to give them because too many and they struggle as it uses up a lot of context. Still, as costs come down multiple steps to look for tools becomes cheaper too.

I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.

Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.

There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.

senko - 4 days ago

I'd argue LLM coding agents are still bad at many more things. But to comment on the two problems raised in the post:

> LLMs don’t copy-paste (or cut and paste) code.

The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.

> Good human developers always pause to ask before making big changes or when they’re unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.

Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do.

jamesjyu - 4 days ago

For #2, if you're working on a big feature, start with a markdown planning file that you and the LLM work on until you are satisfied with the approach. Doesn't need to be rocket science: even if it's just a couple paragraphs it's much better than doing it one shot.

crazygringo - 4 days ago

> Sure, you can overengineer your prompt to try get them to ask more questions (Roo for example, does a decent job at this) -- but it's very likely still won't.

Not in my experience. And it's not "overengineering" your prompt, it's just writing your prompt.

For anything serious, I always end every relevant request with an instruction to repeat back to me the full design of my instructions or ask me necessary clarifying questions first if I've left anything unclear, before writing any code. It always does.

And I don't mind having to write that, because sometimes I don't want that. I just want to ask it for a quick script and assume it can fill in the gaps because that's faster.

raw_anon_1111 - 3 days ago

With a statically typed language like C# or Java, there are dozens of refactors that IDEs could do in a guaranteed [1] correct way better than LLMs as far back as 2012.

The canonical products were from JetBrains. I haven’t used Jetbrains in years. But I would be really surprised with the combination of LLMs + a complete understanding of the codebase through static analysis (like it was doing well over a decade ago) and calling a “refactor tool” that it wouldn’t have better results.

[1] before I get “well actuallied” yes I know if you use reflection all bets are off.

Vipsy - 4 days ago

Coding agents tend to assume that the development environment is static and predictable, but real codebases are full of subtle, moving parts - tooling versions, custom scripts, CI quirks, and non-standard file layouts.

Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.

mehdibl - 4 days ago

You can do copy and paste if you offer it a tool/MCP that do that. It's not complicated using either function extraction with AST as target or line numbers.

Also if you want it to pause asking questions, you need to offer that thru tools (example Manus do that) and I have an MCP that do that and surprisingly I got a lot of questions and if you prompt, it will do. But the push currently is for full automation and that's why it's not there. We are far better in supervised step by step mode. There is elicitation already in MCP, but having a tool asking questions require you have a UI that will allow to set the input back.

LeeRLemonIII - 4 days ago

I think #1 is not that big of a deal, though it does create problems sometimes. #2 is though a big issue. Which is weird since the whole thing is built as a chat model it seems it would be a lot more efficient for the bot to ask the questions of what to build beyond it's assumptions. Generally this lack of back and forth reasoning leads to a lot of then badly generated code. I would hope in the future there is some level of graded response that tries to discern the real intent of the users request through a discussion, rather than going to the fastest code answer.

enraged_camel - 4 days ago

First point is very annoying, yes, and it's why for large refactors I have the AI write step-by-step instructions and then do it myself. It's faster, cheaper and less error-prone.

The second point is easily handled with proper instructions. My AI agents always ask questions about points I haven't clarified, or when they come across a fork in the road. Frequently I'll say "do X" and it'll proceed, then halfway it will stop and say "I did some of this, but before I do the rest, you need to decide what to do about such and such". So it's a complete non-problem for me.

TrackerFF - 4 days ago

I very much agree on point 2.

I often wish that instead of just starting to work on the code, automatically, even if you hit enter / send by accident, the models would rather ask for clarification. The models assume a lot, and will just spit out code first.

I guess this is somewhat to lower the threshold for non-programmers, and to instantly give some answer, but it does waste a lot of resources - I think.

Others have mentioned that you can fix all this by providing a guide to the mode, how it should interact with you, and what the answers should look like. But, still, it'd be nice to have it a bit more human-like on this aspect.

8s2ngy - 4 days ago

One thing LLMs are surprisingly bad at is producing correct LaTeX diagram code. Very often I've tried to describe in detail an electric circuit, a graph (the data structure), or an automaton so I can quickly visualize something I'm studying, but they fail. They mix up labels, draw without any sense of direction or ordering, and make other errors. I find this surprising because LaTeX/TiKZ have been around for decades and there are plenty of examples they could have learned from.

pammf - 4 days ago

In Claude Code, it always shows the diff between current and proposed changes and I have to explicitly allow it to actually modify the code. Doesn’t that “fix” the copy-&-paste issue?

bad_username - 4 days ago

LLMs are great at asking questions if you ask them to ask questions. Try it: "before writing the code, ask me about anything that is nuclear or ambiguous about the task".

ravila4 - 4 days ago

Regarding copy-paste, I’ve been thinking the LLM could control a headless Neovim instance instead. It might take some specialized reinforcement learning to get a model that actually uses Vim correctly, but then it could issue precise commands for moving, replacing, or deleting text, instead of rewriting everything.

Even something as simple as renaming a variable is often safer and easier when done through the editor’s language server integration.

sidgtm - 4 days ago

As a UX designer I see they lack the ability of being opinionated about a design piece and go with the standard mental model. I got fed up with this and made a simple java script code to run a simple canvas on the localhost to pass on more subjective feedback using highlights and notes feature. I tried using playwright first but a. its token heavy b. it's still for finding what's working or breaking instead of thinking deeply about the design.

squirrel - 4 days ago

A friendly reminder that "refactor" means "make and commit a tiny change in less than a few minutes" (see links below). The OP and many comments here use "refactor" when they actually mean "rewrite".

I hear from my clients (but have not verified myself!) that LLMs perform much better with a series of tiny, atomic changes like Replace Magic Literal, Pull Up Field, and Combine Functions Into Transform.

[1] https://martinfowler.com/books/refactoring.html [2] https://martinfowler.com/bliki/OpportunisticRefactoring.html [3] https://refactoring.com/catalog/

gengstrand - 4 days ago

The conversation here seems to be more focused on coding from scratch. What I have noticed when I was looking at this last year was that LLMs were bad at enhancing already existing code (e.g. unit tests) that used annotation (a.k.a. decorators) for dependency injection. Has anyone here attempted that with the more recent models? If so, then what were your findings?

clayliu - 4 days ago

“They’re still more like weird, overconfident interns.” Perfect summary. LLMs can emit code fast but they don’t really handle code like developers do — there’s no sense of spatial manipulation, no memory of where things live, no questions asked before moving stuff around. Until they can “copy-paste” both code and context with intent, they’ll stay great at producing snippets and terrible at collaborating.

schiho - 4 days ago

I just run into this issue with claude sonet 4.5, asked it to copy/paste some constants from one file to another, a bigger chunk of code, it instead "extracted" pieces and named them so. As a last resort, after going back and forth it agreed to do a file/copy by running a system command. I was surprised that of all the programming tasks, a copy/paste felt challenging for the agent.

SafeDusk - 4 days ago

@kixpanganiban Do you think it will work if for refactoring tasks, we take aways OpenAI's `apply_patch` tool and just provide `cut` and `paste` for the first few steps?

I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights.

[0]: https://github.com/aperoc/toolkami

osigurdson - 3 days ago

I don't think it such a big deal that they aren't great yet, but rather the rate of improvement is quite low these days. I feel it is going backwards a little recently - maybe that is due to economic pressures.

peterbonney - 4 days ago

"weird, overconfident interns" -> exactly the mental model I try to get people to use when thinking about LLM capabilities in ALL domains, not just coding.

A good intern is really valuable. An army of good interns is even more valuable. But interns are still interns, and you have to check their work. Carefully.

causal - 4 days ago

Similar to the copy/paste issue I've noticed LLMs are pretty bad at distilling large documents into smaller documents without leaving out a ton of detail. Like maybe you have a super redundant doc. Give it to an LLM and it won't just deduplicate it, it will water the whole thing down.

joshribakoff - 4 days ago

My human fixed a bug by introducing a new one. Classic. Meanwhile, I write the lint rules, build the analyzers, and fix 500 errors before they’ve finished reading Stack Overflow. Just don’t ask me to reason about their legacy code — I’m synthetic, not insane.

Just because this new contributor is forced to effectively “SSH” into your codebase and edit not even with vim but with with sed and awk does not mean that this contributor is incapable of using other tools if empowered to do so. The fact it is able to work within such constraints goes to show how much potential there is. It is already much better at a human than erasing the text and re-typing it from memory and while it is a valid criticism that it needs to be taught how to move files imagine what it is capable of once it starts to use tools effectively.

Recently, I observed LLMs flail around for hours trying to get our e2e tests running as it tried to coordinate three different processes in three different terminals. It kept running commands in one terminal try to kill or check if the port is being used in the other terminal.

However, once I prompted the LLM to create a script for running all three processes concurrently, it is able to create that script, leverage it, and autonomously debug the tests now way faster than I am able to. It has also saved any new human who tries to contribute from similar hours of flailing around. Is there something we could have easily done by hand but just never had the time to do before LLMs. If anything, the LLM is just highlighting the existing problem in our codebase that some of us got too used to.

So yes, LLMs makes stupid mistakes, but so do humans, the thing is that LLms can ifentify and fix them faster (and better, with proper steering)

NumberCruncher - 4 days ago

I don’t really understand why there’s so much hate for LLMs here, especially when it comes to using them for coding. In my experience, the people who regularly complain about these tools often seem more interested in proving how clever they are than actually solving real problems. They also tend to choose obscure programming languages where it’s nearly impossible to hire developers, or they spend hours arguing over how to save $20 a month.

Over time, they usually get what they want: they become the smartest ones left in the room, because all the good people have already moved on. What’s left behind is a codebase no one wants to work on, and you can’t hire for it either.

But maybe I’ve just worked with the wrong teams.

EDIT: Maybe this is just about trust. If you can’t bring yourself to trust code written by other human beings, whether it’s a package, a library, or even your own teammates, then of course you’re not going to trust code from an LLM. But that’s not really about quality, it’s about control. And the irony is that people who insist on controlling every last detail usually end up with fragile systems nobody else wants to touch, and teams nobody else wants to join.

gen220 - 4 days ago

How I describe this phenomenon:

If the code-change is something you would reasonably prefer to use a codemod to implement (i.e. dozens-to-hundreds of small changes fitting a semantic pattern), Claude Code not going to be able to make that change effectively.

However (!), CC is pretty good at writing the codemod.

arbirk - 4 days ago

Those 2 things are not inherit to LLM's and could easily be changed by giving it the proper tools and instructions

amelius - 4 days ago

I recently asked an llm to fix an Ethernet connection while I was logged into the machine through another. Of course, I explicitly told the llm to not break that connection. But, as you can guess, in the process it did break the connection.

If an llm can't do sys admin stuff reliably, why do we think it can write quality code?

justinhj - 4 days ago

Building an mcp tool that has access to refactoring operations should be straightforward and using it appropriately is well within the capabilities of current models. I wonder if it exists? I don't do a lot of refactoring with llm so haven't really had this pain point.

SamDc73 - 4 days ago

For 2) I feel like codex-5 kind of attempted to address this problem, with codex it usually asks a lot of questions and give options before digging in (without me prompting it to).

For copy-paste, you made it feel like a low-hanging fruit? Why don't AI agents have copy/paste tools?

- 4 days ago
[deleted]
BenGosub - 4 days ago

The issue is partly that some expect a fully fledged app or a full problem solution, while others want incremental changes. To some extent this can be controlled by setting the rules in the beginning of the conversation. To some extent, because the limitations noted in the blog still apply.

daxfohl - 3 days ago

Funny, I just encountered a similar issue asking chatgpt to ocr something. It started off pretty good but slowly started embellishing or summarizing on its own, eventually going completely off the rails into a King Arthur story.

giancarlostoro - 4 days ago

Point #2 cracks me up because I do see with JetBrains AI (no fault of JetBrains mind you) the model updates the file, and sometimes I somehow wind up with like a few build errors, or other times like 90% of the file is now build errors. Hey what? Did you not run some sort of what if?

mihau - 4 days ago

> you can overengineer your prompt to try get them to ask more questions

why overengineer? it's super simple

I just do this for 60% of my prompts: "{long description of the feature}, please ask 10 questions before writing any code"

nc - 4 days ago

Add to this list, ability to verify correct implementation by viewing a user interface, and taking a holistic code-base / interface-wide view of how to best implement something.

odkral - 4 days ago

If I need an exact copy pasting, I indicate that couple times in the prompt and it (claude) actually does what I am asking. But yeah overall very bad at refactoring big chunks.

strangescript - 4 days ago

You don't want your agents to ask questions. You are thinking too short term. Its not ideal now, but agents that have to ask frequent questions are useless when it comes the vision of totally autonomous coding.

Humans ask questions of groups to fix our own personal short comings. It make no sense to try and master an internal system I rarely use, I should instead ask someone that maintains it. AI will not have this problem provided we create paths of observability for them. It doesn't take a lot of "effort" for them to completely digest an alien system they need to use.

davydm - 4 days ago

Coding and...?

Plough_Jogger - 4 days ago

Let's just change the title to "LLM coding agents don't use copy & paste or ask clarifying questions" and save everyone the click.

rconti - 4 days ago

Doing hard things that aren't greenfield? Basically any difficult and slightly obscure question I get stuck with and hope the collective wisdom of the internet can solve?

baq - 4 days ago

they're getting better at asking questions; I routinely see search calls against the code base index. they just don't ask me questions.

cadamsdotcom - 4 days ago

You need good checks and balances. E2E tests for your happy path, TDD when you & your agent write code.

Then you - and your agent - can refactor fearlessly.

hnaccountme - 3 days ago

What 2 things? LLMs are bad at everything. Its just there are a lot of people who are worse

hu3 - 4 days ago

I have seen LLMs in VSCode Copilot ask to execute 'mv oldfile.py newfile.py'.

So there's hope.

But often they just delete and recreate the file, indeed.

nextworddev - 4 days ago

Developers will complain if LLM agents start asking too many questions though

DiggyJohnson - 4 days ago

Really nice site design btw

janmarsal - 4 days ago

My biggest issue with LLMs right now is that they're such spineless yes men. Even when you ask their opinion on if something is doable or should it be done in the first place, more often than not they just go "Absolutely!" and shit out a broken answer or an anti-pattern just to please you. Not always, but way too often. You need to frame your questions way too carefully to prevent this.

Maybe some of those character.ai models are sassy enough to have stronger opinions on code?

sxp - 4 days ago

Another place where LLMs have a problem is when you ask them to do something that can't be done via duct taping a bunch of Stack Overflow posts together. E.g, I've been vibe coding in Typescript on Deno recently. For various reasons, I didn't want to use the standard Express + Node stack which is what most LLMs seem to prefer for web apps. So I ran into issues with Replit and Gemini failing to handle the subtle differences between node and deno when it comes to serving HTTP requests.

LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.

It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.

overgard - 4 days ago

I definitely feel the "bad at asking questions" part, a lot of times I'll walk away for a second while it's working, and then I come back and it's gone down some intricate path I really didn't want and if it had just asked a question at the right point it would have saved a lot of wasted work (plus I feel like having that "bad" work in the context window potentially leads to problems down the road). The problem is just that I'm pretty sure there isn't any way for an LLM to really be "uncertain" about a thing, it's basically always certain even when it's incredibly wrong.

To me, I think I'm fine just accepting them for what they're good at. I like them for generating small functions, or asking questions about a really weird error I'm seeing. I don't ever ask it to refactor things though, that seems like a recipe for disaster and a tool that understands the code structure is a lot better for moving things around then an LLM is.

giantg2 - 4 days ago

The third thing- writing meaningfully robust test suites.

maddynator - 4 days ago

Can’t you put this in the agent instructions?

- 4 days ago
[deleted]
MrDunham - 4 days ago

> "LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses."

Strongly disagree that they're terrible at asking questions.

They're terrible at asking questions unless you ask them to... at which point they ask good, sometimes fantastic questions.

All my major prompts now have some sort of "IMPORTANT: before you begin you must ask X clarifying questions. Ask them one at a time, then reevaluate the next question based on the response"

X is typically 2–5, which I find DRASTICALLY improves output.

_ink_ - 4 days ago

> LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses.

I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed.

- 3 days ago
[deleted]
wvenable - 4 days ago

> LLMs are terrible at asking questions.

I was dealing with a particularly tricky problem in a technology I'm not super familiar with and GPT-5 eventually asked me to put in some debug code to analyze the state of the system as it ran. Once I provided it with the feedback it wanted, and a bit of back and forth, we were able to figure out what the issue was.

ra - 4 days ago

IaC, and DSLs in general.

sjapkee - 3 days ago

1. Any 2. Any

tristanb - 3 days ago

Three - CSS.

segmondy - 4 days ago

Someone has definitely fallen behind and has massive skill issues. Instead of learning you are wasting time writing bad takes on LLM. I hope most of you don't fall down this hole, you will be left behind.

mohsen1 - 4 days ago

> LLMs are terrible at asking questions

Not if they're instructed to. In my experience you can adjust the prompt to make them ask questions. They ask very good questions actually!

bytesandbits - 3 days ago

two things only? dude I could make a list with easily two dozen!

podgorniy - 4 days ago

> LLMs are terrible at asking questions. They just make a bunch of assumptions

_Did you ask it to ask questions?_

throw-10-8 - 4 days ago

3. Saying no

LLMs will gladly go along with bad ideas that any reasonable dev would shoot down.

capestart - 4 days ago

[dead]

notpachet - 4 days ago

> They keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.

This is because LLMs trend towards the centre of the human cognitive bell curve in most things, and a LOT of humans use this same problem solving approach.