AI coding assistants are getting worse?

spectrum.ieee.org

425 points by voxadam a day ago


llmslave2 - a day ago

One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?

renegade-otter - a day ago

They are not worse - the results are not repeatable. The problem is much worse.

Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.

jackfranklyn - 3 hours ago

The measurement problem here is real. "10x faster" compared to what exactly? Your best day or your average? First-time implementation or refactoring familiar code?

I've noticed my own results vary wildly depending on whether I'm working in a domain where the LLM has seen thousands of similar examples (standard CRUD stuff, common API patterns) versus anything slightly novel or domain-specific. In the former case, it genuinely saves time. In the latter, I spend more time debugging hallucinated approaches than I would have spent just writing it myself.

The atrophy point is interesting though. I wonder if it's less about losing skills and more about never developing them in the first place. Junior developers who lean heavily on these tools might never build the intuition that comes from debugging your own mistakes for years.

bee_rider - a day ago

This seems like a kind of odd test.

> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.

    df = pd.read_csv(‘data.csv’)    
    df['new_column'] = df['index_value'] + 1
   #there is no column ‘index_value’
> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.

> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.

So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.

Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?

It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).

dudeinhawaii - 12 minutes ago

The most annoying thing in the LLM space is that people write articles and research with grand pronouncements based upon old models. This article has no mention of Sonnet 4.5, nor does it use any of the actual OpenAI coding models (GPT-5-Codex, GPT-5.1 Codex, etc), and based upon that, even the Opus data is likely an older version.

This then leads to a million posts where on one side people say "yeah see they're crap" and on the other side people are saying "why did you use a model from 6 months ago for your 'test' and write up in Jan 2026?".

You might as well ignore all of the articles and pronouncements and stick to your own lived experience.

The change in quality between 2024 and 2025 is gigantic. The change between early 2025 and late 2025 is _even_ larger.

The newer models DO let you know when something is impossible or unlikely to solve your problem.

Ultimately, they are designed to obey. If you authoritatively request bad design, they're going to write bad code.

I don't think this is a "you're holding it wrong" argument. I think it's "you're complaining about iOS 6 and we're on iOS 12.".

anttiharju - 14 hours ago

I like AI for software development.

Sometimes I am uncertain whether it's an absolute win. Analogy: I used to use Huel to save time on lunches to have more time to study. Turns out, lunches were not just refueling sessions but ways to relax. So I lost on that relaxation time and it ended up being +-0 long-term.

AI for sure is net positive in terms of getting more done, but it's way too easy to gloss over some details and you'll end up backtracking more.

"Reality has a surprising amount of detail" or something along those lines.

ronbenton - a day ago

I am used to seeing technical papers from ieee, but this is an opinion piece? I mean, there is some anecdata and one test case presented to a few different models but nothing more.

I am not necessarily saying the conclusions are wrong, just that they are not really substantiated in any way

bodge5000 - 3 hours ago

A little off topic, but this seems like one of the better places to ask where I'm not gonna get a bunch of zealotry; a question for those of you who like using AI for software development, particularly using Claude Code or OpenCode.

I'll admit I'm a bit of a sceptic of AI but want to give it another shot over the weekend, what do people recommend these days?

I'm happy spending money but obviously don't want to spend a tonne since its just an experiment for me. I hear a lot of people raving about Opus 4.5, though apparently using that is near to $20 a prompt, Sonnet 4.5 seems a lot cheaper but then I don't know if I'm giving it (by it I mean AI coding) a fair chance if Opus is that much better. There's also OpenCode Zen, which might be a better option, I don't know.

lucideng - 3 hours ago

This quote feels more relevant than ever:

> Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.

Or in the context of AI:

> Give a man code, and you help him for a day. Teach a man to code, and you help him for a lifetime.

CashWasabi - a day ago

I always wonder what happens when LLMs finally destroyed every source of information they crawl. After stack overflow and forums are gone and when there's no open source code anymore to improve upon. Won't they just canibalize themselves and slowly degrade?

amarka - 2 hours ago

While the author’s (banker and a data scientist) experience is clearly valuable, it is unclear whether it alone is sufficient to support the broader claims made. Engineering conclusions typically benefit from data beyond individual observation.

theptip - a day ago

They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed.

As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)

This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.

Kuinox - a day ago

I speculate LLMs providers are serving smallers models dynamically to follow usage spikes, and need for computes to train new models. I did observed that models agents are becoming worse over time, especially before a new model is released.

dathinab - 5 hours ago

In general "failing to run (successfully)" should per-see been seen as a bad signal.

It might still be:

- the closest to a correct solution the model can produce

- be helpful to find out what it wrong

- might be intended (e.g. in a typical very short red->green unit test dev approach you want to generate some code which doesn't run correctly _just yet_). Test for newly found bugs are supposed to fail (until the bug is fixed). Etc.

- if "making run" means removing sanity checks, doing something semantically completely different or similar it's like the OP author said on of the worst outcomes

jackfranklyn - 10 hours ago

The quality variation from month to month has been my experience too. I've noticed the models seem to "forget" conventions they used to follow reliably - like proper error handling patterns or consistent variable naming.

What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.

nyrikki - 21 hours ago

> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right. If the user rejected the code, or if the code failed to run, that was a negative signal, and when the model was retrained, the assistant would be steered in a different direction.

> This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.

It is not just `inexperienced coders` that make this signal pretty much useless, I mostly use coding assistants for boilerplate, I will accept the suggestion then delete much of what it produced, especially in the critical path.

For many users, this is much faster then trying to get another approximation

     :,/^}/-d
Same for `10dd` etc... it is all muscle memory. Then again I use a local fill in the middle, tiny llm now, because it is good enough for most of the speedup without the cost/security/latency of a hosted model.

It would be a mistake to think that filtering out jr devs will result in good data as the concept is flawed in general. Accepting output may not have anything to do with correctness of the provided content IMHO.

sosodev - a day ago

He asked the models to fix the problem without commentary and then… praised the models that returned commentary. GPT-5 did exactly what he asked. It doesn’t matter if it’s right or not. It’s the essence of garbage in and garbage out.

anttiharju - an hour ago

I've felt this. Bit scary given how essential of a tool it has become.

I started programming before modern LLMs so I can still hack it without, it will just take a lot longer.

winddude - 4 hours ago

Not sure I agree with his tests, but I agree with the headline, I recently had cursor launch into seemingly endless loops of grepping and `cd` and `ls` files. This was in multiple new convos. I think it's they're trying to do to much, for two many "vibe coders", and the lighter weight version that did less were easier to steer to meet your architecture and needs.

kristopolous - 20 hours ago

I stopped using them. Occasionally I go back to see if it's better but really I just treat them as a more interactive stackoverflow/google.

I've been stung by them too many times.

The problem is the more I care about something, the less I'll agree with whatever the agent is trying to do.

Hobadee - 5 hours ago

> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right.

So what about all those times I accepted the suggestion because it was "close enough", but then went back and fixed all the crap that AI screwed up? Was it training on what was accepted the first time? If so I'm sincerely sorry to everyone, and I might be single-handedly responsible for the AI coding demise. :'-D

StarlaAtNight - a day ago

We should be able to pin to a version of training data history like we can pin to software package versions. Release new updates w/ SemVer and let the people decide if it’s worth upgrading to

I’m sure it will get there as this space matures, but it feels like model updates are very force-fed to users

crazygringo - a day ago

This is a sweeping generalization based on a single "test" of three lines that is in no way representative.

chankstein38 - 4 hours ago

The issue is NOT particular to the GPT models. Gemini does this stuff to me all of the time as well! Bandaids around actual problems, hides debugging, etc. They're just becoming less usable.

amelius - a day ago

A dataset with only data from before 2024 will soon be worth billions.

kristianp - 21 hours ago

The failure mode of returning code that only appears to work correctly is one I've encountered before. I've had Sonnet (4 I think) generate a bunch of functions that check if parameter values are out of valid range and just return without error when they should be a failing assertion. That kind of thing does smell of training data that hasn't been checked for correctness by experienced coders.

Edit: Changed 3.5 to 4.

Edit: Looking back to edits and checkins by AI agents, it strikes me that the checkins should contain the prompt used and model version. More recent Aider versions do add the model.

maxbaines - a day ago

Not seeing this in my day to day, in fact the opposite.

furyofantares - a day ago

He graded GPT 4 as winning because it didn't follow his instructions. And the instructions are unrealistic to anyone using coding assistants.

Maybe it's true that for some very bad prompts, old version did a better job by not following the prompt, and that this is reduced utility for some people.

Unrelated to assistants or coding, as an API user I've certainly had model upgrades that feel like downgrades at first, until I work out that the new model is following my instructions better. Sometimes my instructions were bad, sometimes they were attempts to get the older model to do what I want by saying over-the-top stuff that the new model now follows more precisely to a worse result. So I can definitely imagine that new models can be worse until you adapt.

Actually, another strange example like this - I had gotten in the habit of typing extremely fast to LLMs because they work just fine with my prompts riddled with typos. I basically disconnected the part of my brain that cares about sequencing between hands, so words like "can" would be either "can" or "cna". This ended up causing problems with newer models which would take my typos seriously. For example, if I ask to add support for commandline flag "allwo-netwokr-requests" it will usually do what I said, while previous versions would do what I wanted.

For anyone with some technical expertise and who is putting in serious effort to using AI coding assistants, they are clearly getting better at a rapid pace. Not worse.

shevy-java - 16 hours ago

I find the whole idea of AI coding assistants strange.

For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better. Offloading thinking actually makes my thinking process worse and thus slower.

pablonm - 6 hours ago

I noticed Claude Code (on a 100$ max subscription) has become slower for me in the last few weeks. Just yesterday it spent hours coding a simple feature Which I could have coded myself faster.

minimaxir - a day ago

The article uses pandas as a demo example for LLM failures, but for some reason, even the latest LLMs are bad at data science code which is extremely counterintuitive. Opus 4.5 can write a EDA backbone but it's often too verbose for code that's intended for a Jupyter Notebook.

The issues have been less egregious than hallucinating an "index_value" column, though, so I'm suspect. Opus 4.5 still has been useful for data preprocessing, especially in cases where the input data is poorly structured/JSON.

reassess_blind - 10 hours ago

I only have experience with using it within my small scope, being full stack NodeJS web development (i.e an area with many solved problems and millions of lines of existing code for the models to reference), but my experience with the new Opus model in Claude Code has been phenomenal.

cons0le - a day ago

And the Ads aren't even baked in yet . . . that's the end goal of every company

troyvit - a day ago

There's really not much to take from this post without a repo and a lot of supporting data.

I wish they would publish the experiment so people could try with more than just GPT and Claude, and I wish they would publish their prompts and any agent files they used. I also wish they would say what coding tool they used. Like did they use the native coding tools (Claude Code and whatever GPT uses) or was it through VSCode, OpenCode, aider, etc.?

erelong - 16 hours ago

Interesting if true but I would presume it to be negligible in comparison to magnitudes of gains over "manual coding" still, right? So nothing to lose sleep over at the moment...

- a day ago
[deleted]
stared - a day ago

Is it possible to re-run it? I am curious for Gemini 3 Pro.

As a side note, it is easy to create sharable experiments with Harbor - we migrated our own benchmarks there, here is our experience: https://quesma.com/blog/compilebench-in-harbor/.

Johnny555 - 18 hours ago

But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.

I think all general AI agents are running into that problem - as AI becomes more prevalent and people accept and propagate wrong answers, the AI agents are trained to believe those wrong answers.

It feels that lately, Google's AI search summaries are getting worse - they have a kernel of truth, but combines it with an incorrect answer.

bob1029 - a day ago

> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.

I think if you keep the human in the loop this would go much better.

I've been having a lot of success recently by combining recursive invocation with an "AskHuman" tool that takes a required tuple of (question itself, how question unblocks progress). Allowing unstructured assistant dialog with the user/context is a train wreck by comparison. I've found that chain-of-thought (i.e., a "Think" tool that barfs into the same context window) seems to be directly opposed to the idea of recursively descending through the problem. Recursion is a much more powerful form of CoT.

falldrown - 19 hours ago

Codex is still useful for me. But I don't want to pay $200/month for it.

> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.

AI trainers hired by companies like Outlier, Mercor and Alignerr are getting paid like $15-$45/hr. Reviewers are crap. The screening processes are horribly done by AI interviewers.

j45 - an hour ago

It feels like the more standardized the organization, or the more academic the background of an author, the more lagging their insights from the tip of the arrow.

It's clear AI coding assistants are able to help software developers at least in some ways.

Having a non-software developer perspective speak about it is one thing, but it should be mindful that there are experienced folks too for whom the technology appears to be a jetpack.

Just because it didn't work for you, means there's more to learn.

isodev - 18 hours ago

> It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.

So much this... the number of times Claude sneaks default values, or avoids .unwrapping optional values just to avoid a crash at all costs... it's nauseating.

mat_b - 16 hours ago

I have been noticing this myself for the last couple of months. I cannot get the agent to stop masking failures (ex: swallowing exceptions) and to fail loudly.

That said, the premise that AI-assisted coding got worse in 2025 feels off to me. I saw big improvements in the tooling last year.

emsign - 12 hours ago

When coding assistants take longer, is because they use more tokens, is because AI companies are obligated to make more money.

metobehonest - a day ago

I can imagine Claude getting worse. I consider myself bearish on AI in general and have long been a hater of "agentic" coding, but I'm really liking using aider with the deepseek API on my huge monorepo.

Having tight control over the context and only giving it small tasks makes all the difference. The deepseek token costs are unbeatable too.

jvanderbot - a day ago

Likely, and I'm being blithe here, it's because of great acceptance. If we try it on more difficult code, it'll fail in more difficult ways?

Until we start talking about LOC, programming language, domain expertise required, which agent, which version, and what prompt, it's impossible to make quantitative arguments.

radium3d - a day ago

The problem is everyone is using a different “level” of AI model. Experiences by those who can only afford or choose not to pay for the advanced reasoning are far worse than those who can and do pay.

PunchTornado - 9 hours ago

ChatGPT is getting worse and is a useless model. Surprised that people are still using it. The article tests only this model.

nhd98z - a day ago

This guy is using AI in the wrong way...

renarl - a day ago

Strange that the article talks about ChatGPT 4 and 5 but not the latest 5.2 model.

empath75 - a day ago

I'm not sure it is really getting worse, but I have had AI assistants add todo()s and comments saying that this still needs to be implemented and then tell me they did what I asked them to do.

qudat - 16 hours ago

Betteridge's law of headlines is an adage that states: "Any headline that ends in a question mark can be answered by the word no."

kazinator - a day ago

> This is of course an impossible task—the problem is the missing data, not the code.

We cannot with certainty assert that. If the datum is expected to be missing, such that the frame without the datum is still considered valid and must be handled rather than flagged as an error, the code has to do exactly that. Perhaps a missing value in the dictionary can be supplanted with a zero.

  df['new_column'] = df.get('index_value', 0) + 1
  # there might be no column ‘index_value’;
  # requirements say that zero should be substituted.
fwip - 21 hours ago

The author suspects that this effect is due to users accepting these "make it work" fixes. But wouldn't training for coding challenges also explain this? Because those are designed to be solvable, anything that lets you move forward toward the solution is better than giving up.

ta9000 - 7 hours ago

Silent but deadly… oooohh scary! Jesus, talk about sensationalizing a boring topic,

toss1 - a day ago

The key point in the middle of the article. As AIs expand usage to larger numbers of lower-skilled coders whose lower ability to catch errors and provide feedback generates lower quality training data, the AIs are basically eating their own garbage, and the inevitable GIGO syndrome starts.

>>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.

>>AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.

nodesocket - a day ago

While I still prefer to code my side project in Python and Flask myself, I recently used Cursor to write unit tests. I took a few hours of tweaking, refining, and fixing tests but after I had over 400 unit tests with 99% coverage of my app and routes. I would have never spent the time to get this amount of test coverage manually.

solumunus - a day ago

I do find there are particular days where I seem to consistently get poor results, but in general this is not my experience. I’m very pleased with the output 80% of days.

oblio - a day ago

> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.

Heh, there's only one problem with that. Training models is very expensive from a power/infrastructure/hardware perspective. Inference is not as expensive but it's still fairly expensive and needs sophisticated layers on top to make it cheaper (batching, caching, etc).

Guess in which cost category "high-quality data reviewed by experts" falls under.

chiengineer - a day ago

Wheres the benchmarks for all the different tools and subscriptions/ APIs ?

Cli vs IDE vs Web ?

Nothing for gpt codex 5.1 max or 5.2 max?

Nothing about the prompts ? Quality of the prompts? I literally feed the AI into the AI I just ask it for the most advanced prompts with a smaller model and then use it for the big stuff and its smooth sailing

I got codex 5.1 max with the codex extension on vs code - to generate over 10k lines of code for my website demo project that did work first time

This is also with just the regular 20$ subscription

Github copilot pro plus + vs code is my main go to and depending on the project / prompts/ agent.md quality/ project configuration can all change the outcome of each question

FrustratedMonky - a day ago

Perhaps because nobody is on Stack Overflow providing updates?

moshegramovsky - a day ago

This definitely matches my experience.

Gemini 2.5 was genuinely impressive. I even talked it up here. I was a proper fanboy and really enjoyed using it. Gemini 3 is still good at certain things, but it is clearly worse than 2.5 when it comes to working with larger codebases. Recently, I was using AntiGravity and it could not help me find or fix a reference-counting bug. ( 50 classes, 20k LOC total, so well within context limits ) I know AntiGravity is new, which explains why it is rough around the edges. But it is built on Gemini, so the results should at least be on par with Gemini 3, right? Apparently not. I am an excellent prompter, and no amount of additional context, call stacks, watch-window values, you name it, made any difference.

I still use Gemini for code reviews and simple problems, and it remains excellent for those use cases. But in many respects, Gemini 3 is a regression. It hallucinates more, listens less, and seems oddly resistant to evidence. It produces lots of lofty, confident-sounding statements while ignoring the actual facts in front of it. The experience can be exhausting, and I find myself using it much less as a result. I guess this is typical of companies these days - do something great and then enshittify it? Or maybe there are technical issues I'm not aware of.

What is especially interesting is reading all the articles proclaiming how incredible AI coding has become. And to be fair, it is impressive, but it is nowhere near a magic bullet. I recently saw a non-programmer designer type claiming he no longer needs developers. Good luck with that. Have fun debugging a memory leak, untangling a database issue, or maintaining a non-trivial codebase.

At this point, I am pretty sure my use cases are going to scale inversely with my patience and with my growing disappointment.

wainstead - a day ago

Is it just me or is this a giant red flag?

> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.

guluarte - 20 hours ago

idk but opus is pretty good

ripped_britches - a day ago

I’m sorry but what a ridiculous assertion. They are objectively better on every measure we can come up with. I used 2b input and 10m output tokens on codex last week alone. Things are improving by the month!

Zababa - a day ago

>However, recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.

This is a problem that started with I think Claude Sonnet 3.7? Or 3.5, I don't remember well. But it's not recent at all, one of those two Sonnet was known to change tests so that they would pass, even if they didn't test properly stuff anymore.

>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.

No proof or anything is offered here.

The article feels mostly like a mix of speculation, and being behind on practices. You can avoid a lot of the problems of "code that looks right" by making the models write tests, insist that they are easy to review and hard to fake, offering examples. This worked well 6 months ago, this works even better today, especially with Opus 4.5, but even Codex 5.2 and Gemini 3 Pro work well.

Kapura - a day ago

so you're saying all those bros on linkedin telling me that "this is the worst it's ever going to be" were full of shit? i am shocked.

dcre - a day ago

Counterpoint: no, they're not. The test in the article is very silly.

nuky - 4 hours ago

[dead]

bschmidt300 - a day ago

[dead]

stingtao - 9 hours ago

[dead]

ajjahs - 20 hours ago

[dead]

b0rsuk - 21 hours ago

[dead]

black_13 - 20 hours ago

[dead]

mikert89 - a day ago

[flagged]

qsort - a day ago

I mean, it's 2026, you can just say things I guess.

tacoooooooo - a day ago

This is a wildly out of touch thing to say