AI is not a coworker, it's an exoskeleton
kasava.dev343 points by benbeingbin 18 hours ago
343 points by benbeingbin 18 hours ago
You cant run at 10x in an exoskeleton, you can’t move your hand to write any faster using an exoskeleton, the analogy doesn’t fit.
There's an undertone of self-soothing "AI will leverage me, not replace me", which I don't agree with especially in the long run, at least in software. In the end it will be the users sculpting formal systems like playdoh.
In the medium run, "AI is not a co-worker" is exactly right. The idea of a co-worker will go away. Human collaboration on software is fundamentally inefficient. We pay huge communication/synchronization costs to eek out mild speed ups on projects by adding teams of people. Software is going to become an individual sport, not a team sport, quickly. The benefits we get from checking in with other humans, like error correction, and delegation can all be done better by AI. I would rather a single human (for now) architect with good taste and an army of agents than a team of humans.
> The benefits we get from checking in with other humans, like error correction, and delegation can all be done better by AI.
Not this generation of AI though. It's a text predictor, not a logic engine - it can't find actual flaws in your code, it's just really good at saying things which sound plausible.
> it can't find actual flaws in your code
I can tell from this statement that you don't have experience with claude-code.
It might just be a "text predictor" but in the real world it can take a messy log file, and from that navigate and fix issues in source.
It can appear to reason about root causes and issues with sequencing and logic.
That might not be what is actually happening at a technical level, but it is indistinguishable from actual reasoning, and produces real world fixes.
> I can tell from this statement that you don't have experience with claude-code.
I happen to use it on a daily basis. 4.6-opus-high to be specific.
The other day it surmised from (I assume) the contents of my clipboard that I want to do A, while I really wanted to B, it's just that A was a more typical use case. Or actually: hardly anyone ever does B, as it's a weird thing to do, but I needed to do it anyway.
> but it is indistinguishable from actual reasoning
I can distinguish it pretty well when it makes mistakes someone who actually read the code and understood it wouldn't make.
Mind you: it's great at presenting someone else's knowledge and it was trained on a vast library of it, but it clearly doesn't think itself.
What do you mean the content of your clipboard?
I either accidentally pasted it somewhere and removed, forgetting about doing that or it's reading the clipboard.
The suggestion it gave me started with the contents of the clipboard and expanded to scenario A.
Sorry to sound rude - but you polluted the context, pointing to the fact you would like A, and then found it annoying it tried to do A ?
What you're describing is not finding flaws in code. It's summarizing, which current models are known to be relatively good at.
It is true that models can happen to produce a sound reasoning process. This is probabilistic however (moreso than humans, anyway).
There is no known sampling method that can guarantee a deterministic result without significantly quashing the output space (excluding most correct solutions).
I believe we'll see a different landscape of benefits and drawbacks as diffusion language models begin to emerge, and as even more architectures are invented and practiced.
I have a tentative belief that diffusion language models may be easier to make deterministic without quashing nearly as much expressivity.
This all sounds like the stochastic parrot fallacy. Total determinism is not the goal, and it not a prerequisite for general intelligence. As you allude to above, humans are also not fully deterministic. I don't see what hard theoretical barriers you've presented toward AGI or future ASI.
I haven't heard the stochastic parrot fallacy (though I have heard the phrase before). I also don't believe there are hard theoretical barriers. All I believe is that what we have right now is not enough yet. (I also believe autoregressive models may not be capable of AGI.)
> moreso than humans
Citation needed.
Much of the space of artificial intelligence is based on a goal of a general reasoning machine comparable to the reasoning of a human. There are many subfields that are less concerned with this, but in practice, artificial intelligence is perceived to have that goal.
I am sure the output of current frontier models is convincing enough to outperform the appearance of humans to some. There is still an ongoing outcry from when GPT-4o was discontinued from users who had built a romantic relationship with their access to it. However I am not convinced that language models have actually reached the reliability of human reasoning.
Even a dumb person can be consistent in their beliefs, and apply them consistently. Language models strictly cannot. You can prompt them to maintain consistency according to some instructions, but you never quite have any guarantee. You have far less of a guarantee than you could have instead with a human with those beliefs, or even a human with those instructions.
I don't have citations for the objective reliability of human reasoning. There are statistics about unreliability of human reasoning, and also statistics about unreliability of language models that far exceed them. But those are both subjective in many cases, and success or failure rates are actually no indication of reliability whatsoever anyway.
On top of that, every human is different, so it's difficult to make general statements. I only know from my work circles and friend circles that most of the people I keep around outperform language models in consistency and reliability. Of course that doesn't mean every human or even most humans meet that bar, but it does mean human-level reasoning includes them, which raises the bar that models would have to meet. (I can't quantify this, though.)
There is a saying about fully autonomous self driving vehicles that goes a little something like: they don't just have to outperform the worst drivers; they have to outperform the best drivers, for it to be worth it. Many fully autonomous crashes are because the autonomous system screwed up in a way that a human would not. An autonomous system typically lacks the creativity and ingenuity of a human driver.
Though they can already be more reliable in some situations, we're still far from a world where autonomous driving can take liability for collisions, and that's because they're not nearly as reliable or intelligent enough to entirely displace the need for human attention and intervention. I believe Waymo is the closest we've gotten and even they have remote safety operators.
That's not a citation.
It's roughly why I think this way, along with a statement that I don't have objective citations. So sure, it's not a citation. I even said as much, right in the middle there.
Nothing you've said about reasoning here is exclusive to LLMs. Human reasoning is also never guaranteed to be deterministic, excluding most correct solutions. As OP says, they may not be reasoning under the hood but if the effect is the same as a tool, does it matter?
I'm not sure if I'm up to date on the latest diffusion work, but I'm genuinely curious how you see them potentially making LLMs more deterministic? These models usually work by sampling too, and it seems like the transformer architecture is better suited to longer context problems than diffusion
The way I imagine greedy sampling for autoregressive language models is guaranteeing a deterministic result at each position individually. The way I'd imagine it for diffusion language models is guaranteeing a deterministic result for the entire response as a whole. I see diffusion models potentially being more promising because the unit of determinism would be larger, preserving expressivity within that unit. Additionally, diffusion language models iterate multiple times over their full response, whereas autoregressive language models get one shot at each token, and before there's even any picture of the full response. We'll have to see what impact this has in practice; I'm only cautiously optimistic.
I guess it depends on the definition of deterministic, but I think you're right and there's strong reason to expect this will happen as they develop. I think the next 5 - 10 years will be interesting!
Absolutely nuts, I feel like I'm living in a parallel universe. I could list several anecdotes here where Claude has solved issues for me in an autonomous way that (for someone with 17 years of software development, from embedded devices to enterprise software) would have taken me hours if not days.
To the nay sayers... good luck. No group of people's opinions matter at all. The market will decide.
If you only realized how ridiculous your statement is, you never would have stated it.
It's also literally factually incorrect. Pretty much the entire field of mechanistic interpretability would obviously point out that models have an internal definition of what a bug is.
Here's the most approachable paper that shows a real model (Claude 3 Sonnet) clearly having an internal representation of bugs in code: https://transformer-circuits.pub/2024/scaling-monosemanticit...
Read the entire section around this quote:
> Thus, we concluded that 1M/1013764 represents a broad variety of errors in code.
(Also the section after "We find three different safety-relevant code features: an unsafe code feature 1M/570621 which activates on security vulnerabilities, a code error feature 1M/1013764 which activates on bugs and exceptions")
This feature fires on actual bugs; it's not just a model pattern matching saying "what a bug hunter may say next".
Was this "paper" eventually peer reviewed?
PS: I know it is interesting and I don't doubt Antrophic, but for me it is so fascinating they get such a pass in science.
Modern ML is old school mad science.
The lifeblood of the field is proof-of-concept pre-prints built on top of other proof-of-concept pre-prints.
Some people are still stuck in the “stochastic parrot” phase and see everything regarding LLMs through that lense.
Current LLMs do not think. Just because all models anthropomorphize the repetitive actions a model is looping through does not mean they are truly thinking or reasoning.
On the flip side the idea of this being true has been a very successful indirect marketing campaign.
And not this or any existing generation of people. We're bad a determining want vs need, being specific, genericizing our goals into a conceptual framework of existing patterns and documenting & explaining things in a way that gets to a solid goal.
The idea that the entire top down processes of a business can be typed into an AI model and out comes a result is again, a specific type of tech person ideology that sees the idea of humanity as an unfortunate annoyance in the process of delivering a business. The rest of the world see's it the other way round.
While I agree, if you think that AI is just a text predictor, you are missing an important point.
Intelligence, can be borne of simple targets, like next token predictor. Predicting the next token with the accuracy it takes to answer some of the questions these models can answer, requires complex "mental" models.
Dismissing it just because its algorithm is next token prediction instead of "strengthen whatever circuit lights up", is missing the forest for the trees.
You should actually use these tools before putting your completely un-based opinion on display. Pretty ridiculous take lol
I use these tools and that's my experience.
I think it all depends on the use case and a luck factor.
Sometimes I instruct copilot/claude to do a development (stretching it's capabilities), and it does amazingly well. Mind you that this is front-end development, so probably one of the more ideal use-cases. Bugfixing also goes well a lot of times.
But other times, it really struggles, and in the end I have to write it by hand. This is for more complex or less popular things (In my case React-Three-Fiber with skeleton animations).
So I think experiences can vastly differ, and in my environment very dependent on the case.
One thing is clear: This AI revolution (deep learning) won't replace developers any time soon. And when the next revolution will take place, is anyones guess. I learned neural networks at university around 2000, and it was old technology then.
I view LLM's as "applied information", but not real reasoning.
You should actually understand how these tools work internally before putting your completely un-based opinion on display. Pretty ridiculous take lol
Ok, I'll bite. Let's assume a modern cutting edge model but even with fairly standard GQA attention, and something obviously bigger than just monosemantic features per neuron.
Based on any reasonable mechanistic interpretability understanding of this model, what's preventing a circuit/feature with polysemanticity from representing a specific error in your code?
---
Do you actually understand ML? Or are you just parroting things you don't quite understand?
Ok, let's chew on that. "reasonable mechanistic interpretability understanding" and "semantic" are carrying a lot of weight. I think nobody understands what's happening in these models -irrespective of narrative building from the pieces. On the macro level, everyone can see simple logical flaws.
> I think nobody understands what's happening in these models
Quick question, do you know what "Mechanistic Interpretability Researcher" means? Because that would be a fairly bold statement if you were aware of that. Try skimming through this first: https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-ex...
> On the macro level, everyone can see simple logical flaws.
Your argument applies to humans as well. Or are you saying humans can't possibly understand bugs in code because they make simple logical flaws as well? Does that mean the existence of the Monty Hall Problem shows that humans cannot actually do math or logical reasoning?
Polysemantic features in modern transformer architectures (e.g., with grouped-query attention) are not discretely addressable, semantically stable units but superposed, context-dependent activation patterns distributed across layers and attention heads, so there is no principled mechanism by which a single circuit or feature can reliably and specifically encode “a particular code error” in a way that is isolable, causally attributable, and consistently retrievable across inputs.
---
Way to go in showing you want a discussion, good job.
Nice LLM generated text.
Now go read https://transformer-circuits.pub/2024/scaling-monosemanticit... or https://arxiv.org/abs/2506.19382 to see why that text is outdated. Or read any paper in the entire field of mechanistic interpretability (from the past year or two), really.
Hint: the first paper is titled "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" and you can ctrl-f for "We find three different safety-relevant code features: an unsafe code feature 1M/570621 which activates on security vulnerabilities, a code error feature 1M/1013764 which activates on bugs and exceptions"
Who said I want a discussion? I want ignorant people to STOP talking, instead of talking as if they knew everything.
You’re committing the classic fallacy of confusing mechanics with capabilities. Brains are just electrons and chemicals moving through neural circuits. You can’t infer constraints on high-level abilities from that.
This goes both ways. You can't assume capabilities based on impressions. Especially with LLMs, which are purpose built to give an impression of producing language.
Also, designers of these systems appear to agree: when it was shown that LLMs can't actually do calculations, tool calls were introduced.
It's true that they only give plausible sounding answers. But let's say we ask a simple question like "What's the sum of two and two?" The only plausible sounding answer to that will be "four." It doesn't need to have any fancy internal understanding or anything else beyond prediction to give what really is the same answer.
The same goes for a lot of bugs in code. The best prediction is often the correct answer, being the highlighting of the error. Whether it can "actually find" the bugs—whatever that means—isn't really so important as whether or not it's correct.
It becomes important the moment your particular bug is on one hand typical, but has a non-typical reason. In such cases you'll get nonsense which you need to ignore.
Again - they're very useful, as they give great answers based on someone else's knowledge and vague questions on part of the user, but one has to remain vigilant and keep in mind this is just text presented to you to look as believable as possible. There's no real promise of correctness or, more importantly, critical thinking.
100% They're not infallible but that's a different argument to "they can't find bugs in your code."
Your brain is a slab of wet meat, not a logic engine. It can't find actual flaws in your code - it's just half-decent at pattern recognition.
That is not exactly true. The brain does a lot of things that are not "pattern recognition".
Simpler, more mundane (not exactly, still incredibly complicated) stuff like homeostasis or motor control, for example.
Additionally, our ability to plan ahead and simulate future scenarios often relies on mechanisms such as memory consolidation, which are not part of the whole pattern recognition thing.
The brain is a complex, layered, multi-purpose structure that does a lot of things.
This assumes every individual is capable of succinctly communicating to the AI what they want. And the AI is capable of maintaining it as underlying platforms and libraries shift.
And that there is little value in reusing software initiated by others.
> This assumes every individual is capable of succinctly communicating to the AI what they want. And the AI is capable of maintaining it as underlying platforms and libraries shift.
I think there are people who want to use software to accomplish a goal, and there are people who are forced to use software. The people who only use software because the world around them has forced it on them, either through work or friends, are probably cognitively excluded from building software.
The people who seek out software to solve a problem (I think this is most people) and compare alternatives to see which one matches their mental model will be able to skip all that and just build the software they have in mind using AI.
> And that there is little value in reusing software initiated by others.
I think engineers greatly over-estimate the value of code reuse. Trying to fit a round peg in a square hole produces more problems than it solves. A sign of an elite engineer is knowing when to just copy something and change it as needed rather than call into it. Or to re-implement something because the library that does it is a bad fit.
The only time reuse really matters is in network protocols. Communication requires that both sides have a shared understanding.
> I think there are people who want to use software to accomplish a goal, and there are people who are forced to use software.
Typically people feel they're "forced" to use software for entirely valid reasons, such as said software being absolutely terrible to use. I'm sure that most people like using software that they feel like actually helps rather than hinders them.
>The only time reuse really matters is in network protocols. Communication requires that both sides have a shared understanding.
A lot of things are like network protocols. Most things require communication. External APIs, existing data, familiar user interfaces, contracts, laws, etc.
Language itself (both formal and natural) depends on a shared understanding of terms, at least to some degree.
AI doesn't magically make the coordination and synchronisation overhead go away.
Also, reusing well debugged and battle tested code will always be far more reliable than recreating everything every time anything gets changed.
Even within a single computer or program, there is need for communication protocols and shared understanding - such as types, data schema, function signatures. It's the interface between functions, programs, languages, machines.
It could also be argued that "reuse" doesn't necessarily mean reusing the actual code as material, but reusing the concepts and algorithms. In that sense, most code is reuse of some previous code, written differently every time but expressing the same ideas, building on prior art and history.
That might support GP's comment that "code reuse" is overemphasized, since the code itself is not what's valuable, what the user wants is the computation it represents. If you can speak to a computer and get the same result, then no code is even necessary as a medium. (But internally, code is being generated on the fly.)
I think we shouldn't get too hung up on specific artifacts.
The point is that specifying and verifying requirements is a lot of work. It takes time and resources. This work has to be reused somehow.
We haven't found a way to precisely specify and verify requirements using only natural language. It requires formal language. Formal language that can be used by machines is called code.
So this is what leads me to the conclusion that we need some form of code reuse. But if we do have formal specifications, implementations can change and do not necessarily have to be reused. The question is why not.
> The only time reuse really matters is in network protocols.
And long term maintenance. If you use something. You have to maintain it. It's much better if someone else maintains it.
> I think engineers greatly over-estimate the value of code reuse[...]The only time reuse really matters is in network protocols.
The whole idea of an OS is code reuse (and resources management). No need to setup the hardware to run your application. Then we have a lot of foundational subsystems like graphics, sound, input,... Crafting such subsystems and the associated libraries are hard and requires a lot of design thinking.
Which is why we should always just write and train our own LLMs.
I mean it’s just software right? What value is there in reusing it if we can just write it ourselves?
Every internal piece of software you write is a potentially-infinite money sink of training
>This assumes every individual is capable of succinctly communicating to the AI what they want. And the AI is capable of maintaining it as underlying platforms and libraries shift.
It's true that at first not everyone is just as efficient, but I'd be lying if I were to claim that someone needs a 4-year degree to communicate with LLM's.
no but if the old '10x developer' is really 1 in 10 or 1 in 100, they might just do fine while the rest of us, average PHP enjoyers, may go to the wayside
Everybody in the world is now a programmer. This is the miracle of artificial intelligence.
- Jensen Huang, February 2024
https://www.techradar.com/pro/nvidia-ceo-predicts-the-death-...
God help us!
Far from everyone are cut out to be programmers, the technical barrier was a feature if anything.
There's a kind of mental discipline and ability to think long thoughts, to deal with uncertainty; that's just not for everyone.
What I see is mostly everyone and their gramps drooling at the idea of faking their way to fame and fortune. Which is never going to work, because everyone is regurgitating the same mindless crap.
The problem I mostly see with non programmers is that they don't really grasp the concept of a consistent system.
A lot of people want X, but they also want Y, while clearly X and Y cannot coexist in the same system.
> We pay huge communication/synchronization costs to eek out mild speed ups on projects by adding teams of people.
Something Brooks wrote about 50 years ago, and the industry has never fully acknowledged. Throw more bodies at it, be they human bodies or bot agent bodies.
The point of the mythical man month is not that more people are necessarily worse for a project, it's just that adding them at the last minute doesn't work, because they take a while to get up to speed and existing project members are distracted while trying to help them.
It's true that a larger team, formed well in advance, is also less efficient per person, but they still can achieve more overall than small teams (sometimes).
Interesting point. And from the agents point of view, it’s always joining at the last minute, and doesn’t stick around longer than its context window. There’s a lesson in there maybe…
But there is a level of magnitude difference between coordinating AI agents and humans - the AIs are so much faster and more consistent than humans, that you can (as Steve Yegge [0] and Nicholas Carlini [1] showed) have them build a massive project from scratch in a matter of hours and days rather than months and years. The coordination cost is so much lower that it's just a different ball game.
[0] https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...
[1] https://www.anthropic.com/engineering/building-c-compiler
Then why aren’t we seeing orders of magnitude more software being produced?
I think we are. There's definitely been an uptick in "show HN" type posts with quite impressively complex apps that one person developed in a few weeks.
From my own experience, the problem is that AI slows down a lot as the scale grows. It's very quick to add extra views to a frontend, but struggles a lot more in making wide reaching refactors. So it's very easy to start a project, but after a while your progress slows significantly.
But given I've developed 2 pretty functional full stack applications in the last 3 months, which I definitely wouldn't have done without AI assistance, I think it's a fair assumption that lots of other people are doing the same. So there is almost certainly a lot more software being produced than there was before.
I think the proportion of new software that is novel has absolutely plummeted after the advent of AI. In my experience, generative AI will easily reproduce code for which there are a multitude of examples on GitHub, like TODO CRUD React Apps. And many business problems can be solved with TODO CRUD React Apps (just look at Excel’s success), but not every business problem can be solved by TODO CRUD React Apps.
As an analogy: imagine if someone was bragging about using Gen AI to pump out romantasy smut novels that were spicy enough to get off to. Would you think they’re capable of producing the next Grapes of Wrath?
Didn't we have a post the other day saying that the number of "Show HN" posts is skyrocketing?
Claude Code released just over a year ago, agentic coding came into its own maybe in May or June of last year. Maybe give it a minute?
It’s been a minute and a half and I don’t see the evidence you can task an agent swarm to produce useful software without your input or review. I’ve seen a few experiments that failed, and I’ve seen manic garbage, but not yet anything useful outside of the agent operators imagination.
Agent swarms are what, a couple of months old? What are you even talking about. Yes, people/humans still drive this stuff, but if you think there isn't useful software out there that can be handily implemented with current gen agents that need very little or no review, then I don't know what to tell you, apart from "you're mistaken". And I say that as someone who uses three tools heavily but has otherwise no stake in them. The copium in this space is real. Everyone is special and irreplaceable, until another step change pushes them out.
The next thing after agent swarms will be swarm colonies and people will go "it's been a month since agentic swarm colonies, give it a month or two". People have been moving the goal posts like that for a couple years now, it's starting to grow stale. This is like self driving cars which were going to be workingin 2016 and replace 80% of drivers by 2017, all over again. People falling for hype instead of admitting that while it appears somewhat useful, nobody has any clue if it's 97% useful or just 3% useful but so far it's looking like the later.
I generally agree, but counterpoint: Waymo is successfully running robocabs in many cities today.