How does misalignment scale with model intelligence and task complexity?
alignment.anthropic.com171 points by salkahfi 8 hours ago
171 points by salkahfi 8 hours ago
The comments so far seem focused on taking a cheap shot, but as somebody working on using AI to help people with hard, long-term tasks, it's a valuable piece of writing.
- It's short and to the point
- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term
- It's informative on how these models work, informed by some of the best in the business
- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")
Other actionable insights are:
- Merge amendments up into the initial prompt.
- Evaluate prompts multiple times (ensemble).
> Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.
Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.
My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.
So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.
Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.
We can escalate to higher authority and get out of that mess faster if we fail hard and early.
The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.
Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.
> Coherence requires 2 opposing forces
This seems very basic to any kind of information processing beyond straight shot predictable transforms.
Expansion and reduction of possibilities, branches, scope, etc.
Biological and artificial neural networks converging into multiple signals, that are reduced by competition between them.
Scientific theorizing, followed by experimental testing.
Evolutionary genetic recombination and mutation, winnowed back by resource competition.
Generation, reduction, repeat.
In a continually coordinated sense too. Many of our systems work best by encouraging simultaneous cooperation and competition.
Control systems command signal proportional to demand, vs. continually reverse-acting error feedback.
> This seems very basic
Yes, this is not some sort of hard-fought wisdom.
It should be common sense, but I still see a lot of experiments which measure the sound of one hand clapping.
In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.
If you don't really want the experiments and data from the academic paper, we have a white paper which is completely obvious to anyone who's read High Output Management, Mythical Man Month and Philosophy of Software Design recently.
Nothing in there is new, except the field it is applied to has no humans left.
This is a good line: "It found that smarter entities are subjectively judged to behave less coherently"
I think this is twofold:
1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.
2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
Incoherence is not error.
You can have a vanishingly small error and an incoherence at its max.
That would be evidence of perfect alignment (zero bias) and very low variance.
What do 'domain valleys' and 'tunneling' mean in this context?
So, the hidden mental model that the OP is expressing and failed to elucidate on is that llm’s can be thought of as compressing related concepts into approximately orthogonal subspaces of the vector space that is upper bounded by the superposition of all of their weights. Since training has the effect of compressing knowledge into subspaces, a necessary corollary of that fact is that there are now regions within the vector space that contain nothing very much. Those are the valleys that need to be tunneled through, ie the model needs to activate disparate regions of its knowledge manifold simultaneously, which, seems like it might be difficult to do. I’m not sure if this is a good way of looking at things though, because inference isn’t topology and I’m not sure that abstract reasoning can be reduced down to finding ways to connect concepts that have been learned in isolation.
A hallmark of intelligence is the ability to find connections between the seemingly disparate.
Not the OP, but my interpretation here is that if you model the replies as some point in a vector space, assuming points from a given domain cluster close to each other, replies that span two domains need to "tunnel" between these two spaces.
> the ability to traverse between domain valleys in the cognitive manifold.
Couldn't you have just said "know about a lot of different fields"? Was your comment sarcastic or do you actually talk like that?
> When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
Insights are “deep” not on their own merit, but because they reveal something profound about reality. Such a revelation is either testable or not. If it’s testable, distinguishing it from bullshit is relatively easy, and if it’s not testable even in principle, a good heuristic is to put it in the bullshit category by default.
This was not my experience studying philosophy. After Kant there was a period where philosophers were basically engaged in a centuries long obfuscated writing competition. The pendulum didn't start to swing back until Neitchze. It reminded me of legal jargon but more pretentious and less concrete.
The issue is the revelation. It's always individual at some level. And don't forget our senses are crude. The best way is to store "insights" as information until we collect enough data that we can test it again (hopefully without a lot of bias). But that can be more than a lifetime work, so sometimes you have to take some insights at face value based on heuristics (parents, teachers, elder, authority,...)
The "natural overthinking increases incoherence" finding matches my daily experience with Claude.
I maintain ~100 custom skills (specialized prompts). Sometimes Claude reads a skill, understands it, then overthinks itself into "helpful" variations that break the workflow.
Has anyone else found prompt density affects coherence?
I think It's not because AI working on "misaligned" goals. The user never specify the goal clearly enough for AI system to work.
However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.
> I think producing detailed enough specification requires same or even larger amount of work than writing code
Our team has started dedicating much more time writing documentation for our SaaS app, no one seems to want to do it naturally, but there is very large potential for opening your system to machine automation. Not just for coding but customer facing tooling. I saw a preview of that possible future using NewRelic where they have an AI chat use their existing SQL-like query language to build tables and charts from natural language queries right in the web app. Theirs kinda sucks but there's so much potential there that it is very likely going to change how we build UIs and software interfaces.
Plus it also helps sales, support, and SEO having lots of documentation on how stuff works.
That makes me wonder about the "higher and higher-level language" escalator. When you're writing in assembly, is it more work to write the code than the spec? And the reverse is true if you can code up your system in Ruby? If so, does that imply anything about the "spec driven" workflow people are using with AIs? Are we right on the cusp where writing natural language specs and writing high level code are comparably productive?
I believe that the issue right now is that we're using languages designed for human creation in an AI context. I think we probably want languages that are optimized for AI written but human read code, so the surface texture is a lot different.
My particular hypothesis on this is something that feels a little bit like python and ruby, but has an absolutely insane overkill type system to help guide the AI. I also threw in a little lispiness on my draft: https://github.com/jaggederest/locque/
I don't know, LLMs strive on human text, so I would wager that a language designed for humans would quite closely match an ideal one for LLMs. Probably the only difference is that LLMs are not "lazy", they better tolerate boilerplate, and lower complexity structures likely fit them better. (E.g. they can't really one-shot understand some imported custom operator that is not very common in its training data)
Also, they rely surprisingly closely on "good" code patterns, like comments and naming conventions.
So if anything, a managed language [1] with a decent type system and not a lot of features would be the best, especially if it has a lot of code in its training data. So I would rather vote on Java, or something close.
[1] reasoning about life times, even if aided by the compiler is a global property, and LLMs are not particularly good at that
If you are on the same wave length as someone you don't need to produce a full spec. You can trust that the other person has the same vision as you and will pick reasonable ways to implement things. This is one reason why personalized AI agents are important.
Programming languages can be a thinking tool for a lot of tasks. Very much like a lot of notation, like music sheet and map drawing. A condensed and somewhat formal manner of describing ideas can increase communication speed. It may lack nuance, but in some case, nuance is harmful.
The nice thing about code compared to other notation is that it's useful on its. You describe an algorithm and the machine can then solve the problem ad infinitum. It's one step instead of the two step of writing a spec and having an LLM translate it, then having to verify the output and alter it.
Assembly and high level languages are equivalent in terms of semantics. The latter helps in managing complexity, by reducing harmful possibilities (managing memory, off-by-one errors) and presenting common patterns (iterators/collections, struct and other data structures, ....) so that categories of problems are easily solved. There's no higher level of computing model unlocked. Just faster level of productivity unlocked by following proven patterns.
Spec driven workflow is a mirage, because even the best specs will leave a lot of unspecified details. Which are crucial as most of programming is making the computer not do the various things it can do.
> most of programming is making the computer not do the various things it can do
This is a very stimulating way of putting it!
My thought too. To extend this coding agents will make code cheap, specifications cheaper, but may also invert the relative opportunity cost of not writing a good spec.
This matches my intuition. Systematic misalignment seems like it could be prevented by somewhat simple rules like the hippocratic oath or Asimov's Laws of robotics or rather probabilistic bayesian versions of these rules that take into account error bounds and risk.
The probabilistic version of "Do No Harm" is "Do not take excessive risk of harm".
This should work as AIs become smarter because intelligence implies becoming better bayesians which implies being great at calibrating confidence intervals of their interpretations and their reasoning and basically gaining a superhuman ability for evaluating the bounds of ambiguity and risk.
Now this doesn't mean that AIs won't be misaligned, only that it should be possible to align them. Not every AI maker will necessarily bother to align them properly, especially in adversarial, military applications.
"model failures become increasingly dominated by incoherence rather than systematic misalignment."
This should not be surprising.
Systematic misalignment, i.e., bias, is still coherent and rational, if it is to be systematic. This would require that AI reason, but AI does not reason (let alone think), it does not do inference.
It's nice seeing this with Sohl-Dickstein as the last author after reading this blog post from him some time ago: https://sohl-dickstein.github.io/2023/03/09/coherence.html
Longer thinking sections have more space for noise to accumulate?
> This suggests that scaling alone won't eliminate incoherence. As more capable models tackle harder problems, variance-dominated failures persist or worsen.
This is a big deal, but are they only looking at auto-regressive models?
The findings are based on older models and assuming recent models behave similarly, what kind of prompt style one should use then to improve the outcome to avoid the increase in variance especially when you ask a model to solve really complex problems?
When humans dream, we are disconnected from the world around us. Without the grounding that comes from being connected to our bodies, anything can happen in a dream.
It is no surprise that models need grounding too, lest their outputs be no more useful than dreams.
It’s us engineers who give arms and legs to models, so they can navigate the world and succeed at their tasks.
Also since dreams are built from the combinations of experiences that brain already knows so we cannot die in a dream as our brain does not know how to replicate what it would feel like after being dead. Basically LLMs also cannot produce truly novel ideas.
The models they tested are already way behind the current state-of-the-art. Would be interesting to see if their results hold up when repeated with the latest frontier models.
I feel vindicated when I say that the superintelligence control problem is a total farce, we won't get to superintelligence, it's tantamount to a religious belief. The real problem is the billionaire control problem. The human-race-on-earth control problem.
I don’t believe the article makes any claims on the infeasibility of a future ASI. It just explores likely failure modes.
It is fine to be worried about both alignment risks and economic inequality. The world is complex, there are many problems all at once, we don’t have to promote one at the cost of the other.
Yeah article aside, looking back on all the AGI stuff from the last year or so really puts our current moment in protective.
This whole paradigm of AI research is cool and all but it's ultimately a simple machine that probabilistically forms text. It's really good at making stuff that sounds smart but like looking at an AI picture, it falls apart the harder you look at it. It's good at producing stuff that looks like code and often kinda works but based on the other comments in this thread I don't think people really grasp how these models work.
My ignorant question: They did bias and variance noise, how about quantisation noise? I feel like sometimes agents are "flipfloping" between metastable divergent interpretations of the problem or solution.
[flagged]
This is very interesting research and a great write up.
I just want to nitpick something that really annoys me that has become extremely common: the tendency to take every opportunity to liken all qualities of LLMs to humans. Every quirk, failure, oddity, limitation, or implementation detail is relentlessly anthropomorphized. It's to the point where many enthusiasts have convinced themselves that humans think by predicting the next token.
It feels a bit like a cult.
Personally, I appreciate more sobriety in tech, but I can accept that I'm in the minority in that regard.
For some reason the article reads to me like “AI is not evil, it just has accidents when it loses coherence.” Sounds a lot like liability shifting.
They compared it to industrial accidents. I don't think a software company would try to shift liability by comparing themselves to factories explosions and chemical spills.
I don’t know why it seems so hard for these guys to understand you scorecard every step for new strategy to Close distance at goal and if you have multiple generated forward options with no good weight you spawn a new agent and multiple paths. Then you score all the terminal branches and prune.
LLMs aren’t constrained to linear logic like your average human.
[flagged]
Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.