Agents need control flow, not more prompts

bsuh.bearblog.dev

230 points by bsuh 5 hours ago

1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".

We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.

This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.

We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.

DrewADesign - an hour ago

I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.
I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.
- cogman10 - 18 minutes ago
  
  > However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it.
  I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.
  Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?
- pishpash - an hour ago
  
  Aren't they just buying time to build you whatever harness you need? They want to be the only software engineering shop in the world.
rdedev - 9 minutes ago

I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answer
At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent
sharperguy - 9 minutes ago

So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks?
So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.
- tonylucas - 5 minutes ago
  
  I've been working on an integrated deterministic/agent integrated system for a few months now. It basically runs an AI step to build a plan, which biases towards deterministic steps as much as possible but escalates back to AI when it needs to (for AI only capabilities or deterministic failures) so effectively (when I perfect it, I'm about 90% there) it can bounce back and forward as needed with deterministic steps launching AI steps and AI steps launching deterministic steps as needed.
  Probably not explaining it very well but I think it's pretty effective at reducing token usage.
mmis1000 - an hour ago

> This started breaking down after ~30 files.
Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.
- swores - an hour ago
  
  Is Codex willing to do "multi hour" tasks when used with a ChatGPT Plus subscription, or does it need something more expensive like Pro?
  - dnh44 - 7 minutes ago
    
    I regularly get codex to do multi hour tasks with a single prompts I don't think thats a big deal anymore. But you don't want a single agent doing all the work. The root agent needs to delegate the work to sub agents. For example, a sub agent for context gathering, then one for planning, then one (or more) for implementation, then another for review. This way the root agent doesn't use up its context window and it just manages from a bird's eye view. I do have the $200 plan though.
woeirua - 39 minutes ago

I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect.
Joeri - an hour ago

You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script.
sroussey - an hour ago

I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually.
- tonylucas - 4 minutes ago
  
  I'm working on something similar (won't link to it as don't want people to think I'm spamming) but if you want to compare notes happy to talk.
- cluckindan - an hour ago
  
  Jira for agents?
pishpash - an hour ago

Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely.
- svachalek - 39 minutes ago
  
  True. The prompt reads: Run the following Python: ```

rnxrx - 4 hours ago

I wonder if a part of the problem isn't just the misapplication of LLMs in the first place. As has been mentioned elsewhere, perhaps the agent's prompt should be to write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible. This would hopefully include validation of the agent's output as well. The overall goal would be to keep the LLM out of doing processing that could be more efficiently (and often correctly) handled programmatically.

chrismarlow9 - 3 hours ago

100% agreed. use the non-deterministic thing that is right 90% of the time to generate a deterministic thing that is right 100% of the time. one of the key things I add to my prompts is:
- Please consult me when you encounter any ambiguous edge cases
Attaching the AI to production to directly do things with API calls is bad. For me the only use case where the app should do any AI stuff is with reading/categorizing/etc. Basically replacing the "R" in old CRUD apps. If you want to use that same new AI based "R" endpoint to auto fill forms for the "C", "U", and "D" based on a prompt that's cool, but it should never mutate anything for a customer before a human reviews it. Basically CRUD apps are still CRUD apps (and this will always be true), they just have the benefit of having a very intelligent "R" endpoint that can auto complete forms for customers (or your internal tooling/Jenkins pipelines/etc), or suggest (but never invoke) an action.
vishvananda - 3 hours ago

I think there is a flow in most organizations from:
llm -> prompt -> result
llm -> prompt + prompt encoded as skill -> result
llm -> prompt + deterministic code encoded as skill -> result
I do think prompting to generate code early can shortcut that path to deterministic code, but we're still essentially embedding deterministic code in a non-deterministic wrapper. There is a missing layer of determinism in many cases that actually make long-horizon tasks successful. We need deterministic code outside the non-deterministic boundary via an agentic loop or framework. This puts us in a place where the non-deterministic decision making is sandwiched in between layers of determinism:
deterministic agentic flows -> non-deterministic decision making -> deterministic tools
This has been a very powerful pattern in my experiments and it gets even stronger when the agents are building their own determinism via tools like auto-researcher.
evilelectron - 2 hours ago

This is exactly how I did my last project of automating the generation of an interface library between a server that controls hardware and the mobile app.
The hardware control team delivers a spec as a document and spreadsheet. The mobile team was using that to code the interface library and validating their code against the server. I converted the document to TSV, sent some parts to Claude and have it write a parser for the TSV keeping all the nuances of human written spec. It took more than 150 iterations to get the parser to handle all edge cases and generate an intermediate output as JSON. Then Claude helped me write a code generator using some custom glue on top of Apollo to generate the code that is consumed by the mobile app.
This whole pipeline runs as part of Github actions and calls Claude only when our library validator fails. There is an md file which is sent to Claude on failure as part of the request to figure out what went wrong, propose a solution and create a PR. This is followed by a human review, rework and merge. Total credits consumed to get here < $350.
groovetandon - an hour ago

This is so true have been working on a project for exactly this principle -
https://www.decisional.com/blog/workflow-automation-should-b...
I think there is a fundamental incentive problem - code + llm + harness is bound to be more efficient but the labs want you to burn tokens so they are not going to tell you to use the code, just burn more tokens. They are asking us to forget about the token cost and reliability for now - model will become better.
This means that most people just believe that their agent should just be able to do anything with the help of some Model fairy dust with prompts + skills.
People need to watch their agents fail in production to be able to come to the right conclusion unfortunately.
VMG - 3 hours ago

The problem is that often the program runs into some edge case that requires interpretation, at which point one is tempted to let the LLM deal with the edge case, at which point one is tempted to let the LLM deal with the whole loop and let it do the tool calls
- Fishkins - 3 hours ago
  
  Agreed. I think the approach described here is promising. Most of the workflow is deterministic and includes safeguards, but an LLM is invoked in the one case where it's really useful.
  https://lethain.com/agents-as-scaffolding/
khasan222 - an hour ago

Completely agree! People tend to forget we are non deterministic too! Yet we are able to write code fine, and fairly reliably by using tools that can help keep us fairly honest.
I think most problems with ai tend to be around can you deterministically test the thing you are asking it to do?
How many of us would never ever show work, without going to check the thing we just built first?
- cluckindan - an hour ago
  
  > can you deterministically test the thing you are asking it to do?
  Of course: have it write tests first; and run them to check its work.
  Works well for refactoring, but greenfield implementations still rely on a spec that is guaranteed to be incomplete, overcomplete and wrong in many ways.
  - pishpash - an hour ago
    
    You can't ask something to check its own work without external reward/penalty. It'll cheat.
nixpulvis - 2 hours ago

My agents often write themselves scripts. Isn't that effectively what you're asking for? Prompting for scripts can also be a useful time and accuracy tactic when you know it'll be a good fit for it.
- falkensmaize - 25 minutes ago
  
  The problem is that code it spits out on the fly is untested and untrustworthy. Identify the parts of your workflow that could be accomplished with regular code - write and unit test that code, with LLM help if you want, and use the llm as the orchestrator only.
- sisve - 2 hours ago
  
  Yeah, the problem is that I do not think the agents is good at reusing scripts and stitching it together.At least for me it's recreating to much similar. I hope we will see platforms like windmill.dev find the optimal solution for this. I have not been able to test it enough. But have a platform that gives you some observability out of the box and protect secrets from llm is nice
  - reddit_clone - 14 minutes ago
    
    I noticed that too. Unless you _ask_ for a script, they throw away the scripts they write.
    They are particularly bad at complex multiline parsing. Writing all sorts of weird/crude python/awk scripts and getting confused in the process.
    I wish they would use Perl6/Grammer or Haskell/Parsec or similar and write better parsing scripts.
foolserrandboy - 4 hours ago

yup, the standard way of thinking about agents seems backwards and probably costly. Use LLMs to write scripts, then stick all your scripts in your own looping harness and call out for LLMs for those parts that are too hard to automate with some deterministic validation at the end.
user3939382 - 2 hours ago

> write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible
Correct. The concept of having probabilistic output with deterministic acceptance “guardrails” is illogical. If the domain resists deterministic modeling such that you’re using an LLM, the guardrails don’t magically gain that capability.

bwestergard - 5 hours ago

I agree with the sentiment, but I think the conclusion should be altered. When you hit the limit of prompting, you need to move from using LLMs at run time to accomplish a task to using LLMs to write software to accomplish the task. The role of LLMs at run time will generally shrink to helping users choose compliant inputs to a software system that embodies hard business rules.

scrappyjoe - 4 hours ago

I’ve had a couple of weeks of downtime at work, so I decided to incorporate agents into my work processes - things like note taking, task tracking, document management.
Your comment EXACTLY mirrors my experience. Week 1 was ever expanding prompts, and degrading performance. Week 2 has been all about actually defining the objects precisely (notes, tasks, projects, people etc) and defining methods for performing well defined operations against these objects. The agent surface has, as you rightly point out, shrunk to a translation layer that converts natural language to commands and args that pass the input validator.
sowbug - 3 hours ago

A full-circle system prompt would be to "find every opportunity to put yourself out of your job by automating it away. When you are given a question that code can answer, answer the question by writing code and running it to obtain the result."
Such an LLM might have fared better with the strawberry test.
edgarvaldes - 4 hours ago

Some have expressed the opinion in this forum that the future of software lies in programs that are created and adapted at runtime, using genAI. I don't know how far we are from that.
- aleksiy123 - 3 hours ago
  
  It’s already here the question is just to what extent?
  Are google search results modifying your software at runtime?
  Take or agent chat for example, the output text is a ui, agents can generate charts and even constrained ui elements.
  Isn’t that created and adapted at run time?
  If you mean like agents live modifying your code. I think that’s pretty much here as well. Can read the logs and send prs.
  The only thing is how fast that loop will execute from days or hours to mins or seconds, and what validation gates it needs to pass.
  My git repo is pretty much self modifying personal software at this point, that I interface through the ide chat window.
  But I don’t think we will ever lose the intermediary deterministic language (code) between the llm and the execution engine.
  It would be prohibitively expensive to run everything through models all the time.
  But I am starting to think we need a more precise language than English when talking with LLMs. That can do both precision and ambiguity when you need either.
  - jmaw - 3 hours ago
    
    Some kind of "code", you could say
    
    aleksiy123 - 44 minutes ago
    
    Yes but more declarative vs imperative.
    I say what the llm says how.
    
    pishpash - 38 minutes ago
    
    Not that long ago the workflow was to turn code comments into code. Maybe leave some comments as is now.
  - pishpash - 40 minutes ago
    
    Sounds like assemblers bemoaning loss of control to C. The solution was inline assembly...
- cassianoleal - 2 hours ago
  
  So we're back to vim over ssh in production, only without a human with _some semblance_ of judgement in the loop?
- mjr00 - 4 hours ago
  
  > Some have expressed the opinion in this forum that the future of software lies in programs that are created and adapted at runtime, using genAI.
  Good luck with that. Users will flood you with complaints if a button moves 5px to the left after a design update. A program that is generated at runtime, with not just a variable UI but also UX and workflows, would get you death threats.
  - hilariously - 4 hours ago
    
    I think many software adjacent folks are super excited because they can now have the personalized toothbrush they keep asking people to make for them.
    The problem is that outside of that most people want boring and regular interfaces so they can get in and solve the problem and get out - they don't want to "love" it or care if its "sexy" they want it to work and get out of the way.
    LLMs transmogrifying your software at ever request assumes people are software architects and creators who love the computer interface, and that just doesn't describe the bulk of the population.
    Most people using computers use the to consume things or utilize access to things, not for their own sake, and they certainly don't think "what if I just had code to do x..." unless x is make them a lot of money.
  - munk-a - an hour ago
    
    A program that is generated at runtime is fine (we have interpreted languages and often compile on demand) - the issue is with the non-deterministic nature of the output.
    I think the core issue is that non-deterministic output is great for a chatbot experience where you want unpredictable randomness so it feels less like talking to the mirror - but when it comes to coding I think we're pretty fundamentally misaligned in sticking to that non-deterministic approach so firmly.
QuercusMax - 3 hours ago

I've seen cases where models will get stuck in a particular mode of problem solving and need a nudge to tell them to move to a new mode. For example, instead of trying to massage a bunch of system service configs to handle hot-plug/unplug of an audio stream, what I really needed was to just write a couple dozen lines of Python to handle stuff.
I just had Claude write itself a couple shell scripts to handle a bunch of common cases (like running tests) in my workflow where it just couldn't figure it out efficiently. Now it just runs those tools and sets things up instead of spinning in circles for half an hour.
Every time it tries to ask me if it can run some one-off crazy shell or python one-liner to do something, I've started asking myself if I should have it write a tool I can auto-approve instead.
3uba - 3 hours ago

[dead]

moconnor - 15 minutes ago

“Flow” moves agents through a yaml flowchart of prompts and decisions. It’s working quite well for a couple of us in Tenstorrent, more to discover here though:

https://github.com/yieldthought/flow

Happily, 5.5 is good at writing and using it.

jerf - 4 hours ago

This is why I frequently refer to "next generation AIs" that aren't just LLMs. LLMs are pretty cool and I expect that even if we see no further foundational advancement in AIs that we're going to continue to see them exploited in more interesting ways and optimized better. Even if the models froze as they are today, there's a lot more value to be squeezed out of them as we figure out how to do that.

However, there are some things that I think need a foundational next-generation improvement of some sort. The way that LLMs sort of smudge away "NEVER DO X" and can even after a lot of work end up seeing that as a bit of a "PLEASE DO X" seems fundamental to how they work. It can be easy to lose track of as we are still in the initial flush of figuring out what they can do (despite all we've already found), but LLMs are not everything we're looking for out of AI.

There should be some sort of architecture that can take a "NEVER DO X" and treat it as a human would. There should be some sort of architecture that instead of having a "context window" has memory hierarchies something like we do, where if two people have sufficiently extended conversations with what was initially the same AI, the resulting two AIs are different not just in their context windows but have actually become two individuals.

I of course have no more idea what this looks like than anyone else. But I don't see any reason to think LLMs are the last word in AI.

cultofmetatron - an hour ago

heres a fun one for you https://www.youtube.com/watch?v=kYkIdXwW2AE&t=315s

JohnMakin - 3 hours ago

> Imagine a programming language where statements are suggestions and functions return “Success” while hallucinating. Reasoning becomes impossible; reliability collapses as complexity grows.

This is essentially declarative programming. Most traditional programming is imperative, what most developers are used to - I give the exact set of instructions and expect them to be obeyed as I write them. Agents are way more declarative than imperative - you give them a result, they work on getting that result. Now the problem of course, is in something declarative like say, SQL, this result is going to be pretty consistent and well-defined, but you're still trusting the underlying engine on how to go about it.

Thinking about agents declaratively has helped me a lot rather than to try to design these rube-goldberg "control" systems around them. Didn't get it right? Ok, I validated it's not correct, let's try again or approach it differently.

If you really need something imperative, then write something imperative! Or have the agent do so. This stuff reads like trying to use the wrong tool for the job.

repelsteeltje - 3 hours ago

I was thinking of declarative, but PROLOG rather than SQL. So with actual control flow and reasoning capabilities.
And then you run into similar issues as the llm does, like silent failures, loops, contradictions unless you're very careful.
The essence might be the same closed world assumption problem. In llm case this manifests as hallucination rather that admitting it does not know.
miltonlost - 2 hours ago

SQL's declarativeness is also based on the mathematics of relational algebra, so it will return the same result every time. Will it return it in the same amount of time every single query? No, that depends on indexing and database size. But the query itself won't be altered in the same way an LLM would be.
- JohnMakin - 2 hours ago
  
  Engines that use SQL can vary drastically in how they handle strings, floating points, etc., where identical SQL queries on identical data absolutely can return different results, which is why I mentioned the engine underneath - LLM's being nondeterministic in addition to declarative is kind of tangential to the point I was trying to make.
  It is the same in terraform - yes, the HCL spec defines things very precisely, but you're kind of at the mercy of how the provider and provider API decide how to handle what you wrote, which can be very messy and inconsistent even when nothing changed on your side at all. LLM/agent usage feels a lot like that to me, in the sense it's declarative and can be a bit lossy. As a result there are things I could technically do in terraform but would never, because I need imperativeness.
  My main point being, I think people are trying to ram agents into a ton of cases where they might not necessarily need or even want to be used, and stuff like this gets written. Maybe not, but I see it day to day - for instance, I have a really hard time convincing coworkers that are complaining about the reliability of MCP responses with their agents, that they could simply take an API key, have the agent write a script that uses it, and strictly bound/define the type of response format they want, rather than let the agent or server just guess - for some reason there is some inclination to "let the agent decide how to do everything."
  I think that's probably what this article is getting at, but, I am saying that trying to create these elaborate control flows with validation checks everywhere to reign in an unruly application making dumb decisions, why not just use it to write deterministic automation instead of using agent as the automation?

isityettime - 3 hours ago

Afaict all harnesses are wrong in this respect, some of them deeply so.

Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.

Even things that have nothing to do with controlling the text generator's input and output are entangled with chat actions for no good reason except "it's a chat thing, let's pretend we're operating an IRC bot".

There are a zillion LLM agents out there nowadays, but none of them really separate control from the agent loop from presentation well. (A few do at least have headless modes, which is cool.)

dnautics - 3 hours ago

> Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.
I get what you're trying to say but in practice architecting what you propose is considerably more difficult. Why not build it and try to get hired by one of the bigcos?
- isityettime - 2 hours ago
  
  I don't think the basic architecture principles are novel. The big AI labs and other large tech companies already have engineers who can see this, without a doubt. But the AI labs clearly don't care if their LLM agents are just big balls of mud, and the big tech companies priorities mostly lie elsewhere, too.
  They just want features. They don't really care about duplicated work, so half of them reinvent the TUI rendering wheel. Pluggability is something that might be actually hostile to their interests in lock-in. And the AI labs probably think "after a couple more scaling cycles, our models will be so good that our agents can just rewrite themselves from scratch"; until they hit a compute or power wall, it always looks rational to them to defer rearchitecting.
  Another real possibility is that if you work on an agent with a really clean architecture and publish it in hopes of getting hired by some AI company, all of them think "that looks great, but we don't want to rearchitect right now". Your code winds up in the training set, and a year and a half from now, existing agents can "one-shot" rewrites along the lines of your design because they're "smarter".
  As for me, I'm not that interested, personally. There are other things I want to build and I'm working on those.
the_duke - 2 hours ago

In codex CLI /status works just fine during a turn.
Other things don't though.
user34283 - 3 hours ago

I use the Codex desktop app.
In the GUI I can see the context indicator and usage stats.
It also makes it easier to jump between conversations and see the updates.
Sometimes I use Claude Code or opencode in the terminal, and my experience is much poorer compared to the Codex desktop app.

Nizoss - an hour ago

If you’re interested in such deterministic scaffolding/control flow, check out Probity.

I created it to address this exact issue. It is a vendor-neutral ESLint-style policy engine and currently supports Claude Code, Codex, and Copilot.

It uses the agents hooks payloads and session history to enforce the policies. Allowing it to be setup to block commits if a file has been modified since the checks were last run, disallow content or commands using string or regex matching, and enforce TDD without the need of any extra reporter setup and it works with any language.

Feedback welcome: https://github.com/nizos/probity

rglover - 2 hours ago

> Babysitter: Keep a human in the loop to catch errors before they propagate.

This is the only way to guarantee AI usage doesn't burn you. Any automation beyond this is just theater, no matter how much that hurts to hear/undermines your business model.

A bird sings, a duck quacks. You don't expect the duck to start singing now, do you?

kelseyfrog - 2 hours ago

I'm not sure I agree. Like all stochastic processes, LLM errors can be quantified. That makes each use case a risk-reward tradeoff where users can decide if the tradeoff makes sense for them or not. There are scenarios where errors are acceptable because the risks are low or errors are acceptable or the rewards make up for them. This is a process engineer problem where business and technology specifics matter.
- rglover - 2 hours ago
  
  I see where you're coming from, but this assumes good behavior and discipline which most people/teams struggle with.
  If a business can get away with some margin of error being acceptable, more power to them. But if not (or doing so would cause additional problems; what I'd imagine to be true for a non-trivial number of orgs), it's wise to consider the nature of the tool a lot of people are suggesting is mandatory if you're dependent on consistent, predictable results.
  - kelseyfrog - an hour ago
    
    That's fair. A heuristic that leaves some opportunity on the table due to org capability is a reasonable one to have.

Neywiny - 5 hours ago

If you're trying to get reliability and determinism out of the LLM, you've already lost

tekne - 4 hours ago

Wait... why?
Making an unreliable, nondeterministic system give reliable results for a bounded task with well-understood parameters is... like half of engineering, no?
There's a huge difference between "generate this code here's a vague feature description" and "here's a list of criteria, assign this input to one of these buckets" -- the latter is obviously subject to prompt engineering, hallucination, etc -- but so can a human pipeline!
- JCTheDenthog - 3 hours ago
  
  >the latter is obviously subject to prompt engineering, hallucination, etc -- but so can a human pipeline!
  ...which is why we write deterministic code to take the human out of the pipeline. One of the early uses of computers was calculating firing tables for artillery, to replace teams of humans that were doing the calculations by hand (and usually with multiple humans performing each calculation to catch errors). If early computers had a 99% chance of hallucinating the wrong answer to an artillery firing table, the response from the governments and militaries that used them would not be to keep using computers to calculate them. It would be to go back to having humans do it with lots of manual verification steps and duplicated work to be sure of the results.
  If you're trying to make LLMs (a vague simulacrum of humans) with their inherent and unsolvable[1] hallucination problems replace deterministic systems, people are going to eventually decide to return to the tried and true deterministic systems.
  1: https://arxiv.org/abs/2401.11817
- Neywiny - 3 hours ago
  
  Because it's not possible. There is nothing you can say to the LLM that will guarantee that something happens. It's not how it works. It will maybe be taken into consideration if you're lucky.
  But if you're trying to tell me that every time you list criteria you get them all perfectly matched, you're clearly gifted.
evantbyrne - 2 hours ago

I would hope that when engineers speak of LLM determinism they just mean it as shorthand for close to 1 under expected conditions
aleksiy123 - 4 hours ago

There’s a whole range between completely random and completely rule based deterministic.
Somewhere in between that I guess is the varying levels of intelligence more likely able to make the “right” decision for anything you throw at it.
sudosteph - 2 hours ago

I mean, with reliability there's a spectrum. If the risks that an unreliable outcome brings aren't all that bad, then sometimes it's worth it to chase "my agents made an acceptable PR 70% of the time, can I get it to 90?"
Determinism is a different matter. Scripts and hooks are really the main levers you can pull there, but yeah - a a decent script and a cron job will handle certain things much better (and for a fraction of the cost)
pydry - 4 hours ago

This is something I think some people are fundamentally not capable of understanding.

59nadir - 4 hours ago

This was one of the key insights in Stripe's explanations about Minions[0], their autonomous agent system; in-between non-deterministic LLM work they had deterministic nodes that handled quality assurance and so on in order to not leave those types of things to the LLMs.

0 - https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...

- 2 hours ago

[deleted]

sudosteph - 2 hours ago

This is a good discussion topic. A lot of people really seem to believe that if you word a prompt just so, that you just need to throw a high-powered model at it, it will work consistently how you want. And maybe as models progress that might be the case. But right now, that's not how I've seen real life work out.

Even skills are not a catch-all, because besides the supply chain risk from using skills you pull from someone else, a lot of tasks require an assortment of skills.

I've accommodated this with my agent team (mostly sonnets fwiw) by developing what we call "operational reflexes". Basically common tasks that require multiple domains of expertise are given a lockfile defining which of the skills are most relevant (even which fragment of a skill) and how in-depth / verbose each element needs to be to accomplish the same task the same way, with minimal hallucinations or external sources.

A coordinator agent assigns the tasks and selects the relevant lockfile and sends it along or passes it along to another agent with a different specified lockfile geared towards reviewing.

It's a bit, but this workflow dramatically increased the quality of output for technical work I get from my agents and I don't really need to write many prompts myself like this.

mnalley95 - an hour ago

Own your control flow! A key point from 12 factor agents.

"One thing that I have seen in the wild quite a bit is taking the agent pattern and sprinkling it into a broader more deterministic DAG." - https://github.com/humanlayer/12-factor-agents/blob/main/REA...

glasner - 37 minutes ago

This exactly why I’m building aiki to be a control layer for harness execution. I don’t think the model companies will ever give us the neutral layer we need.

apalmer - 5 hours ago

Generally agree with this stance case in point: the breakthrough in ai coding was not that AI intelligence increased as much as that a lot of the core process execution moved out of the LLM prompt and into the harness.

2001zhaozhao - 29 minutes ago

If we need control flows, then designing these control flows ought to be the future of agent engineering

mhotchen - 32 minutes ago

HUMANS need control flow. It's a very effective strategy that has worked wonders in healthcare

stonewizard - 27 minutes ago

[dead]

zby - 2 hours ago

I concur - it does not make sense to do in llm prompts what can be done in code. Code is cheaper, faster, deterministic and we have lots of experience with working with code.

Especially all bookkeeping logic should move into the symbolic layer: https://zby.github.io/commonplace/notes/scheduler-llm-separa...

illwrks - 3 hours ago

I’ve been building a small ‘agent’ using copilot at work, partly a learning exercise as well as testing it in a small use case.

My personal opinion is that AI and agents are being misrepresented… The amount of setup, guidance and testing that’s required to create smarter version of a form is insane.

At the moment my small test is: Compressed instructions (to fit within the 8k limit) 9 different types of policies to guide the agent (json) 3 actual documents outlining domain knowledge (json) 8 Topics (hint harvesting, guide rails, and the pieces of information prepared as adaptive cards for the user) 3 Tools (to allow for connectors)

The whole thing is as robust as I can make it but it still feels like a house of cards and I expect some random hiccup will cause a failure.

tim-projects - 4 hours ago

This is exactly the problem I've been working on and I see others are too. When you implement quality control gates, everything works better. It solves so many of the basic problems llms create - saying code is finished when it isn't. Skipping tests, introducing code regressions, basic code validation etc

I am finding that the better the quality gates are the lower quality llm you can use for the same result (at a cost of time).

Nizoss - an hour ago

Exactly! I don’t babysit TDD anymore. I have another agent that does that for me and honestly sometimes catches things I would have missed if I was the babysitting.
Hooks do wonders here. The payload contains a lot of information about the pending action the agent wants to make. Combine that with the most recent n events from the agent’s session history and you have a rich enough context to pass to another agent to validate the action through the SDK.
This way the validation uses the same subscription you’re logged in to, whether you’re using Claude Code, Codex, or Copilot. The validation agent responds with a json format that you can easily parse and return, allowing you to let the action through or block it with direction and guidance. I’m genuinely impressed by how well this works considering how simple it is.
You can find my approach here: https://github.com/nizos/probity

colek42 - an hour ago

We built https://aflock.ai/ (open source) to help with this. Constraining activity tends to work well

kenjackson - 3 hours ago

I feel like people forget that they're still allowed to program. You're still allowed to create workflows tying together LLMs and agents if you want. Almost all the tools and technology that existed before LLMs are still available to be used.

xuhu - 3 hours ago

It sounds like the "app written in C++ calling Lua scripts, versus app written in Lua calling C++ libraries" debate.

Both designs (Lightroom, game engines) have worked successfully.

There's probably nothing that prevents mixing both approaches in the same "app".

QuercusMax - 34 minutes ago

This pattern has been described for decades: https://wiki.c2.com/?AlternateHardAndSoftLayers. It's not just a matter of who's in control - you can layer these things.

jarboot - an hour ago

I think this is a good usecase for temporal + pydantic-ai

arbirk - 2 hours ago

I always wonder with these posts: - are they talking about coding (where I am the control flow) - or RPA agents (in which it is obvious) ? - also don't use llm for deterministic tasks

arian_ - 4 hours ago

Control flow tells the agent what it's allowed to do. It doesn't tell you what the agent actually did. Both matter. Everyone is building the permission layer. Almost nobody is building the verification layer.

kmad - 3 hours ago

This is, at least in part, the promise of frameworks like DSPy and PydanticAI. They allow you to structure LLM calls within the broader control flow of the program, with typed inputs and outputs. That doesn’t fix non-determinism, hallucinations, etc., but it does allow you to decompose what it is you’re trying to accomplish and be very precise about when an LLM is called and why.

astrobiased - 5 hours ago

It's the right direction, but control flow introduces limitations within a system that is quite adaptable to dynamic situations. The more control flow you try to do, the more buggy edge cases that pop up if done poorly.

Still have yet to see a universal treatment that tackles this well.

TuringTest - 3 hours ago

I would just reverse the architecture of the whole system. Build a classic deterministic program, and use LLMs as heuristics adapting the system to the environment - the functions that you call on the 'if's and 'switch' statements to decide where the system should go.
I see this as the most robust way to build a predictable system that runs in a controlled way while taking advantage of probabilistic AIs while reducing the impact of their alucinations.
LLMs simply can't be trusted to follow instructions in the general case, no matter how much you constraint them. The power of very large probabilistic models is that they basically solved the _frame problem_ of classic AI: logical reasoning didn't work for general tasks because you can't encode all common sense knowledge as axioms, and inference engines lost their way trying to solve large problems.
LLMs fix those handicaps, as they contain huge amounts of real world knowledge and they're capable of finding facts relevant to the problem at hand in an efficient way. Any autonomous system using them should exploit this benefit.

hmaxdml - 2 hours ago

We've found that durable workflows is a much needed primitive for agents control flow. They give a structure for deterministic replays, observability, and, of course, fault tolerance, that operators need to make the agent loop reliable.

briga - 4 hours ago

Sometimes it feels like Agents are just reinventing microservices. Except they are are doing it in the most inefficient way possible. It is certainly a good way for the LLM companies to sell more tokens

onion2k - 4 hours ago

Agents are probabilistic systems. A common mechanism to get a reliable answer from systems that can have variable output is to run them several times (ideally in separate, isolated instances) and then have something vote on the best result or use the most common result. This happens in things like rockets and aviation where you have multiple systems giving an answer and an orchestrator picking the result.

I've tried doing something similar with AI by running a prompt several times and then have an agent pick the best response. It works fairly well but it burns a lot of tokens.

Yokohiii - 2 hours ago

An LLMs "wrong" decision is either systemic or biased. They learn "common sense" from human input (i.e. shared datasets, reinforcement learning). If a decision is flat out wrong for you, asking 10 LLMs is unlikely to help.
suprfnk - 4 hours ago

But then, if an agent picks the best response, how would you know that that is reliable?
- onion2k - 2 hours ago
  
  You could get the agents to output something structured and then use a deterministic test if you're worried about that.
- xienze - 3 hours ago
  
  Obviously you have multiple agents justify why they picked a certain response and then create another agent that picks the solution with the best justification.
  - kkyr - 3 hours ago
    
    touché

gardnr - 4 hours ago

This is straight outta 2023:

Agents aren't reliable; use workflows instead.

chandureddyvari - 3 hours ago

I had good success with hooks in claude code. Personally I feel this problem was common with humans as well. We added tools like husky for git commits, for our peers to push code which was linted, type checked etc.

I feel hooks are integral part of your code harness, that’s only deterministic way to control coding agents.

Nizoss - 28 minutes ago

I fully agree. Also started using husky before expanding further and created my own hooks. I can’t imagine myself using agents today without them, it would require a lot of babysitting.

cesarvarela - 2 hours ago

This will remain a persistent problem without a definitive answer until models move from generative tools to actual AI.

try-working - an hour ago

that's why you need a recursive workflow that creates its own artifacts per step that can later be used for verification.

Nizoss - 30 minutes ago

Sounds interesting, can you elaborate on your thinking? Got me curious.

terminalbraid - an hour ago

My friend, you have invented management.

Nizoss - 24 minutes ago

Not throwing shade at anyone here but the thought has definitely crossed my mind that we are recreating SAFe but for agents when looking at some of the orchestration setups out there. I think that it is better to not force the same hierarchical processes that worked for humans in large organizations onto agents and instead look at what they need to give better results and what their failure modes look like.

solomonb - 4 hours ago

I agree and I think a really wonderful way to encode agentic control flow would be with Polynomial Functors.

https://arxiv.org/abs/2312.00990

dnautics - 3 hours ago

Yes. Humans are also unreliable and nondeterministic (though certainly more reliable). Accordingly we have built software dev practices around this. I imagine it would be super useful for example to have a "TDD enforcer":

Phase 1: only test files may be altered, exactly one new test failure must appear.

Phase 2: only code files may be altered. The phase is cleared when the test now succeeds and no other tests fail.

If you get stuck, bail and ask for guidance

ManWith2Plans - 3 hours ago

I've been busy building and dogfooding open-artisan for my own development purposes. I've diverged quite a bit from main and am hoping to merge some of those changes back soon. It's basically an OpenCode plugin that forces open-code token-hungry state machine that tries to map the engineering process I follow, exposing only valid tools and states at every step of development. If you're interested, in following along or trying it out, it's available here:
https://github.com/yehudacohen/open-artisan/
Hopefully, I'll merge in my large structural changes in the next couple of weeks. These structural changes will enhance the state machine meaningfully, as well as adding support for hermes agenet.

- 2 hours ago

[deleted]

geon - 3 hours ago

How is this not obvious to everyone? It's like people forgot how to engineer.

aykutseker - 3 hours ago

all caps in a prompt is a code smell. when you're typing MANDATORY, you should be writing a wrapper, not refining the prose.

Nizoss - 21 minutes ago

Exactly! I have said this a couple of times but it was taken literally as in no capital letters or strong language. Glad to see someone else who shares this perspective.

_pdp_ - 3 hours ago

Or maybe, just maybe, LLMs do not run deterministicly and that is ok?

In the real world almost nothing runs like that - only software and even that has a lot of failures.

So perhaps rather than trying to make agents run deterministicly the goal is to assume some failure rate and find compensation control around it.

afxuh - 2 hours ago

thats why agents completes a project with the first 3 prompts, , then maintaining and fine-tuning it take ages till hits "-Session token expired"

ModernMech - 4 hours ago

Slowly and surely we are replacing AI with programming languages.

flowgrammer - 38 minutes ago

Fortunately I already created the best option for AI flow control. Hug away, Hackernoooz!

https://github.com/Far-World-Labs/Verblets

If you like it please add your own chain too! We need as many chains as there are words in the dictionary. (Many controversial views to be found here!)

PRs require prompts that generated the code please. Fork and use my specs if you can.

oinoom - 4 hours ago

this is just advocating for a harness, which has been the focus (along with evals) for at least the last three months by pretty much anyone working with agents professionally or seriously

eth415 - 5 hours ago

agreed - this is what we’ve been trying to build at scale.

https://github.com/salesforce/agentscript

ltbarcly3 - 2 hours ago

Don't listen to anyone who knows what should be done without proof. If someone 'knows' what agents 'need' then that knowledge is worth millions of dollars right now. If they haven't built it they are probably just talking shit.

droolingretard - 5 hours ago

Are you the guy who used to write MapleStory hacks?

yogthos - 4 hours ago

This was basically my realization as well. We are trying to get LLMs to write software the way humans do it, but they have a different set of strength and weaknesses. Structuring tooling around what LLMs actually do well seems like an obvious thing to do. I wrote about this in some detail here:

https://yogthos.net/posts/2026-02-25-ai-at-scale.html

flowgrammer - 15 minutes ago

My experimentation with Verblets also concluded plain functions are the most logical harness for LLMs.

encoderer - 4 hours ago

You can get a lot done with agentic programming without going "all in" on a gastown-like system, but I think there is a minimum viable setup:

1. an adversarial agent harness that uses one agent to create a plan and implement it, and another to review the plan and code-review each step.

2. an agentic validation suite -- a more flexible take on e2e testing.

3. some custom skills that explain how to use both of those flows.

With this in place you can formulate ideas in a chat session, produce planning artifacts, then use the adversarial system to implement the plans and the validation layer to get everything working e2e for human review.

There are a lot of tools you can use for these things but I chose to just build the tooling in the repo as I go.

Schiendelman - 4 hours ago

Claude already creates multiple agents for some projects just to keep context windows smaller. I don't think it'll be long before they offer a testing agent along with their planning agent.
- encoderer - 3 hours ago
  
  I prefer having codex author plans and implement, and claude play reviwer. I do swap them from time to time and i have a lot of respect for claude 4.6 and 4.7 but for my domain I think codex does a better job with the authoring.
  - Schiendelman - 3 hours ago
    
    That's a cool idea! Plus I bet you can stay in lower tiers with both?
    
    encoderer - 3 hours ago
    
    You're definitely burning more tokens with the back/forth and multi-step approach but assuming you swap who does the authoring from time to time you can definitely get the max out of each plan. Review doesn't use as many tokens.

- 2 hours ago

[deleted]

empath75 - 2 hours ago

I have heard this sort of thing a lot from people working with agents, and I just think it's fundamentally misguided as a way to think of them, and if you work with them this way, you are probably setting money on fire for no reason because the tasks you are able to perform this way _don't need agents to begin with_.

You might use an LLM api call here as a translation or summary step in a deterministic workflow, but they are not acting as agents, because they lack _agency_.

The value of using an agent harness is precisely that they are _not deterministic_. You provide agents a goal, tools and constraints and they do the task they were asked to perform as best as they can figure out how to do it. You may provide them deterministic workflows as tools they can call, but those workflows, outside of the agent harness itself, should not constrain what the agent does. You are paying a lot of money for agent reasoning, not to act as an expensive data transformation pipeline.

It may be the case that a lot of agentic workflows are more properly done with fully deterministic workflows, but the goal there should be to _remove the agents entirely_ and spend those tokens on non deterministic tasks that require agentic decision making.

I do think there are fundamental limits to what agents are capable of doing unsupervised and there does need to be a lot more human guidance, observability and control over what they are doing, but that's sort of the opposite of embedding them in deterministic workflows, that is more of team integration/communication problem to solve.

AIorNot - 5 hours ago

I mean we have Langgraph, BAML etc

fredcallagan - 21 minutes ago

[flagged]

pschw - 33 minutes ago

[dead]

- an hour ago

[deleted]

aditgupta - 3 hours ago

[dead]

lacymorrow - 3 hours ago

[flagged]