Show HN: Statewright – Visual state machines that make AI agents reliable

github.com

124 points by azurewraith 4 days ago


Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

I'm Ben Cochran, I spent 20+ years in the trenches with full-stack Engineering, DevOps, high performance computing & ML with stints at NVIDIA, AMD and various other organizations most recently as a Distinguished Engineer.

For agents to work reliably you either need massive parameter counts or massive context windows to keep the solution spaces workable. Most people are brute forcing reliability with bigger models and longer prompts.

What if I made the problem smaller instead of making the model bigger?

I took a different approach by using smaller models: models in the 13-20B parameter range and set them to task solving real SWE-bench problems. I constrained the tool and solution spaces using formal state machines. Each state in the machine defines which tools the model can access, how many iterations it gets and what transitions are valid. A planning state gets read-only tools. An implementation state gets edit tools (scoped to prevent mega edits) and write friendly bash tools. The testing state gets bash but only for testing commands. The model cannot physically skip steps or use the wrong tool at the wrong time. It is enforced via protocol, not via prompts.

The results were more promising than I would have expected. Across multiple model families irrespective of age (qwen-coder, gpt-oss, gemma4) and the improvements were consistent above the 13B parameter inflection point. Below that, models can navigate the state machine but can't retain enough context to produce accurate edits. More on the research bit: https://statewright.ai/research

Surprisingly this yielded improvements in frontier models as well. Haiku and Sonnet start to punch above their weight and Opus solves more reliably with fewer tokens and death spirals. Fine tuning did not yield these kinds of functional improvements for me. The takeaway it seems is that context window utilization matters more than raw context size - a tightly scoped working context at each step outperforms a model given carte blanche over everything. Constraining LLMs which are non-idempotent by using deterministic code is a pattern that nobody is currently talking about.

So, I built Statewright. Its core is a Rust engine that evaluates state machine definitions: states, transitions, guards and tool restrictions. Its orchestration doesn't use an LLM, just enforces the state machine. On top of that is a plugin layer that integrates with Claude Code (and soon Codex, Cursor and others) via MCP. When you activate a workflow, hooks enforce the guardrails per state automatically. The model sees 5 tools available instead of dozens, gets clear instructions for the current phase and transitions when conditions are met. Importantly it tells the model when it's attempting to do something that isn't in scope, incorrect or when it needs to try something else after getting stuck.

You can use your agent via MCP to build a state machine for you to solve a problem in your current context. The visual editor at statewright.ai lets you tweak these workflows in a graph view... You can clearly see the failure paths, the retry loops and the approval gates. State machines aren't DAGs; they loop and retry, which is what agentic work actually needs.

Statewright is currently live with a free tier, try it out in Claude Code by running the following:

/plugin marketplace add statewright/statewright

/plugin install statewright

/reload-plugins

Then "start the bugfix workflow" or /statewright start bugfix. You'll need to paste your API key when prompted. The latest versions of Claude may complain -- paste the API key again and say you really mean it, Claude is just being cautious here.

Feedback is welcome on the workflow editor, the plugin experience, and tell me what workflows you'd want to build first. Agents are suggestions, states are laws.

embedding-shape - 4 days ago

I wanted to try to reproduce the research results (https://github.com/statewright/statewright#research-results) locally but I wasn't able to find the code for it, have you publish the code for running those somewhere?

The research page (https://statewright.ai/research) mentions a patent, and a "core engine";

> Provisional patent application filed: #64/054,240 (April 30, 2026). 35 claims covering state machine guardrail enforcement for LLM agent tool access. The core engine remains Apache 2.0 open source.

I'm not sure I understand what the "core engine" is if it's not the "state machine guardrail runtime" which is what the patent cover. What parts are the open source parts exactly?

I find the idea really interesting and was nodding along the way as I read what you wrote, makes sense both for the human and the agent, seems like a really nice idea that'd help, but the patent kind of makes me want to run away and not look into it too deeply.

giancarlostoro - 4 days ago

Interesting, I built a ticketing system similar to Beads which has yielded more predictable results with Claude and other models, and I'm currently building a custom harness, I'm able to use offline models though my GPU ram bandwidth is much lower, but I'm also planning on doing something similar to what you've built, namely the editing tools and what not, I hate how long it takes for Claude to look for files, it feels wasteful. I'm still astounded that everyone else has figured out ways to speed up harnesses, but Claude Code is still slow like a slug. I don't even care if I am waiting on the LLM in terms of slowness, but running local tools slowly bothers the living crap out of me, stop using grep, RIPGREP IS FASTER!

In any case, I'll have to check out Statewright after work ;)

redhale - 4 days ago

I feel like caching should be mentioned in tradeoffs, right? If you change the tool list frequently, that's a cache bust. In long sessions that seems like it could significantly affect costs.

tim-projects - 4 days ago

I'm fully convinced that state machines are the key to getting low powered llm models to produce good quality code.

addaon - 3 days ago

I’ve been using a pattern similar to this with near-frontier models to solve problems harder than coding. Structurally things are even more extreme — no tool calling allowed. Each state gives structured output that the harness then uses to derive the next state and context. So a context in one state may say “you have these lemmas with definition visible, and these by name in other files”; the agent from a certain state can consume the visible lemmas, but can also modify includes to get visibility into and ability to use other lemmas after iteration. So far, seems sane, but haven’t benchmarked on this problem against more free-form solutions.

DeathArrow - 4 days ago

First thought: But why do we need statewright.ai external api? Why can't we do everything locally?

Second thought: enforcing tools is useful and I built myself a Pi extension to deny access to particular tools in some workflows.

But we need somehow to force agents obey the rules.

For example I have rules when using Pi to ask main agent to dispatch implementer agents in parallel using git worktrees. Some time it uses git worktrees, sometimes not.

The thoughts are like this: "the user asked me to use git worktrees so let me start using git worktrees. But wait, the task is simple so maybe I don't need git worktrees..."

If I ask why it didn't follow the rules, it says something like: "The user is right, I should have followed the rules..."

tecoholic - 3 days ago

Very cool idea. I had something vaguely similar in my mind. It's nice one see go ahead and implement it. All the Claude code animations and not knowing what's happening, how long it will take and what will come out is really frustrating me. On top of that there is no way to actually limit the scope of things. Opencode's Plan mode and build mode helps a bit.

If a state machine can improve a local LLM to produce better results, it's welcome addition to tinkerers and solo devs.

2001zhaozhao - 4 days ago

Interesting.

In your Github, the JSON format shown for defining custom workflows is very simple. I wonder if that limits the detail in the state-related instructions and error messages you can send to a model.

For example, in state transitions, does your tool just tell the model something like "you are in 'act' mode and no longer in 'plan' mode, here are your new available tools"? Seems difficult to give it any more informative messages given how simple the workflow definitions are. Likewise when the model attempts to do something that's not supported for tools in the given phase.

fizza_pizza - 3 days ago

This actually makes a lot of sense. Feels like most people are trying to brute force reliability with bigger models while you’re reducing the problem space instead. “Agents are suggestions, states are laws” is such a good line too.

nextaccountic - 4 days ago

In https://github.com/statewright/statewright/blob/main/docs/im...

what's the difference between a "transition" (purple line, not shown in the workflow) as opposed to happy path / failure?

esafak - 4 days ago

I just have a smart model write a testable phased plan, have a cheaper model implement them, and yet another model to review each phase. I don't see the value of adding a Rust state engine. Algorithmically verifiable things can be tests, and more nebulous things (like pattern compliance) need an LLM to do the heavy lifting and can make mistakes, so what does the state engine buy you?

password4321 - 4 days ago

Does it make sense to ship an MCP code mode API? I'm surprised you're recommending MCP as-is when concerned about context usage optimization. I don't have a lot of hands-on experience either way yet so I'm curious what's best and/or most popular... I understand MCP is less effort and still affordable at VC-subsidised prices.

miki_tyler - 4 days ago

Very nice project!

Is the editor/composer separate from the runtime?

If I build a workflow in the visual editor, can I use that same flow inside my own app just by using the runtime/engine? Or is it mainly tied to the Statewright platform and Claude Code plugin?

I’m wondering if the runtime can be used as a standalone piece to power apps I build.

aitchnyu - 2 days ago

My Kilocode has error messages like "you have called edit for a file you have not read". Did you make an evolved version of this?

prunrCloud - 3 days ago

Really interesting approach. My only concern would be how much flexibility gets lost when workflows become too rigid. Curious how it performs on tasks that require more creative exploration.

dataworth - 3 days ago

Visualizing agentic problem solving is a really cool concept. Feels like something I’ve seen on TV or something before. I like it.

davidkpiano - 4 days ago

Pretty cool. Looks like stately.ai but catered towards agentic state machine workflows. Really interesting!

chris_st - 4 days ago

Please add support for the Windsurf editor as well. Thanks!

brainless - 4 days ago

I have to check how you are using state machines but I have also been focused on small models for a while now.

nocodo is one of my product experiments, currently using 120B model but I have tested a few agents inside it with 20B models.

I create a bunch of agents, each with very specific goals. Like Project Manager, Backend Engineer, etc.

Each agent gets a very compact list of tools and access to only certain parts of the filesystem or commands.

https://github.com/brainless/nocodo/tree/main/agents/src

azurewraith - 3 days ago

Hey it's me again. Some things that didn't fit in the README or the original post -- less about features, more about where this goes.

The plan/implement/test workflow is very basic and represents the most common agentic use case. But the state machine pattern applies to any multi-step work where agents are useful but susceptible to death spirals, hallucinations, or other non-deterministic quirkiness. This also enables Claude Desktop and other non-coding agents to perform useful constrained work.

I've been building a content pipeline for tabletop publishing and tested it a bit earlier yesterday. A research phase gathers lore and game details from a compendium, a drafting phase generates structured content including schema-specific JSON validation (so my Lua+LaTeX templates work without iterating). A review gate has me editing content directly (tmux+neovim dialog is great for this). The agent shapes the content, makes sure it conforms to JSON validation and content requirements, then I write it. Before I adapted the state machine to it, the agent tried to do everything all at once — calling multiple agents is sometimes effective but details get lost and you definitely lose visibility in the summarization. The state machine runs everyone serially (for now) but chaining and parallelization are on the roadmap.

While working with statewright on a different workflow over the weekend and Claude (as Claude does) attempted to write an intricate bash script to work around a guardrail... and statewright blocked it! I think that was when I knew there was some real power behind what's been built here. Enforcement has to be structural, not advisory.

Also, being generally useful for things besides coding you can start to think about things like SOC 2 change management. Every change needs a plan, a human review gate, audited implementation, pull request, review, human approval, and then finally a human to approve a production deployment. Today teams enforce this with checklists and hope. An agent constrained by a workflow that won't let it deploy without all the prerequisite pieces is enterprise delivery with an auditable paper trail and humans injected for approvals where they need to be - not managing each change's lifecycle.

The piece I'm most excited about is agent-generated workflows. You solve a problem once and maintain your context, then point the agent at the JSON schema and it creates and uploads a new workflow to statewright automatically that you can use immediately. No fine-tuning, no exhaustive prompt engineering, no dozens of agents... best-fit lightweight guardrails that agents help build themselves, compiling your intent into structure the models can't weasel their way out of. This is a fundamentally different reality than what the current state of the art is practicing. I think that's a big deal.

quantumadopter - a day ago

[flagged]

Bret_McKinney - 2 days ago

[flagged]

MehdiBelkacem - 2 days ago

[dead]

squid-protocol - 4 days ago

[dead]

reiter - 3 days ago

[flagged]

hiroto_lemon - 4 days ago

[flagged]

theuniverseson - 3 days ago

[flagged]

GhostDriftInc - 4 days ago

[flagged]

Bmello11 - 4 days ago

[flagged]

ldaniel_ships - 4 days ago

[flagged]

tommy29tmar - 3 days ago

[flagged]

Regina0727 - 3 days ago

[dead]