What I learned building an opinionated and minimal coding agent
mariozechner.at403 points by SatvikBeri 2 days ago
403 points by SatvikBeri 2 days ago
Really awesome and thoughtful thing you've built - bravo!
I'm so aligned on your take on context engineering / context management. I found the default linear flow of conversation turns really frustrating and limiting. In fact, I still do. Sometimes you know upfront that the next thing you're to do will flood/poison the nicely crafted context you've built up... other times you realise after the fact. In both cases, you didn't have that many alternatives but to press on... Trees are the answer for sure.
I actually spent most of Dec building something with the same philosphy for my own use (aka me as the agent) when doing research and ideation with LLMs. Frustrated by most of the same limitations - want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff. Be able to traverse the tree forwards and back to understand how I got to a place...
Anyway, you've definitely built the more valuable incarnation of this - great work. I'm glad I peeled back the surface of the moltbot hysteria to learn about Pi.
> want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff
My attempt - a minimalist graph format that is a simple markdown file with inline citations. I load MIND_MAP.md at the start of work, and update it at the end. It reduces context waste to resume or spawn subagents. Memory across sessions.
This is incredible. It never occurred to me to even think of marrying memory gather and update slash commands as a mindmap that follows the appropriate node and edge. It makes so much sense.
I was using table structure with column 1 as a key, and col 2 as the data, and told the agents to match key before looking at Col 2. It worked, but sometimes it failed spectacularly.
I’m going to try this out. Thanks for sharing your .md!
Very very cool. Going to try this out on some of my codebases. Do you have the gist that helps the agent populate the mindmap for an existing codebase? Your pastebin mentions it, but I dont see it linked anywhere.
The OpenClaw/pi-agent situation seems similar to ollama/llama-cpp, where the former gets all the hype, while the latter is actually the more impressive part.
This is great work, I am looking forward how it evolves in the future. So far Claude Code seems best despite its bugs given the generous subscription, but when the market corrects and the prices will get closer to API prices, then probably the pay-per-token premium with optimized experience will be a better deal than to suffer Claude Code glitches and paper cuts.
The realization is that at the end agent framework kit that is customizable and can be recursively improved by agents is going to be better than a rigid proprietary client app.
> but when the market corrects and the prices will get closer to API prices
I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous. We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think and there is no shortage of funding for R&D.
I also wouldn’t bet on Claude Code staying the same as it is right now with little glitches. All of the tools are going to improve over time. In my experience the competing tools aren’t bug free either but they get a pass due to underdog status. All of the tools are improving and will continue to do so.
> I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous.
I think this is absolutely true. There will likely be caps to stop the people running Ralph loops/GasTown with 20 clients 24/7, but for general use they will probably start to drop the API prices rather than vice-versa.
> We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think
Inference is generally accepted to be a very profitable business (outside the HN bubble!).
Claude Code subscriptions are more complicated of course but I think they probably follow the general pattern of most subscription software - lots of people who hardly use it, and a few who push it very hard can they lose money on. Capping the usage solves the "losing money" problem.
FWIW, you can use subscriptions with pi. OpenAI has blessed pi allowing users to use their GPT subscriptions. Same holds for other providers, except Flicker Company.
And I'm personally very happy that Peter's project gets all the hype. The pi repo already gets enough vibesloped PRs from openclaw users as is, and its still only 1/100th of what the openclaw repository has to suffer through.
Good to know, that makes it even better. I still find Opus 4.5 to be the best model currently. But if next generation of GPT/Gemini close the gap that will cross the inflection point for me and make 3rd party harnesses viable. Or if they jump ahead, that should put more pressure on the Flicker Company to fix the flicker or relax the subscriptions.
Is this something that OpenAI explicitly approves per project? I have had a hard time understanding what their exact position is.
This is basically identical to the ChatGPT/GPT-3 situation ;) You know OpenAI themselves keep saying "we still don't understand why ChatGPT is so popular... GPT was already available via API for years!"
ChatGPT is quite different from GPT. Using GPT directly to have a nice dialogue simply doesn't work for most intents and purposes. Making it usable for a broad audience took quite some effort, including RLHF, which was not a trivial extension.
This is the first I'm hearing of this pi-agent thing and HOW DO PEOPLE TECH DECIDE TO NAME THINGS?
Seriously. Is creator not aware that "pi" absolutely invokes the name of another very important thing? sigh.
The creator is very aware. Its original name was "shitty coding agent".
then do SCA and backronym it into something acceptable! That's even better lore :)
There's a fair chunk of irony here in that Mario is being both anti-memetic with his naming choices and contrarian in his design decisions, and yet he still finds himself dunked in the muck of popularity as the backbone of OpenClaw.
You mean Software Component Architecture? Do you want to bring down the wrath of IBM!
From the article: "So what's an old guy yelling at Claudes going to do? He's going to write his own coding agent harness and give it a name that's entirely un-Google-able, so there will never be any users. Which means there will also never be any issues on the GitHub issue tracker. How hard can it be?"
> Special shout out to Google who to this date seem to not support tool call streaming which is extremely Google.
Google doesn't even provide a tokenizer to count tokens locally. The results of this stupidity can be seen directly in AI studio which makes an API call to count_tokens every time you type in the prompt box.
AI studio also has a bug that continuously counts the tokens, typing or not, with 100% CPU usage.
Sometimes I wonder who is drawing more power, my laptop or the TPU cluster on the other side.
> If you look at the security measures in other coding agents, they're mostly security theater. As soon as your agent can write code and run code, it's pretty much game over.
At least for Codex, the agent runs commands inside an OS-provided sandbox (Seatbelt on macOS, and other stuff on other platforms). It does not end up "making the agent mostly useless".
Approval should be mandatory for any non-read tool call. You should read everything your LLM intends to do, and approve it manually.
"But that is annoying and will slow me down!" Yes, and so will recovering from disastrous tool calls.
You’ll just end up approving things blindly, because 95% of what you’ll read will seem obviously right and only 5% will look wrong. I would prefer to let the agent do whatever they want for 15 minutes and then look at the result rather than having to approve every single command it does.
Works until it has access to write to external systems and your agent is slopping up Linear or GitHub without you knowing, identified as you.
Sure; I mean this is what I _would like_; I’m not saying this would work 100% of the time.
That kind of blanket demand doesn't persuade anyone and doesn't solve any problem.
Even if you get people to sit and press a button every time the agent wants to do anything, you're not getting the actual alertness and rigor that would prevent disasters. You're getting a bored, inattentive person who could be doing something more valuable than micromanaging Claude.
Managing capabilities for agents is an interesting problem. Working on that seems more fun and valuable than sitting around pressing "OK" whenever the clanker wants to take actions that are harmless in a vast majority of cases.
This is like having a firewall on your desktop where you manually approve each and every connection.
Secure, yes? Annoying, also yes. Very error-prone too.
It’s not just annoying; at scale it makes using the agent clis impossible. You can tell someone spends a lot of time in Claude Code: they can type —dangerously-skip-permissions with their eyes closed.
Yep. The agent CLIs have the wrong level of abstraction. Needs more human in the loop.
It's not reliable. The AI can just not prompt you to approve, or hide things, etc. AI models are crafty little fuckers and they like to lie to you and find secret ways to do things with alterior motives. This isn't even a prompt injection thing, it's an emergent property of the model. So you must use an environment where everything can blow up and it's fine.
The harness runs the tool call for the LLM. It is trivial to not run the tool call without approval, and many existing tools do this.
My codex just uses python to write files around the sandbox when I ask it to patch a sdk outside its path.
It's definitely not a sandbox if you can just "use python to write files" outside of it o_O
Hence the article’s security theatre remark.
I’m not sure why everyone seems to have forgotten about Unix permissions, proper sandboxing, jails, VMs etc when building agents.
Even just running the agent as a different user with minimal permissions and jailed into its home directory would be simple and easy enough.
I'm just guessing, but seems the people who write these agent CLIs haven't found a good heuristic for allowing/disallowing/asking the user about permissions for commands, so instead of trying to sit down and actually figure it out, someone had the bright idea to let the LLM also manage that allowing/disallowing themselves. How that ever made sense, will probably forever be lost on me.
`chroot` is literally the first thing I used when I first installed a local agent, by intuition (later moved on to a container-wrapper), and now I'm reading about people who are giving these agents direct access to reply to their emails and more.
> I'm just guessing, but seems the people who write these agent CLIs haven't found a good heuristic for allowing/disallowing/asking the user about permissions for commands, so instead of trying to sit down and actually figure it out, someone had the bright idea to let the LLM also manage that allowing/disallowing themselves. How that ever made sense, will probably forever be lost on me.
I don't think there is such a good heuristic. The user wants the agent to do the right thing and not to do the wrong thing, but the capabilities needed are identical.
> `chroot` is literally the first thing I used when I first installed a local agent, by intuition (later moved on to a container-wrapper), and now I'm reading about people who are giving these agents direct access to reply to their emails and more.
That's a good, safe, and sane default for project-focused agent use, but it seems like those playing it risky are using agents for general-purpose assistance and automation. The access required to do so chafes against strict sandboxing.
Here's OpenAI's docs page on how they sandbox Codex: https://developers.openai.com/codex/security/
Here's the macOS kernel-enforced sandbox profile that gets applied to processes spawned by the LLM: https://github.com/openai/codex/blob/main/codex-rs/core/src/...
I think skepticism is healthy here, but there's no need to just guess.
That still doesn't seem ideal. Run the LLM itself in a kernel-enforced sandbox, lest it find ways to exploit vulnerabilities in its own code.
The LLM inference itself doesn't "run code" per se (it's just doing tensor math), and besides, it runs on OpenAI's servers, not your machine.
There still needs to be a harness running on your local machine to spawn the processes in their sandboxes. I consider that "part of the LLM" even if it isn't doing any inference.