Letting Claude play text adventures

38 points by varjag 5 days ago

This is a great idea and great work.

Context is intuitively important, but people rarely put themselves in the LLM's shoes.

What would be eye-opening would be to create an LLM test system that periodically sends a turn to a human instead of the model. Would you do better than the LLM? What tools would you call at that moment, given only that context and no other knowledge? The way many of these systems are constructed, I'd wager it would be difficult for a human.

The agent can't decide what is safe to delete from memory because it's a sort of bystander at that moment. Someone else made the list it received, and someone else will get the list it writes. The logic that went into why the notes exist is lost. LLMs are living the Christopher Nolan film Memento.

lukev - 10 minutes ago

This is a great framework to experiment with memory architectures.

Everything the author says about memory management tracks with my intuition of how CC works, including my perception that it isn't very good at explicitly managing its own memory.

My next step in trying to get it to work well on a bigger game would be to try to build a more "intuitive" memory tool, where the textual description of a room or an item would automatically RAG previous interactions with that entity into context.

That also is closer to how human memory works -- we're instantly reminded of things via a glimpse, a sound, a smell... we don't need to (analogously) write in or search our notebook for basic info we already know about the world.

brimtown - 22 minutes ago

I’m currently letting Claude build and play its own Dwarf Fortress clone, as an installable plugin in Claude Code

https://github.com/brimtown/claude-fortress

pflenker - an hour ago

For a game like anchorhead, which is famous in its niche, shouldn’t Claude already know it sufficiently to just solve it right away? I would expect that its data source contained multiple discussions and walkthroughs of the game.

ratg13 - an hour ago

It's very likely the model didn't stop to question if the game they were playing was something they knew already, and just assumed it was a puzzle created for it.
- sfjailbird - 26 minutes ago
  
  You can see Claude's responses in the repo. The first one is:
  Ah, Anchorhead! One of the most celebrated pieces of interactive fiction ever written

sfjailbird - 33 minutes ago

Cool! I would like to see the game sessions.

Edit: they are there in the repo: https://github.com/eudoxia0/claude-plays-anchorhead/tree/mas...

skybrian - an hour ago

It seems like asking Claude to keep notes somehow would work better. An AGENTS file and a TODO file? An issue tracker like beads? Lots of things to try.

tiahura - 31 minutes ago

Claude code, nethack, and tmux are fun to experiment with.

imiric - an hour ago

> By the time you get to day two, each turn costs tens of thousands of input tokens

This behavior surprised me when I started using LLMs, since it's so counterintuitive.

Why does every interaction require submitting and processing all data in the current session up until that point? Surely there must be a way for the context to be stored server-side, and referenced and augmented by each subsequent interaction. Could this data be compressed in a way to keep the most important bits, and garbage collect everything else? Could there be different compression techniques depending on the type of conversation? Similar to the domain-specific memories and episodic memory mentioned in the article. Could "snapshots" be supported, so that the user can explore branching paths in the session history? Some of this is possible by manually managing context, but it's too cumbersome.

Why are all these relatively simple engineering problems still unsolved?

iamjackg - an hour ago

It's not unsolved, at least not the first part of your question. In fact it is a feature offered by all main LLM providers!
- https://platform.openai.com/docs/guides/prompt-caching
- https://platform.claude.com/docs/en/build-with-claude/prompt...
- https://ai.google.dev/gemini-api/docs/caching
- imiric - 37 minutes ago
  
  Ah, that's good to know, thanks.
  But then why is there compounding token usage in the article's trivial solution? Is it just a matter of using the cache correctly?
  - StevenWaterman - 29 minutes ago
    
    Cached tokens are cheaper (90% discount ish) but not free
    
    moyix - 8 minutes ago
    
    Also, unlike OpenAI, Anthropic's prompt caching is explicit (you set up to 4 cache "breakpoints"), meaning if you don't implement caching then you don't benefit from it.