Prompt caching: 10x cheaper LLM tokens, but how?

ngrok.com

86 points by samwho 3 days ago


coderintherye - 19 minutes ago

Really well done article.

I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it failed on it (with lots of colorful chain of thought), though Gemini got it right away.

est - 3 hours ago

This is a surprising good read of how LLM works in general.

simedw - 3 days ago

Thanks for sharing; you clearly spent a lot of time making this easy to digest. I especially like the tokens-to-embedding visualisation.

I recently had some trouble converting a HF transformer I trained with PyTorch to Core ML. I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…

Youden - 2 days ago

Link seems to be broken: content briefly loads then is replaced with "Something Went Wrong" then "D is not a function". Stays broken with adblock disabled.