Without benchmarking LLMs, you're likely overpaying

karllorey.com

120 points by lorey a day ago


hamiltont - a day ago

Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition

andy99 - a day ago

Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.

Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.

Id be careful is all.

wolttam - an hour ago

I'm consistently amazed at how much some individuals spend on LLMs.

I get a good amount of non-agentic use out of them, and pay literally less than $1/month for GLM-4.7 on deepinfra.

I can imagine my costs might rise to $20-ish/month if I used that model for agentic tasks... still a very far cry from the $1000-$1500 some spend.

verdverm - a day ago

I'd second this wholeheartedly

Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used

Cut all my subs, spend less money, don't get rate limited

Havoc - 4 hours ago

I’m also collecting the data my side with the hopes of later using it to fine tuning a tiny model later. Unsure whether it’ll work but if I’m using APIs anyway may as well gather it and try to bottle some of that magic of using bigger models

gridspy - a day ago

Wow, this was some slick long form sales work. I hope your SaaS goes well. Nice one!

dizhn - 4 hours ago

I paid a total of 13 US Dollars for all my llm usage in about 3 years. Should I analyze my providers and see if there's room for improvement?

iFire - 4 hours ago

I love the user experience for your product. You're giving a free demo with results within 5 minutes and then encourage the customer to "sign in" for more than 10 prompts.

Presumably that'll be some sort of funnel for a paid upload of prompts.

tantalor - 4 hours ago

> it's the default: You have the API already

Sorry, this just makes no sense to start off with. What do you mean?

empiko - a day ago

I do not disagree with the post, but I am surprised that a post that is basically explaining very basic dataset construction is so high up here. But I guess most people just read the headline?

ebla - 6 hours ago

Aren't you supposed to customize the prompts to the specific models?

deepsquirrelnet - a day ago

This is just evaluation, not “benchmarking”. If you haven’t setup evaluation on something you’re putting into production then what are you even doing.

Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.

petcat - a day ago

> He's a non-technical founder building an AI-powered business.

It sounds like he's building some kind of ai support chat bot.

I despise these things.

OutOfHere - a day ago

You don't need a fancy UI to try the mini model first.

- a day ago
[deleted]
nickphx - a day ago

ah yes... nothing like using another nondeterministic black box of nonsense to judge / rate the output of another.. then charge others for it.. lol

epolanski - a day ago

The author of this post should benchmark his own blog for accessibility metrics, text contrast is dreadful..

On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.