How much of my observability data is waste?

usetero.com

109 points by binarylogic 20 hours ago


physicles - 26 minutes ago

I can't get over how expensive these observability platforms are.

Last I looked (and looked again just now), if we were to take all our structured logs from all services and send them to Datadog with our current retention policy, it would just about double our current IT spend.

Instead, we use Grafana + Loki + ClickHouse and it's been mostly maintenance-free for years. Costs under $100/month.

What am I missing? What's the real value that folks are getting out of these platforms?

karianna - 17 hours ago

Hard agree on the data waste, noise to signal ratio is typically very high and processing, shipping and storing all of that data costs a ton.

Previous start-up I worked on (jClarity, exited to Microsoft) mitigated much of this by having a model of only collecting the tiny amount of data that really mattered for a performance bottleneck investigation in a ring buffer and only processing / shipping and storing that data if a bottleneck trigger occurred (+ occasional baselines).

It allowed our product at the time (Illuminate to run at massive scale without costing our customers an arm and a leg or impacting their existing infrastructure. We charged on the value of the product reducing MTTR and not on how much data was being chucked around.

There was the constant argument against approach of always on observably or “collect all data JIC”, but with a good model (in our case something called the Java Performance Diagnostic Method) we never missed having the noise

jldugger - 18 hours ago

>Turns out you can compile tens of thousands of patterns and still match at line rate.

Well, yea, sort of the magic of the regular expression <-> NFA equality theorem. Any regex can be converted to a state machine. And since you can combine regexes (and NFAs!) procedurally, this is not a surprising result.

> I ran it against the first service: ~40% waste. Another: ~60%. Another: ~30%. On average, ~40% waste.

I'm surprised it's only 40%. Observability seems to be treated like fire suppression systems: all important in a crisis, but looks like waste during normal operations.

> The AI can't find the signal because there's too much garbage in the way.

There's surprisingly simple techniques to filter out much of the garbage: compare logs from known good to known bad, and look for the stuff thats' strongly associated with bad. The precise techniques seem bayesian in nature, as the more evidence (logs) you get the more strongly associated it will appear.

More sophisticated techniques will do dimensional analysis -- are these failed requests associated with a specific pod, availability zone, locale, software version, query string, or customer? etc. But you'd have to do so much pre-analysis, prompting and tool calls that the LLM that comprise today's AI won't provide any actual value.

gmuslera - 16 hours ago

Reminded me a note I heard about backups. You don't want backups, it is a waste of time, bandwidth and disk space, by far most if not all of it will end being discarded without being ever used. What you really want is something to restore from if anything breaks. That is the cost that should matter to you. What if you don't have anything meaningful to make a restore from?

With observability is not the volume of data, time and bandwidth used on it, is being able to understand your system and properly diagnose and solve problems when they happen. Can you do that with less? For the next problem that you don't know yet? If you can't because of lack of information or information you didn't collect, then spending so much may be was not enough.

Of course that there are more efficient (towards the end result) ways to do it than others. But having the needed information available, even if it is never used, is the real goal here.

smithclay - 18 hours ago

Kudos to Ben for speaking to one of the elephants in the room in observability: data waste and the impact it has on your bill.

All major vendors have a nice dashboard and sometimes alerts to understand usage (broken down by signal type or tags) ... but there's clearly a need for more advanced analysis which Tero seems to be going after.

Speaking of the elephant in room in observability: why does storing data on a vendor cost so much in the first place? With most new observability startups choosing to store store data in columar formats on cheap object storage, think this is also getting challenged in 2026. The combination of cheap storage with meaningful data could breathe some new life into the space.

Excited to see what Tero builds.

stackskipton - 19 hours ago

As Ops (DevOps/Sysadmin/SREish) person here, excellent article.

However, as always, the problem is more political than technical and those are hardest problems to solve and another service with more cost IMO won't solve it. However, there is plenty of money to be made in attempting to solve it so go get that bag. :)

At end of day, it's back to DevOps mentality and it's never caught on at most companies. Devs don't care, Project Manager wants us to stop block feature velocity and we are not properly staffed since we are "massive wasteful cost center".