How much of my observability data is waste?
usetero.com109 points by binarylogic 20 hours ago
109 points by binarylogic 20 hours ago
I can't get over how expensive these observability platforms are.
Last I looked (and looked again just now), if we were to take all our structured logs from all services and send them to Datadog with our current retention policy, it would just about double our current IT spend.
Instead, we use Grafana + Loki + ClickHouse and it's been mostly maintenance-free for years. Costs under $100/month.
What am I missing? What's the real value that folks are getting out of these platforms?
Hard agree on the data waste, noise to signal ratio is typically very high and processing, shipping and storing all of that data costs a ton.
Previous start-up I worked on (jClarity, exited to Microsoft) mitigated much of this by having a model of only collecting the tiny amount of data that really mattered for a performance bottleneck investigation in a ring buffer and only processing / shipping and storing that data if a bottleneck trigger occurred (+ occasional baselines).
It allowed our product at the time (Illuminate to run at massive scale without costing our customers an arm and a leg or impacting their existing infrastructure. We charged on the value of the product reducing MTTR and not on how much data was being chucked around.
There was the constant argument against approach of always on observably or “collect all data JIC”, but with a good model (in our case something called the Java Performance Diagnostic Method) we never missed having the noise
In broad strokes, I think this is similar to Bitdrift (https://bitdrift.io) - though they’re focused on mobile observability.
>Turns out you can compile tens of thousands of patterns and still match at line rate.
Well, yea, sort of the magic of the regular expression <-> NFA equality theorem. Any regex can be converted to a state machine. And since you can combine regexes (and NFAs!) procedurally, this is not a surprising result.
> I ran it against the first service: ~40% waste. Another: ~60%. Another: ~30%. On average, ~40% waste.
I'm surprised it's only 40%. Observability seems to be treated like fire suppression systems: all important in a crisis, but looks like waste during normal operations.
> The AI can't find the signal because there's too much garbage in the way.
There's surprisingly simple techniques to filter out much of the garbage: compare logs from known good to known bad, and look for the stuff thats' strongly associated with bad. The precise techniques seem bayesian in nature, as the more evidence (logs) you get the more strongly associated it will appear.
More sophisticated techniques will do dimensional analysis -- are these failed requests associated with a specific pod, availability zone, locale, software version, query string, or customer? etc. But you'd have to do so much pre-analysis, prompting and tool calls that the LLM that comprise today's AI won't provide any actual value.
Yeah, it's funny, I never went down the regex rabbit hole until this, but I was blown away by Hyperscan/Vectorscan. It truly changes the game. Traditional wisdom tells you regex is slow.
> I'm surprised it's only 40%.
Oh, it's worse. I'm being conservative in the post. That number represents "pure" waste without sampling. You can see how we classify it: https://docs.usetero.com/data-quality/logs/malformed-data. If you get comfortable with sampling the right way (entire transactions, not individual logs), that number gets a lot bigger. The beauty of categories is you can incrementally root out waste in a way you're comfortable with.
> compare logs from known good to known bad
I think you're describing anomaly detection. Diffing normal vs abnormal states to surface what's different. That's useful for incident investigation, but it's a different problem than waste identification. Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever? A health check log isn't anomalous, it's just not worth keeping.
You're right that the dimensional analysis and pre-processing is where the real work is. That's exactly what Tero does. It compresses logs into semantic events, understands patterns, and maps meaning before any evaluation happens.
> Traditional wisdom tells you regex is slow.
Because it's uncomfortably easy to create catastrophic backtracking.
But just logical-ORing many patterns together isn't one of the ways to do that, at least as far as I'm aware.
> I think you're describing anomaly detection.
Well it's in the same neighborhood. Anomaly detection tends to favor finding unique things that only happened once. I'm interested in the highest volume stuff that only happens on the abnormal state side. But I'm not sure this has a good name.
> Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever?
I get your point but: if sorting by the most strongly associated yields root causes (or at least, maximally interesting logs), then sorting in the opposite direction should yield the toxic waste we want to eliminate?
But if you don't do anomaly detection, how can you possibly know which data is useful for anomaly detection? And thus, which data is valuable to keep
Vectorscan is impressive. It makes a huge difference if you're looping through an eval of dozens (or more) regexps. I have a pending PR to fix it so it'll run as a wasm engine -- this is a good reminder to take that to completion.
Reminded me a note I heard about backups. You don't want backups, it is a waste of time, bandwidth and disk space, by far most if not all of it will end being discarded without being ever used. What you really want is something to restore from if anything breaks. That is the cost that should matter to you. What if you don't have anything meaningful to make a restore from?
With observability is not the volume of data, time and bandwidth used on it, is being able to understand your system and properly diagnose and solve problems when they happen. Can you do that with less? For the next problem that you don't know yet? If you can't because of lack of information or information you didn't collect, then spending so much may be was not enough.
Of course that there are more efficient (towards the end result) ways to do it than others. But having the needed information available, even if it is never used, is the real goal here.
I agree with the framing. The goal isn't less data for its own sake. The goal is understanding your systems and being able to debug when things break.
But here's the thing: most teams aren't drowning in data because they're being thorough. They're drowning because no one knows what's valuable and what's not. Health checks firing every second aren't helping anyone debug anything. Debug logs left in production aren't insurance, they're noise.
The question isn't "can you do with less?" It's "do you even know what you have?" Most teams don't. They keep everything just in case, not because they made a deliberate choice, but because they can't answer the question.
Once you can answer it, you can make real tradeoffs. Keep the stuff that matters for debugging. Cut the stuff that doesn't.
The problem is until I hit a specific bug I don't know what logs might be useful. For every bug I've had to fix 99% of the logs were useless, but I've had to fix many bugs over the years and each one needed a different set of logs. Sometimes I know in the code "this can't happen but I'll log an error just in case" - when I see those in a bug report they are often a clue, but I often need a lot of info bugs that happen normally all the time to figure out how my system got into that state.
"disk getting full" isn't useful unless you understand how/why it got full and that requires logging things that might or might matter to the problem.
There is a lot of crap that is and will ever be useless when debugging a problem. But there is a also a lot that you don't know if you will need it, at least, not yet, not when you are defining what information you collect, and may become essential when something in particular (usually unexpected) breaks. And then you won't have the past data you didn't collect.
You can go in a discovering path, can the data you collect explain how and why the system is running now? There are things that are just not relevant when things are normal and when they are not? Understanding the system, and all the moving parts, are a good guide for tuning what you collect, what you should not, and what are the missing pieces. And cycle with that, your understanding and your system will keep changing.
Kudos to Ben for speaking to one of the elephants in the room in observability: data waste and the impact it has on your bill.
All major vendors have a nice dashboard and sometimes alerts to understand usage (broken down by signal type or tags) ... but there's clearly a need for more advanced analysis which Tero seems to be going after.
Speaking of the elephant in room in observability: why does storing data on a vendor cost so much in the first place? With most new observability startups choosing to store store data in columar formats on cheap object storage, think this is also getting challenged in 2026. The combination of cheap storage with meaningful data could breathe some new life into the space.
Excited to see what Tero builds.
Thank you! And you're right, it shouldn't cost that much. Financials are public for many of these vendors: 80%+ margins. The cost to value ratio has gotten way out of whack.
But even if storage were free, there's still a signal problem. Junk has a cost beyond the bill: infrastructure works harder, pipelines work harder, network egress adds up. And then there's noise. Engineers are inundated with it, which makes it harder to debug, understand their systems, and iterate on production. And if engineers struggle with noise and data quality, so does AI.
It's all related. Cheap storage is part of the solution, but understanding has to come first.
Problem has never been the storage. Its running those queries to return in milliseconds - if its for a dashboard, an alert of your new AI agent trying to make sense of it.
As Ops (DevOps/Sysadmin/SREish) person here, excellent article.
However, as always, the problem is more political than technical and those are hardest problems to solve and another service with more cost IMO won't solve it. However, there is plenty of money to be made in attempting to solve it so go get that bag. :)
At end of day, it's back to DevOps mentality and it's never caught on at most companies. Devs don't care, Project Manager wants us to stop block feature velocity and we are not properly staffed since we are "massive wasteful cost center".
100% accurate. It is very much political. I'd also add that the problem is perpetuated by a disconnection between engineers who produce the data and those who are responsible for paying for it. This is somewhat intentional and exploited by vendors.
Tero doesn't just tell you how much is waste. It breaks down exactly what's wrong, attributes it to each service, and makes it possible for teams to finally own their data quality (and cost).
One thing I'm hoping catches on: now that we can put a number on waste, it can become an SLO, just like any other metric teams are responsible for. Data quality becomes something that heals itself.
I'd be shocked if you can accurately identify waste since you are not ultimately familiar with the product.
Sure, I've kicked over what I thought was waste but told it's not or "It is but deal Ops"
You're right, it's not always binary. That's why we broke it down into categories:
https://docs.usetero.com/data-quality/logs/malformed-data
You'd be shocked how much obviously-safe waste (redundant attributes, health checks, debug logs left in production) accounts for before you even get to the nuanced stuff.
But think about this: if you had a service that was too expensive and you wanted to optimize the data, who would you ask? Probably the engineer who wrote the code, added the instrumentation, or whoever understands the service best. There's reasoning going on in their mind: failure scenarios, critical observability points, where the service sits in the dependency graph, what actually helps debug a 3am incident.
That reasoning can be captured. That's what I'm most excited about with Tero. Waste is just the most fundamental way to prove it. Each time someone tells us what's waste or not, the understanding gets stronger. Over time, Tero uses that same understanding to help engineers root cause, understand their systems, and more.
I would like to just have a storage engine that can be very aggressive at deduplicating stuff. If some data is redundant, why am I storing it twice?
That's already pretty common, but the goal isn't storing less data for its own sake.
> the goal isn't storing less data for its own sake.
Isn't it? I was under impression that the problem is the cost storing all this stuff
Nope, you can't just look at cost of storage and try to minimize it. There are a lot of other things that matter.
What I am asking is, what are the other concerns other than literally the cost? I have interest in this area and I am seeing everyone saying that observability companies are overcharging their consumers.