I got OpenTelemetry to work. But why was it so complicated?

iconsolutions.com

288 points by paltaie 4 days ago


hinkley - 4 days ago

The whole time I was learning/porting to Otel I felt like I was back in the Java world again. Every time I stepped through the code it felt like EnterpriseFizzBuzz. No discoverability. At all. And their own jargon that looks like it was made by people high on something.

And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.

OTEL is actively hostile to any language that uses one process per core. What a joke.

Just go with Prometheus. It’s not like there are other contenders out there.

rtuin - 4 days ago

Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job. Also kudos to grafana for adopting OpenTelemetry as a first class citizen of their ecosystem.

I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises. So as years passed and OpenTelemetry API’s and SDK’s stabilized it became our standard for application observability.

To be honest the documentation could be better overall and the onboarding docs differ per programming language, which is not ideal.

My current team is on a NodeJS/Typescript stack and we’ve created a set of packages and an example Grafana stack to get started with OpenTelemetry real quick. Maybe it’s useful to anyone here: https://github.com/zonneplan/open-telemetry-js

dimitar - 4 days ago

It is as complicated as you want or need it to be. You can avoid any magic and stick to a subset that is easy to reason about and brings the most value in your context.

For our team, it is very simple:

* we use a library send traces and traces only[0]. They bring the most value for observing applications and can contain all the data the other types can contain. Basically hash-maps vs strings and floats.

* we use manual instrumentation as opposed to automatic - we are deliberate in what we observe and have great understand of what emits the spans. We have naming conventions that match our code organization.

* we use two different backends - an affordable 3rd party service and an all-on-one Jaeger install (just run 1 executable or docker container) that doesn't save the spans on disk for local development. The second is mostly for piece of mind of team members that they are not going to flood the third party service.

[0] We have a previous setup to monitor infrastructure and in our case we don't see a lot of value of ingesting all the infrastructure logs and metrics. I think it is early days for OTEL metrics and logs, but the vendors don't tell you this.

junto - 4 days ago

One of my biggest problems was the local development story. I wanted logs, traces and metrics support locally but didn’t want to spin up a multitude of Docker images just to get that to work. I wanted logs to be able to check what my metrics, traces, baggage and activity spans look like before I deploy.

Recently, the .NET team launched .NET Aspire and it’s awesome. Super easy to visualize everything in one place in my local development stack and it acts as an orchestrator as code.

Then when we deploy to k8s we just point the OTEL endpoint at the DataDog Agent and everything just works.

We just avoid the DataDog custom trace libraries and SDK and stick with OTEL.

Now it’s a really nice development experience.

https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals...

https://docs.datadoghq.com/opentelemetry/#overview

BiteCode_dev - 4 days ago

If you are doing otel with python, use Logfire's client... even if you don't use their offering.

It's foss, and ypu can point it to any otel compat enpoint. Plus the client that the pydantic team made is 10 times better and simpler than the official otel lib.

Samuel Colvin has a cool intervew where he explains how he got there: https://www.bitecode.dev/p/samuel-colvin-on-logfire-mixing-p...

edenfed - 4 days ago

Definitely can relate, this is why I started an open-source project that focus on making OpenTelemetry adoption as easy as running a single command line: https://github.com/odigos-io/odigos

pat2man - 4 days ago

A lot of web frameworks etc do most of the instrumentation for you these days. For instance using opentelemetry-js and self hosting something like https://signoz.io should take less than an hour to get spun up and you get a ton of data without writing any custom code.

deepsun - 4 days ago

Same thing. OpenTelemetry grew up from Traces, but Metrics and Logs are much better left to specialized solutions.

Feels like a "leaky abstraction" (or "leaky framework") issue. If we wanted to put everything under one umbrella, then well, an SQL database can also do all these things at the same time! Doesn't mean it should.

dboreham - 4 days ago

Author is trying to do something difficult with a non-batteries-included open source (free to them) product. Seems quite uncomplicated given the circumstances. The whole point of OTel is to not get bent over backwards by one of the SaaS "logging/tracing/telemetry" companies, and as such it's going to incur some cost/pain of its own, but typically the bargain is worth taking.

BugsJustFindMe - 4 days ago

If you get to the end you find that the pain was all self-inflicted. I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.

nimish - 4 days ago

It's complicated because it's designed for the companies selling Otel compatible software, not the engineers implementing it

6r17 - 4 days ago

I have implemented OTEL over numerous projects to retrieve traces. It's just a total pain and I'd 500% skip it for anything else.

PeterZaitsev - 4 days ago

For those looking for tracing but less complexity check out eBPF based solutions such as Coroot or Odigos

mmanciop - 2 days ago

Adopting OpenTelemetry does not have to be hard for common use-cases. On Kubernetes, the Dash0 operator (https://artifacthub.io/packages/search?repo=dash0-operator) automatically instruments Node.js and Java workloads (and soon other runtimes) with just a custom resource created in a namespace. It works with all OpenTelemetry backends I know of.

Disclaimer: I am one of the authors of the Dash0 operator and work on Dash0 (https://www.dash0.com/), an OpenTelemetry-native observability platform.

Automatic instrumentation on Kubernetes is also provided by the community OpenTelemetry (https://github.com/open-telemetry/opentelemetry-operator).

I am certainly biased here because OpenTelemetry and Prometheus have been at the core of my professional life for the past half decade, but I think that the biggest challenge, is that there are many different ways to get you to a good setup, and people get lost in the discovery of the available options.

cglan - 4 days ago

I agree. I tried to get it to work recently with datadog, but there was so many hiccups. I ended up having to use datadogs solution mostly. The documentation across everything is also kind of confusing

cedws - 4 days ago

I still don’t understand what OTEL is. What problem is it solving? If it’s a standard what is the change for the end user? Is it not just a matter of continuing to use whatever (Prometheus, Grafana, etc) with the option to swap components out?

antithesis-nl - 4 days ago

This was exactly my reaction to OpenTelemetry.

Creating an HTTP endpoint that publishes metrics in a Prometheus-scrape-able format? Easy! Some boolean/float key-value-pairs with appropriate annotations (basically: is this a counter or a gauge?), and done! And that lead (and leads!) to some very usable Grafana dashboards-created-by-actual-users and therefore much joy.

Then, I read up on how to do things The Proper Way, and was initially very much discouraged, but decided to ignore All that Noise due to the existing solutions working so well. No complaints so far!

vzbl9293 - 2 days ago

Interesting. We're trying to cut costs on APM so we've been moving toward opensource alternatives. Setting up OTEL is definitely tedious, especially for traces and DT wasn't making it easier. I've been checking out a few alts, Signoz, Odigos, Chronosphere... there a few others too but these guys stood out. As much as we want to build out OTEl ourselves, looking for a solution to make the transition easy seems like the way to go.

ejs - 4 days ago

Glad I'm not the only one that feels this way. For a small application when you just want some metrics and observability, it's a big burden to get it all working.

On my own projects, I send the metrics I care about out through the logs and have another project I run collect and aggregate them from the logs. Probably “wrong” but it works and it's easy to set up.

pnathan - 4 days ago

I spent altogether too much time trying to get the Rust otel libs working in a useful and concise way. After a few hours I junked it and went back to a direct use of a jaeger client sending off to the otel collector.

there's some gold here, but most of it is over in the consultant/vendor space today, I fear.

lexh - 4 days ago

Gee whiz is this person is in for a treat when they discover the joys of OpAMP https://github.com/open-telemetry/opamp-spec/blob/main/speci...

Turtles all the way down.

shireboy - 4 days ago

I'm literally porting some code to Otel now and here is what I landed on, even before this article: It is confusing because it's a topic that uses vague terminology that means different things in different domains. For example, I'm looking at one OTel ui and "Traces" are the individual http requests to a service. In another UI, against the same data, "Traces" are the log messages from code in the service, and "Requests" are the individual http requests. To wire up in code, there's yet other terminology.

I haven't decided exactly what to blame for this. In some ways, it's necessary to have vague, inconsistent terminology to cover various use cases. And, to be fair some of the UIs predate OTel.

almaight - a day ago

In addition to OTEL, there are many other products, including Odigos, Beyla, Kubeshark, Malcolm, Falco, DDosify, Deepflow, Tetragon, and Retina. Deepflow is a free and open source product.

Cwizard - 3 days ago

OTEL always seems way too complicated to use to me. Especially if you want to understand what it is doing. The code has a lot of abstractions and indirection (at least in Go).

And reading this it seems a lot of people agree. Hope that can be fixed at some point. Tracing should be simple.

See for example this project: https://github.com/jmorrell/minimal-nodejs-otel-tracer

I think it is more a POC but it shows that all this complexity is not needed IMO.

andrewflnr - 3 days ago

So much pain related to context tracking. I'm growing more and more convinced that solving that problem will be the next big thing in PLs, probably in the form of effect systems.

etimberg - 4 days ago

What otel really needs to succeed, at least in the python space, is something as easy and straightforward as DataDog's ddtrace command.

Groxx - 4 days ago

Yeah... this is about how well every OTel migration goes, from what I've seen.

Docs are an absolute monstrosity that rival Bazel's for utility, but are far less complete. Implementations are extremely widely varied in support for basics. Getting X to work with OTel often requires exactly what they did here: reverse-engineering X to figure out where it does something slightly abnormal... which is normal, almost every library does something similar, because it's so hard to push custom data through these systems in a type-safe way, and many decent systems want type safety and will spend a lot of effort to get it.

It feels kinda like OAuth 2 tbh. Lots of promise, obvious desirable goals, but completely failing at everything involving consistent and standardized implementation.

gpi - 4 days ago

OpenTelemessy

pranay01 - 4 days ago

I literally gave a lightning talk on this in Kubecon NA last year. Here's the youtube video, might help you get some perspective

tl;dr

while there are certainly many areas to improve for the project, some reasons why it could seem complicated

Extensibility by Design: Flexibility in defining meters and signals ensures diverse use cases are supported.

It's still a relatively new technology (~3 years old), growing pains are expected. OpenTelemetry is still the most advanced open standard handling all three signals together.

[1]https://www.youtube.com/watch?v=xEu8_Aeo_-o

jensensbutton - 4 days ago

It's getting close to k8s in terms of activity so at least there are a lot of people working on it.

anacrolix - 3 days ago

https://github.com/anacrolix/notel?tab=readme-ov-file#what-a...

icelancer - 3 days ago

I wish I could move off NewRelic. Every time I post about it (seriously, check my post history) over the years, HN commenters try to convince me that it does automated metrics almost as good, or just as good, or even better.

Once in awhile I try to spin up OTel like they say. Every single time it sucks. I'll keep trying, though. NewRelic's pricing is so brutal that I hold out hope. Unfortunately, NR's product really is that good...

- 3 days ago
[deleted]
linkerdoo - 3 days ago

[dead]

hocuspocus - 4 days ago

Have you considered Kamon instead? From personal experience it's really the best tracing solution for Akka and other libraries using Scala Futures. I haven't tried it, but it does have built-in Spring support as well.

https://kamon.io

Edit: I wonder why suggesting JVM instrumentation that is much more polished than the OTel and Lightbend agents gets me downvoted?