NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
OpenTelemetry: Escape Hatch from the Observability Cartel (oneuptime.com)
ramon156 1 days ago [-]
OTel has been slow as hell since release. we tried using it for tracing last week and the app was significantly slower.

We just use prometheus+grafana now. Yes, this technically also slows the app down, but OTel was unbearably slow.

I'm sure im doing a million and one things wrong, but i can't be arsed to set something up just to see some performance metrics. Deadlocks can be found using transaction metrics, that's all you need.

Edit: I now read in the comments the JS ver is a bad impl, I guess this might be part of the reason.

nijave 1 days ago [-]
You might be able to use big telemetry vendor libraries for instrumentation. Iirc you can use ddtrace Python library and have it send to otel collector which will convert from DD span format to otel.

I haven't tried but it's probably possible to do the same with JS

I think Sentry was also expanding into tracing--might be worth a look to see if they're doing something that works better in their library

phillipcarter 1 days ago [-]
Would be curious about this in more detail since I’ve not normally seen a JS app be significantly slower due to adding autoinstrumentation over the years. There’s obviously an overhead but aside from the occasional bug I’ve never seen it be significant to impact user experiences or cost to serve.

That said if your goal is basic performance metrics and nothing more, then tracing is overkill. Don’t even need an SDK, just monitor the compute nodes async with some basic system metrics. But if your goal is to narrow down behaviors within your app on a per-request basis there really is no way around tracing if you value your sanity.

schainks 1 days ago [-]
Say more about what you did to instrument? I am willing to bet architectural choices are that slowness, not OTel itself.
kchoudhu 1 days ago [-]
OpenTelemetry won observability mindshare, but it is entirely the wrong architectural choice: by buying into its ethos your code is held hostage by the least stable otel monitoring library for your dependencies.

Sadly, there was always an alternative that no one took: dtrace. Add USDTs to your code and then monitor progress by instrumenting it externally, sending the resulting traces to wherever you want. My sincere hope is that the renewed interest in ebpf makes this a reality soon: I never want to have to do another from opentelemetry.<whatever> import <whatever> ever again.

jitl 1 days ago [-]
If some tracing plug in is shitting up your code with its monkeypatching, rip it out and instrument it yourself. We do this a lot. I’d say Otel packages are no better or worse quality wise than any other stuff in node_modules. Not otel’s fault that Code in general has Bugs and is Bad.
rixed 1 days ago [-]
I'm not familiar with usdt, but my understanding is that it covers only the lower level of Opentelemetry. Is there an equivalent of OTLP for dtrace?
jitl 1 days ago [-]
Does any OpenTelemetry vendor have a dashboard / graph product in the same level of usability as Datadog?

Honeycomb is decent at what it has, but very limited offerings for dashboard.

Coming from Datadog, Grafana is such a bad experience I want to cry every time I try to build out a service dashboard. So much more friction to get anything done like adding transform functions / operators, do smoothing or extrapolation, even time shifting is like pulling teeth. Plus they just totally broke all our graphs with formulas for like 2 days.

Grafana is to Datadog what Bugzilla is to Linear.

everfrustrated 1 days ago [-]
As someone used to Datadog I can't find anything that comes close. Datadog makes it so easy to throw adhoc data up on a screen and explore a hypothesis from there. Using Grafana or New Relic feels like going from a 30" screen to a 14" laptop - everything is just harder and more frustrating in ways that are hard to articulate.
_boffin_ 1 days ago [-]
Elastic is my choice. Don’t like grafana and dd felt like it was holding your hand while telling you to give them your money as they punch you repeatedly.
nijave 1 days ago [-]
I thought New Relic was pretty good but their per seat license cost is super high

They have a SQL like query language that I think can do most of what you're describing

jitl 1 days ago [-]
I want improvements like “dropdown menu of tags not limited to 50 items” and “one button forecast on any graph”. Unsure if any SQL query language will solve my annoyance that stuff in Datadog was one click, and in Grafana is 14 clicks or impossible without referencing a Book of the Arcane.

Heck I’m dying for “I can copy a graph between dashboards”. Grafana allows this but if any variable is in the graph but doesn’t exist in the destination, pasting just creates an empty graph.

yread 1 days ago [-]
I use New Relic and I'm not sure I would recommend it. Sure it was easy to install on all the server but now I keep getting alerts every months (on like the 2nd day) that I ran out of some "quota" of 100GB. I thought that's impossible as I almost don't use it but apparently they (used to) send by default list of running processes every second and that gunks things up.

Also I setup alerts - for >50% CPU or >90% disk full. I do get an alert but it doesn't say which volume or how full is it - what was the actual value that triggered sending the alert. WTF.

blyry 1 days ago [-]
We migrated from newrelic to datadog (for cost reasons LMAO) a while back and I miss NRQL every single day I'm building a dashboard.

I enjoy having everything instrumented and in one spot, it's super powerful, but I am currently advocating for self hosting loki so that we can have debug+ level logs across all environments for a much much lower cost. Datadog is really good at identifying anomalies, but the cost for logs is so high there's a non-trivial amount of savings in sampling and minimizing logging. I HATE that we have told devs "don't log so much" -- that misses the entire point of building out a haystack. And sampling logs at 1%, and only logging warnings+ in prod makes it even harder to identify anomalies in lower environments before a prod release.

last hot take: The UX in kibana in 2016 was better than anything else we have now for rapidly searching through a big haystack, and identifying and correlating issues in logs.

voxic11 1 days ago [-]
Have you looked at https://signoz.io/ ? Its the closest I have tried, but I still wouldn't put it quite at the same level of usability.
jitl 1 days ago [-]
Doesn't seem like this can visualize over data stored in Honeycomb like Grafana can :(
makeavish 20 hours ago [-]
It’s a full fledged tool like DataDog which is opensource and can be selfhosted. You can replace honeycomb with it. What features do you think it’s missing compared to honeycomb?
tokol17982 1 days ago [-]
https://github.com/uptrace/uptrace OpenTelemetry-based, feels much closer to Datadog than Grafana.
boilerupnc 1 days ago [-]
Give IBM Instana a look. It has great otel support. Nice usability. [0][1] Free sandbox for an email [2]

[0] https://www.ibm.com/products/instana/opentelemetry

[1] https://github.com/instana/instana-otel-collector

[2] https://play-with.instana.io/#/home

[disclaimer: I'm an IBMer]

morningspace 16 hours ago [-]
> Can this visualize over data stored in Honeycomb like Grafana can?

IIUC, Grafana connects directly to Honeycomb via its API to visualize data without storing it. Instana, on the other hand, is a bit different. It needs telemetry data to be ingested into backend before it can be visualized in UI. With Honeycomb, this could be possible if the data can be exported from Honeycomb to Instana.

jitl 1 days ago [-]
Can this visualize over data stored in Honeycomb like Grafana can?
arccy 1 days ago [-]
really? for a lot of us the datadog query language for dashboards is absolute trash. getting it to do what you want sits somewhere between difficult to impossible.

i guess it depends on what you're used to

siegecraft 1 days ago [-]
what do you use instead and what does it give you that you can't get from datadog? as someone who's been locked into datadog for the past few years i'm wondering what i've been missing.
MrBuddyCasino 1 days ago [-]
dash0 is a brand new player, and wants to be "simple", perhaps check it out. Former colleagues of mine so I'm biased, but they do know what they're doing - they built the APM tool Instana, which sold to IBM for $400M.
jitl 1 days ago [-]
You can make a tool that’s both powerful and usable. Datadog did it. Adobe did it. There is no downside to usability.
mohammadv184 1 days ago [-]
There is an un-marketed reality: OTel is not simple. The learning curve is steep, the documentation can be a maze of specs, and the SDKs (especially for metrics and logs) can feel over-engineered. You will get frustrated.
pas 1 days ago [-]
the problem is not simple either, and the same (or very similar) SDKs are used by the very fancy platforms too, no?

compared to dumping logs to a file (or a single instance Prometheus scraping /metrics) everything is frustrating, because there are so many moving parts anyway, you want to query stuff and correlate, but for that you need to propagate the trace id, and emit and store spans, and take care to properly handle async workers, and you want historical comparisons, and downsampling for retention, and of course auto-discovery/registration/labeling from k8s pods (or containers or whatever) and source code upload and release tagging from CI, and ...

arccy 1 days ago [-]
nearly all the sdks are wild pieces of overengineering designed to conform to a design-by-committee api that's native to no language.

half the things you list aren't even part of the sdks, they're part of the collector.

pas 1 days ago [-]
yes, and that's why it's usually better to put the spans where you actually need them instead of depending on monkey patching. still, it's the same whether you use DataDog or Sentry (or OTel).
secondcoming 1 days ago [-]
The C++ SDK is a masterpiece of over-engineering.
pan69 1 days ago [-]
So is the JavaScript/TypeScript one. Very steep learning curve, very fragmented but clearly also very powerful once you know how to use it.
nijave 1 days ago [-]
I guess I'm the contrarian but I've had good success with otel. It's especially powerful being able to plug and play components like take ddtrace Python instrumentation library and hook it up to otel collector and dump into Jaeger.

I can run a full tracing stack locally for dev use with minimal config.

greatgib 1 days ago [-]
I was always turned down to use more Otel by how verbose it is and how heavy are the telemetry payload compared to simple adhoc alternatives.

Am I wrong?

jitl 1 days ago [-]
The JavaScript Otel packages are implemented with 34 layers of extra abstraction. We wrote our own implementation of tracing for Cloudflare Workers and it performs much better with 0 layers of abstraction. I’ve seen a few other services switching over to our lightweight tracer. The emitted JSON is still chunky but removing all the incidental complexity helped a lot.
mvf4z7 1 days ago [-]
Can you share a link to your implementation?
jitl 1 days ago [-]
It's private code in our monorepo

EDIT: we actually have two. The one we use for Node, the author plans to open source it eventually. That one is drop in replacement for Span and Trace classes and Just Works with upstream Otel. Main blocker is that we have some patch-package to fix other performance issues with upstream, and need to make our stuff work with non-patched upstream.

The one we use for Workers is more janky and doesn’t make sense to open source. It’s like 100 total LoC but doesn’t have compatibility with existing Otel libraries.

PunchyHamster 1 days ago [-]
It is designed to get insights into dozen(s) connected systems.

It will always be overkill for just an app or two talking with eachother... till you grow, then it won't be overkill any more.

But still might be worth getting into it on smaller apps just thanks to wealth of tools available.

greatgib 1 days ago [-]
But my point is, especially for a big system, the performance impact of just the telemetry should be huge.
MrBuddyCasino 1 days ago [-]
No, the spec isn't great and makes it hard to implement a performant solution.
r3tr0 1 days ago [-]
OTel's real bottleneck isn't the spec. It's the fact that it requires you to instrument the app itself. That couples your performance to the maturity of the least stable SDK you depend on.

eBPF solves this by reversing the model: instrument the system, not the application. Turn it on / off dynamically, zero redeploys, minimal overhead.

The missing piece is accessibility. Kernel-level observability exists; "normal engineers can use it" and good DX does not.

pandemic_region 1 days ago [-]
Sadly, few people remember Zipkin, the spiritual father of tracing that Otel stabbed in the back for its own profit
csomar 1 days ago [-]
What the author doesn't realize is that OpenTelemetry has fundamental problems. I experienced this firsthand two years ago working with Otel in Rust, and just today I spent an entire afternoon debugging what turned out to be an otel package update breaking react-router links. Since the bug showed up with several other package updates at once, otel was in the bottom of my suspicion list.

The core issue is that, with otel, observability platforms become just a UI layer over a database. No one wants to invest in proper instrumentation, which is a difficult problem, so we end up with a tragedy of the commons where the instrumentation layer itself gets neglected as there is no money to be made there.

sweetgiorni 1 days ago [-]
> The core issue is that, with otel, observability platforms become just a UI layer over a database. No one wants to invest in proper instrumentation, which is a difficult problem, so we end up with a tragedy of the commons where the instrumentation layer itself gets neglected as there is no money to be made there.

I don't think it's fair to say "no one wants to invest in proper instrumentation" - the OpenTelemetry community has built a massive amount of instrumentation in a relatively short period of time. Yes, OpenTelemetry is still young and unstable, but it's getting better every day.

As the article notes, the OpenTelemetry Collector has plugins can convert nearly any telemetry format to OTLP and back. Many of the plugins are "official" and maintained by employees of Splunk, Datadog, Snowflake, etc. Not only does it break the lock-in, but it allows you to reuse all the great instrumentation that's been built up over the years.

> The core issue is that, with otel, observability platforms become just a UI layer over a database.

I think this is a good thing - when everyone is on the same playing field (I can use Datadog instrumentation, convert it to OTel, then export it to Grafana Cloud/Prometheus), vendors will have to compete on performance and UX instead of their ability to lock us in with "golden handcuffs" instrumentation libraries.

hu3 1 days ago [-]
I would add that in most cases, it's just a web UI displaying a lot of noise disguised as data.

Making sense out of so much data is why datadog and sentry make so much money.

csomar 1 days ago [-]
You still have to do that work yourself. I am using honeycomb (the free tier) but their pricing makes little sense. Their margins must be something like x100.
carefulfungi 1 days ago [-]
Having shipped free-tier observability products, your comment that you aren't paying them but think their margins are 100x is a perfect irony.
ehnto 1 days ago [-]
The trade off is worth it in this case. Those are technical hurdles, when the issue we are trying to solve is data sovereignty then those hurdles become incidental complexity.

Of course you could also roll your own telemtry, which is generally no that difficult in a lot of frameworks. You don't always need something like OTEL.

PunchyHamster 1 days ago [-]
Most languages have pretty mature ecosystem, I used it in Go and it was mostly problem free, with biggest annoyance being a bit of boilerplate that had to be added
tbrownaw 1 days ago [-]
> instrumentation layer itself gets neglected

It needs to be treated as an integral part of whatever framework is being instrumented. And maintained by those same people.

andrewstuart2 1 days ago [-]
I'd just like to point out that you've said OTel has fundamental problems, and then you pointed out a couple examples of one-time-fixable transient problems.

These are issues you'd experience with anything that spans your stack as a custom telemetry library would.

kchoudhu 1 days ago [-]
There is very much an alternative. Looking at the execution of your code should never alter its fundamental performance the way otel is built to do. This was a solved problem at least a decade and a half ago, but the cool kids decided to reinvent the wheel, poorly.

https://news.ycombinator.com/item?id=45845889

PunchyHamster 1 days ago [-]
dtrace was meant for entirely different use, and it's not a replacement for otel

Otel was made to basically track the request execution (and anything that request triggers) across multiple apps at once, not to instrument an app to find slow points

nothrabannosir 1 days ago [-]
To OP’s credit though the latter is exactly what every single piece of otel documentation pushes you to do. Using only the manual spans api is an exercise in api docs spelunking and ignoring “suggested best practices” and “only do this if everything else has failed for you”.
kchoudhu 1 days ago [-]
We should be using USDTs to emit trace ids that can be consumed by dtrace and shoved into whatever backend we want for tracing.
masterj 1 days ago [-]
Why don’t you try that, convert the output to OTLP and then write about it?
ok_dad 1 days ago [-]
What’s a USDT? All I can find on Google is crypto garbage.
csomar 1 days ago [-]
That's just one dimension to telemetry. For my use case, for example, I need distributed tracing; which is a fancy word for correlated logs.
csomar 1 days ago [-]
It's more than a couple. The fundamental issue is not the bugs themselves (these are expected) but that, from my perspective, otel is at odds with the observability business because these actors have little interest to contribute back to telemetry agents since anyone can reap the rewards of that. So instead they'd focus more on their platforms and the agents/libraries get neglected.

It's a great idea, in principle, but unless it gets strong backing from big tech, I think it'll fail. I'd love to be proven wrong.

zja 1 days ago [-]
> otel is at odds with the observability business because these actors have little interest to contribute back to telemetry agents since anyone can reap the rewards of that.

But all major vendors _do_ contribute to OTEL.

andrewstuart2 1 days ago [-]
That's kind of how open source works, though. Of course the backend vendors won't care about anything that doesn't affect the backend somehow. But the people, i.e. users, who do want to be able to easily switch away from bad vendors, have incentives to keep things properly decoupled and working.

The license is the key enabler for all of this. The vendors can't be all that sneaky in the code they contribute without much higher risk of being caught. Sure, they will focus on the funnel that brings more data to them, but that leaves others more time to work on the other parts.

ViewTrick1002 1 days ago [-]
What is the preferred setup for deploying OpenTelemetry on a Kubernetes cluster? Is OpenTelemetry the choice today?

I am running a few projects on a minimal Hetzner K3S cluster and just want some cheap easy observability to store logs, reduce log noise and instead rely on counters/metrics without paying an arm and a leg.

Languages used are Rust, Javascript and Python mostly.

jitl 1 days ago [-]
Otel collector or complete Otel stack with data storage and ui etc?

The collector is a helm chart, someone on my project added it to our K8s clusters last week. It was like 30 lines of YAML/Terraform in total. Logs, trace forwarding, Prometheus scraping. That bit is easy.

Idk about deploying the ui/storage. I’ve used Grafana Loki stuff in Docker Compose locally without much head scratching for local development.

https://github.com/grafana/docker-otel-lgtm

ViewTrick1002 1 days ago [-]
Was thinking storage and UI. Making something useful of the work spent collecting the information!
nijave 1 days ago [-]
Not sure about preferred but Jaeger is pretty simple for tracing only. Hyperdx looks interesting but brings in both Mongo and Clickhouse which imo is pretty heavy

Native k8s logs + Prometheus is probably more on the lighter weight side but you don't get traces. You could find some middle ground using the otel collector to extract trace metrics so you get RED metrics but you wouldn't have full traces

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 23:26:11 GMT+0000 (Coordinated Universal Time) with Vercel.