client_golang (original) (raw)

@kakkoyun nailed explaining it. I think we need to essentially take a more controllable introduction of new metrics. There are literally books written about what metrics you should use for monitoring Go. We can't change it overnight, even if it's "only" naming that changes. Naming is what tools depend on, not only humans, even if the name is clearer now. By implicitly adding new ones on top of old ones we are not helping with adoption much, because people in most cases won't notice new ones (no good way of discovering them) - other than having to pay more for those metrics. We need to work around this with documentation, tutorials and tooling changes.

NOTE: It's also not only histograms that are problematic. An overlap of extra metrics just to rename them and allow ppl to migrate to them is significant. Again, without the extra work mentioned above, it will not be useful for others.

Coming down the pipeline in the Go world is golang/go#48409 which is going to expose new, useful metrics for identifying when the GC is being throttled to avoid death spirals. This is useful for diagnosing OOMs, linking them back to a bad configuration push or a sudden increase in memory use (or spikier memory use).

Yes - and it would be awesome to have those! For now as opt-in. We might at some point consider more granular options or even a little bit bigger default for truly popular and useful metrics for everyone in future.

It does not scale for the Go team to go to each metrics exporter (Prometheus, etc.) and add these metrics. That's a big motivation behind slurping up all metrics programmatically. I worry that new metrics will only be added after something like this has already impacted someone, which is what I wanted to avoid in the first place.

It does not scale for Go Team, but:

This situation proved that it's so easy to make the unexpected impactful situation for users as we should probably treat metric change much more seriously than programmatic API change in compatibility terms. Go community that works on those metrics might treat it lightly and rename or add an enormous histogram because it's important for their use cases. Not many will remember that this has serious consequences to those exporters like client_golang between 1.12-now that immediately transfer all changes of runtime/metrics to thousand production clusters without any review of metric experts from exporter domain.
You are not alone. If a metric will be useful it will be added as soon as PR to Go is merged and the version with the new runtime/metric change is released. It's nice to have those things consumed on more data driven "need" base, and not because a developer thinks it's useful.

From my perspective, a lot of the issues that cropped up over time are the result of my initial flawed integration of runtime/metrics into Prometheus. The flaws, I think, were mostly due to my inexperience with the Prometheus ecosystem generally (going back to the poor choice of histogram representation). As a result, turning off exporting new metrics by default completely in addition to switching to the old metrics feels like throwing the baby out with the bathwater.

Don't worry, it's our fault (my and @kakkoyun) that we completely missed those shortcomings. And I truly believe we could fix this (and we will, before releasing of new version) within days, it's not a biggie. Still, even with fixes, we are adding too many new metrics without value (the majority of people/tooling won't know about them and won't know how to use them).

That being said, I recognize that this integration of runtime/metrics has been causing you (the maintainers) problems, and this is a straightforward way to resolve that. I apologize for not being fully present to address these issues.

Again, no need to be sorry @mknyszek your help was life-saving here. It's ultra-important to work with Go Team on what's coming to Go runtime in terms of instrumentation 💪🏽 It's was amazing to have your contributions and let's move those things forward, just a little bit more safely and opt-in - until we graduate some things to default! (:

I do hope you'll reconsider the decision to not export new metrics by default (or at least the non-histogram metrics) in the future.

Defnitely!