-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rcmgr: Use prometheus for metrics (OpenCensus generates garbage) #1955
Comments
Interesting. This is a huge amount of garbage we're creating. Thanks for reporting, @guseggert! We should aim for making recording of metrics allocation-free, ideally. This seems to be easy with This is highly relevant for our broader metrics effort (#1356, #1910). |
Seems related: census-instrumentation/opencensus-go#1265 |
There doesn’t seem to be a lot of appetite for these optimizations in OpenCensus. This would be an argument against using OpenCensus and for using Prometheus directly. |
Agreed, it doesn't matter what kind of flexibility OpenCensus provides, if it's unusable in hot code paths. |
Putting this on the list for the v0.25 release. Resource manager metrics are an integral part of our resource observability story, they need to “just work” under all work loads. |
I did some profiling and benchmarking. The benchmark repeatedly opens a stream and then closes it and its conn. Adding the IIUC, the OpenCensus API has some features and flexibility that make this hard to optimize further without changing the OpenCensus API itself, or limiting what users can do with OpenCensus. I can try something similar w/ Prometheus to compare. Benchmark and optimizations: Bench on the original code: 3f55b57 |
What's your opinion on OpenTelemetry? Seems like kubo started using; why not use it in libp2p? |
@Wondertan there’s a discussion about that in #1356. As far as I can see, it’s not quite ready yet. |
Ok. If there is no good solution for OpenCensus, it might be worth considering the risk of switching to OpenTelemetry. We started using it in early Summer, and since then, the API of actual meters has stayed the same. Although the setup API(providers/exporters/etc.) did change and we have to do an update like here. Regarding performance, I have yet to try profiling Otel personally, but there have been no complaints about garbage allocations and subsequent issues. |
They're using it for tracing, not for metrics, as far as I can tell.
Interesting. I'm still hesitant to use an alpha version though, to be honest. |
Updated the title. I'm on board to change this to use the prometheus sdk. |
I've just started doing performance profiling (to figure out why KUBO is always so CPU hungry, 35% continuous usage at a minimum is just too high). One of my findings thus far points to the resource manager. It's metrics collections pops up quite high on the cpu usage graphs. Exactly like @guseggert shows here too. In my case i disabled the resource manager completely with |
Thanks for the data point. What do you see if you enable the resource manager but disable the metrics collecting? (I believe the metrics should be off by default). |
I don't think Kubo has a knob for turning off the metrics (see here), we should add one to it. Or wait for the prom metrics if that's imminent? |
libp2p _used_ to support metrics using OpenCensus, but this was recently changed to use Prometheus instead - libp2p/go-libp2p#1955. Unfortunately, it is extremely difficult to get Prometheus metrics into OpenTelemetry without running the external OTEL agent. This re-implements the same metrics using OpenTelemetry using the _new_ Prometheus names rather than the old OpenCensus naming. Fixes #2059
libp2p _used_ to support metrics using OpenCensus, but this was recently changed to use Prometheus instead - libp2p/go-libp2p#1955. Unfortunately, it is extremely difficult to get Prometheus metrics into OpenTelemetry without running the external OTEL agent. This re-implements the same metrics using OpenTelemetry using the _new_ Prometheus names rather than the old OpenCensus naming. Fixes #2059
libp2p _used_ to support metrics using OpenCensus, but this was recently changed to use Prometheus instead - libp2p/go-libp2p#1955. Unfortunately, it is extremely difficult to get Prometheus metrics into OpenTelemetry without running the external OTEL agent. This re-implements the same metrics using OpenTelemetry using the _new_ Prometheus names rather than the old OpenCensus naming. Fixes #2059
I've been profiling a large hydra-booster deployment, and recently added the built-in Resource Manager metrics, which caused CPU usage to significantly increase, caused by increased GC frequency, which seems to be caused by a large amount of garbage generated by adding tags to OpenCensus metrics inside of Resource Manager:
(Note that this is using go-libp2p v0.21, which was before go-libp2p-resource-manager was conslidated into the go-libp2p repo, but the RM metrics code has not changed significantly in recent versions, so likely still a problem. I am still working to upgrade the hydras to v0.24, after which I'll try again.)
The text was updated successfully, but these errors were encountered: