Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rcmgr: Use prometheus for metrics (OpenCensus generates garbage) #1955

Closed
Tracked by #1916
guseggert opened this issue Dec 14, 2022 · 14 comments · Fixed by #2044
Closed
Tracked by #1916

rcmgr: Use prometheus for metrics (OpenCensus generates garbage) #1955

guseggert opened this issue Dec 14, 2022 · 14 comments · Fixed by #2044
Assignees
Labels
kind/maintenance Work required to avoid breaking changes or harm to project's status quo P0 Critical: Tackled by core team ASAP

Comments

@guseggert
Copy link
Contributor

guseggert commented Dec 14, 2022

I've been profiling a large hydra-booster deployment, and recently added the built-in Resource Manager metrics, which caused CPU usage to significantly increase, caused by increased GC frequency, which seems to be caused by a large amount of garbage generated by adding tags to OpenCensus metrics inside of Resource Manager:

allocs

(Note that this is using go-libp2p v0.21, which was before go-libp2p-resource-manager was conslidated into the go-libp2p repo, but the RM metrics code has not changed significantly in recent versions, so likely still a problem. I am still working to upgrade the hydras to v0.24, after which I'll try again.)

@marten-seemann
Copy link
Contributor

Interesting. This is a huge amount of garbage we're creating. Thanks for reporting, @guseggert!

We should aim for making recording of metrics allocation-free, ideally. This seems to be easy with tag.Upsert, but there might be other code paths where this is not that easy.

This is highly relevant for our broader metrics effort (#1356, #1910).

@guseggert
Copy link
Contributor Author

Seems related: census-instrumentation/opencensus-go#1265

@marten-seemann
Copy link
Contributor

There doesn’t seem to be a lot of appetite for these optimizations in OpenCensus.

This would be an argument against using OpenCensus and for using Prometheus directly.

@guseggert
Copy link
Contributor Author

Agreed, it doesn't matter what kind of flexibility OpenCensus provides, if it's unusable in hot code paths.

@marten-seemann marten-seemann changed the title Resource manager metrics generate a lot of garbage rcmgr: OpenCensus metrics generate a lot of garbage Dec 15, 2022
@marten-seemann
Copy link
Contributor

Putting this on the list for the v0.25 release. Resource manager metrics are an integral part of our resource observability story, they need to “just work” under all work loads.

This was referenced Dec 15, 2022
@guseggert
Copy link
Contributor Author

guseggert commented Dec 16, 2022

I did some profiling and benchmarking. The benchmark repeatedly opens a stream and then closes it and its conn. Adding the StatsTraceReporter causes ~70% increase in allocations, with StatsTraceReporter accounting for 26% of allocated bytes. With some small changes to the StatsTraceReporter (mostly around caching mutators), it accounts for 9% of allocated mem, which is definitely more tolerable, and more in line w/ the overhead mentioned in census-instrumentation/opencensus-go#1265.

IIUC, the OpenCensus API has some features and flexibility that make this hard to optimize further without changing the OpenCensus API itself, or limiting what users can do with OpenCensus. I can try something similar w/ Prometheus to compare.

Benchmark and optimizations:

Bench on the original code: 3f55b57
Bench with optimizations: 410f10e

@Wondertan
Copy link
Contributor

What's your opinion on OpenTelemetry? Seems like kubo started using; why not use it in libp2p?

@marten-seemann
Copy link
Contributor

@Wondertan there’s a discussion about that in #1356. As far as I can see, it’s not quite ready yet.

@Wondertan
Copy link
Contributor

Ok. If there is no good solution for OpenCensus, it might be worth considering the risk of switching to OpenTelemetry. We started using it in early Summer, and since then, the API of actual meters has stayed the same. Although the setup API(providers/exporters/etc.) did change and we have to do an update like here.

Regarding performance, I have yet to try profiling Otel personally, but there have been no complaints about garbage allocations and subsequent issues.

@marten-seemann
Copy link
Contributor

Seems like kubo started using; why not use it in libp2p?

They're using it for tracing, not for metrics, as far as I can tell.

We started using it in early Summer, and since then, the API of actual meters has stayed the same. Although the setup API(providers/exporters/etc.) did change and we have to do an update like celestiaorg/celestia-node#1537.

Interesting. I'm still hesitant to use an alpha version though, to be honest.

@MarcoPolo MarcoPolo changed the title rcmgr: OpenCensus metrics generate a lot of garbage rcmgr: Use prometheus for metrics (OpenCensus generates garbage) Jan 3, 2023
@MarcoPolo
Copy link
Collaborator

Updated the title. I'm on board to change this to use the prometheus sdk.

@marten-seemann marten-seemann added P0 Critical: Tackled by core team ASAP kind/maintenance Work required to avoid breaking changes or harm to project's status quo labels Jan 27, 2023
@markg85
Copy link

markg85 commented Jan 30, 2023

I've just started doing performance profiling (to figure out why KUBO is always so CPU hungry, 35% continuous usage at a minimum is just too high). One of my findings thus far points to the resource manager. It's metrics collections pops up quite high on the cpu usage graphs. Exactly like @guseggert shows here too.

In my case i disabled the resource manager completely with LIBP2P_RCMGR=0. This is obviously wrong too but serves my case of nailing down which part of kubo is so heavy on resources. Disabling this doesn't magically make kubo resource friendly at all. It merely reduces cpu usage measurable by a couple percentages.

@MarcoPolo
Copy link
Collaborator

Thanks for the data point. What do you see if you enable the resource manager but disable the metrics collecting? (I believe the metrics should be off by default).

@guseggert
Copy link
Contributor Author

guseggert commented Feb 2, 2023

I don't think Kubo has a knob for turning off the metrics (see here), we should add one to it. Or wait for the prom metrics if that's imminent?

wjam added a commit to bacalhau-project/bacalhau that referenced this issue Feb 27, 2023
libp2p _used_ to support metrics using OpenCensus, but this was recently
changed to use Prometheus instead - libp2p/go-libp2p#1955.
Unfortunately, it is extremely difficult to get Prometheus metrics
into OpenTelemetry without running the external OTEL agent.

This re-implements the same metrics using OpenTelemetry using the
_new_ Prometheus names rather than the old OpenCensus naming.

Fixes #2059
wjam added a commit to bacalhau-project/bacalhau that referenced this issue Feb 27, 2023
libp2p _used_ to support metrics using OpenCensus, but this was recently
changed to use Prometheus instead - libp2p/go-libp2p#1955.
Unfortunately, it is extremely difficult to get Prometheus metrics
into OpenTelemetry without running the external OTEL agent.

This re-implements the same metrics using OpenTelemetry using the
_new_ Prometheus names rather than the old OpenCensus naming.

Fixes #2059
wjam added a commit to bacalhau-project/bacalhau that referenced this issue Feb 27, 2023
libp2p _used_ to support metrics using OpenCensus, but this was recently
changed to use Prometheus instead -
libp2p/go-libp2p#1955. Unfortunately, it is
extremely difficult to get Prometheus metrics into OpenTelemetry without
running the external OTEL agent.

This re-implements the same metrics using OpenTelemetry using the _new_
Prometheus names rather than the old OpenCensus naming.

Fixes #2059
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/maintenance Work required to avoid breaking changes or harm to project's status quo P0 Critical: Tackled by core team ASAP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants