Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks for Sum metric exemplars #31683

Closed
tiithansen opened this issue Mar 11, 2024 · 8 comments · Fixed by #32210
Closed

Memory leaks for Sum metric exemplars #31683

tiithansen opened this issue Mar 11, 2024 · 8 comments · Fixed by #32210
Labels
bug Something isn't working connector/spanmetrics

Comments

@tiithansen
Copy link

Component(s)

connector/spanmetrics

What happened?

Description

There is a memory leaks if exemplars are enabled. In file connector.go histogram exemplars are reset with every export but sum metrics exemplars are not.

Expected Result

Memory usage should be steady depending how much metrics are being generated.

Actual Result

Memory usage keeps growing until process gets OOM killed in k8s cluster.

Collector version

v0.95.0

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@tiithansen tiithansen added bug Something isn't working needs triage New item requiring triage labels Mar 11, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

I'm not super familiar with this connector, but following your comment it looks like the following lines should potentially be added:

			m.events.Reset()
			m.sums.Reset()

To this section:

p.resourceMetrics.ForEach(func(k resourceKey, m *resourceMetrics) {
// Exemplars are only relevant to this batch of traces, so must be cleared within the lock
if !p.config.Histogram.Disable {
m.histograms.Reset(true)
}
// If metrics expiration is configured, remove metrics that haven't been seen for longer than the expiration period.
if p.config.MetricsExpiration > 0 {
if now.Sub(m.lastSeen) >= p.config.MetricsExpiration {
p.resourceMetrics.Remove(k)
}
}
})

This would need to be confirmed through testing though, and preferably a long running test doing memory analysis on the collector to ensure this actually resolves it.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Mar 14, 2024
@tiithansen
Copy link
Author

We are running, slightly modified (adding exemplars if specific attribute is set on span) version in production with 150k spans per second going through this connector. Before applying fix it got OOM killed (8GB limit in 5 instances total 40GB) in half an hour or so. After applying change, its has been steady 200 - 300 MB for days now.

@crobert-1
Copy link
Member

Would you be able to share your fix, or even post a PR?

@tiithansen
Copy link
Author

I will create a PR soon.

@tcaty
Copy link

tcaty commented Apr 2, 2024

Hi @crobert-1 and @tiithansen! I can confirm that issue really exists. We use helm chart opentelemetry-collector-0.82.0 with otel/opentelemetry-collector-contrib:0.95.0 to deploy it in k8s cluster. I'll share our config details that probably relates to this issue.

values.yml
mode: "deployment"

config:
receivers:
  otlp:
    protocols:
      http: {}
  prometheus:
    config: 
      scrape_configs:
        - job_name: otel-collector-metrics
          scrape_interval: 15s
          static_configs:
          - targets: ['localhost:8888']

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 4000
    spike_limit_percentage: 20
  batch: {}

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    enable_open_metrics: true

connectors:
  spanmetrics:
    exemplars:
      enabled: true
          
extensions:
  pprof:
    endpoint: "0.0.0.0:1777"

service:
  ...
         
resources:
requests:
  cpu: 100m
  memory: 2Gi
limits:
  memory: 4Gi
useGOMEMLIMIT: true

There are our metrics between 6:00 a.m. and 8:00 a.m. on screenshots below, may be they will be useful.

screenshots

Crucial points:

  • 6:00 a.m. - otel-collector started with minimal load
  • 7:00 a.m - otel-collector reached limits

Pod resources

image

Pprof at 6:00 a.m.

image

Pprof at 7:00 a.m.

image

And some otel-collector metrics

image

image

image

image

We'll try to run otel-collector without exemplars generation and I'll be back with feedback about resource usage.

@swar8080
Copy link
Contributor

swar8080 commented Apr 2, 2024

@tcaty configuring metrics_expiration added in 0.97 helps a bit by discarding exemplars for infrequently received spans

Also, generating delta temporality span metrics and then converting them to cumulative would solve the problem if either of these become available:

The delta span metrics only keep exemplars received since the last metrics_flush_interval

@swar8080
Copy link
Contributor

swar8080 commented Apr 2, 2024

@tiithansen are you still planning to submit a fix? If not then I can give this a go

evan-bradley pushed a commit that referenced this issue Apr 16, 2024
…lushing (#32210)

**Description:** 
Discard counter span metric exemplars after flushing to avoid unbounded
memory growth when exemplars are enabled.

This is needed because #28671 added exemplars to counter span metrics,
but they are not removed after each flush interval like they are for
histogram span metrics.

Note: this may change behaviour if using the undocumented
`exemplars.max_per_data_point` configuration option, since exemplars
would no longer be accumulated up until that count. However, i'm unclear
on the value of that feature since there's no mechanism to replace old
exemplars with newer ones once the maximum is reached. Maybe a follow-up
enhancement is only discarding exemplars once the maximum is reached, or
using a circular buffer to replace them. That could be useful for
pull-based exporters like `prometheusexporter`, as retaining exemplars
for longer would decrease the chance of them getting discarded before
being scraped.

**Link to tracking Issue:** 

Closes #31683 

**Testing:** 
- Unit tests
- Running the collector and setting a breakpoint to verify the exemplars
are being cleared in-between flushes. Before the change I could see the
exemplar count continually growing

**Documentation:** <Describe the documentation added.>
Updated the documentation to mention that exemplars are added to all
span metrics. Also mentioned when they are discarded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics
Projects
None yet
4 participants