Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak issue with Open Telemetry and jaeger exporter #1490

Closed
varun06 opened this issue Jan 25, 2021 · 9 comments · Fixed by #1687
Closed

Memory leak issue with Open Telemetry and jaeger exporter #1490

varun06 opened this issue Jan 25, 2021 · 9 comments · Fixed by #1687
Assignees
Labels
area:trace Part of OpenTelemetry tracing bug Something isn't working pkg:API Related to an API package
Milestone

Comments

@varun06
Copy link

varun06 commented Jan 25, 2021

We are using Open Telemetry in a reverse Proxy at scale and seeing memory leak.

func InitTracing(serviceName string, collectorEndpoint string, sample float64) (func(), error) {
	tc := propagation.TraceContext{}
	// Register the TraceContext propagator globally.
	otel.SetTextMapPropagator(tc)

	// Create and install Jaeger export pipeline.
	return jaeger.InstallNewPipeline(
		jaeger.WithCollectorEndpoint(collectorEndpoint),
		jaeger.WithProcess(jaeger.Process{
			ServiceName: serviceName,
			Tags: []label.KeyValue{
				label.String("exporter", "jaeger"),
			},
		}),
		jaeger.WithSDK(&sdktrace.Config{DefaultSampler: sdktrace.TraceIDRatioBased(sample)}),
	)
}

This is how I am initiating tracer.

	if cfg.TracingEnabled {
		// Initialize tracing and set global tracer
		flush, err := tracing.InitTracing("odnd", cfg.TracingCollectorEndpoint, cfg.TracingSampleValue)
		if err != nil {
			proxyLogger.Error("Failed to initialize tracing collector", logger.Error(err))
		}
		if flush != nil {
			defer flush()
		}
	}

This is how it is used to create a global tracer.

	tc := otel.GetTextMapPropagator()
	ctx := tc.Extract(originalReq.Context(), originalReq.Header)

	tracer := otel.Tracer(tracing.TracerName)

	tracingCtx, span := tracer.Start(ctx, "serveHTTP", trace.WithSpanKind(trace.SpanKindServer))
	originalReq = originalReq.WithContext(tracingCtx)
	defer span.End()

That's how we are creating a span and then few more child spans are created. But we are seeing this

Screen Shot 2021-01-25 at 9 48 25 AM

It goes up as we add more spans and off course that impacts GC and CPU usage on node. Can someone please have a look and let me know if i am doing something wrong in code or it's related to lib??

@MrAlias MrAlias added the bug Something isn't working label Jan 28, 2021
@MrAlias
Copy link
Contributor

MrAlias commented Jan 28, 2021

Thanks for the bug report. Have you been able to see the error while running with a profiler?

Also, when you defer the end of the span, does that function ever return?

@varun06
Copy link
Author

varun06 commented Feb 6, 2021

Hey @MrAlias

It seems that problem was calling GetTracer every time we were creating a new Span(Bug). But I was wondering, shouldn't GetTracer be a Singleton??

@MrAlias
Copy link
Contributor

MrAlias commented Feb 9, 2021

Hey @MrAlias

It seems that problem was calling GetTracer every time we were creating a new Span(Bug). But I was wondering, shouldn't GetTracer be a Singleton??

Thanks for looking into this. The Tracer returned from GetTracer should indeed be a reference to the same Tracer value if it is named the same. Do you have evidence this is not happening? That definitely sounds like a bug if so.

@varun06
Copy link
Author

varun06 commented Feb 10, 2021

I have lost the pprof that shows something like that, but i can share problematic code that was creating this problem.

func AddSpanToRequest(req *http.Request, spanName string) (*http.Request, trace.Span) {
 	var (
 		span               trace.Span
 		reqCtx, tracingCtx context.Context
 		ok                 bool
 	)

 	if req == nil {
 		return nil, nil
 	}

 	reqCtx = req.Context()
 	val := reqCtx.Value(ContextKey)
 	if tracingCtx, ok = val.(context.Context); !ok {
 		tracingCtx = context.Background()
 	}

 	tracer := global.Tracer(TracerName)
 	tracingCtx, span = tracer.Start(tracingCtx, spanName)
 	req = req.WithContext(context.WithValue(reqCtx, ContextKey, tracingCtx))
 	return req, span
}

and it was getting called every time someone created a new span. When i profiled, i remember seeing multiple tracer.

@brancz
Copy link

brancz commented Feb 16, 2021

We are experiencing the exact same problem. We continuously profile our entire infrastructure, so we have profiling data available for everything. Let us know what would be useful to you! :)

Process start: https://share.polarsignals.com/7aef522/
Right before OOM: https://share.polarsignals.com/5bf8ce4/
Diff of the two profiles: https://share.polarsignals.com/8945fcd/

This data also highly suggests something within the global tracer to be leaking, but it's very possible that we're doing something wrong. Let me know what I can try or provide further that would be helpful!

@varun06
Copy link
Author

varun06 commented Feb 16, 2021

@brancz Thanks for chiming in. I see nothing with those links . But yeah, we had same behavior. If I get some time this week, will try to read some code and see if I can find out, I am also planning to replicate this situation in a test env and see what is causing this.

@danp
Copy link

danp commented Feb 27, 2021

It's entirely possible I'm missing context, and apologies in advance if so! But looking at the profiles @brancz shared it gets me wondering when things added to p.tracers here would be removed:

t := &tracer{name: name, opts: opts}
p.tracers = append(p.tracers, t)

If, for example, the grpc UnaryClientInterceptor is calling Tracer like so for every request:

https://github.com/open-telemetry/opentelemetry-go-contrib/blob/1c687598a61840d67e56c9c7707af4997e1a057b/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L81-L84

When would its associated tracer in p.tracers be removed?

@Aneurysm9 Aneurysm9 self-assigned this Mar 9, 2021
@Aneurysm9 Aneurysm9 added area:trace Part of OpenTelemetry tracing pkg:API Related to an API package labels Mar 9, 2021
bogdandrutu added a commit to bogdandrutu/opentelemetry-go that referenced this issue Mar 10, 2021
bogdandrutu added a commit to bogdandrutu/opentelemetry-go that referenced this issue Mar 10, 2021
bogdandrutu added a commit to bogdandrutu/opentelemetry-go that referenced this issue Mar 10, 2021
bogdandrutu added a commit to bogdandrutu/opentelemetry-go that referenced this issue Mar 10, 2021
bogdandrutu added a commit to bogdandrutu/opentelemetry-go that referenced this issue Mar 10, 2021
bogdandrutu added a commit to bogdandrutu/opentelemetry-go that referenced this issue Mar 10, 2021
bogdandrutu added a commit to bogdandrutu/opentelemetry-go that referenced this issue Mar 10, 2021
MrAlias pushed a commit that referenced this issue Mar 10, 2021
@michelaquino
Copy link

Hi @bogdandrutu
do you have a forecast to release this fix?

@MrAlias
Copy link
Contributor

MrAlias commented Mar 18, 2021

@michelaquino we are hoping to make a release by the end of the week that will contain this fix. Please reach out to us in the CNCF otel-go slack channel if you have any more questions about the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:trace Part of OpenTelemetry tracing bug Something isn't working pkg:API Related to an API package
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants