[exporterhelper] record metric should log the number of log records before the data are sent to the consumers downstream #10402

zpzhuSplunk · 2024-06-13T17:23:20Z

Description

The sender metric within the exporterhelper should measure the number of items coming into the sender, not what was done with the items downstream, if the components downstream are mutable. An example of this is provided as a unit test within this PR.

This PR also addresses nil panics that some users are experiencing.

Link to tracking issue

Fixes #
#10033

Testing

Existing test cases should cover this code change.

Unit test added

Documentation

splunkericl · 2024-06-14T16:39:15Z

Hey @dmitryax and @atoulme. Just to give a bit more context. Our team was able to reproduce the problem only when we send large amount of data(500GB per day in an instance) through. So the theory in the description fits as this can only happen in very rare race condition. Can you take a look and see if this change is ok?

atoulme · 2024-06-14T16:54:34Z

The fix seems simple enough, but I'd be interested to try and see if a unit test can still catch this if we provoke this problem.

codecov · 2024-06-14T17:01:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.24%. Comparing base (49ea32b) to head (862d293).
Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #10402   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files         403      403           
  Lines       18720    18723    +3     
=======================================
+ Hits        17268    17271    +3     
  Misses       1097     1097           
  Partials      355      355

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-07-03T03:16:04Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

grandwizard28 · 2024-07-16T16:49:21Z

We are facing the same issue with a custom receiver.

...
logs, err := receiver.parser.Parse(body)
if err != nil {
	writeError(w, err, http.StatusBadRequest)
	return
}

// At this point, the receiver has accepted the payload
ctx := receiver.obsreport.StartLogsOp(req.Context())
err = receiver.nextConsumer.ConsumeLogs(ctx, logs)
receiver.obsreport.EndLogsOp(ctx, metadata.Type.String(), logs.LogRecordCount(), err)

if err != nil {
	writeError(w, err, http.StatusInternalServerError)
	return
}
...

Panic happens here:

receiver.obsreport.EndLogsOp(ctx, metadata.Type.String(), logs.LogRecordCount(), err)

Can we investigate this? Or maybe mention somewhere in the documentation this will happen?

atoulme · 2024-07-16T16:51:20Z

@grandwizard28 that seems like a different issue, because it happens outside of exporterhelper. Please open a new issue, post version and pipeline. You can redact your custom receiver and just post the snippet of code you reference here.

shivanshuraj1333 · 2024-07-16T17:47:16Z

The fix seems simple enough, but I'd be interested to try and see if a unit test can still catch this if we provoke this problem.

@atoulme I think we can merge this PR as it is, since, the race condition would happen when the request is completed and then the next line is called. Catching this in a unit test would be tricky...

	err := lewo.nextSender.send(c, req)
	lewo.obsrep.EndLogsOp(c, req.ItemsCount(), err)

WDYT?

atoulme · 2024-07-16T19:47:03Z

If we do this:

Having a unit test and making the clear expectation that it is not safe to call ItemsCount on any logs/metrics/traces after consumption needs to be spelled out
we should do this across all signals, not just logs.

Catching this in a unit test is not particularly tricky, if we're just looking to reproduce the failure: reset the logs to nil and create a situation where the panic is reproduced.

splunkericl · 2024-07-16T20:14:55Z

@atoulme yeah it is understandable to have a unit test to re-create the failures. Our team looked into it for a while but can't really reproduce the problem with unit test. Even on production it is only reproducible under specific scenario(specific pipeline setup and throughput).

We changed our internal pipeline so it doesn't happen again in production for now. We tried and wasn't able to re-create the failure in a unit test. If anyone else has any suggestions please feel free to post them.

shivanshuraj1333 · 2024-07-16T22:55:37Z

Catching this in a unit test is not particularly tricky, if we're just looking to reproduce the failure: reset the logs to nil and create a situation where the panic is reproduced.

I'm talking about the changes in this PR. In the unit test, mimicking an actual send method, and making that send happen before req.ItemsCount() is called is gonna be nondeterministic, and I think that's what @splunkericl is also suggesting.

Any other way to mimic it would involve emptying the req explicitly before calling req.ItemsCount(), which would throw a null pointer error but I don't think we need that UT.

That's why I said, IMO we can merge it without UT, also internationally we have also patched this in our custom receiver.

…efore the data are sent to the consumers downstream

atoulme · 2024-07-23T17:03:09Z

That looks good - can you please add the same test and fix for metrics and traces? Thanks!

atoulme

LGTM

zpzhuSplunk · 2024-07-24T16:26:46Z

Hi @dmitryax, mind taking a look?

dmitryax

LGTM

zpzhuSplunk requested review from a team and bogdandrutu June 13, 2024 17:23

github-actions bot added Stale and removed Stale labels Jul 3, 2024

grandwizard28 mentioned this pull request Jul 16, 2024

[Receiver] Panic on calling LogRecordCount() in receiver #10625

Open

zpzhuSplunk changed the title ~~[exporterhelper] fix panic introduced by nil pointer in send~~ [exporterhelper] record metric should log the number of log records before the data are sent to the consumers downstream Jul 19, 2024

[exporterhelper] record metric should log the number of log records b…

eb1ca23

…efore the data are sent to the consumers downstream

zpzhuSplunk force-pushed the fix_exporterhelper_panic branch from 716b9be to eb1ca23 Compare July 19, 2024 18:52

add trace and metric data type

3aa6181

zpzhuSplunk force-pushed the fix_exporterhelper_panic branch from c2955ca to 3aa6181 Compare July 24, 2024 08:17

Merge branch 'main' into fix_exporterhelper_panic

ea33b5f

atoulme approved these changes Jul 24, 2024

View reviewed changes

Merge branch 'main' into fix_exporterhelper_panic

862d293

dmitryax approved these changes Jul 26, 2024

View reviewed changes

dmitryax merged commit 0462e5c into open-telemetry:main Jul 26, 2024
50 checks passed

github-actions bot added this to the next release milestone Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporterhelper] record metric should log the number of log records before the data are sent to the consumers downstream #10402

[exporterhelper] record metric should log the number of log records before the data are sent to the consumers downstream #10402

zpzhuSplunk commented Jun 13, 2024 •

edited

Loading

splunkericl commented Jun 14, 2024

atoulme commented Jun 14, 2024

codecov bot commented Jun 14, 2024 •

edited

Loading

github-actions bot commented Jul 3, 2024

grandwizard28 commented Jul 16, 2024

atoulme commented Jul 16, 2024

shivanshuraj1333 commented Jul 16, 2024

atoulme commented Jul 16, 2024

splunkericl commented Jul 16, 2024

shivanshuraj1333 commented Jul 16, 2024

atoulme commented Jul 23, 2024

atoulme left a comment

zpzhuSplunk commented Jul 24, 2024

dmitryax left a comment

[exporterhelper] record metric should log the number of log records before the data are sent to the consumers downstream #10402

[exporterhelper] record metric should log the number of log records before the data are sent to the consumers downstream #10402

Conversation

zpzhuSplunk commented Jun 13, 2024 • edited Loading

Description

Link to tracking issue

Testing

Documentation

splunkericl commented Jun 14, 2024

atoulme commented Jun 14, 2024

codecov bot commented Jun 14, 2024 • edited Loading

Codecov Report

github-actions bot commented Jul 3, 2024

grandwizard28 commented Jul 16, 2024

atoulme commented Jul 16, 2024

shivanshuraj1333 commented Jul 16, 2024

atoulme commented Jul 16, 2024

splunkericl commented Jul 16, 2024

shivanshuraj1333 commented Jul 16, 2024

atoulme commented Jul 23, 2024

atoulme left a comment

Choose a reason for hiding this comment

zpzhuSplunk commented Jul 24, 2024

dmitryax left a comment

Choose a reason for hiding this comment

zpzhuSplunk commented Jun 13, 2024 •

edited

Loading

codecov bot commented Jun 14, 2024 •

edited

Loading