Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporterhelper] record metric should log the number of log records before the data are sent to the consumers downstream #10402

Merged
merged 4 commits into from
Jul 26, 2024

Conversation

zpzhuSplunk
Copy link
Contributor

@zpzhuSplunk zpzhuSplunk commented Jun 13, 2024

Description

The sender metric within the exporterhelper should measure the number of items coming into the sender, not what was done with the items downstream, if the components downstream are mutable. An example of this is provided as a unit test within this PR.

This PR also addresses nil panics that some users are experiencing.

Link to tracking issue

Fixes #
#10033

Testing

Existing test cases should cover this code change.

Unit test added

Documentation

@zpzhuSplunk zpzhuSplunk requested review from a team and bogdandrutu June 13, 2024 17:23
@splunkericl
Copy link
Contributor

Hey @dmitryax and @atoulme. Just to give a bit more context. Our team was able to reproduce the problem only when we send large amount of data(500GB per day in an instance) through. So the theory in the description fits as this can only happen in very rare race condition. Can you take a look and see if this change is ok?

@atoulme
Copy link
Contributor

atoulme commented Jun 14, 2024

The fix seems simple enough, but I'd be interested to try and see if a unit test can still catch this if we provoke this problem.

Copy link

codecov bot commented Jun 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.24%. Comparing base (49ea32b) to head (862d293).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #10402   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files         403      403           
  Lines       18720    18723    +3     
=======================================
+ Hits        17268    17271    +3     
  Misses       1097     1097           
  Partials      355      355           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Jul 3, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added Stale and removed Stale labels Jul 3, 2024
@grandwizard28
Copy link

We are facing the same issue with a custom receiver.

...
logs, err := receiver.parser.Parse(body)
if err != nil {
	writeError(w, err, http.StatusBadRequest)
	return
}

// At this point, the receiver has accepted the payload
ctx := receiver.obsreport.StartLogsOp(req.Context())
err = receiver.nextConsumer.ConsumeLogs(ctx, logs)
receiver.obsreport.EndLogsOp(ctx, metadata.Type.String(), logs.LogRecordCount(), err)

if err != nil {
	writeError(w, err, http.StatusInternalServerError)
	return
}
...

Panic happens here:

receiver.obsreport.EndLogsOp(ctx, metadata.Type.String(), logs.LogRecordCount(), err)

Can we investigate this? Or maybe mention somewhere in the documentation this will happen?

@atoulme
Copy link
Contributor

atoulme commented Jul 16, 2024

@grandwizard28 that seems like a different issue, because it happens outside of exporterhelper. Please open a new issue, post version and pipeline. You can redact your custom receiver and just post the snippet of code you reference here.

@shivanshuraj1333
Copy link
Member

The fix seems simple enough, but I'd be interested to try and see if a unit test can still catch this if we provoke this problem.

@atoulme I think we can merge this PR as it is, since, the race condition would happen when the request is completed and then the next line is called. Catching this in a unit test would be tricky...

	err := lewo.nextSender.send(c, req)
	lewo.obsrep.EndLogsOp(c, req.ItemsCount(), err)

WDYT?

@atoulme
Copy link
Contributor

atoulme commented Jul 16, 2024

If we do this:

  • Having a unit test and making the clear expectation that it is not safe to call ItemsCount on any logs/metrics/traces after consumption needs to be spelled out
  • we should do this across all signals, not just logs.

Catching this in a unit test is not particularly tricky, if we're just looking to reproduce the failure: reset the logs to nil and create a situation where the panic is reproduced.

@splunkericl
Copy link
Contributor

@atoulme yeah it is understandable to have a unit test to re-create the failures. Our team looked into it for a while but can't really reproduce the problem with unit test. Even on production it is only reproducible under specific scenario(specific pipeline setup and throughput).

We changed our internal pipeline so it doesn't happen again in production for now. We tried and wasn't able to re-create the failure in a unit test. If anyone else has any suggestions please feel free to post them.

@shivanshuraj1333
Copy link
Member

Catching this in a unit test is not particularly tricky, if we're just looking to reproduce the failure: reset the logs to nil and create a situation where the panic is reproduced.

I'm talking about the changes in this PR. In the unit test, mimicking an actual send method, and making that send happen before req.ItemsCount() is called is gonna be nondeterministic, and I think that's what @splunkericl is also suggesting.

Any other way to mimic it would involve emptying the req explicitly before calling req.ItemsCount(), which would throw a null pointer error but I don't think we need that UT.

That's why I said, IMO we can merge it without UT, also internationally we have also patched this in our custom receiver.

@zpzhuSplunk zpzhuSplunk changed the title [exporterhelper] fix panic introduced by nil pointer in send [exporterhelper] record metric should log the number of log records before the data are sent to the consumers downstream Jul 19, 2024
…efore the data are sent to the consumers downstream
@atoulme
Copy link
Contributor

atoulme commented Jul 23, 2024

That looks good - can you please add the same test and fix for metrics and traces? Thanks!

Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zpzhuSplunk
Copy link
Contributor Author

Hi @dmitryax, mind taking a look?

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dmitryax dmitryax merged commit 0462e5c into open-telemetry:main Jul 26, 2024
50 checks passed
@github-actions github-actions bot added this to the next release milestone Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants