Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: OTEL context lost in subthreads of parallel bulk calls #2616

Merged
merged 22 commits into from
Sep 3, 2024

Conversation

claudinoac
Copy link
Contributor

When using OpenTelemetry to trace ES requests, all the parallel bulk chunks go out of the scope of the main span, and that happens because they lose the parent context as they become new threads (which happens in parallel bulk chunks), resulting in what we see below:

Screenshot 2024-07-15 at 18 53 27

In order to keep the same context flowing through all subthreads, I'm using the context propagation concept (https://opentelemetry.io/docs/languages/python/propagation/) to reestablish the previous context on the new spawned thread for every subthread of parallel bulk, thus keeping these subthreads following the flow of the parent context along with all the other calls (search/delete/index/etc).

Here's an example for the same batch job after applying this fix:

Screenshot 2024-07-15 at 18 40 11

I tried to keep the scope to the OTEL class itself and limit the test to the level of the call to recover the context, though I'm aware that might not be the ideal scenario.

I look forward for reviews and suggestions :)

Copy link

cla-checker-service bot commented Jul 23, 2024

💚 CLA has been signed

Copy link

A documentation preview will be available soon.

Request a new doc build by commenting
  • Rebuild this PR: run docs-build
  • Rebuild this PR and all Elastic docs: run docs-build rebuild

run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.

If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.

@pquentin
Copy link
Member

buildkite test this please

@pquentin
Copy link
Member

Thank you! Excellent find. I can take a proper look next week, but in the meantime:

  1. lint is still failing. You can reproduce locally with nox -rs lint.
  2. The integration tests are also failing with AttributeError: 'FailingBulkClient' object has no attribute '_otel'. You can't see those CI logs right now, sorry. It's possible but annoying to reproduce locally with nox -rs test-3.12 and nox -rs test_otel-3.12. But you need an Elasticsearch cluster listening at localhost:9200. (I can fix this next week myself, no worries.)

@claudinoac
Copy link
Contributor Author

claudinoac commented Jul 26, 2024

@pquentin I've fixed the lint issues, it seemed wrong bc I was running the linter on python v3.8, now I believe the lint CI should pass.

Now I'm not being able to run the integration tests properly as it's throwing a bunch of SSL errors such as ValueError: TLS options require scheme to be 'https'.

Another error that is happening in quite a lot of the tests is BadRequestError(400, 'illegal_argument_exception', 'Cannot delete policy [kibana-event-log-policy]. It is in use by one or more indices: [.kibana-event-log-8.3.2-000012, .kibana-event-log-8.3.2-000011, .kibana-event-log-8.3.2-000010, .kibana-event-log-8.3.2-000013]').

The ES cluster is empty, configured with basic dev recommended configs and running in a docker container as well as the lib enviroment (I've put them under the same network and changed all the localhost references to {{container_name}} just to run the tests).

259 tests are succeeding but other 177 are failing with the errors above

@pquentin
Copy link
Member

buildkite test this please

1 similar comment
@pquentin
Copy link
Member

buildkite test this please

Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I was on vacation last week. OK, so now that the CI is fixed, I took the time to look at the code in more detail. I'm not familiar with OpenTelemetry contexts and propagation: I just read the documentation today! So please tell me if what I'm saying does not make sense.

My main concern is that the propagation will only behave correctly if there's a context to recover from, but that is only true if one span was created already. If the first function you're calling is parallel_bulk, there won't be a span or context. Thankfully, this can be fixed by having the parallel_bulk call itself and all callers of _process_bulk_chunk create a span, which is something we want anyway. (Bonus points for covering elasticsearch/_async/helpers.py too.)

In other words, all bulk helpers should start with something like this:

with client._otel.span("parallel_bulk", inject_context=True):
    ...

elasticsearch/_otel.py Outdated Show resolved Hide resolved
elasticsearch/_otel.py Outdated Show resolved Hide resolved
elasticsearch/_otel.py Outdated Show resolved Hide resolved
elasticsearch/_otel.py Outdated Show resolved Hide resolved
elasticsearch/_otel.py Outdated Show resolved Hide resolved
@pquentin
Copy link
Member

pquentin commented Aug 1, 2024

As explained inline, I believe my comments will make more sense if you take a look at my top-level review: #2616 (review)

@pquentin
Copy link
Member

pquentin commented Aug 6, 2024

Hey @claudinoac, did you get the chance to look at my latest comments?

@claudinoac
Copy link
Contributor Author

For now, I'll be submitting the changes we agreed to make so far, and in the meanwhile we can discuss the optional saving of context on the carrier.

@claudinoac
Copy link
Contributor Author

claudinoac commented Aug 6, 2024

Ok @pquentin so I've made the latest changes based on your suggestions, though I wasn't sure what to put into endpoint_id and path_parts attributes, so I've just put default values there as there wasn't a simple way to infer the endpoint being called (as this is a very generic method) and I'm not exactly sure what path_parts means.

@pquentin
Copy link
Member

pquentin commented Aug 7, 2024

Thanks. I will be able to take a proper look at this next week

@pquentin
Copy link
Member

pquentin commented Aug 20, 2024

Sorry for the delay. This looks pretty good, but now we need to add a span to all helper functions that end up calling _process_bulk_chunk, not only parallel_bulk. That way, all bulk function will have helpful spans, even if the top-level application does not 't have them. And we need a better way to add that span, using client._otel.span is a bit of hack that does not really fit. A quick POC on what this would look like for streaming bulk:

image

I'll work on getting this finished if that's OK for you.

@claudinoac
Copy link
Contributor Author

@pquentin I'm not sure if that's really necessary as OTEL itself keeps them in the span stack, below their callers.
If those other methods don't split themselves in threads or processes like parallel_bulk does, then you'll be wasting your time as they probably already look the same as your screenshot shows.
Nonetheless, the issue I'm trying to address here is the subcalls of parallel bulk getting out of scope and polluting the overall trace, which IMHO is in a much higher priority.
As an example, I'm dealing with a quite complex application continuously running hundreds of data pipelines and millions of records per day and these out of scope spans are not only polluting my tracing views, but preventing me from proper trace such pipelines.
I agree that the approach using the _otel object is kind of hacky, but that's due the nature of the way OTEL was implemented here. I just tried to follow the pattern.
Feel free to make any changes.

@pquentin
Copy link
Member

buildkite test this please

@pquentin
Copy link
Member

buildkite test this please

@pquentin
Copy link
Member

pquentin commented Aug 21, 2024

The current version is in line with my vision. I now need to add more robust tests and possibly do the same thing for async helpers.

@pquentin
Copy link
Member

buildkite test this please

@pquentin
Copy link
Member

buildkite test this please

@pquentin
Copy link
Member

pquentin commented Sep 2, 2024

buildkite test this please

Instead of maintaining a global context, we simply pass the otel span
down and use it manually.
@pquentin
Copy link
Member

pquentin commented Sep 3, 2024

buildkite test this please

@@ -322,6 +324,7 @@ def _process_bulk_chunk(
Tuple[_TYPE_BULK_ACTION_HEADER, _TYPE_BULK_ACTION_BODY],
]
],
otel_span: OpenTelemetrySpan,
Copy link
Contributor

@miguelgrinberg miguelgrinberg Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a first look it seemed to me the otel_span attribute would be optional, to handle the case of apps not using otel. But I guess when otel is disabled this is just a fake span object that doesn't do anything?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@miguelgrinberg miguelgrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pquentin
Copy link
Member

pquentin commented Sep 3, 2024

@claudinoac Sorry for the delay! Yesterday, Miguel made me realize that we don't need context propagation at all here since spans are serializables across threads. But using context propagation was also harmful: if two threads were calling parallel_bulk at the same time, then resulting span hierarchy would have been incorrect. I've implemented the simpler and more correct version today. Thanks for your patience.

@pquentin pquentin merged commit d4df09f into elastic:main Sep 3, 2024
11 checks passed
github-actions bot pushed a commit that referenced this pull request Sep 3, 2024
Co-authored-by: Quentin Pradet <[email protected]>
(cherry picked from commit d4df09f)
@claudinoac
Copy link
Contributor Author

@pquentin Yeah I'm aware about them being serializable, in fact that's the way I handle them across threads, processes and even across different applications. Though I'm not sure if your approach here is going to fix the bug. I'm going to make a few live tests just to make sure it addresses the right problem. And for the record, if the context propagation is indeed a problem in race condition scenarios, then the OTEL class shouldn't be a singleton at all.

@pquentin
Copy link
Member

pquentin commented Sep 4, 2024

I can wait for your tests before releasing 8.15.1 if you prefer.

The OTel class isn't a singleton, but there's only one per client. It's simply not designed to carry state about spans.

pquentin pushed a commit that referenced this pull request Sep 5, 2024
Co-authored-by: Quentin Pradet <[email protected]>
(cherry picked from commit d4df09f)

Co-authored-by: Alisson Claudino <[email protected]>
@claudinoac
Copy link
Contributor Author

@pquentin I've made the live tests and everything looks good. Thanks for getting this merged, you can proceed with the release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants