Improve timeout and retry mechanic of exporters #4183

LarsMichelsen · 2024-09-08T15:06:56Z

Description

I took a deep look into #3764, tried to separate the individual changes. While I think the approach of the PR in general is good, I changed some aspects of the change. For example, I don't agree with the change picking the lowest of the given timeouts (see here).

Having the change split into multiple steps, it's still a big undertaking. So I guess a reviewer will have to take some time to get into it. Nevertheless, I hope that someone will be found who feels up to it.

Fixes #4043

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Several tests were added (retry logic; shutdown logic; new timeout export argument) and of course executed also :)

Does This PR Require a Contrib Repo Change?

Yes. - Link to PR:
No.

Checklist:

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated

LarsMichelsen · 2024-09-11T06:31:05Z

I recognized the public symbols test being red (https://github.com/open-telemetry/opentelemetry-python/actions/runs/10760943700/job/29900823921). I understand it's purpose, but I guess there is some guidance needed on which names are acceptable to be new public names and which not. Any advice?

Regarding the changelog: Is it intended the I add the change log entry?

emdneto · 2024-09-12T00:08:34Z

I recognized the public symbols test being red (https://github.com/open-telemetry/opentelemetry-python/actions/runs/10760943700/job/29900823921). I understand it's purpose, but I guess there is some guidance needed on which names are acceptable to be new public names and which not. Any advice?

Regarding the changelog: Is it intended the I add the change log entry?

In that case we need a changelog for this one

This is the first change in a chain of commits to rework the retry mechanic. It is based on the work of open-telemetry#3764 and basically trying to land the changes proposed by this monolithic commit step by step. The plan is roughly to proceed in these steps: * Extract retry mechanic from GRPC exporters * Consolidate HTTP with GRPC exporter retry implementation * Pipe timeout through RetryingExporter * Make exporter lock protect the whole export instead of just a single iteration * Make timeout float instead of int * Add back-off with jitter It's pretty likely that the plan will change along the way.

This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.

This unifies the implementation of the OTLP exporters and the HTTP metric exporter.

This unifies the implementation of the OTLP exporters and the HTTP trace exporter.

Propagate the given timeout value from all exporters to the RetryingExporter, which will hand it over to the individual `_export` function call. Currently the same unaltered value is handed over. In following commits the Retrying exporter will update the timeout handed over to `_export` depending on the remaining time.

TL;DR The time assert in the test was wrong. Looking at the original commit (c84ba9) of this test, the delay of the mocked backoff was 4 seconds fixed with 6 iterations. So the time until shutdown returned was around 24 seconds. The default timeout of the lock wait in shutdown did not hit before the retry of the exporter was over. The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed to be wrong from the beginning. It just ensured that the shutdown call took at least 30ms. I assume it was supposed to do something else. But to me it's not clear what exactly it should do. One explantion would be that it was intended to do `self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But that would mean: Shutdown should need at least 30 seconds to pass, which does not make sense imho. Another option would be that it was intended to do `self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That would mean: Shutdown should need at maximum 30 seconds to pass, which actually could make more sense. It would ensure that the call of `self.exporter.shutdown` returns after 30 seconds. My biggest issue here is, that it tests the happy path where the retry is over before the timeout while there is no mechanism to ensure that the exporter thread is stopped in a sensible time in case it's current exponential backoff is in a longer sleep already while the shutdown flag is set. If I understand it correctly, it can hit the beginning of a 32 second sleep, which could mean that a process needs a) 30 seconds to signal the shutdown and additional 32 seconds to finish the exporter thread. But that's something for another change. After all these words, looking at the current HEAD, the situation of the test case is basically the same as originally. The retry interval has been changed recently to be way shorter. The default lock timeout stayed the same at 30 seconds, but that does not matter too much since the retry mechanism is finished long before that. And the aforementioned assert is still wrong, with some additional problem from open-telemetry#4014.

Previously the export lock was released by the export between retries, which gave a shutdown the opportunity to interrupt the export depending on which thread is first to get the lock again. This behavior is not desired as it heavily relies on timing and makes things less deterministic. The retry loop still checks whether the shutdown flag is set and will break in case it was told to do so.

A few tests are still commented out and will be enabled later, once the implementation is updated to implement the desired behavior. The tests are, like other existing tests, timing sensitive which makes them prone to fail on slow systems. It would certainly be better to get rid of this issue. I haven't found another approach in the existing test cases so I sticked to the current pattern.

Replace the RetryableExporter shutdown flag and sleep with a `threading.Event` to make the waiting time for the next retry interruptable. This enables us to change the semantic of the shutdown timeout to work in a less surprising way. Previously the timeout worked roughly like this: 1. Wait for the current export iteration to finish 2. Try to get the lock faster than the export thread 3. If the lock was acquired, the shutdown flag was set 4. If the lock was not acquired, the shutdown flag was set 5. The export thread would have picked this flag up after finishing the current retry cycle. This lead to hanging applications on shutdown, giving the applications no choice to influence the shutdown behavior. Uses could not decide how much time shall be spent trying to prevent the loss of data. This change allows shutdown to interrupt the current export waiting call so that the exporter can be shut down with a user provided deadline.

Derive a deadline from the initial timeout for each export_with_retry call and try to stick to this time frame as good as possible. Each call to export_function will get the remaining time as timeout parameter and is expected to return as quick in case of an issue so that export_with_retry can care for waiting and retrying. This also removes the surprsing deadline triggered by the max_value of the exponential backoff.

This change allows the user to call retrying exports with individiual timeouts. This argument will be needed in the next step, where the GRPC metric exporters batch processing is changed to respect the timeout for the whole export.

The timeout passed as argument to the export method is now overriding the default values set during initialization) once given. Additonally also ensure the batching in OTLPMetricExporter.export respects the export timeout.

The timeout passed as argument to the export method is now overriding the default values set during initialization if given. This change is made to align the API of all exporters. In the metric exporter this parameter was already existant but was not used. This was changed with the previous commit (2102e3b25b52).

It's desirable to have a common API across all exporters. Since the previously existing timeout_millis argument wasn't used anyways, it should not be an issue to change the default value here. The None case will fall back to the object level timeout, which is the same as the previous default argument value.

LarsMichelsen · 2024-09-14T16:06:07Z

To my understanding the following new public symbols are necessary:

- exporter/opentelemetry-exporter-otlp-proto-common/src/opentelemetry/exporter/otlp/proto/common/exporter.py
	ExportResultT
	ExportPayloadT
	RetryableExportError
	RetryingExporter
	acquire_timeout

If I understood CONTRIBUTING.md correctly, I can not do anything further about it and now some maintainer / code owner needs to judge it.

LarsMichelsen requested a review from a team September 8, 2024 15:06

LarsMichelsen mentioned this pull request Sep 8, 2024

Timeout and retry of exporters is broken #4043

Open

LarsMichelsen added 16 commits September 14, 2024 17:53

Use RetryingExporter in HTTP log exporter

5596c07

This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.

Use RetryingExporter in HTTP metric exporter

bf18b85

This unifies the implementation of the OTLP exporters and the HTTP metric exporter.

Add missing shutdown mechanic to HTTP metric exporter

5c90194

Use RetryingExporter in HTTP trace exporter

0c3831a

This unifies the implementation of the OTLP exporters and the HTTP trace exporter.

Add optional timeout argument to export_with_retry

6c93326

This change allows the user to call retrying exports with individiual timeouts. This argument will be needed in the next step, where the GRPC metric exporters batch processing is changed to respect the timeout for the whole export.

GRPC metric exporter: Use timeout passed to export

cf6a679

The timeout passed as argument to the export method is now overriding the default values set during initialization) once given. Additonally also ensure the batching in OTLPMetricExporter.export respects the export timeout.

Add change log entry

62a9101

LarsMichelsen force-pushed the exporter_timeout branch from defd3b8 to 62a9101 Compare September 14, 2024 16:02

emdneto added the Approve Public API check This label shows that the public symbols added or changed in a PR are strictly necessary label Sep 14, 2024

Merge branch 'main' into exporter_timeout

0789f95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve timeout and retry mechanic of exporters #4183

Improve timeout and retry mechanic of exporters #4183

LarsMichelsen commented Sep 8, 2024 •

edited

Loading

LarsMichelsen commented Sep 11, 2024

emdneto commented Sep 12, 2024

LarsMichelsen commented Sep 14, 2024

Improve timeout and retry mechanic of exporters #4183

Are you sure you want to change the base?

Improve timeout and retry mechanic of exporters #4183

Conversation

LarsMichelsen commented Sep 8, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Does This PR Require a Contrib Repo Change?

Checklist:

LarsMichelsen commented Sep 11, 2024

emdneto commented Sep 12, 2024

LarsMichelsen commented Sep 14, 2024

LarsMichelsen commented Sep 8, 2024 •

edited

Loading