Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve timeout and retry mechanic of exporters #4183

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

LarsMichelsen
Copy link

@LarsMichelsen LarsMichelsen commented Sep 8, 2024

Description

I took a deep look into #3764, tried to separate the individual changes. While I think the approach of the PR in general is good, I changed some aspects of the change. For example, I don't agree with the change picking the lowest of the given timeouts (see here).

Having the change split into multiple steps, it's still a big undertaking. So I guess a reviewer will have to take some time to get into it. Nevertheless, I hope that someone will be found who feels up to it.

Fixes #4043

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Several tests were added (retry logic; shutdown logic; new timeout export argument) and of course executed also :)

Does This PR Require a Contrib Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@LarsMichelsen
Copy link
Author

I recognized the public symbols test being red (https://github.com/open-telemetry/opentelemetry-python/actions/runs/10760943700/job/29900823921). I understand it's purpose, but I guess there is some guidance needed on which names are acceptable to be new public names and which not. Any advice?

Regarding the changelog: Is it intended the I add the change log entry?

@emdneto
Copy link
Member

emdneto commented Sep 12, 2024

I recognized the public symbols test being red (https://github.com/open-telemetry/opentelemetry-python/actions/runs/10760943700/job/29900823921). I understand it's purpose, but I guess there is some guidance needed on which names are acceptable to be new public names and which not. Any advice?

Regarding the changelog: Is it intended the I add the change log entry?

In that case we need a changelog for this one

This is the first change in a chain of commits to rework the retry
mechanic. It is based on the work of open-telemetry#3764 and basically trying to land
the changes proposed by this monolithic commit step by step.

The plan is roughly to proceed in these steps:

* Extract retry mechanic from GRPC exporters
* Consolidate HTTP with GRPC exporter retry implementation
* Pipe timeout through RetryingExporter
* Make exporter lock protect the whole export instead of just a single iteration
* Make timeout float instead of int
* Add back-off with jitter

It's pretty likely that the plan will change along the way.
This unifies the implementation of the OTLP exporters and the HTTP log
exporter.

Next step is to consolidate the remaining HTTP exporters.

Fixes open-telemetry#4043.
This unifies the implementation of the OTLP exporters and the HTTP
metric exporter.
This unifies the implementation of the OTLP exporters and the HTTP
trace exporter.
Propagate the given timeout value from all exporters to the
RetryingExporter, which will hand it over to the individual
`_export` function call.

Currently the same unaltered value is handed over. In following
commits the Retrying exporter will update the timeout handed over
to `_export` depending on the remaining time.
TL;DR The time assert in the test was wrong.

Looking at the original commit (c84ba9) of this test, the delay of the
mocked backoff was 4 seconds fixed with 6 iterations. So the time until
shutdown returned was around 24 seconds. The default timeout of the lock
wait in shutdown did not hit before the retry of the exporter was over.

The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed
to be wrong from the beginning. It just ensured that the shutdown call
took at least 30ms. I assume it was supposed to do something else. But
to me it's not clear what exactly it should do.

One explantion would be that it was intended to do
`self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But
that would mean: Shutdown should need at least 30 seconds to pass, which
does not make sense imho.

Another option would be that it was intended to do
`self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That
would mean: Shutdown should need at maximum 30 seconds to pass, which
actually could make more sense. It would ensure that the call of
`self.exporter.shutdown` returns after 30 seconds.

My biggest issue here is, that it tests the happy path where the retry
is over before the timeout while there is no mechanism to ensure that the
exporter thread is stopped in a sensible time in case it's current
exponential backoff is in a longer sleep already while the shutdown flag
is set. If I understand it correctly, it can hit the beginning of a 32
second sleep, which could mean that a process needs a) 30 seconds to
signal the shutdown and additional 32 seconds to finish the exporter
thread. But that's something for another change.

After all these words, looking at the current HEAD, the situation of the
test case is basically the same as originally. The retry interval has
been changed recently to be way shorter. The default lock timeout stayed
the same at 30 seconds, but that does not matter too much since the
retry mechanism is finished long before that. And the aforementioned
assert is still wrong, with some additional problem from open-telemetry#4014.
Previously the export lock was released by the export between retries,
which gave a shutdown the opportunity to interrupt the export depending
on which thread is first to get the lock again. This behavior is not
desired as it heavily relies on timing and makes things less
deterministic.

The retry loop still checks whether the shutdown flag is set and will
break in case it was told to do so.
A few tests are still commented out and will be enabled later, once
the implementation is updated to implement the desired behavior.

The tests are, like other existing tests, timing sensitive which makes
them prone to fail on slow systems. It would certainly be better to get
rid of this issue. I haven't found another approach in the existing test
cases so I sticked to the current pattern.
Replace the RetryableExporter shutdown flag and sleep with a
`threading.Event` to make the waiting time for the next retry
interruptable.

This enables us to change the semantic of the shutdown timeout
to work in a less surprising way. Previously the timeout worked
roughly like this:

1. Wait for the current export iteration to finish
2. Try to get the lock faster than the export thread
3. If the lock was acquired, the shutdown flag was set
4. If the lock was not acquired, the shutdown flag was set
5. The export thread would have picked this flag up after finishing the
   current retry cycle.

This lead to hanging applications on shutdown, giving the applications
no choice to influence the shutdown behavior. Uses could not decide how
much time shall be spent trying to prevent the loss of data.

This change allows shutdown to interrupt the current export waiting call
so that the exporter can be shut down with a user provided deadline.
Derive a deadline from the initial timeout for each export_with_retry
call and try to stick to this time frame as good as possible.

Each call to export_function will get the remaining time as timeout
parameter and is expected to return as quick in case of an issue so
that export_with_retry can care for waiting and retrying.

This also removes the surprsing deadline triggered by the max_value
of the exponential backoff.
This change allows the user to call retrying exports with individiual
timeouts. This argument will be needed in the next step, where the GRPC
metric exporters batch processing is changed to respect the timeout for
the whole export.
The timeout passed as argument to the export method is now overriding
the default values set during initialization) once given.

Additonally also ensure the batching in OTLPMetricExporter.export
respects the export timeout.
The timeout passed as argument to the export method is now overriding
the default values set during initialization if given.

This change is made to align the API of all exporters. In the metric
exporter this parameter was already existant but was not used. This was
changed with the previous commit (2102e3b25b52).
It's desirable to have a common API across all exporters.

Since the previously existing timeout_millis argument wasn't used
anyways, it should not be an issue to change the default value here.

The None case will fall back to the object level timeout, which is
the same as the previous default argument value.
@LarsMichelsen
Copy link
Author

To my understanding the following new public symbols are necessary:

- exporter/opentelemetry-exporter-otlp-proto-common/src/opentelemetry/exporter/otlp/proto/common/exporter.py
	ExportResultT
	ExportPayloadT
	RetryableExportError
	RetryingExporter
	acquire_timeout

If I understood CONTRIBUTING.md correctly, I can not do anything further about it and now some maintainer / code owner needs to judge it.

@emdneto emdneto added the Approve Public API check This label shows that the public symbols added or changed in a PR are strictly necessary label Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Approve Public API check This label shows that the public symbols added or changed in a PR are strictly necessary
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout and retry of exporters is broken
2 participants