-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics #9243
Comments
Described and tracked in open-telemetry/opentelemetry-collector#9243. The otelarrow export component copied this code from the core OTLP exporter, so it has the same bug.
Linking this previous issue here: #6686 @evan-bradley do you recall why the choice was made to return a permanent error on partial success in the PRs to address that issue? I remember some discussions around it, but can't recall the decision |
The decision to return a permanent error was made because it will stop retries for the data and because the consumer APIs don't currently have any way to support reporting partial success counts. I think it would also be okay with what currently exists to return success and simply log the details of the partial success message. The approach here may help communicate this information to the caller: #9041. We could debate whether or not it should be an error type, but it would at least make the caller aware of the existence of a partial success. |
Vector has an elaborate approach of communicating status of payloads in the pipeline back to the callers - end-to-end acknowledgements. |
Changes the treatment of [PartialSuccess](https://opentelemetry.io/docs/specs/otlp/#partial-success), making them successful and logging a warning instead of returning an error to the caller. These responses are meant to convey successful receipt of valid data which could not be accepted for other reasons, specifically to cover situations where the OpenTelemetry SDK and Collector have done nothing wrong, specifically to avoid retries. While the existing OTLP exporter returns a permanent error (also avoids retries), it makes the situation look like a total failure when in fact it is more nuanced. As discussed in the tracking issue, it is a lot of work to propagate these "partial" successes backwards in a pipeline, so the appropriate simple way to handle these items is to return success. In this PR, we log a warning. In a future PR, (IMO) as discussed in open-telemetry/oteps#238, we should count the spans/metrics/logs that are rejected in this way using a dedicated outcome label. **Link to tracking Issue:** Part of #9243 **Testing:** Tests for the "partial success" warning have been added. **Documentation:** PartialSuccess behavior was not documented. Given the level of detail in the README, it feels appropriate to continue not documenting, otherwise lots of new details should be added. --------- Co-authored-by: Alex Boten <[email protected]>
Describe the bug
In my opinion, the Collector's OTLP exporter handles PartialSuccess incorrectly. The present handling of this field is to return an error to the caller, when in fact the opposite behavior was originally intended in the specification. The meaning of PartialSuccess starts with a successful request. The consumer of the data has acknowledged successful receipt of valid data; the delivery pipeline is operational, but something is maybe ill-formed about the data. There is not an error code. This is a qualified success.
As described in open-telemetry/oteps#238, there is an opportunity for Collectors to count the points rejected, as described in PartialSuccess messages, via a new
otel.outcome
label, which we might call "rejected". Rejected points are those for which errors must be suppressed because some of the data was correct and all of the data was valid. It would be appropriate to return success and count rejected points, which are those in a category different than dropped which cannot be retried and were part of a successful export request.It would also be appropriate -- and that the second half of this issue -- for Collector pipelines to report this kind of qualified success in such a way that OTel SDKs receive true information about PartialSuccess when they pass OTLP through a Collector pipeline. In other words, I would be happy if we built support to propagate PartialSuccess all the way (backwards) through the pipeline. Also, components should be able to introduce partial success responses themselves, as discussed in #7669.
As an example, consider a batch processor component forked from the core
batchprocessor
, here: https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessorThis component has been extended to block the caller until their own timeout takes place, or until the final outcome of the batch is known. After the batch export response is known, there are two cases:
This is a model case for the use of PartialSuccess -- there was some successful data which we have to acknowledge with a success, and there was some failure which we cannot allow to look like a transient error. It is incorrect to view it as a permanent error. This batch processor could use PartialSuccess to express back to the client exactly how many of its points were accepted.
We must recognize that this information can't always be perfect. If an item is included in a batch and then part of that batch has PartialSuccess w/
rejected_points
!= 0, then it will not be possible to assign an integer count correctly to the corresponding inputs. See open-telemetry/oteps#238 for more details.Steps to reproduce
Return data with PartialSuccess to a collector, for example, where one metric point out of 1000 has a problem with validation. This is the case PartialSuccess was meant to address. The collector returns an error when 999 points were correct.
What did you expect to see?
The response should be a success. The pipeline metrics should count 1
rejected
items.What did you see instead?
I see this code trigger:
with
(1 rejected)
.What version did you use?
0.91.1
What config did you use?
An
otlp/metrics
pipeline withwith a vendor that performs metric point validation.
The text was updated successfully, but these errors were encountered: