OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics #9243

jmacd · 2024-01-08T23:35:26Z

Describe the bug
In my opinion, the Collector's OTLP exporter handles PartialSuccess incorrectly. The present handling of this field is to return an error to the caller, when in fact the opposite behavior was originally intended in the specification. The meaning of PartialSuccess starts with a successful request. The consumer of the data has acknowledged successful receipt of valid data; the delivery pipeline is operational, but something is maybe ill-formed about the data. There is not an error code. This is a qualified success.

As described in open-telemetry/oteps#238, there is an opportunity for Collectors to count the points rejected, as described in PartialSuccess messages, via a new otel.outcome label, which we might call "rejected". Rejected points are those for which errors must be suppressed because some of the data was correct and all of the data was valid. It would be appropriate to return success and count rejected points, which are those in a category different than dropped which cannot be retried and were part of a successful export request.

It would also be appropriate -- and that the second half of this issue -- for Collector pipelines to report this kind of qualified success in such a way that OTel SDKs receive true information about PartialSuccess when they pass OTLP through a Collector pipeline. In other words, I would be happy if we built support to propagate PartialSuccess all the way (backwards) through the pipeline. Also, components should be able to introduce partial success responses themselves, as discussed in #7669.

As an example, consider a batch processor component forked from the core batchprocessor, here: https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor

This component has been extended to block the caller until their own timeout takes place, or until the final outcome of the batch is known. After the batch export response is known, there are two cases:

All of the items were included in the same export, in which case we have all or none failure.
Some of the items were successful (batches were successful), some were not (batches failed)

This is a model case for the use of PartialSuccess -- there was some successful data which we have to acknowledge with a success, and there was some failure which we cannot allow to look like a transient error. It is incorrect to view it as a permanent error. This batch processor could use PartialSuccess to express back to the client exactly how many of its points were accepted.

We must recognize that this information can't always be perfect. If an item is included in a batch and then part of that batch has PartialSuccess w/ rejected_points != 0, then it will not be possible to assign an integer count correctly to the corresponding inputs. See open-telemetry/oteps#238 for more details.

Steps to reproduce
Return data with PartialSuccess to a collector, for example, where one metric point out of 1000 has a problem with validation. This is the case PartialSuccess was meant to address. The collector returns an error when 999 points were correct.

What did you expect to see?
The response should be a success. The pipeline metrics should count 1 rejected items.

What did you see instead?
I see this code trigger:

	if !(partialSuccess.ErrorMessage() == "" && partialSuccess.RejectedSpans() == 0) {
		return consumererror.NewPermanent(fmt.Errorf("OTLP partial success: \"%s\" (%d rejected)", resp.PartialSuccess().ErrorMessage(), resp.PartialSuccess().RejectedSpans()))
	}

with (1 rejected).

What version did you use?
0.91.1

What config did you use?
An otlp/metrics pipeline with

      otelarrow/metrics:
        endpoint: vendorname.com:443

with a vendor that performs metric point validation.

The text was updated successfully, but these errors were encountered:

Described and tracked in open-telemetry/opentelemetry-collector#9243. The otelarrow export component copied this code from the core OTLP exporter, so it has the same bug.

codeboten · 2024-01-09T19:59:58Z

Linking this previous issue here: #6686

@evan-bradley do you recall why the choice was made to return a permanent error on partial success in the PRs to address that issue? I remember some discussions around it, but can't recall the decision

evan-bradley · 2024-01-10T17:56:05Z

The decision to return a permanent error was made because it will stop retries for the data and because the consumer APIs don't currently have any way to support reporting partial success counts. I think it would also be okay with what currently exists to return success and simply log the details of the partial success message.

The approach here may help communicate this information to the caller: #9041. We could debate whether or not it should be an error type, but it would at least make the caller aware of the existence of a partial success.

yurishkuro · 2024-01-10T21:12:50Z

Vector has an elaborate approach of communicating status of payloads in the pipeline back to the callers - end-to-end acknowledgements.

Changes the treatment of [PartialSuccess](https://opentelemetry.io/docs/specs/otlp/#partial-success), making them successful and logging a warning instead of returning an error to the caller. These responses are meant to convey successful receipt of valid data which could not be accepted for other reasons, specifically to cover situations where the OpenTelemetry SDK and Collector have done nothing wrong, specifically to avoid retries. While the existing OTLP exporter returns a permanent error (also avoids retries), it makes the situation look like a total failure when in fact it is more nuanced. As discussed in the tracking issue, it is a lot of work to propagate these "partial" successes backwards in a pipeline, so the appropriate simple way to handle these items is to return success. In this PR, we log a warning. In a future PR, (IMO) as discussed in open-telemetry/oteps#238, we should count the spans/metrics/logs that are rejected in this way using a dedicated outcome label. **Link to tracking Issue:** Part of #9243 **Testing:** Tests for the "partial success" warning have been added. **Documentation:** PartialSuccess behavior was not documented. Given the level of detail in the README, it feels appropriate to continue not documenting, otherwise lots of new details should be added. --------- Co-authored-by: Alex Boten <[email protected]>

jmacd added the bug Something isn't working label Jan 8, 2024

jmacd mentioned this issue Jan 9, 2024

Return success on PartialSuccess open-telemetry/otel-arrow#130

Merged

jmacd mentioned this issue Jan 10, 2024

Treat PartialSuccess as Success #9260

Merged

jmacd changed the title ~~OTLP PartialSuccess responses should not be interpreted as errors~~ OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics #9243

OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics #9243

jmacd commented Jan 8, 2024

codeboten commented Jan 9, 2024

evan-bradley commented Jan 10, 2024

yurishkuro commented Jan 10, 2024

OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics #9243

OTLP PartialSuccess responses should not be interpreted as errors, items should count as "rejected" in pipeline metrics #9243

Comments

jmacd commented Jan 8, 2024

codeboten commented Jan 9, 2024

evan-bradley commented Jan 10, 2024

yurishkuro commented Jan 10, 2024