[receiver/otlpreceiver] Support Rate Limiting #6725

blakeroberts-wk · 2022-12-09T17:56:12Z

Is your feature request related to a problem? Please describe.

The OpenTelemetry Specification outlines throttling for both gRPC and HTTP; however, the OTLP receiver does not currently support this (optional) specification.

Right now, if a processor is under pressure the only option it has is to return an error informing the receiver to tell the client that the request failed and is not retry-able.

Describe the solution you'd like

It would be neat if the receiver offered an implementation of error that could be returned to it to signal that it should return an appropriately formatted response to the client signaling that the request was rate limited. The format of the response should follow the semantic convention (e.g. the HTTP receiver should return a status code of 429 and set the "Retry-After" header.

For example, the OTLP receiver could export the following error implementation:

package errors

import "time"

type ErrorRateLimited struct {
	Backoff time.Duration
}

func (e *ErrorRateLimited) Error() string {
	return "Too Many Requests"
}

var ErrRateLimited error = &ErrorRateLimited{}

func NewErrRateLimited(backoff time.Duration) error {
	return &ErrorRateLimited{
		Backoff: backoff,
	}
}

Any processor or exporter in the pipeline could return (optionally wrapping) this error:

import "go.opentelemetry.io/collector/receiver/otlpreceiver"

func (p *processor) ConsumeTraces(ctx context.Context, td ptrace.Traces) error {
	return otlpreceiver.NewErrRateLimited(time.Minute)
}

Then when handling errors from the pipeline, the receiver could check for this error:

var (
	err error
	w http.ResponseWriter
)

errRateLimited := &ErrorRateLimited{}
if errors.As(err, &errRateLimited) {
	w.Header().Set("Retry-After", strconv.FormatInt(int64(errRateLimited.Backoff)/1e9, 10))
	w.WriteHeader(http.StatusTooManyRequests)
}

Describe alternatives you've considered

To accomplish rate limiting, a fork of the OTLP receiver will be used. Here are the changes: main...blakeroberts-wk:opentelemetry-collector:otlpreceiver-rate-limiting.

Additional context

The above example changes include the addition of an internal histogram metric which records server latency (http.server.duration or rpc.server.duration) to allow monitoring of the collector's latency, throughput, and error rate. This portion of the changes is not necessary to support rate limiting.

There exists an open issue regarding rate limiting (#3509); however, the suggested approach seems to suggest the use of Redis which goes beyond what I believe necessary for the OTLP receiver to support rate limiting.

The text was updated successfully, but these errors were encountered:

atoulme · 2022-12-17T18:19:59Z

Can you clarify this is not only for OTLP but would be applicable to any pipeline with an exporter able to return this error?

blakeroberts-wk · 2022-12-19T16:48:58Z

Yeah that's a good point. The collector could have some general errors or receiver/errors package that any receiver (or possibly even scrappers?) could look for in the return value from their next consumer. One point to keep in mind though is that the shape of the response to the request in this case is in accordance to the OTel specification, but any non-OTLP receiver looking for these errors could handle it in accordance to their specification, if any.

blakeroberts-wk · 2024-03-27T17:10:55Z

#9357 does not fully implement the OTel specification about OTLP HTTP throttling: there does not exist a way to set the Retry-After header.

mx-psi · 2024-03-27T17:24:21Z

@TylerHelmuth can you take a look?

TylerHelmuth · 2024-03-27T18:02:03Z

The Retry-After header is optional. If the server has a recommendation for how the client should retry it can be set, but the server is not required to provide this recommendation (and often, may not be able to give a good recommendation).

If the client receives an HTTP 429 or an HTTP 503 response and the “Retry-After” header is not present in the response, then the client SHOULD implement an exponential backoff strategy between retries.

The 429/503 response codes are enough to get a OTLP client to start enacting an retry strategy.

Reading through the issue again I agree that we could introduce more to the collector to allow components to explicitly define how they want clients to retry in known, controlled scenarios. For that use case, this issue is not completed yet.

blakeroberts-wk · 2024-03-27T18:36:56Z

@TylerHelmuth Thank you for your response.

To support your analysis with personal experience: I have a custom processor that limits the number of unique trace IDs per service per minute. In this case, it is possible to determine the appropriate duration after which it should be permissible that the service resubmit their trace data.

Allowing an OTLP client to use exponential backoff is sufficient but not optimal. Optimal being a solution that, within system limits/limitations, reduces the duration between when an operation of a service creates a span or trace and when that span or trace is available to be queried from a backend storage system, and reduces the amount of resources (cpu, memory, network, i/o) required to report the span or trace from the originating service to a backend storage system. However, in most cases, the benefit from this optimization will be small if not negligible.

cforce · 2024-09-04T10:29:23Z

@blakeroberts-wk Can you opensource this custom processor that limits the number of unique trace IDs per service per minute.

cforce · 2024-09-04T10:51:59Z

Remark: samples based on rate limits are aviable here
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
https://pkg.go.dev/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor#section-readme

blakeroberts-wk mentioned this issue Dec 17, 2022

Rate Limit Receiver Extension open-telemetry/opentelemetry-collector-contrib#6908

Open

keruitan-wk mentioned this issue Mar 19, 2024

[receiver/otlp] Return proper http response code based on retryable errors #9357

Merged

mx-psi closed this as completed in #9357 Mar 27, 2024

mx-psi closed this as completed in 2b0decf Mar 27, 2024

mx-psi reopened this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/otlpreceiver] Support Rate Limiting #6725

[receiver/otlpreceiver] Support Rate Limiting #6725

blakeroberts-wk commented Dec 9, 2022

atoulme commented Dec 17, 2022

blakeroberts-wk commented Dec 19, 2022

blakeroberts-wk commented Mar 27, 2024

mx-psi commented Mar 27, 2024

TylerHelmuth commented Mar 27, 2024

blakeroberts-wk commented Mar 27, 2024

cforce commented Sep 4, 2024

cforce commented Sep 4, 2024

[receiver/otlpreceiver] Support Rate Limiting #6725

[receiver/otlpreceiver] Support Rate Limiting #6725

Comments

blakeroberts-wk commented Dec 9, 2022

atoulme commented Dec 17, 2022

blakeroberts-wk commented Dec 19, 2022

blakeroberts-wk commented Mar 27, 2024

mx-psi commented Mar 27, 2024

TylerHelmuth commented Mar 27, 2024

blakeroberts-wk commented Mar 27, 2024

cforce commented Sep 4, 2024

cforce commented Sep 4, 2024