Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver] Create helper for server-like receivers to replace memory_limiter processor #8632

Open
mx-psi opened this issue Oct 6, 2023 · 5 comments

Comments

@mx-psi
Copy link
Member

mx-psi commented Oct 6, 2023

The memory_limiter processor allows end-users to drop telemetry data when a memory limit is reached. It is up to receivers to apply backpressure and handle retrying of errors.

A large class of receivers spin up one or more servers that listen for telemetry in a particular format, map to pdata and send it along to the next consumer. For protocols like HTTP and gRPC there are standard ways (429, RESOURCE_EXHAUSTED) to report the memory has been exhausted and apply backpressure to clients.

We should build a server receiver helper that handles this. In particular it should:

  • Provide an easy way to build receivers based on one or more servers that wrap handlers to handle memory limits
  • Support at least HTTP and gRPC
  • Be extensible to add other capabilities to the helper in the future (e.g. we could tackle add the option to enable server reflection on otlp grpc receiver #4951 generally for all gRPC-based receivers?)
  • Be extensible to support fairness across multiple clients reporting to the same receiver endpoint when it comes to throttling them
  • Be configurable in the same manner as the memory limiter currently is

Having this helper available and tackling this for scraper helpers may remove the need of a memory_limiter processor.

@splunkericl
Copy link
Contributor

Hey @dmitryax and @mx-psi ,we are helping out with the http use cases and want to confirm a few things:

the suggestion was to build a receiver helper method to wrap the log consumption with memory limiter. But this means the memory limiting is handled post HTTP connections creation and receiving the data on the server. This still poses problem on too many resources used.

Instead, there should be some kinds of http interceptor(similar to authInterceptor) with memory limiter baked in. So users can just plug it in when they spawn the http server. Is my understanding correct?

@mx-psi
Copy link
Member Author

mx-psi commented Nov 20, 2023

An interceptor sounds like a good approach. We should keep the requirements in mind (I wrote a list on the original post, we can discuss them if they are too stringent but they seem like a reasonable list for starting to think about this)

@bogdandrutu
Copy link
Member

@mx-psi I am not sure if this is the right approach, here are some use-cases that would be good the new design (if any) should cover:

  • If collector is a "side-car", only converts from OTel format to DataDog (for example) and does not have any "buffering". If the DD exporter reports RESOURCE_EXHAUSTED we should actually probably send that back to the client.
  • For other errors like "Retry-After" if an exporter gets that would be good to back-propagate it.
  • There may be other components (exporter, processor, connector) that may return errors that the receiver may want to do something special with it (RESOURCE_EXHAUSTED, Retry-After for back-pressure, PERMISSION_DENIED, etc.).

I personally think we should actually adopt the grpc status (which is actually inside Google called Task::Status because they are generic enough for almost all use-cases including disk operations, RPCs, etc.) and return a error like Status from our Consumer funcs. Then we the receiver can read these "canonical" codes and do what is appropriate for them.

I know that in some scenarios, we may be able for example for memory to reduce the pressure earlier (e.g. if it is a http.Handler may reject the request before parsing/uncompressing etc.), but I see this as an optimization of the more generic problem we have right now, which is that we don't have a set of canonical codes/errors that we return from the pipeline components to better understand/react upstream.

@dmitryax
Copy link
Member

dmitryax commented Jan 4, 2024

To clarify the reported issue, right now, the memory_limiter processor doesn't drop any data; it just rejects it, and receivers decide what to do with that. Push-based gRPC/HTTP receivers should propagate the error to the caller as a retriable error.

I like the idea of standardizing on the gRPC codes in the collector, as @bogdandrutu suggested. However, I think this probably can be scoped as a separate effort. If we define that using memory_limiter on the receiver side should return 429 and RESOURCE_EXHAUSTED when memory limit is reached, it won't affect the adoption of grpc status codes from the caller perspective, right?

I think the problem that this issue should solve is that the memory_limiter as a processor is not efficient in preventing the collector from running out of memory for non-OTLP/GPRC receivers. Even if the memory_limiter processor reaches the threshold, receivers still accept requests, which can take a significant amount of memory before the memory_limiter can reject the pdata. Moving it to the receiver side introduces better protection from OOMs.

dmitryax added a commit that referenced this issue Jan 23, 2024
…orylimiter processor (#8964)

**Description:**

Following
#8632,
this change introduces memory limiter as an extension. This allows
us to place the component to reject incoming connections due to limited
memory, providing better protection from running out of memory.

missing feature: receiver fairness. issue where a receiver hogs all the
resource can happen.

**Link to tracking Issue:** <Issue number if applicable>
 #8632

---------

Co-authored-by: Dmitry Anoshin <[email protected]>
@mx-psi
Copy link
Member Author

mx-psi commented Sep 4, 2024

I am going to remove this from the Collector 1.0 project. This is an useful enhancement but I feel is not in scope because it's an enhancement and can be done without breaking changes and we already have (more limited) way of handling this

@mx-psi mx-psi removed release:required-for-ga Must be resolved before GA release needs-review-from-stability-wg labels Sep 4, 2024
@mx-psi mx-psi removed this from the go.opentelemetry.io/receiver 1.0 milestone Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants