-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector-agent hits OOM every hour #21655
Comments
Hey! Diagnosing memory issues in Vector can be tricky. A few questions that may help:
|
Thanks @jszwedko for the response. We don't have any limits applied but the pod typically hit OOM around 29GB. Yes |
Interesting, thanks for sharing that graph. It seems likely to me that the issue is that the concurrency limit is never finding a max. You could try to configure a max via |
That did not help. Did I get it right?
|
That looks right. Did you observe |
It went past it and the memory growth had the same behavior as before. @jszwedko Wait - looking at the breakdown per sink type it looks like most of the concurrency_limit are coming from another sink which does not have the setting. Let me update the other sink and report back. |
@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well. |
Hmm, can you share the config you are trying for the Datadog Logs sink? |
Here
|
That looks right. Are you confident that that is the sink that is exceeding the limit? All of the others are respecting it? |
Gotcha, thanks! That does look like it is exceeding the max. I'm having trouble reproducing this behavior locally though 😢 I'm running this config: sources:
source0:
namespace: vector
scrape_interval_secs: 0.1
type: internal_metrics
source1:
namespace: vector
scrape_interval_secs: 0.1
type: internal_metrics
sinks:
sink0:
inputs:
- source0
type: datadog_metrics
batch:
max_events: 1
request:
adaptive_concurrency:
initial_concurrency: 1
max_concurrency_limit: 5
sink1:
inputs:
- source1
type: datadog_metrics
request:
adaptive_concurrency:
initial_concurrency: 1
max_concurrency_limit: 5 For |
Is that graph cumulative across multiple nodes / sinks? E.g. do you have multiple |
That graph is simply @jszwedko How long did you run your test? Was the limit stable at 6 for long time? |
I'm not sure you want to plot I only ran my test for maybe 10 minutes. I can try another run. |
@jszwedko You mean |
Hmm. What system are you sending these metrics into? |
Nice, yeah, I see My next hypothesis is that there might be backpressure causing requests to queue up in the source waiting to flush data downstream. Could you share a graph of |
Is there a more explicit metric on the queue size? Maybe https://vector.dev/docs/reference/configuration/sources/internal_metrics/#buffer_events |
Yeah, it could be, though your config is fairly simple and it'd be surprising that if you had a leak that it wasn't affecting a large number of users. I'll try to think about this some more, but if you are able to grab a memory profile using valgrind that could be helpful to narrow down where the memory is being used. |
vector doesn't even run for me with valgrind - it fails with sig-fault.
|
@jszwedko So I trimmed down on the number of pipelines and found that this issue surfaces when I enable pipelines with high cardinality metrics and histograms. Is there a setting in vector that would force flush the metrics or ratelimit them?
We have it set for one sink but not the other.
What would be the default behaviour for dd sink above in terms of flushing the buffers? Also, is there a way to block/rate-limit specific metrics? |
Regarding throttling, we have https://vector.dev/docs/reference/configuration/transforms/throttle/ but that doesn't work with metrics yet. Also, I am not aware of anything that allows you to force-flush but you can set batch.max_events to something smaller like in #21655 (comment).
There's no hard limit for this. It all depends how fast you produce/consume and your buffer strategy. If you are dropping events when the buffer is full, it shouldn't cause OOM. |
I got a heap profile and this is how a diff looks
|
Can you share the slimmed down configuration that still reproduces the issue? That would help narrow the search space. |
@st-omarkhalid do you know if the metrics are "churning"? That is: say we have 4M metrics at time t1, at time t2 could we have a different set of metrics even if the cardinality is still a total of 4M? |
@jszwedko Looks like there a correlation with receive errors. I noticed some timeouts in the logs so adjusted the scrape-timeout which not only resolved the errors but also the memory growth. |
Another related issue. With the following vector-sink the metrics looked spotty in our dashboards. I noticed that the disk buffer would get filled up within 30mins so maybe due to
I tried the following config instead
thinking it would slow down the ingestion once the buffer was filled up. Unfortunately this again lead to the memory growth. |
Hmm, that is interesting. It sounds like some requests may have been stacking up in the
I think what is happening here is that the sink cannot send the data fast enough and so it is piling up first in the buffers and then in the scraping source, which I believe will continue scraping on the interval even if it can't flush the data downstream (this is probably suboptimal; I think the source should stop scraping the previous request hasn't been flushed downstream yet). To fix this, I think you'll need to adjust the sink to meet the throughput requirements of your source. Some ideas to do this:
|
@jszwedko You're right the egress volume's not keeping up with the ingress causes the agent to hit OOM. However for I think there are two bugs here:
How could we track these two issues? Otherwise this thread could be closed. |
Hi @st-omarkhalid, OOM issues have been historically quite tricky to debug since we don't have identical environments to reproduce them and there are so many variables involved. I really appreciate the details you included in this issue. It would be beneficial to isolate concerns. If you don't mind you can create a new bug issue for each bug and summarize the situation there. For example, I am not sure what config we are talking about for bug (2). |
A note for the community
Problem
Vector-agent in our deployment shows constant memory growth till the pod hits OOM. This is happening continuously. I looked at a number of other issues open already on the same problem but it's not clear how to resolve it. In prod we have a lot more pipelines than shown below though.
The metric
vector_component_allocated_bytes
shows remap-* components have the most memory allocation and constantly growing.Configuration
Version
0.35.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: