Vector-agent hits OOM every hour #21655

st-omarkhalid · 2024-10-30T02:46:17Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Vector-agent in our deployment shows constant memory growth till the pod hits OOM. This is happening continuously. I looked at a number of other issues open already on the same problem but it's not clear how to resolve it. In prod we have a lot more pipelines than shown below though.

The metric vector_component_allocated_bytes shows remap-* components have the most memory allocation and constantly growing.

Configuration

data_dir: /var/lib/vector
expire_metrics_secs: 600
api:
  enabled: true
  address: 0.0.0.0:8686
  playground: true
sources:
  host_metric_source:
    type: host_metrics
    scrape_interval_secs: 60
  internal_metric_source:
    type: internal_metrics
    scrape_interval_secs: 60
  ip-node1:
    type: prometheus_scrape
    endpoints:
    - http://<ip>:9100/metrics
    scrape_interval_secs: 60
sinks:
  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    site: datadoghq.com
  prometheus_exporter:
    type: prometheus_exporter
    inputs:
    - internal_metric_source
    - host_metric_source
    address: 0.0.0.0:9598
  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true
    headers:
      auth-version: "1"
transforms:
  remap_node_ip-node1:
    type: remap
    inputs:
    - ip-node1
    source: |-
      .tags.nodename = "ip-node1"
      .tags.service_name = "nodeExporter"
  tiering:
    type: route
    inputs:
    - remap_*
    route:
      tier1: .name == .name
    reroute_unmatched: false

Version

0.35.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-10-30T13:47:49Z

Hey! Diagnosing memory issues in Vector can be tricky. A few questions that may help:

What memory limits do you have applied?
If you increase them, does the usage level out, or do you see it grow consistently regardless of the limit (indicating some sort of leak)?
Is it correlated with the adaptive_concurrency_limit metric? I'm wondering if an increasing number of concurrent requests are causing pressure

st-omarkhalid · 2024-10-30T14:12:59Z

Thanks @jszwedko for the response. We don't have any limits applied but the pod typically hit OOM around 29GB. Yes vector_adaptive_concurrency_limit_count does correlate with the memory growth.

jszwedko · 2024-10-30T15:26:33Z

Interesting, thanks for sharing that graph. It seems likely to me that the issue is that the concurrency limit is never finding a max. You could try to configure a max via request.adaptive_concurrency.max_concurrency_limit (e.g. https://vector.dev/docs/reference/configuration/sinks/vector/#request.adaptive_concurrency.max_concurrency_limit).

st-omarkhalid · 2024-10-30T17:49:07Z

That did not help. Did I get it right?

  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    request:
      adaptive_concurrency:
        max_concurrency_limit: 500

jszwedko · 2024-10-30T17:55:48Z

That did not help. Did I get it right?

  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    request:
      adaptive_concurrency:
        max_concurrency_limit: 500

That looks right. Did you observe adaptive_concurrency_limit not exceeding this limit?

st-omarkhalid · 2024-10-30T17:58:23Z

It went past it and the memory growth had the same behavior as before.

@jszwedko Wait - looking at the breakdown per sink type it looks like most of the concurrency_limit are coming from another sink which does not have the setting. Let me update the other sink and report back.

st-omarkhalid · 2024-10-30T19:07:17Z

@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.

jszwedko · 2024-10-30T19:25:30Z

@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.

Hmm, can you share the config you are trying for the Datadog Logs sink?

st-omarkhalid · 2024-10-30T19:31:11Z

Here

  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    default_api_key: <key>
    site: datadoghq.com
    request:
      adaptive_concurrency:
        max_concurrency_limit: 100

jszwedko · 2024-10-30T20:05:55Z

That looks right. Are you confident that that is the sink that is exceeding the limit? All of the others are respecting it?

st-omarkhalid · 2024-10-30T20:10:52Z

Here - vector gets to 55-65 and dd-metrics gets to 1k before the pod hit OOM.

jszwedko · 2024-10-30T21:11:58Z

Gotcha, thanks! That does look like it is exceeding the max. I'm having trouble reproducing this behavior locally though 😢

I'm running this config:

sources:
  source0:
    namespace: vector
    scrape_interval_secs: 0.1
    type: internal_metrics
  source1:
    namespace: vector
    scrape_interval_secs: 0.1
    type: internal_metrics
sinks:
  sink0:
    inputs:
    - source0
    type: datadog_metrics
    batch:
      max_events: 1
    request:
      adaptive_concurrency:
        initial_concurrency: 1
        max_concurrency_limit: 5
  sink1:
    inputs:
    - source1
    type: datadog_metrics
    request:
      adaptive_concurrency:
        initial_concurrency: 1
        max_concurrency_limit: 5

For source0 I'm limiting batch.max_events to 1 to cause it to make more requests than it otherwise would. I see the max for sink0 bubbling around 6. If I remove the limit it goes up to 50. I tried with both Vector v0.35.0 and v0.42.0.

st-omarkhalid · 2024-10-30T22:47:03Z

For a moment I thought the limits had improved but that's still not the case. There were a couple of evicted pods which had lower limits but otherwise the limits are way over 100 for dd sink.

For context we have over 300 sources (each with its own transform) in the config.

jszwedko · 2024-10-31T00:01:06Z

Is that graph cumulative across multiple nodes / sinks? E.g. do you have multiple datadog_metrics sinks in the same config?

st-omarkhalid · 2024-10-31T01:02:35Z

That graph is simply vector_adaptive_concurrency_limit_count plotted with no filters. We have a single vector-agent pod in our setup. The sawtooth pattern above is because the pod is getting OOM'd and replaced by a new one. The config has 3 sinks as pointed out in the issue description.

@jszwedko How long did you run your test? Was the limit stable at 6 for long time?

st-omarkhalid · 2024-10-31T14:12:12Z

I looked at another instance of the vector-agent in a different cluster (where again the limit is not applied). The limit reached to 15K without hitting OOM - the number of sources in that cluster is a lot less but it still has 3 sinks.

On the problematic vector-agent instance I see large memory allocation across remap components (vector_component_allocated_bytes)

jszwedko · 2024-10-31T14:47:44Z

That graph is simply vector_adaptive_concurrency_limit_count plotted with no filters. We have a single vector-agent pod in our setup.

I'm not sure you want to plot _count, this would be the count of the number of data points in the histogram. Could you plot the max instead?

I only ran my test for maybe 10 minutes. I can try another run.

st-omarkhalid · 2024-10-31T14:56:21Z

@jszwedko You mean max(vector_adaptive_concurrency_limit_count) - otherwise I don't see max metric for it.

jszwedko · 2024-10-31T15:00:28Z

@jszwedko You mean max(vector_adaptive_concurrency_limit_count) - otherwise I don't see max metric for it.

Hmm. What system are you sending these metrics into? vector_adaptive_concurrency_limit is a histogram so you should be able to do things like taking the max, min, and average.

st-omarkhalid · 2024-10-31T15:41:01Z

The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h])) something you want to see?

jszwedko · 2024-10-31T16:00:14Z

The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h])) something you want to see?

Nice, yeah, I see histogram_quantile(1, ...) gives you the max value. There we can see it does seem to be staying below or around the limit of 100.

My next hypothesis is that there might be backpressure causing requests to queue up in the source waiting to flush data downstream. Could you share a graph of vector_component_utilization and buffer_send_duration_seconds per component?

st-omarkhalid · 2024-10-31T17:21:45Z

Utilization has an interesting pattern - it spikes in the remap component and then settles down.

st-omarkhalid · 2024-10-31T17:28:23Z

sum(histogram_quantile(0.99, increase(vector_buffer_send_duration_seconds_bucket[10m]))) by (component_type)

Is there a more explicit metric on the queue size? Maybe https://vector.dev/docs/reference/configuration/sources/internal_metrics/#buffer_events

st-omarkhalid · 2024-11-01T14:55:10Z

The buffer size is pretty stable as well.

I also checked the ingress/egress difference and that consistent as well

This does appear like a mem leak issue.

jszwedko · 2024-11-01T15:00:42Z

Yeah, it could be, though your config is fairly simple and it'd be surprising that if you had a leak that it wasn't affecting a large number of users. I'll try to think about this some more, but if you are able to grab a memory profile using valgrind that could be helpful to narrow down where the memory is being used.

st-omarkhalid · 2024-11-03T19:56:20Z

vector doesn't even run for me with valgrind - it fails with sig-fault.

==20819== Memcheck, a memory error detector
==20819== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==20819== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==20819== Command: vector/target/release/vector
==20819==
==20819== Invalid read of size 4
==20819==    at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819==    by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819==    by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819==    by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819==    by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819==    by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==20819==
==20819==
==20819== Process terminating with default action of signal 11 (SIGSEGV)
==20819==  Access not within mapped region at address 0x64
==20819==    at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819==    by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819==    by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819==    by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819==    by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819==    by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819==  If you believe this happened as a result of a stack
==20819==  overflow in your program's main thread (unlikely but
==20819==  possible), you can try to increase the size of the
==20819==  main thread stack using the --main-stacksize= flag.
==20819==  The main thread stack size used in this run was 108392448.
==20819==
==20819== HEAP SUMMARY:
==20819==     in use at exit: 172,057 bytes in 471 blocks
==20819==   total heap usage: 478 allocs, 7 frees, 174,129 bytes allocated
==20819==
==20819== LEAK SUMMARY:
==20819==    definitely lost: 0 bytes in 0 blocks
==20819==    indirectly lost: 0 bytes in 0 blocks
==20819==      possibly lost: 0 bytes in 0 blocks
==20819==    still reachable: 172,057 bytes in 471 blocks
==20819==         suppressed: 0 bytes in 0 blocks
==20819== Reachable blocks (those to which a pointer was found) are not shown.
==20819== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==20819==
==20819== For lists of detected and suppressed errors, rerun with: -s
==20819== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault

st-omarkhalid · 2024-11-07T14:42:36Z

@jszwedko So I trimmed down on the number of pipelines and found that this issue surfaces when I enable pipelines with high cardinality metrics and histograms. Is there a setting in vector that would force flush the metrics or ratelimit them?

Batches are flushed when 1 of 2 conditions are met:

The batch age meets or exceeds the configured timeout_secs.
The batch size meets or exceeds the configured max_bytes or max_events.

We have it set for one sink but not the other.

  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    site: datadoghq.com
  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: <url>
    tls:
      alpn_protocols:
      - h2
      enabled: true
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true

What would be the default behaviour for dd sink above in terms of flushing the buffers? Also, is there a way to block/rate-limit specific metrics?

st-omarkhalid · 2024-11-09T19:55:47Z

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

pront · 2024-11-11T17:04:13Z

Regarding throttling, we have https://vector.dev/docs/reference/configuration/transforms/throttle/ but that doesn't work with metrics yet.

Also, I am not aware of anything that allows you to force-flush but you can set batch.max_events to something smaller like in #21655 (comment).

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

There's no hard limit for this. It all depends how fast you produce/consume and your buffer strategy. If you are dropping events when the buffer is full, it shouldn't cause OOM.

st-omarkhalid · 2024-11-13T23:39:56Z

I got a heap profile and this is how a diff looks

Showing nodes accounting for 13.73GB, 94.84% of 14.47GB total
Dropped 190 nodes (cum <= 0.07GB)
      flat  flat%   sum%        cum   cum%
   13.73GB 94.84% 94.84%    13.73GB 94.84%  prof_backtrace_impl
         0     0% 94.84%     0.27GB  1.90%  <(FnA,FnB,FnC) as nom::sequence::Tuple<Input,(A,B,C),Error>>::parse
         0     0% 94.84%     0.22GB  1.52%  <F as nom::internal::Parser<I,O,E>>::parse
         0     0% 94.84%    13.11GB 90.58%  <T as futures_util::fns::FnMut1<A>>::call_mut
         0     0% 94.84%     0.33GB  2.30%  <alloc::collections::btree::map::BTreeMap<K,V,A> as core::clone::Clone>::clone::clone_subtree
         0     0% 94.84%     0.37GB  2.55%  <alloc::string::String as core::clone::Clone>::clone
         0     0% 94.84%     0.35GB  2.44%  <alloc::vec::Vec<T,A> as core::clone::Clone>::clone
         0     0% 94.84%    13.17GB 91.00%  <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::futures_unordered::FuturesUnordered<Fut> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::Flatten<St> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.10GB 90.52%  <futures_util::stream::stream::filter_map::FilterMap<St,Fut,F> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::flatten_unordered::FlattenUnorderedWithFlowController<St,Fc> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::flatten_unordered::PollStreamFut<St> as core::future::future::Future>::poll
         0     0% 94.84%    13.73GB 94.85%  <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
         0     0% 94.84%     0.18GB  1.27%  <tower::limit::rate::service::RateLimit<S> as tower_service::Service<Request>>::call
         0     0% 94.84%     0.18GB  1.27%  <tower::retry::Retry<P,S> as tower_service::Service<Request>>::call
         0     0% 94.84%    13.73GB 94.83%  <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
         0     0% 94.84%     0.18GB  1.27%  <vector::sinks::util::retries::FibonacciRetryPolicy<L> as tower::retry::policy::Policy<Req,Res,alloc::boxed::Box<dyn core::error::Error+core::marker::Send+core::marker::Sync>>>::clone_request
         0     0% 94.84%     0.19GB  1.30%  <vector::sinks::vector::sink::VectorSink<S> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}
         0     0% 94.84%    13.11GB 90.57%  <vector::sources::prometheus::scrape::PrometheusScrapeContext as vector::sources::util::http_client::HttpClientContext>::on_response
         0     0% 94.84%     0.19GB  1.30%  <vector::transforms::remap::Remap<Runner> as vector_core::transform::SyncTransform>::transform
         0     0% 94.84%     2.87GB 19.81%  <vector_core::event::metadata::EventMetadata as core::default::Default>::default
         0     0% 94.84%     1.60GB 11.03%  <vector_core::event::metadata::Inner as core::default::Default>::default
         0     0% 94.84%     0.16GB  1.11%  <vector_core::event::metric::Metric as core::clone::Clone>::clone
         0     0% 94.84%     7.04GB 48.67%  <vector_core::event::metric::tags::MetricTags as core::convert::From<alloc::collections::btree::map::BTreeMap<alloc::string::String,alloc::string::String>>>::from
         0     0% 94.84%     0.18GB  1.24%  <vector_core::event::vrl_target::VrlTarget as vrl::compiler::target::Target>::target_insert
         0     0% 94.84%     0.17GB  1.19%  <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}
         0     0% 94.84%     0.18GB  1.24%  <vrl::compiler::expression::Expr as vrl::compiler::expression::Expression>::resolve
         0     0% 94.84%     0.18GB  1.24%  <vrl::compiler::expression::assignment::Assignment as vrl::compiler::expression::Expression>::resolve
         0     0% 94.84%    13.73GB 94.84%  [libc.so.6]
         0     0% 94.84%    11.05GB 76.35%  _rjem_je_malloc_default
         0     0% 94.84%    13.73GB 94.84%  _rjem_je_prof_backtrace
         0     0% 94.84%    13.73GB 94.84%  _rjem_je_prof_tctx_create
         0     0% 94.84%     0.13GB  0.89%  alloc::collections::btree::map::BTreeMap<K,V,A>::insert
         0     0% 94.84%     7.17GB 49.55%  alloc::collections::btree::map::entry::VacantEntry<K,V,A>::insert
         0     0% 94.84%     0.13GB  0.89%  alloc::collections::btree::node::Handle<alloc::collections::btree::node::NodeRef<alloc::collections::btree::node::marker::Mut,K,V,alloc::collections::btree::node::marker::Leaf>,alloc::collections::btree::node::marker::Edge>::insert_recursing
         0     0% 94.84%     2.67GB 18.41%  alloc::raw_vec::RawVec<T,A>::grow_one
         0     0% 94.84%     0.24GB  1.65%  alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
         0     0% 94.84%     2.90GB 20.06%  alloc::raw_vec::finish_grow
         0     0% 94.84%    13.73GB 94.85%  core::ops::function::FnOnce::call_once{{vtable.shim}}
         0     0% 94.84%     2.68GB 18.49%  do_rallocx
         0     0% 94.84%    13.13GB 90.73%  futures_util::stream::stream::StreamExt::poll_next_unpin
         0     0% 94.84%     0.27GB  1.90%  prometheus_parser::line::Line::parse
         0     0% 94.84%     0.27GB  1.90%  prometheus_parser::parse_text
         0     0% 94.84%    13.73GB 94.84%  std::sys::backtrace::_rust_begin_short_backtrace
         0     0% 94.84%    13.73GB 94.85%  std::sys::pal::unix::thread::Thread::new::thread_start
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::blocking::pool::Inner::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::context::runtime::enter_runtime
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::Context::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::Context::run_task
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::task::core::Core<T,S>::poll
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::task::harness::Harness<T,S>::poll
         0     0% 94.84%    13.18GB 91.06%  vector::source_sender::SourceSender::send_event_stream::{{closure}}
         0     0% 94.84%    12.83GB 88.67%  vector::sources::prometheus::parser::reparse_groups
         0     0% 94.84%    13.18GB 91.06%  vector::sources::util::http_client::call::{{closure}}
         0     0% 94.84%     0.17GB  1.19%  vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}
         0     0% 94.84%    13.18GB 91.04%  vector::topology::builder::Builder::build_sources::{{closure}}::{{closure}}

jszwedko · 2024-11-14T20:49:08Z

@jszwedko So I trimmed down on the number of pipelines and found that this issue surfaces when I enable pipelines with high cardinality metrics and histograms. Is there a setting in vector that would force flush the metrics or ratelimit them?

Can you share the slimmed down configuration that still reproduces the issue? That would help narrow the search space.

jszwedko · 2024-11-14T20:54:08Z

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

If the total cardinality was an issue, I'd expect Vector to hit that wall right away rather than the steady increase you seem to be seeing 🤔

jszwedko · 2024-11-14T20:59:43Z

@st-omarkhalid do you know if the metrics are "churning"? That is: say we have 4M metrics at time t1, at time t2 could we have a different set of metrics even if the cardinality is still a total of 4M?

st-omarkhalid · 2024-11-15T13:48:25Z

@jszwedko Looks like there a correlation with receive errors. I noticed some timeouts in the logs so adjusted the scrape-timeout which not only resolved the errors but also the memory growth.

This definitely looks like a bug in the agent. The requests could fail for many reasons but that should not impact the agent.

st-omarkhalid · 2024-11-15T20:00:31Z

Another related issue. With the following vector-sink the metrics looked spotty in our dashboards. I noticed that the disk buffer would get filled up within 30mins so maybe due to drop_newest it would simply drop the new events.

vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: <url>
    tls:
      alpn_protocols:
      - h2
      enabled: true
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true

I tried the following config instead

buffer:
      type: disk
      when_full: block
      max_size: 5368709120

thinking it would slow down the ingestion once the buffer was filled up. Unfortunately this again lead to the memory growth.

Not sure how to debug this further.

jszwedko · 2024-11-15T20:40:44Z

@jszwedko Looks like there a correlation with receive errors. I noticed some timeouts in the logs so adjusted the scrape-timeout which not only resolved the errors but also the memory growth.

Hmm, that is interesting. It sounds like some requests may have been stacking up in the prometheus_scape and not being reclaimed when they timed out, but that would be surprising.

thinking it would slow down the ingestion once the buffer was filled up. Unfortunately this again lead to the memory growth.

I think what is happening here is that the sink cannot send the data fast enough and so it is piling up first in the buffers and then in the scraping source, which I believe will continue scraping on the interval even if it can't flush the data downstream (this is probably suboptimal; I think the source should stop scraping the previous request hasn't been flushed downstream yet). To fix this, I think you'll need to adjust the sink to meet the throughput requirements of your source. Some ideas to do this:

Set request.concurrency to a high static number
If that isn't enough and you are bottlenecked on another resource, increase that resource (e.g. CPU, memory, or network bandwidth)

st-omarkhalid · 2024-11-18T15:52:12Z

@jszwedko You're right the egress volume's not keeping up with the ingress causes the agent to hit OOM. However for block it seems to be a bug as it would cause data-loss for events which have not been flushed to disk. Right?

I think there are two bugs here:

prometheus_scape causes OOM when there are requests timeouts.
Again, the prometheus_scape would hit OOM with block buffer when ingress is greater than egress volume.

How could we track these two issues? Otherwise this thread could be closed.

pront · 2024-11-18T16:09:58Z

Hi @st-omarkhalid, OOM issues have been historically quite tricky to debug since we don't have identical environments to reproduce them and there are so many variables involved. I really appreciate the details you included in this issue. It would be beneficial to isolate concerns. If you don't mind you can create a new bug issue for each bug and summarize the situation there. For example, I am not sure what config we are talking about for bug (2).

st-omarkhalid added the type: bug A code related bug. label Oct 30, 2024

st-omarkhalid changed the title ~~Vector-agent hit OOM every hour~~ Vector-agent hits OOM every hour Oct 30, 2024

pront added the complexity: high Sufficiently complex issue that requires proper planning and time investment to solve. label Oct 31, 2024

Vector-agent hits OOM every hour #21655

Vector-agent hits OOM every hour #21655

Comments

st-omarkhalid commented Oct 30, 2024 • edited Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Oct 30, 2024

st-omarkhalid commented Oct 30, 2024

jszwedko commented Oct 30, 2024

st-omarkhalid commented Oct 30, 2024 • edited Loading

jszwedko commented Oct 30, 2024

st-omarkhalid commented Oct 30, 2024 • edited Loading

st-omarkhalid commented Oct 30, 2024

jszwedko commented Oct 30, 2024

st-omarkhalid commented Oct 30, 2024

jszwedko commented Oct 30, 2024

st-omarkhalid commented Oct 30, 2024 • edited Loading

jszwedko commented Oct 30, 2024

st-omarkhalid commented Oct 30, 2024 • edited Loading

jszwedko commented Oct 31, 2024

st-omarkhalid commented Oct 31, 2024 • edited Loading

st-omarkhalid commented Oct 31, 2024 • edited Loading

jszwedko commented Oct 31, 2024 • edited Loading

st-omarkhalid commented Oct 31, 2024

jszwedko commented Oct 31, 2024

st-omarkhalid commented Oct 31, 2024 • edited Loading

jszwedko commented Oct 31, 2024

st-omarkhalid commented Oct 31, 2024

st-omarkhalid commented Oct 31, 2024 • edited Loading

st-omarkhalid commented Nov 1, 2024

jszwedko commented Nov 1, 2024

st-omarkhalid commented Nov 3, 2024 • edited Loading

st-omarkhalid commented Nov 7, 2024 • edited Loading

st-omarkhalid commented Nov 9, 2024 • edited Loading

pront commented Nov 11, 2024

st-omarkhalid commented Nov 13, 2024 • edited Loading

jszwedko commented Nov 14, 2024

jszwedko commented Nov 14, 2024

jszwedko commented Nov 14, 2024

st-omarkhalid commented Nov 15, 2024

st-omarkhalid commented Nov 15, 2024 • edited Loading

jszwedko commented Nov 15, 2024 • edited Loading

st-omarkhalid commented Nov 18, 2024 • edited Loading

pront commented Nov 18, 2024

st-omarkhalid commented Oct 30, 2024 •

edited

Loading

st-omarkhalid commented Oct 30, 2024 •

edited

Loading

st-omarkhalid commented Oct 30, 2024 •

edited

Loading

st-omarkhalid commented Oct 30, 2024 •

edited

Loading

st-omarkhalid commented Oct 30, 2024 •

edited

Loading

st-omarkhalid commented Oct 31, 2024 •

edited

Loading

st-omarkhalid commented Oct 31, 2024 •

edited

Loading

jszwedko commented Oct 31, 2024 •

edited

Loading

st-omarkhalid commented Oct 31, 2024 •

edited

Loading

st-omarkhalid commented Oct 31, 2024 •

edited

Loading

st-omarkhalid commented Nov 3, 2024 •

edited

Loading

st-omarkhalid commented Nov 7, 2024 •

edited

Loading

st-omarkhalid commented Nov 9, 2024 •

edited

Loading

st-omarkhalid commented Nov 13, 2024 •

edited

Loading

st-omarkhalid commented Nov 15, 2024 •

edited

Loading

jszwedko commented Nov 15, 2024 •

edited

Loading

st-omarkhalid commented Nov 18, 2024 •

edited

Loading