Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector-agent hits OOM every hour #21655

Open
st-omarkhalid opened this issue Oct 30, 2024 · 38 comments
Open

Vector-agent hits OOM every hour #21655

st-omarkhalid opened this issue Oct 30, 2024 · 38 comments
Labels
complexity: high Sufficiently complex issue that requires proper planning and time investment to solve. type: bug A code related bug.

Comments

@st-omarkhalid
Copy link

st-omarkhalid commented Oct 30, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Vector-agent in our deployment shows constant memory growth till the pod hits OOM. This is happening continuously. I looked at a number of other issues open already on the same problem but it's not clear how to resolve it. In prod we have a lot more pipelines than shown below though.

The metric vector_component_allocated_bytes shows remap-* components have the most memory allocation and constantly growing.

Configuration

data_dir: /var/lib/vector
expire_metrics_secs: 600
api:
  enabled: true
  address: 0.0.0.0:8686
  playground: true
sources:
  host_metric_source:
    type: host_metrics
    scrape_interval_secs: 60
  internal_metric_source:
    type: internal_metrics
    scrape_interval_secs: 60
  ip-node1:
    type: prometheus_scrape
    endpoints:
    - http://<ip>:9100/metrics
    scrape_interval_secs: 60
sinks:
  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    site: datadoghq.com
  prometheus_exporter:
    type: prometheus_exporter
    inputs:
    - internal_metric_source
    - host_metric_source
    address: 0.0.0.0:9598
  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true
    headers:
      auth-version: "1"
transforms:
  remap_node_ip-node1:
    type: remap
    inputs:
    - ip-node1
    source: |-
      .tags.nodename = "ip-node1"
      .tags.service_name = "nodeExporter"
  tiering:
    type: route
    inputs:
    - remap_*
    route:
      tier1: .name == .name
    reroute_unmatched: false

Version

0.35.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@st-omarkhalid st-omarkhalid added the type: bug A code related bug. label Oct 30, 2024
@st-omarkhalid st-omarkhalid changed the title Vector-agent hit OOM every hour Vector-agent hits OOM every hour Oct 30, 2024
@jszwedko
Copy link
Member

Hey! Diagnosing memory issues in Vector can be tricky. A few questions that may help:

  • What memory limits do you have applied?
  • If you increase them, does the usage level out, or do you see it grow consistently regardless of the limit (indicating some sort of leak)?
  • Is it correlated with the adaptive_concurrency_limit metric? I'm wondering if an increasing number of concurrent requests are causing pressure

@st-omarkhalid
Copy link
Author

Thanks @jszwedko for the response. We don't have any limits applied but the pod typically hit OOM around 29GB. Yes vector_adaptive_concurrency_limit_count does correlate with the memory growth.
Screenshot 2024-10-30 at 9 12 11 AM

@jszwedko
Copy link
Member

Interesting, thanks for sharing that graph. It seems likely to me that the issue is that the concurrency limit is never finding a max. You could try to configure a max via request.adaptive_concurrency.max_concurrency_limit (e.g. https://vector.dev/docs/reference/configuration/sinks/vector/#request.adaptive_concurrency.max_concurrency_limit).

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Oct 30, 2024

That did not help. Did I get it right?

  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    request:
      adaptive_concurrency:
        max_concurrency_limit: 500

@jszwedko
Copy link
Member

That did not help. Did I get it right?

  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: https://<host>:8903
    request:
      adaptive_concurrency:
        max_concurrency_limit: 500

That looks right. Did you observe adaptive_concurrency_limit not exceeding this limit?

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Oct 30, 2024

It went past it and the memory growth had the same behavior as before.
Screenshot 2024-10-30 at 12 58 13 PM

@jszwedko Wait - looking at the breakdown per sink type it looks like most of the concurrency_limit are coming from another sink which does not have the setting. Let me update the other sink and report back.

@st-omarkhalid
Copy link
Author

@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.

@jszwedko
Copy link
Member

@jszwedko I set the data-dog sink at a limit of 100 but it's at 1K after 50 mins and memory has grown linearly as well.

Hmm, can you share the config you are trying for the Datadog Logs sink?

@st-omarkhalid
Copy link
Author

Here

  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    default_api_key: <key>
    site: datadoghq.com
    request:
      adaptive_concurrency:
        max_concurrency_limit: 100

@jszwedko
Copy link
Member

That looks right. Are you confident that that is the sink that is exceeding the limit? All of the others are respecting it?

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Oct 30, 2024

Here - vector gets to 55-65 and dd-metrics gets to 1k before the pod hit OOM.
Screenshot 2024-10-30 at 3 10 35 PM

@jszwedko
Copy link
Member

Gotcha, thanks! That does look like it is exceeding the max. I'm having trouble reproducing this behavior locally though 😢

I'm running this config:

sources:
  source0:
    namespace: vector
    scrape_interval_secs: 0.1
    type: internal_metrics
  source1:
    namespace: vector
    scrape_interval_secs: 0.1
    type: internal_metrics
sinks:
  sink0:
    inputs:
    - source0
    type: datadog_metrics
    batch:
      max_events: 1
    request:
      adaptive_concurrency:
        initial_concurrency: 1
        max_concurrency_limit: 5
  sink1:
    inputs:
    - source1
    type: datadog_metrics
    request:
      adaptive_concurrency:
        initial_concurrency: 1
        max_concurrency_limit: 5

For source0 I'm limiting batch.max_events to 1 to cause it to make more requests than it otherwise would. I see the max for sink0 bubbling around 6. If I remove the limit it goes up to 50. I tried with both Vector v0.35.0 and v0.42.0.

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Oct 30, 2024

For a moment I thought the limits had improved but that's still not the case. There were a couple of evicted pods which had lower limits but otherwise the limits are way over 100 for dd sink.
Screenshot 2024-10-30 at 5 44 08 PM
For context we have over 300 sources (each with its own transform) in the config.

@jszwedko
Copy link
Member

Is that graph cumulative across multiple nodes / sinks? E.g. do you have multiple datadog_metrics sinks in the same config?

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Oct 31, 2024

That graph is simply vector_adaptive_concurrency_limit_count plotted with no filters. We have a single vector-agent pod in our setup. The sawtooth pattern above is because the pod is getting OOM'd and replaced by a new one. The config has 3 sinks as pointed out in the issue description.

@jszwedko How long did you run your test? Was the limit stable at 6 for long time?

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Oct 31, 2024

I looked at another instance of the vector-agent in a different cluster (where again the limit is not applied). The limit reached to 15K without hitting OOM - the number of sources in that cluster is a lot less but it still has 3 sinks.

On the problematic vector-agent instance I see large memory allocation across remap components (vector_component_allocated_bytes)
Screenshot 2024-10-31 at 6 00 38 PM

@jszwedko
Copy link
Member

jszwedko commented Oct 31, 2024

That graph is simply vector_adaptive_concurrency_limit_count plotted with no filters. We have a single vector-agent pod in our setup.

I'm not sure you want to plot _count, this would be the count of the number of data points in the histogram. Could you plot the max instead?

I only ran my test for maybe 10 minutes. I can try another run.

@pront pront added the complexity: high Sufficiently complex issue that requires proper planning and time investment to solve. label Oct 31, 2024
@st-omarkhalid
Copy link
Author

@jszwedko You mean max(vector_adaptive_concurrency_limit_count) - otherwise I don't see max metric for it.
Screenshot 2024-10-31 at 9 50 24 AM

@jszwedko
Copy link
Member

@jszwedko You mean max(vector_adaptive_concurrency_limit_count) - otherwise I don't see max metric for it. Screenshot 2024-10-31 at 9 50 24 AM

Hmm. What system are you sending these metrics into? vector_adaptive_concurrency_limit is a histogram so you should be able to do things like taking the max, min, and average.

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Oct 31, 2024

The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h])) something you want to see?
Screenshot 2024-10-31 at 10 40 14 AM

@jszwedko
Copy link
Member

The buckets are present. Is histogram_quantile(1, increase(vector_adaptive_concurrency_limit_bucket[1h])) something you want to see?

Nice, yeah, I see histogram_quantile(1, ...) gives you the max value. There we can see it does seem to be staying below or around the limit of 100.

My next hypothesis is that there might be backpressure causing requests to queue up in the source waiting to flush data downstream. Could you share a graph of vector_component_utilization and buffer_send_duration_seconds per component?

@st-omarkhalid
Copy link
Author

Utilization has an interesting pattern - it spikes in the remap component and then settles down.
Screenshot 2024-10-31 at 11 55 45 AM

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Oct 31, 2024

Screenshot 2024-10-31 at 12 28 02 PM
sum(histogram_quantile(0.99, increase(vector_buffer_send_duration_seconds_bucket[10m]))) by (component_type)

Is there a more explicit metric on the queue size? Maybe https://vector.dev/docs/reference/configuration/sources/internal_metrics/#buffer_events
Screenshot 2024-10-31 at 12 36 24 PM

@st-omarkhalid
Copy link
Author

The buffer size is pretty stable as well.
Screenshot 2024-11-01 at 9 53 19 AM

I also checked the ingress/egress difference and that consistent as well
Screenshot 2024-11-01 at 9 50 25 AM

This does appear like a mem leak issue.

@jszwedko
Copy link
Member

jszwedko commented Nov 1, 2024

Yeah, it could be, though your config is fairly simple and it'd be surprising that if you had a leak that it wasn't affecting a large number of users. I'll try to think about this some more, but if you are able to grab a memory profile using valgrind that could be helpful to narrow down where the memory is being used.

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Nov 3, 2024

vector doesn't even run for me with valgrind - it fails with sig-fault.

==20819== Memcheck, a memory error detector
==20819== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==20819== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==20819== Command: vector/target/release/vector
==20819==
==20819== Invalid read of size 4
==20819==    at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819==    by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819==    by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819==    by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819==    by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819==    by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==20819==
==20819==
==20819== Process terminating with default action of signal 11 (SIGSEGV)
==20819==  Access not within mapped region at address 0x64
==20819==    at 0x3EBDE6A: _rjem_je_tcache_bin_flush_small (in /home/admin/vector/target/release/vector)
==20819==    by 0x3E895B8: _rjem_je_sdallocx_default (in /home/admin/vector/target/release/vector)
==20819==    by 0x2CE09D8: core::ptr::drop_in_place<clap_builder::parser::matches::matched_arg::MatchedArg> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B6AB51: core::ptr::drop_in_place<clap_builder::util::flat_map::FlatMap<clap_builder::util::id::Id,clap_builder::parser::matches::matched_arg::MatchedArg>> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3B99EFB: core::ptr::drop_in_place<clap_builder::parser::matches::arg_matches::ArgMatches> (in /home/admin/vector/target/release/vector)
==20819==    by 0x3C68D4C: vector::cli::Opts::get_matches (in /home/admin/vector/target/release/vector)
==20819==    by 0x14092CE: vector::main (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EC2: std::sys::backtrace::__rust_begin_short_backtrace (in /home/admin/vector/target/release/vector)
==20819==    by 0x1408EB8: std::rt::lang_start::{{closure}} (in /home/admin/vector/target/release/vector)
==20819==    by 0x8BC199F: std::rt::lang_start_internal (in /home/admin/vector/target/release/vector)
==20819==    by 0x140A434: main (in /home/admin/vector/target/release/vector)
==20819==  If you believe this happened as a result of a stack
==20819==  overflow in your program's main thread (unlikely but
==20819==  possible), you can try to increase the size of the
==20819==  main thread stack using the --main-stacksize= flag.
==20819==  The main thread stack size used in this run was 108392448.
==20819==
==20819== HEAP SUMMARY:
==20819==     in use at exit: 172,057 bytes in 471 blocks
==20819==   total heap usage: 478 allocs, 7 frees, 174,129 bytes allocated
==20819==
==20819== LEAK SUMMARY:
==20819==    definitely lost: 0 bytes in 0 blocks
==20819==    indirectly lost: 0 bytes in 0 blocks
==20819==      possibly lost: 0 bytes in 0 blocks
==20819==    still reachable: 172,057 bytes in 471 blocks
==20819==         suppressed: 0 bytes in 0 blocks
==20819== Reachable blocks (those to which a pointer was found) are not shown.
==20819== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==20819==
==20819== For lists of detected and suppressed errors, rerun with: -s
==20819== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Nov 7, 2024

@jszwedko So I trimmed down on the number of pipelines and found that this issue surfaces when I enable pipelines with high cardinality metrics and histograms. Is there a setting in vector that would force flush the metrics or ratelimit them?

Batches are flushed when 1 of 2 conditions are met:

The batch age meets or exceeds the configured timeout_secs.
The batch size meets or exceeds the configured max_bytes or max_events.

We have it set for one sink but not the other.

  datadog_sink:
    type: datadog_metrics
    inputs:
    - remap_pod_*
    site: datadoghq.com
  vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: <url>
    tls:
      alpn_protocols:
      - h2
      enabled: true
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true

What would be the default behaviour for dd sink above in terms of flushing the buffers? Also, is there a way to block/rate-limit specific metrics?

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Nov 9, 2024

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

@pront
Copy link
Contributor

pront commented Nov 11, 2024

Regarding throttling, we have https://vector.dev/docs/reference/configuration/transforms/throttle/ but that doesn't work with metrics yet.

Also, I am not aware of anything that allows you to force-flush but you can set batch.max_events to something smaller like in #21655 (comment).

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

There's no hard limit for this. It all depends how fast you produce/consume and your buffer strategy. If you are dropping events when the buffer is full, it shouldn't cause OOM.

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Nov 13, 2024

I got a heap profile and this is how a diff looks

Showing nodes accounting for 13.73GB, 94.84% of 14.47GB total
Dropped 190 nodes (cum <= 0.07GB)
      flat  flat%   sum%        cum   cum%
   13.73GB 94.84% 94.84%    13.73GB 94.84%  prof_backtrace_impl
         0     0% 94.84%     0.27GB  1.90%  <(FnA,FnB,FnC) as nom::sequence::Tuple<Input,(A,B,C),Error>>::parse
         0     0% 94.84%     0.22GB  1.52%  <F as nom::internal::Parser<I,O,E>>::parse
         0     0% 94.84%    13.11GB 90.58%  <T as futures_util::fns::FnMut1<A>>::call_mut
         0     0% 94.84%     0.33GB  2.30%  <alloc::collections::btree::map::BTreeMap<K,V,A> as core::clone::Clone>::clone::clone_subtree
         0     0% 94.84%     0.37GB  2.55%  <alloc::string::String as core::clone::Clone>::clone
         0     0% 94.84%     0.35GB  2.44%  <alloc::vec::Vec<T,A> as core::clone::Clone>::clone
         0     0% 94.84%    13.17GB 91.00%  <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::futures_unordered::FuturesUnordered<Fut> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::Flatten<St> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.10GB 90.52%  <futures_util::stream::stream::filter_map::FilterMap<St,Fut,F> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::flatten_unordered::FlattenUnorderedWithFlowController<St,Fc> as futures_core::stream::Stream>::poll_next
         0     0% 94.84%    13.12GB 90.68%  <futures_util::stream::stream::flatten_unordered::PollStreamFut<St> as core::future::future::Future>::poll
         0     0% 94.84%    13.73GB 94.85%  <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
         0     0% 94.84%     0.18GB  1.27%  <tower::limit::rate::service::RateLimit<S> as tower_service::Service<Request>>::call
         0     0% 94.84%     0.18GB  1.27%  <tower::retry::Retry<P,S> as tower_service::Service<Request>>::call
         0     0% 94.84%    13.73GB 94.83%  <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
         0     0% 94.84%     0.18GB  1.27%  <vector::sinks::util::retries::FibonacciRetryPolicy<L> as tower::retry::policy::Policy<Req,Res,alloc::boxed::Box<dyn core::error::Error+core::marker::Send+core::marker::Sync>>>::clone_request
         0     0% 94.84%     0.19GB  1.30%  <vector::sinks::vector::sink::VectorSink<S> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}
         0     0% 94.84%    13.11GB 90.57%  <vector::sources::prometheus::scrape::PrometheusScrapeContext as vector::sources::util::http_client::HttpClientContext>::on_response
         0     0% 94.84%     0.19GB  1.30%  <vector::transforms::remap::Remap<Runner> as vector_core::transform::SyncTransform>::transform
         0     0% 94.84%     2.87GB 19.81%  <vector_core::event::metadata::EventMetadata as core::default::Default>::default
         0     0% 94.84%     1.60GB 11.03%  <vector_core::event::metadata::Inner as core::default::Default>::default
         0     0% 94.84%     0.16GB  1.11%  <vector_core::event::metric::Metric as core::clone::Clone>::clone
         0     0% 94.84%     7.04GB 48.67%  <vector_core::event::metric::tags::MetricTags as core::convert::From<alloc::collections::btree::map::BTreeMap<alloc::string::String,alloc::string::String>>>::from
         0     0% 94.84%     0.18GB  1.24%  <vector_core::event::vrl_target::VrlTarget as vrl::compiler::target::Target>::target_insert
         0     0% 94.84%     0.17GB  1.19%  <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}
         0     0% 94.84%     0.18GB  1.24%  <vrl::compiler::expression::Expr as vrl::compiler::expression::Expression>::resolve
         0     0% 94.84%     0.18GB  1.24%  <vrl::compiler::expression::assignment::Assignment as vrl::compiler::expression::Expression>::resolve
         0     0% 94.84%    13.73GB 94.84%  [libc.so.6]
         0     0% 94.84%    11.05GB 76.35%  _rjem_je_malloc_default
         0     0% 94.84%    13.73GB 94.84%  _rjem_je_prof_backtrace
         0     0% 94.84%    13.73GB 94.84%  _rjem_je_prof_tctx_create
         0     0% 94.84%     0.13GB  0.89%  alloc::collections::btree::map::BTreeMap<K,V,A>::insert
         0     0% 94.84%     7.17GB 49.55%  alloc::collections::btree::map::entry::VacantEntry<K,V,A>::insert
         0     0% 94.84%     0.13GB  0.89%  alloc::collections::btree::node::Handle<alloc::collections::btree::node::NodeRef<alloc::collections::btree::node::marker::Mut,K,V,alloc::collections::btree::node::marker::Leaf>,alloc::collections::btree::node::marker::Edge>::insert_recursing
         0     0% 94.84%     2.67GB 18.41%  alloc::raw_vec::RawVec<T,A>::grow_one
         0     0% 94.84%     0.24GB  1.65%  alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
         0     0% 94.84%     2.90GB 20.06%  alloc::raw_vec::finish_grow
         0     0% 94.84%    13.73GB 94.85%  core::ops::function::FnOnce::call_once{{vtable.shim}}
         0     0% 94.84%     2.68GB 18.49%  do_rallocx
         0     0% 94.84%    13.13GB 90.73%  futures_util::stream::stream::StreamExt::poll_next_unpin
         0     0% 94.84%     0.27GB  1.90%  prometheus_parser::line::Line::parse
         0     0% 94.84%     0.27GB  1.90%  prometheus_parser::parse_text
         0     0% 94.84%    13.73GB 94.84%  std::sys::backtrace::_rust_begin_short_backtrace
         0     0% 94.84%    13.73GB 94.85%  std::sys::pal::unix::thread::Thread::new::thread_start
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::blocking::pool::Inner::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::context::runtime::enter_runtime
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::Context::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::Context::run_task
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::scheduler::multi_thread::worker::run
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::task::core::Core<T,S>::poll
         0     0% 94.84%    13.73GB 94.85%  tokio::runtime::task::harness::Harness<T,S>::poll
         0     0% 94.84%    13.18GB 91.06%  vector::source_sender::SourceSender::send_event_stream::{{closure}}
         0     0% 94.84%    12.83GB 88.67%  vector::sources::prometheus::parser::reparse_groups
         0     0% 94.84%    13.18GB 91.06%  vector::sources::util::http_client::call::{{closure}}
         0     0% 94.84%     0.17GB  1.19%  vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}
         0     0% 94.84%    13.18GB 91.04%  vector::topology::builder::Builder::build_sources::{{closure}}::{{closure}}

@jszwedko
Copy link
Member

@jszwedko So I trimmed down on the number of pipelines and found that this issue surfaces when I enable pipelines with high cardinality metrics and histograms. Is there a setting in vector that would force flush the metrics or ratelimit them?

Can you share the slimmed down configuration that still reproduces the issue? That would help narrow the search space.

@jszwedko
Copy link
Member

@jszwedko @pront I looked at the active time-series on the cluster and we have around 4M of them. Is that too much for the vector-agent to handle?

If the total cardinality was an issue, I'd expect Vector to hit that wall right away rather than the steady increase you seem to be seeing 🤔

@jszwedko
Copy link
Member

@st-omarkhalid do you know if the metrics are "churning"? That is: say we have 4M metrics at time t1, at time t2 could we have a different set of metrics even if the cardinality is still a total of 4M?

@st-omarkhalid
Copy link
Author

@jszwedko Looks like there a correlation with receive errors. I noticed some timeouts in the logs so adjusted the scrape-timeout which not only resolved the errors but also the memory growth.
Screenshot 2024-11-15 at 7 44 12 AM
Screenshot 2024-11-15 at 7 44 50 AM
This definitely looks like a bug in the agent. The requests could fail for many reasons but that should not impact the agent.

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Nov 15, 2024

Another related issue. With the following vector-sink the metrics looked spotty in our dashboards. I noticed that the disk buffer would get filled up within 30mins so maybe due to drop_newest it would simply drop the new events.

vector_sink:
    type: vector
    inputs:
    - tiering.tier1
    address: <url>
    tls:
      alpn_protocols:
      - h2
      enabled: true
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true

I tried the following config instead

buffer:
      type: disk
      when_full: block
      max_size: 5368709120

thinking it would slow down the ingestion once the buffer was filled up. Unfortunately this again lead to the memory growth.

Screenshot 2024-11-15 at 1 52 00 PM
Not sure how to debug this further.

@jszwedko
Copy link
Member

jszwedko commented Nov 15, 2024

@jszwedko Looks like there a correlation with receive errors. I noticed some timeouts in the logs so adjusted the scrape-timeout which not only resolved the errors but also the memory growth.

Hmm, that is interesting. It sounds like some requests may have been stacking up in the prometheus_scape and not being reclaimed when they timed out, but that would be surprising.

thinking it would slow down the ingestion once the buffer was filled up. Unfortunately this again lead to the memory growth.

I think what is happening here is that the sink cannot send the data fast enough and so it is piling up first in the buffers and then in the scraping source, which I believe will continue scraping on the interval even if it can't flush the data downstream (this is probably suboptimal; I think the source should stop scraping the previous request hasn't been flushed downstream yet). To fix this, I think you'll need to adjust the sink to meet the throughput requirements of your source. Some ideas to do this:

  • Set request.concurrency to a high static number
  • If that isn't enough and you are bottlenecked on another resource, increase that resource (e.g. CPU, memory, or network bandwidth)

@st-omarkhalid
Copy link
Author

st-omarkhalid commented Nov 18, 2024

@jszwedko You're right the egress volume's not keeping up with the ingress causes the agent to hit OOM. However for block it seems to be a bug as it would cause data-loss for events which have not been flushed to disk. Right?

I think there are two bugs here:

  1. prometheus_scape causes OOM when there are requests timeouts.
  2. Again, the prometheus_scape would hit OOM with block buffer when ingress is greater than egress volume.

How could we track these two issues? Otherwise this thread could be closed.

@pront
Copy link
Contributor

pront commented Nov 18, 2024

Hi @st-omarkhalid, OOM issues have been historically quite tricky to debug since we don't have identical environments to reproduce them and there are so many variables involved. I really appreciate the details you included in this issue. It would be beneficial to isolate concerns. If you don't mind you can create a new bug issue for each bug and summarize the situation there. For example, I am not sure what config we are talking about for bug (2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity: high Sufficiently complex issue that requires proper planning and time investment to solve. type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

6 participants
@jszwedko @pront @st-omarkhalid and others