Question: What does FluentdRecordsCountHigh alert imply? #89

Ghazgkull · 2021-08-13T21:12:07Z

Is your feature request related to a problem? Please describe.

After enabling the out of the box PrometheusRules provided by this chart, the FluentdRecordsCountHigh alert has briefly fired and resolved itself twice in the first couple hours.

There is documentation on this alert which reads:

      summary: fluentd records count are critical
      description: In the last 5m, records counts increased 3 times, comparing to the latest 15 min.

Unfortunately, I'm not able to understand what this means. Why is this a "critical" problem?

Describe the solution you'd like

If I'm able to understand what this alert means, I'd like to suggest a documentation change to make it more easily understood.

The text was updated successfully, but these errors were encountered:

Ghazgkull · 2021-09-09T21:24:34Z

A little follow-up. This alert continued to fire and resolve in our deployments, which were actually operating just fine. We ended up having to write scripting to patch the PrometheusRules in our cluster after the helm chart deploys in order to remove the FluentdRecordsCountHigh alert. Because those alerts live nested inside two arrays in the PrometheusRules object, it is a non-trivial bit of scripting.

I'm left wondering if anyone understands this alert and is getting any value out of it. Or if we should just remove it from the chart. @monotek Any thoughts?

monotek · 2021-09-20T20:13:58Z

Why don´t you just disable the alert if you don't need it?
https://github.com/kokuwaio/helm-charts/blob/main/charts/fluentd-elasticsearch/values.yaml#L314-L325

Ghazgkull · 2021-09-20T20:30:19Z

Is there something in the helm chart configuration that would allow me to disable this one alert? Maybe I'm being dense, but I don't see anything in the code you linked which would allow me to disable that alert?

As I mentioned above, I've written some scripting which modifies the PrometheusRule CRD after it's deployed to remove the alert. But it's complicated and just raised the question for me of what the alert is even for.

monotek · 2021-09-20T23:18:19Z

Just overwrite it in your values.yaml

prometheusRule:
  enabled: true
  prometheusNamespace: monitoring
  rules:
  - alert: FluentdNodeDown
    expr: up{job="{{ include "fluentd-elasticsearch.metricsServiceName" . }}"} == 0
    for: 10m
    labels:
      service: fluentd
      severity: warning
    annotations:
      summary: fluentd cannot be scraped
      description: Prometheus could not scrape {{ "{{ $labels.job }}" }} for more than 10 minutes
  - alert: FluentdNodeDown
    expr: up{job="{{ include "fluentd-elasticsearch.metricsServiceName" . }}"} == 0
    for: 30m
    labels:
      service: fluentd
      severity: critical
    annotations:
      summary: fluentd cannot be scraped
      description: Prometheus could not scrape {{ "{{ $labels.job }}" }} for more than 30 minutes
  - alert: FluentdQueueLength
    expr: rate(fluentd_status_buffer_queue_length[5m]) > 0.3
    for: 1m
    labels:
      service: fluentd
      severity: warning
    annotations:
      summary: fluentd node are failing
      description: In the last 5 minutes, fluentd queues increased 30%. Current value is {{ "{{ $value }}" }}
  - alert: FluentdQueueLength
    expr: rate(fluentd_status_buffer_queue_length[5m]) > 0.5
    for: 1m
    labels:
      service: fluentd
      severity: critical
    annotations:
      summary: fluentd node are critical
      description: In the last 5 minutes, fluentd queues increased 50%. Current value is {{ "{{ $value }}" }}
  - alert: FluentdRetry
    expr: increase(fluentd_status_retry_count[10m]) > 0
    for: 20m
    labels:
      service: fluentd
      severity: warning
    annotations:
      description: Fluentd retry count has been  {{ "{{ $value }}" }} for the last 10 minutes
      summary: Fluentd retry count has been  {{ "{{ $value }}" }} for the last 10 minutes
  - alert: FluentdOutputError
    expr: increase(fluentd_output_status_num_errors[10m]) > 0
    for: 1s
    labels:
      service: fluentd
      severity: warning
    annotations:
      description: Fluentd output error count is {{ "{{ $value }}" }} for the last 10 minutes
      summary: There have been Fluentd output error(s) for the last 10 minutes

Ghazgkull · 2021-09-21T00:17:16Z

@monotek Sure, thanks. I appreciate the help, but I do have a solution in hand which works for me with a different set of tradeoffs.

My request here is for a clarification of the docs, though. My question is What does "FluentdRecordsCountHigh" actually mean? I just don't understand the description on the alert. Does anyone know what value this alert provides? If so, I'd be happy to update the doc with a PR.

monotek · 2021-09-21T12:27:11Z

The alarm is copiedf from: https://github.com/fluent/fluent-plugin-prometheus/blob/master/misc/prometheus_alerts.yaml#L49-L59
Imho it just means "hey, you get unusal more logs as normaly".
But i guess it's the best to ask in the repo mentioned above to get clarification.

Ghazgkull added the enhancement New feature or request label Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: What does FluentdRecordsCountHigh alert imply? #89

Question: What does FluentdRecordsCountHigh alert imply? #89

Ghazgkull commented Aug 13, 2021

Ghazgkull commented Sep 9, 2021

monotek commented Sep 20, 2021 •

edited

Loading

Ghazgkull commented Sep 20, 2021

monotek commented Sep 20, 2021 •

edited

Loading

Ghazgkull commented Sep 21, 2021 •

edited

Loading

monotek commented Sep 21, 2021

Question: What does FluentdRecordsCountHigh alert imply? #89

Question: What does FluentdRecordsCountHigh alert imply? #89

Comments

Ghazgkull commented Aug 13, 2021

Ghazgkull commented Sep 9, 2021

monotek commented Sep 20, 2021 • edited Loading

Ghazgkull commented Sep 20, 2021

monotek commented Sep 20, 2021 • edited Loading

Ghazgkull commented Sep 21, 2021 • edited Loading

monotek commented Sep 21, 2021

monotek commented Sep 20, 2021 •

edited

Loading

monotek commented Sep 20, 2021 •

edited

Loading

Ghazgkull commented Sep 21, 2021 •

edited

Loading