Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miss error logs after changes in retry #30271

Closed
monoxono opened this issue Jan 3, 2024 · 10 comments
Closed

Miss error logs after changes in retry #30271

monoxono opened this issue Jan 3, 2024 · 10 comments
Labels

Comments

@monoxono
Copy link
Contributor

monoxono commented Jan 3, 2024

Component(s)

exporter/prometheusremotewrite

What happened?

Description

Before the commit 8715a5c(#23842), when prometheusremotewrite failed to send metrics, we can see error logs like this:

2024-01-03T09:49:45.963+0100	error	exporterhelper/retry_sender.go:145	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: Post \"https://xxxx.com/prom/push\": Service Unavailable", "dropped_items": 16}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	go.opentelemetry.io/collector/[email protected]/exporterhelper/retry_sender.go:145
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
	go.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:176
go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).start.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:124
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).Start.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/bounded_memory_queue.go:52

That commit uses "cenkalti/backoff" to retry instead of retry function provided by exporterhelper and now when prometheusremotewrite failed to send metrics, there's no any error log and it's very hard to do troubleshooting.

Steps to Reproduce

Run the example configuration to reproduce

Expected Result

We can see errors in logs to understand why metrics sending fails.

Actual Result

No error log.

Collector version

8715a5c

Environment information

Environment

OpenTelemetry Collector configuration

receivers:
  prometheus/self:
    config:
      global:
        scrape_interval: 10s
        scrape_timeout: 1s
      scrape_configs:
      - job_name: opentelemetry
        static_configs:
        - targets: [localhost:8888]

exporters:
  prometheusremotewrite:
    endpoint: https://xxx.com/prom/push # some fake url

service:
  pipelines:
    metrics:
      receivers:
      - prometheus/self
      exporters:
      - prometheusremotewrite

Log output

No response

Additional context

No response

@monoxono monoxono added bug Something isn't working needs triage New item requiring triage labels Jan 3, 2024
Copy link
Contributor

github-actions bot commented Jan 3, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Hello @monoxono, thanks for filing this issue! I understand the confusion here. The change you've referenced allowed retries on 5xx errors instead of resulting in permanent errors. This means the exporter is now retrying the export instead of explicitly failing, so the log message is no longer accurate.

I think it would be fair to add a log message when a request gets a 5xx error, but it should be at the debug or possibly info level, probably not higher. It would be good to know that data wasn't sent successfully, but it's no longer a permanent error or dropped data, just delayed (hopefully).

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Jan 3, 2024
@rapphil
Copy link
Contributor

rapphil commented Jan 4, 2024

Hi @monoxono

This commit was included in v0.84.0. I tried to reproduce the issue in v0.84.0 and I was not able to. However, I'm able to reproduce in v0.85.0. I share your concern and I think permanent errors should be logged.

I will investigate the issue more.

@rapphil
Copy link
Contributor

rapphil commented Jan 4, 2024

I'm fairly confident that this issue is caused by this change: open-telemetry/opentelemetry-collector#8369

I think the exporterhelper is not logging terminal errors at all in case the queued_retry is used - the prometheusremotewrite does not use queued_retry from the exporterhelper because we implement retries ourselves to avoid out of order samples.

@crobert-1
Copy link
Member

I share your concern and I think permanent errors should be logged.

I may be misunderstanding something, but isn't the error Service Unavailable that this bug was filed for a 503 error, which means it isn't permanent? If this is the case, isn't a retry the right thing to do instead of making it a permanent error? I understand wanting to log errors, but just want to make sure we're clear on whether it's a permanent error or not.

@rapphil
Copy link
Contributor

rapphil commented Jan 4, 2024

@crobert-1 it can become permanent after the component have exhausted a certain number of retries. In this case promethues remote write exporter will return an permanent error to the exporterhelper.

@monoxono
Copy link
Contributor Author

monoxono commented Jan 5, 2024

Thank both for looking into the issue.

@rapphil, you're right. My previous finding is wrong and this issue can be reproduced in 0.85.0, but not 0.84.0.

@jsirianni
Copy link
Member

I believe I am seeing the same issue. I had a misconfiguration (port). The exporter silently fails to send metrics.

@jmichalek132
Copy link
Contributor

This should be fixed by open-telemetry/opentelemetry-collector#9282 right? Can we close it?

@crobert-1
Copy link
Member

Thanks @jmichalek132, I believe you're correct. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants