-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve transient errors in googlecloud trace exporter batch write spans. #34957
Comments
What version of the collector are you on? |
Version 0.102 |
Can you share the googlecloud exporter configuration as well? |
and
|
Just adding another datapoint here, we have a user who's doing some outage testing with logs and are running into the same issue where the exporter seems to be dropping logs when there's a transient error (can't connect to GCP). Logs are like this: {"level":"warn","ts":"2024-09-06T12:02:16.765+0200","caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: \"logging.googleapis.com:443\", ServerName: \"logging.googleapis.com:443\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 142.250.74.138:443: i/o timeout\"","grpc_log":true}
{"level":"error","ts":"2024-09-06T12:03:07.799+0200","caller":"exporterhelper/queue_sender.go:101","msg":"Exporting failed. Dropping data.","kind":"exporter","data_type":"logs","name":"googlecloud/applogs_google","error":"context deadline exceeded","dropped_items":15,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101\ngo.opentelemetry.io/collector/exporter/internal/queue.(*persistentQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/queue/persistent_queue.go:215\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43"} Where that first log is repeated many times (google api servers are intentionally blocked for testing in this case) They are using v0.102.1 of the collector. This case is a little more artificial and we'll try adjusting the timeout settings, but figured I'd add what we're seeing here. |
We're also seeing this sporadically (we admittedly do have some very chatty trace exporters) on version 0.105. Config is: # Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
exporters:
googlecloud:
user_agent: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}
googlemanagedprometheus:
user_agent: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}
extensions:
health_check:
endpoint: ${env:MY_POD_IP}:13133
processors:
filter/self-metrics:
metrics:
include:
match_type: strict
metric_names:
- otelcol_process_uptime
- otelcol_process_memory_rss
- otelcol_grpc_io_client_completed_rpcs
- otelcol_googlecloudmonitoring_point_count
batch:
send_batch_max_size: 200
send_batch_size: 200
timeout: 5s
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
- k8s.job.name
- k8s.node.name
- k8s.pod.name
- k8s.pod.uid
- k8s.pod.start_time
passthrough: false
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
memory_limiter:
check_interval: 1s
limit_percentage: 65
spike_limit_percentage: 20
metricstransform/self-metrics:
transforms:
- action: update
include: otelcol_process_uptime
operations:
- action: add_label
new_label: version
new_value: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}
# We need to add the pod IP as a resource label so the k8s attributes processor can find it.
resource/self-metrics:
attributes:
- action: insert
key: k8s.pod.ip
value: ${env:MY_POD_IP}
resourcedetection:
detectors: [gcp]
timeout: 10s
transform/collision:
metric_statements:
- context: datapoint
statements:
- set(attributes["exported_location"], attributes["location"])
- delete_key(attributes, "location")
- set(attributes["exported_cluster"], attributes["cluster"])
- delete_key(attributes, "cluster")
- set(attributes["exported_namespace"], attributes["namespace"])
- delete_key(attributes, "namespace")
- set(attributes["exported_job"], attributes["job"])
- delete_key(attributes, "job")
- set(attributes["exported_instance"], attributes["instance"])
- delete_key(attributes, "instance")
- set(attributes["exported_project_id"], attributes["project_id"])
- delete_key(attributes, "project_id")
receivers:
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
http:
cors:
allowed_origins:
- http://*
- https://*
endpoint: ${env:MY_POD_IP}:4318
prometheus/self-metrics:
config:
scrape_configs:
- job_name: otel-self-metrics
scrape_interval: 1m
static_configs:
- targets:
- ${env:MY_POD_IP}:8888
service:
extensions:
- health_check
pipelines:
metrics/self-metrics:
exporters:
- googlemanagedprometheus
processors:
- filter/self-metrics
- metricstransform/self-metrics
- resource/self-metrics
- k8sattributes
- memory_limiter
- resourcedetection
- batch
receivers:
- prometheus/self-metrics
traces:
exporters:
- googlecloud
processors:
- k8sattributes
- memory_limiter
- resourcedetection
- batch
receivers:
- otlp
telemetry:
metrics:
address: ${env:MY_POD_IP}:8888 |
Thanks. We disable retries in the collector because the google cloud client library has retries built-in |
Component(s)
No response
Describe the issue you're reporting
I am experiencing transient otel-collector failures for exporting Trace batches. e.g.:
I have:
I have tried increasing timeout to 45 sec, as described here. And I have tried decreasing batch size from 200 to 100 as suggested here. Neither approach has given any statistically relevant observable improvements.
Stacktrace:
Any ideas on what to do?
Seemingly, I don't have problems with retry queue.
The text was updated successfully, but these errors were encountered: