Improve transient errors in googlecloud trace exporter batch write spans. #34957

AkselAllas · 2024-09-02T11:11:01Z

Component(s)

No response

Describe the issue you're reporting

I am experiencing transient otel-collector failures for exporting Trace batches. e.g.:

I have:

    traces/2:
      receivers: [ otlp ]
      processors: [ tail_sampling, batch ]
      exporters: [ googlecloud ]

I have tried increasing timeout to 45 sec, as described here. And I have tried decreasing batch size from 200 to 100 as suggested here. Neither approach has given any statistically relevant observable improvements.

Stacktrace:

"caller":"exporterhelper/queue_sender.go:101", "data_type":"traces", "dropped_items":200, "error":"context deadline exceeded", "kind":"exporter", "level":"error", "msg":"Exporting failed. Dropping data.", "name":"googlecloud", "stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1

Any ideas on what to do?

Seemingly, I don't have problems with retry queue.

The text was updated successfully, but these errors were encountered:

Frapschen · 2024-09-03T05:40:46Z

Pinging code owners: @aabmass, @dashpole,@jsuereth,@punya, @damemi, @psx95

github-actions · 2024-09-03T05:51:37Z

Pinging code owners for exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @damemi @psx95. See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2024-09-03T13:59:53Z

What version of the collector are you on?

AkselAllas · 2024-09-03T14:31:37Z

Version 0.102

dashpole · 2024-09-03T16:32:56Z

Can you share the googlecloud exporter configuration as well?

AkselAllas · 2024-09-04T05:23:23Z

@dashpole

    traces:
      receivers: [ otlp ]
      processors: [ tail_sampling, batch ]
      exporters: [ googlecloud ]

and

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:8080"
  batch:
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 10s
  tail_sampling:
    policies:
      - name: drop_noisy_traces_url
        type: string_attribute
        string_attribute:
          key: http.user_agent
          values:
            - GoogleHC*
            - StackDriver*
            - kube-probe*
            - GoogleStackdriverMonitoring-UptimeChecks*
            - cloud-run-http-probe
          enabled_regex_matching: true
          invert_match: true
      - name: drop_noisy_traces_new_semantic_conv
        type: string_attribute
        string_attribute:
          key: user_agent.original
          values:
            - GoogleHC*
            - StackDriver*
            - GoogleStackdriverMonitoring-UptimeChecks*
            - cloud-run-http-probe
          enabled_regex_matching: true
          invert_match: true
  
  exporters:
    googlecloud:
    metric:
      prefix: 'custom.googleapis.com'

BinaryFissionGames · 2024-09-10T17:31:57Z

Just adding another datapoint here, we have a user who's doing some outage testing with logs and are running into the same issue where the exporter seems to be dropping logs when there's a transient error (can't connect to GCP).

Logs are like this:

{"level":"warn","ts":"2024-09-06T12:02:16.765+0200","caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: \"logging.googleapis.com:443\", ServerName: \"logging.googleapis.com:443\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 142.250.74.138:443: i/o timeout\"","grpc_log":true}
{"level":"error","ts":"2024-09-06T12:03:07.799+0200","caller":"exporterhelper/queue_sender.go:101","msg":"Exporting failed. Dropping data.","kind":"exporter","data_type":"logs","name":"googlecloud/applogs_google","error":"context deadline exceeded","dropped_items":15,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101\ngo.opentelemetry.io/collector/exporter/internal/queue.(*persistentQueue[...]).Consume\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/queue/persistent_queue.go:215\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43"}

Where that first log is repeated many times (google api servers are intentionally blocked for testing in this case)

They are using v0.102.1 of the collector. This case is a little more artificial and we'll try adjusting the timeout settings, but figured I'd add what we're seeing here.

dbason · 2024-10-11T03:24:42Z

We're also seeing this sporadically (we admittedly do have some very chatty trace exporters) on version 0.105.

Config is:

    # Copyright 2024 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.

    exporters:
      googlecloud:
        user_agent: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}
      googlemanagedprometheus:
        user_agent: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}

    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    processors:
      filter/self-metrics:
        metrics:
          include:
            match_type: strict
            metric_names:
            - otelcol_process_uptime
            - otelcol_process_memory_rss
            - otelcol_grpc_io_client_completed_rpcs
            - otelcol_googlecloudmonitoring_point_count
      batch:
        send_batch_max_size: 200
        send_batch_size: 200
        timeout: 5s

      k8sattributes:
        extract:
          metadata:
          - k8s.namespace.name
          - k8s.deployment.name
          - k8s.statefulset.name
          - k8s.daemonset.name
          - k8s.cronjob.name
          - k8s.job.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.pod.start_time
        passthrough: false
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.ip
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: connection
      memory_limiter:
        check_interval: 1s
        limit_percentage: 65
        spike_limit_percentage: 20

      metricstransform/self-metrics:
        transforms:
        - action: update
          include: otelcol_process_uptime
          operations:
          - action: add_label
            new_label: version
            new_value: Google-Cloud-OTLP manifests:0.1.0 otel/opentelemetry-collector-contrib:{{ .Values.opentelemetryCollector.version }}

      # We need to add the pod IP as a resource label so the k8s attributes processor can find it.
      resource/self-metrics:
        attributes:
        - action: insert
          key: k8s.pod.ip
          value: ${env:MY_POD_IP}

      resourcedetection:
        detectors: [gcp]
        timeout: 10s

      transform/collision:
        metric_statements:
        - context: datapoint
          statements:
          - set(attributes["exported_location"], attributes["location"])
          - delete_key(attributes, "location")
          - set(attributes["exported_cluster"], attributes["cluster"])
          - delete_key(attributes, "cluster")
          - set(attributes["exported_namespace"], attributes["namespace"])
          - delete_key(attributes, "namespace")
          - set(attributes["exported_job"], attributes["job"])
          - delete_key(attributes, "job")
          - set(attributes["exported_instance"], attributes["instance"])
          - delete_key(attributes, "instance")
          - set(attributes["exported_project_id"], attributes["project_id"])
          - delete_key(attributes, "project_id")

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: ${env:MY_POD_IP}:4317
          http:
            cors:
              allowed_origins:
              - http://*
              - https://*
            endpoint: ${env:MY_POD_IP}:4318
      prometheus/self-metrics:
        config:
          scrape_configs:
          - job_name: otel-self-metrics
            scrape_interval: 1m
            static_configs:
            - targets:
              - ${env:MY_POD_IP}:8888

    service:
      extensions:
      - health_check
      pipelines:
        metrics/self-metrics:
          exporters:
          - googlemanagedprometheus
          processors:
          - filter/self-metrics
          - metricstransform/self-metrics
          - resource/self-metrics
          - k8sattributes
          - memory_limiter
          - resourcedetection
          - batch
          receivers:
          - prometheus/self-metrics
        traces:
          exporters:
          - googlecloud
          processors:
          - k8sattributes
          - memory_limiter
          - resourcedetection
          - batch
          receivers:
          - otlp
      telemetry:
        metrics:
          address: ${env:MY_POD_IP}:8888

dashpole · 2024-10-11T12:52:23Z

Thanks. We disable retries in the collector because the google cloud client library has retries built-in

AkselAllas added the needs triage New item requiring triage label Sep 2, 2024

AkselAllas mentioned this issue Sep 2, 2024

googlecloud monitoring exporter drops data for transient failures: "Exporting failed. Dropping data" #31033

Closed

github-actions bot mentioned this issue Sep 3, 2024

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

Frapschen added the exporter/googlecloud label Sep 3, 2024

dashpole self-assigned this Sep 3, 2024

dashpole added bug Something isn't working and removed needs triage New item requiring triage labels Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve transient errors in googlecloud trace exporter batch write spans. #34957

Improve transient errors in googlecloud trace exporter batch write spans. #34957

AkselAllas commented Sep 2, 2024

Frapschen commented Sep 3, 2024

github-actions bot commented Sep 3, 2024

dashpole commented Sep 3, 2024

AkselAllas commented Sep 3, 2024

dashpole commented Sep 3, 2024

AkselAllas commented Sep 4, 2024 •

edited

Loading

BinaryFissionGames commented Sep 10, 2024

dbason commented Oct 11, 2024

dashpole commented Oct 11, 2024

Improve transient errors in googlecloud trace exporter batch write spans. #34957

Improve transient errors in googlecloud trace exporter batch write spans. #34957

Comments

AkselAllas commented Sep 2, 2024

Component(s)

Describe the issue you're reporting

Frapschen commented Sep 3, 2024

github-actions bot commented Sep 3, 2024

dashpole commented Sep 3, 2024

AkselAllas commented Sep 3, 2024

dashpole commented Sep 3, 2024

AkselAllas commented Sep 4, 2024 • edited Loading

BinaryFissionGames commented Sep 10, 2024

dbason commented Oct 11, 2024

dashpole commented Oct 11, 2024

AkselAllas commented Sep 4, 2024 •

edited

Loading