Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka exporter is not able to keep up to the load being stream to OpenTelemetry collector. #35208

Open
Om7771 opened this issue Sep 16, 2024 · 3 comments
Labels
bug Something isn't working exporter/kafka

Comments

@Om7771
Copy link

Om7771 commented Sep 16, 2024

Component(s)

exporter/kafka

What happened?

Description

We had initiated the load of 1700 TPS towards the open telemetry collector and observed that Kafka exporter was not able to consume the whole load and stream it to Kafka topic at the same rate as were generating the load.

To debug we enabled the telemetry metrics level: detailed expecting a detailed metrices level, But observed that very few metrices are shown in the log.

The metrices shown in the logs were less than what we expected for metrices level: Basic

The Kafka exporter use this:
https://gitlabce.tools.aws.vodafone.com/IOT/dsip-opentelemetry.git

The documentation refereed for metrices
https://opentelemetry.io/docs/collector/internal-telemetry/

Steps to Reproduce

  1. Start a load 1700 TPS towards Opentelemetry collector
  2. We verified the lag of the topic on which Kafka exporter was writing the data.

Expected Result

  1. The Kafka exporter was able to consume the load streamed to OpenTelemetry collector without any errors.
  2. Verified from the offsets of the topic that Kafka exporter was writing data to topic at the same speed at which data was being pushed to the OpenTelemetry Collector.

Actual Result

  1. The Kafka exporter was able to consume the load streamed to OpenTelemetry collector without any errors.
  2. From the offset of the topic if was observed that Kafka exporter was writing to the topic at reduced speed with respect to data being pushed to OpenTelemetry Collector.

Collector version

0.96.0

Environment information

Environment

OS: alpine (Running as a containerized image on EKS)

OpenTelemetry Collector configuration

#Following is the config for Kafka exporter in our environment
receivers:
  otlp:
    protocols:
      grpc:
        auth:
          authenticator: basicauth/server
        tls:
          cert_file: ***
          key_file: ***
          ca_file: ***
      http:
        auth:
          authenticator: basicauth/server
        tls:
          cert_file: ***
          key_file: ***
          ca_file: ***
        # TODO - CORS is not configured yet
exporters:
  debug:
    verbosity: detailed
  logging:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200
  kafka:
    brokers: [ '*****' ]
    topic: ****
    auth:
      sasl:
        username: ****
        password: ***
        mechanism: SCRAM-SHA-512
      # TODO - appropriate certs must be set
      tls:
        insecure: true
    encoding: otlp_json
    protocol_version: 2.6.2
    metadata:
      retry:
        max: 3
        backoff: 250ms
    timeout: 5s
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 120s
    sending_queue:
      enabled: true
      num_consumers: 20
      queue_size: 2000
    producer:
      # Set to 5MB and compression should be enough to keep it below 1M (Kafka's limit)
      max_message_bytes: 5000000
      required_acks: 1
      compression: 'lz4'
      flush_max_messages: 0

processors:
  batch:
    send_batch_size: 5000
    send_batch_max_size: 8000
    timeout: 0s

extensions:
  basicauth/server:
    htpasswd:
      file: ***
  health_check:
    path: "/health"
    tls:
      cert_file: ***
      key_file: ***
      ca_file: ***

service:
  telemetry:
    logs:
      level: "debug"
    metrics:
      level: detailed
      address: ":9404"
  extensions: [basicauth/server, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [kafka, debug]

Log output

No response

Additional context

No response

@Om7771 Om7771 added bug Something isn't working needs triage New item requiring triage labels Sep 16, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@VihasMakwana
Copy link
Contributor

Thanks for filing this @Om7771.

  • Is this a regression? Did it function properly before and then slow down in newer versions?
  • Are there any error logs in the OpenTelemetry (otel) logs?

Additionally, we use SyncProducer in the Kafka exporter, which might be causing delays due to the time taken to receive acknowledgments. We could consider adding an option for users to switch to async mode, but this decision would be up to the code owners. Let me know what do you think @pavolloffay @MovieStoreGuy?

@Om7771
Copy link
Author

Om7771 commented Sep 17, 2024

@VihasMakwana Thanks a lot for the quick response.

  1. I cannot confirm if this is regression as we have been using version 0.96.0 from beginning.
  2. There are no error logs printed. The only logs generated were the traces getting printed on stdout.

Regarding exporting to Kafka using SyncProducer, we have set exporter.kafka.producer.required_acks=1. As per https://pkg.go.dev/github.com/Shopify/sarama#RequiredAcks, this means that the exporter only waits for responding broker to ack. When we set this to 0 (an unreliable delivery) this effectively becomes async. Do you mean this mode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/kafka
Projects
None yet
Development

No branches or pull requests

3 participants