-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gap in offset monitoring and kafka_consumer stuck in refreshing metadata when one Kafka broker in a cluster goes down #14341
Comments
Reproducing in docker
instances:
- kafka_connect_str: kafka1,kafka2
kafka_client_api_version: 2.5.0
monitor_unlisted_consumer_groups: true
monitor_all_broker_highwatermarks: true
version: "2.4"
services:
datadog:
image: public.ecr.aws/datadog/agent:7.43.1
environment:
DD_HOSTNAME: dev
DD_LOG_LEVEL: debug
DD_API_KEY: 0123456789abcdef0123456789abcdef
extra_hosts:
- kafka1:192.168.0.11
- kafka2:192.168.0.12
- kafka3:192.168.0.13
volumes:
- ./kafka_consumer.yaml:/etc/datadog-agent/conf.d/kafka_consumer.yaml
networks:
default:
ipv4_address: 192.168.0.100
kafka1:
image: wurstmeister/kafka:2.13-2.8.1
environment:
KAFKA_BROKER_ID: "1"
KAFKA_LISTENERS: DEFAULT://kafka1:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: DEFAULT:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: DEFAULT
KAFKA_ZOOKEEPER_CONNECT: zookeeper
depends_on:
- zookeeper
networks:
default:
ipv4_address: 192.168.0.11
kafka2:
image: wurstmeister/kafka:2.13-2.8.1
environment:
KAFKA_BROKER_ID: "2"
KAFKA_LISTENERS: DEFAULT://kafka2:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: DEFAULT:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: DEFAULT
KAFKA_ZOOKEEPER_CONNECT: zookeeper
depends_on:
- zookeeper
networks:
default:
ipv4_address: 192.168.0.12
zookeeper:
image: zookeeper:3.5.9
environment:
ZOO_STANDALONE_ENABLED: "true"
networks:
default:
ipv4_address: 192.168.0.10
networks:
default:
ipam:
config:
- subnet: 192.168.0.0/24
Agent emits errors and warnings:
Reproducing with a simpler docker-compose without static IP addressesSimilar behavior can be reproduced without statically assigning IP addresses to containers by relying on docker to manage container IP addresses and internal DNS records. But in this setup error messages from kafka-python will be different. When a broker is stopped, its DNS record is removed by docker. Then the agent constantly tries to resolve IP of the stopped broker. No traffic is being sent to any broker while the agent is stuck in this infinite DNS resolution loop.
|
Hi @ls-sergey-katsubo 👋 , we recently revamped the Kafka consumer check to use the |
Thanks a lot @yzhan289 for all the efforts put into the migration from Upon testing on different Kafka clusters, I would say that the initial issue is fixed in v7.45. If a broker goes down, the offset collector is no longer blocked and the collection continues, yay! But (is this for another issue?)The collection time in v7.45 increased noticeably: to seconds (with 100 partitions) or even minutes (with thousands of partitions). For cluster in degraded state (this is relevant to the issue we are discussing) collection gets slower due to attempts/timeouts when trying to connect to the failed node. It causes extra delay in the middle of collection with logs:
For healthy cluster, the collection in v7.45 is slower too. As far as I can see from traffic and logs:
DetailsI have verified it on 3 test beds
Config: instances:
- kafka_connect_str: kafka1,kafka2
monitor_unlisted_consumer_groups: true
monitor_all_broker_highwatermarks: true Test on a stable cluster: all brokers up and running
Test on a degraded cluster: 1 broker down
|
Hey @ls-sergey-katsubo, thanks for bringing this up and also sending us the testing results. I'm going to make a card in our backlog to investigate the performance decrease with the new version of the check. I'll keep this card open in case we have any updates as we investigate! |
Hey @yzhan289 Thanks a lot! |
Steps to reproduce the issue:
kafka_consumer
integration enabled and pointing to all brokers inkafka_connect_str
. All good for now. The agent is sending Metadata requests and ListGroups request to brokers every 15 seconds (visible in debug logs and in traffic dump).Describe the results you received:
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
_send_request_to_node
races with cluster metadata and can enter an infinite loop dpkp/kafka-python#2193.Additional environment details (Operating System, Cloud provider, etc):
Happens in various environments:
Output of the info page
The text was updated successfully, but these errors were encountered: