Invalid bootstrap servers list blocks during startup #1473

mmodenesi · 2018-04-12T15:52:35Z

>>> kafka.KafkaProducer(bootstrap_servers='can-not-resolve:9092')
# never returns

dpkp · 2018-04-12T16:06:22Z

It will continue to check DNS until thr name resolves. This helps maintain continuous performance during temporary network outages. What were you expecting?

…

On Thu, Apr 12, 2018, 8:52 AM mmodenesi ***@***.***> wrote: >>> kafka.KafkaProducer(bootstrap_servers='can-not-resolve:9092')# never returns — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1473>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAzetGVFCuerzcYyNfIVAF3Su8tny1n8ks5tn3hFgaJpZM4TSDXV> .

mmodenesi · 2018-04-12T22:54:01Z

I see your point. So, this is a good thing when there is a temporary problem and the name will eventually be resolved properly again. But what if the provided name is plain wrong? (say, for example, you edited some configuration files and mispelled the server fqdn).

I though "there should be an optional timeout argument.". I could not find it at first glance in the documentations (I should say, I am using 1.4.2). Then, grepping the code for the warning DNS lookup failed for not-a-valid-name:9092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?) I found it is produced around

kafka-python/kafka/conn.py

Line 288 in c0fddbd

def connect_blocking(self, timeout=float('inf')):

(Note that the presence of the named argument made me think I was on the right track)

which is called from

kafka-python/kafka/client_async.py

Line 245 in c0fddbd

if not bootstrap.connect_blocking():

definitely without any arguments.

In my opinion, this should not enter the infinite loop unavoidably.

I am only starting with both kafka and your code, so I don't know the implications of this (I mean, may be this function is called on bootstrapping and then every 10 seconds for metadata refreshing, I really don't know if what I am asking is way stupid).

This is what I would expect:

$ docker run --rm kafka:2.11-1.1.0 kafka-console-producer.sh --topic test --broker-list con-cristina-estabamos-mejor:9092  [19:50:09]
[2018-04-12 22:50:30,316] WARN Removing server con-cristina-estabamos-mejor:9092 from bootstrap.servers as DNS resolution failed for con-cristina-estabamos-mejor (org.apache.kafka.clients.ClientUtils)
org.apache.kafka.common.KafkaException: Failed to construct kafka producer
        at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:456)
        at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:303)
        at kafka.producer.NewShinyProducer.<init>(BaseProducer.scala:40)
        at kafka.tools.ConsoleProducer$.main(ConsoleProducer.scala:49)
        at kafka.tools.ConsoleProducer.main(ConsoleProducer.scala)
Caused by: org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
        at org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:66)
        at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:405)
        ... 4 more

dpkp · 2018-04-22T16:43:29Z

The two options here would be (1) validate that bootstrap_servers resolve via DNS to at least one IP address; (2) validate that bootstrap succeeds and we are able to get initial cluster metadata.

svvitale · 2018-04-25T01:03:41Z

I agree with @mmodenesi that this should be configurable, either with a timeout or a max number of retries, and should raise an exception on exhaustion of retries. Ultimately the caller should be able to make a choice about whether to reconnect ad naseum or handle the exception in a different way. Also, I think the reconnect_backoff_ms and reconnect_backoff_max_ms options should be honored, even when DNS resolution fails.

svvitale · 2018-04-25T01:36:24Z

I've started on these changes here: svvitale@e60129c

@dpkp, let me know if you agree with this approach and I'd be happy to add documentation for these new parameters and file a pull request.

jeffwidman · 2018-05-25T19:04:15Z

@dpkp, I saw your PR disabled the DNS retries.

As @svvitale noted, what do you think about making this configurable?

I understand the perspective of folks who want this to fail loudly and immediately in case they typo'd.

However, my day job has a flaky DNS server that's owned by another team and every few hours it will drop some queries... I'd prefer not to make my kafka-python wrapper more complex just to handle these retries. Futhermore, while ours is worse than most, I would still assert that in all production environments, DNS cannot assumed to be reliable 100% of the time, so adding the option of retries is useful.

So would it be possible to add a timeout config that defaults to failing immediately, but could be passed a larger value to keep retrying for a period of time?

Or as @svvitale suggested, could just inherit this retry timeout value from reconnect_backoff_max_ms

If you want a PR for this, I'm more than happy to do so, but I wanted to discuss the API first.

dpkp · 2018-05-26T15:46:57Z

That seems like a good thing to make configurable. To be clear, it only currently affects bootstrapping (once an initial metadata response is found, all future "reconnects" should continue retries after backoff even if DNS fails). But I do think that folks would prefer the default behavior be to assume DNS is functioning properly and not retry at all.

dpkp changed the title ~~Impossible to recover from DNS error~~ Invalid bootstrap servers list blocks during startup Apr 22, 2018

dpkp mentioned this issue May 25, 2018

Improve connection handling when bootstrap list is invalid #1507

Merged

jeffwidman mentioned this issue May 25, 2018

Exception not raised by the KafkaConsumer() method for invalid address #1481

Closed

dpkp closed this as completed in #1507 May 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid bootstrap servers list blocks during startup #1473

Invalid bootstrap servers list blocks during startup #1473

mmodenesi commented Apr 12, 2018

dpkp commented Apr 12, 2018 via email

mmodenesi commented Apr 12, 2018

dpkp commented Apr 22, 2018

svvitale commented Apr 25, 2018

svvitale commented Apr 25, 2018

jeffwidman commented May 25, 2018 •

edited

Loading

dpkp commented May 26, 2018

Invalid bootstrap servers list blocks during startup #1473

Invalid bootstrap servers list blocks during startup #1473

Comments

mmodenesi commented Apr 12, 2018

dpkp commented Apr 12, 2018 via email

mmodenesi commented Apr 12, 2018

dpkp commented Apr 22, 2018

svvitale commented Apr 25, 2018

svvitale commented Apr 25, 2018

jeffwidman commented May 25, 2018 • edited Loading

dpkp commented May 26, 2018

jeffwidman commented May 25, 2018 •

edited

Loading