Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid bootstrap servers list blocks during startup #1473

Closed
mmodenesi opened this issue Apr 12, 2018 · 7 comments
Closed

Invalid bootstrap servers list blocks during startup #1473

mmodenesi opened this issue Apr 12, 2018 · 7 comments

Comments

@mmodenesi
Copy link

>>> kafka.KafkaProducer(bootstrap_servers='can-not-resolve:9092')
# never returns
@dpkp
Copy link
Owner

dpkp commented Apr 12, 2018 via email

@mmodenesi
Copy link
Author

I see your point. So, this is a good thing when there is a temporary problem and the name will eventually be resolved properly again. But what if the provided name is plain wrong? (say, for example, you edited some configuration files and mispelled the server fqdn).

I though "there should be an optional timeout argument.". I could not find it at first glance in the documentations (I should say, I am using 1.4.2). Then, grepping the code for the warning DNS lookup failed for not-a-valid-name:9092, exception was [Errno -2] Name or service not known. Is your advertised.listeners (called advertised.host.name before Kafka 9) correct and resolvable?) I found it is produced around

def connect_blocking(self, timeout=float('inf')):

(Note that the presence of the named argument made me think I was on the right track)

which is called from

if not bootstrap.connect_blocking():

definitely without any arguments.

In my opinion, this should not enter the infinite loop unavoidably.

I am only starting with both kafka and your code, so I don't know the implications of this (I mean, may be this function is called on bootstrapping and then every 10 seconds for metadata refreshing, I really don't know if what I am asking is way stupid).

This is what I would expect:

$ docker run --rm kafka:2.11-1.1.0 kafka-console-producer.sh --topic test --broker-list con-cristina-estabamos-mejor:9092  [19:50:09]
[2018-04-12 22:50:30,316] WARN Removing server con-cristina-estabamos-mejor:9092 from bootstrap.servers as DNS resolution failed for con-cristina-estabamos-mejor (org.apache.kafka.clients.ClientUtils)
org.apache.kafka.common.KafkaException: Failed to construct kafka producer
        at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:456)
        at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:303)
        at kafka.producer.NewShinyProducer.<init>(BaseProducer.scala:40)
        at kafka.tools.ConsoleProducer$.main(ConsoleProducer.scala:49)
        at kafka.tools.ConsoleProducer.main(ConsoleProducer.scala)
Caused by: org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
        at org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:66)
        at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:405)
        ... 4 more

@dpkp dpkp changed the title Impossible to recover from DNS error Invalid bootstrap servers list blocks during startup Apr 22, 2018
@dpkp
Copy link
Owner

dpkp commented Apr 22, 2018

The two options here would be (1) validate that bootstrap_servers resolve via DNS to at least one IP address; (2) validate that bootstrap succeeds and we are able to get initial cluster metadata.

@svvitale
Copy link

I agree with @mmodenesi that this should be configurable, either with a timeout or a max number of retries, and should raise an exception on exhaustion of retries. Ultimately the caller should be able to make a choice about whether to reconnect ad naseum or handle the exception in a different way. Also, I think the reconnect_backoff_ms and reconnect_backoff_max_ms options should be honored, even when DNS resolution fails.

@svvitale
Copy link

I've started on these changes here: svvitale@e60129c

@dpkp, let me know if you agree with this approach and I'd be happy to add documentation for these new parameters and file a pull request.

@jeffwidman
Copy link
Collaborator

jeffwidman commented May 25, 2018

@dpkp, I saw your PR disabled the DNS retries.

As @svvitale noted, what do you think about making this configurable?

I understand the perspective of folks who want this to fail loudly and immediately in case they typo'd.

However, my day job has a flaky DNS server that's owned by another team and every few hours it will drop some queries... I'd prefer not to make my kafka-python wrapper more complex just to handle these retries. Futhermore, while ours is worse than most, I would still assert that in all production environments, DNS cannot assumed to be reliable 100% of the time, so adding the option of retries is useful.

So would it be possible to add a timeout config that defaults to failing immediately, but could be passed a larger value to keep retrying for a period of time?

Or as @svvitale suggested, could just inherit this retry timeout value from reconnect_backoff_max_ms

If you want a PR for this, I'm more than happy to do so, but I wanted to discuss the API first.

@dpkp
Copy link
Owner

dpkp commented May 26, 2018

That seems like a good thing to make configurable. To be clear, it only currently affects bootstrapping (once an initial metadata response is found, all future "reconnects" should continue retries after backoff even if DNS fails). But I do think that folks would prefer the default behavior be to assume DNS is functioning properly and not retry at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants