-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assignor_state is NULL initially and it's only set after assignment but its value isn't checked before calling #4312
Comments
rd_kafka_destroy
rd_kafka_destroy
crashing the process
rd_kafka_destroy
crashing the processrd_kafka_destroy
crashes the process
It seems to be a duplicate of: #4252 Once 2.2.0 is released I will close this one and the rdkafka-ruby related one. |
@emasab I can still reproduce it with 2.2.0:
|
Hello @mensfeld . This seems like a different thing, the |
@emasab thanks. So theoretically I should be able to mitigate it for now by always waiting for the first rebalance to finish on our side right? |
That should mitigate it. Thanks for the report! |
Awesome. I will implement this strategy on our side then as a temporary measure. |
Is is also set after an empty assignment? Can I assume, that the moment rb_cb kicks in, I'm good to go with the shutdown? |
@emasab FYI I was able to repro + mitigate on my side. Thanks |
rd_kafka_destroy
crashes the processe.g. a destroy happens before the first assignment. Only affects the cooperative-sticky assignor. fixes #4312
e.g. a destroy happens before the first assignment. Only affects the cooperative-sticky assignor. fixes #4312
Hey,
This is expansion of the report made here: #4308 - I created a separate card because I don't have edit rights to the original one, and I find that this info correlates to not only the shutdown of the Ruby process but also to any close attempt on the consumer during the sticky cooperative rebalance.
If you consider it part of #4308 please merge them 🙏
How to reproduce
Here is the simplified code I used to reproduce this. It reproduces in 100% of cases:
the close code we use in Ruby follows the expected flow where we first close the consumer and then destroy it.
Every single execution ends up with this:
I confirmed this behavior exists in the following librdkafka versions:
2.1.1
2.0.2
1.9.2
librdkafka version (release number or git tag):
2.1.1
and all mentioned aboveApache Kafka version:
2.8.1
and3.4.0
from bitnamilibrdkafka client configuration: presented in the above code snippet
Operating system:
Linux 5.4.0-146-generic #163-Ubuntu SMP Fri Mar 17 18:26:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Provide logs (with
debug=..
as necessary) from librdkafkaProvide broker log excerpts
Critical issue - I would consider it critical, because the race-condition is not mentioned in the termination docs (https://github.com/confluentinc/librdkafka/blob/master/INTRODUCTION.md#termination) and on the contrary, they state that: "There is no need to unsubscribe"
Now let's dive deeper:
rd_kafka_unsubscribe
and wait for the time of the rebalance. This works well, however, is not reliable and drastically increases the time needed to shut down the consumer as there is no direct way (aside from maybe using metrics via poll) to establish that the consumer is not under a rebalance "exactly at the moment of runningrd_kafka_destroy
. The wait is needed despiterd_kafka_assignment
returning no assignments as it seems that post revocation but prior to re-assignment the TPL is empty. This gives us a "fake" info that there is (and will not be) any TPLs assigned.rd_kafka_consumer_close
also partially mitigates this due to the fact, thatrd_kafka_consumer_close
will unsubscribe automatically. This may mitigate this on a long living consumers (ref: Segfaults during consumer shutdown karafka/rdkafka-ruby#254), however this does not solve the problem for short-lived consumers (don't know why) that are in the middle of getting the first assignment.rdkafka_sticky_assignor
is already created. If we attempt to close and destroy consumer then, crash happens as well. This is less likely because there is a short time in between the initialization of therdkafka_sticky_assignor
and it handing to rebalance callback however issue persists.close
, hence the probability of being in the rebalance state is lower (thought it can happen).rdkafka_assignor.c
nor inrdkafka_roundrobin_assignor
.rd_kafka_destroy
on a process that is anyhow going to be closed (long running processes under shutdown) can also partially mitigate this (9/10 times).Suggested fix
The consumer should probably wait for rebalance to finish before fully closing itself, however this may introduce a potential closing lag on a long running rebalance. The second thing would be to drop out of CG and just let the rebalance go, but I have no idea what will be the effect of this on the consumer group.
Logs
Here is the
debug
all
info tail (if you need more just ping me, I can generate it on the spot):and Kafka log matching this time:
The text was updated successfully, but these errors were encountered: