-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close SystemConsumer properly #1674
base: master
Are you sure you want to change the base?
Conversation
I don't think there is any problem with the existing code. For standby containers, the expected behavior is to keep the On the other hand, for active containers, there is need to only read checkpoint once and hence closes right after. I'd trace it back to see if there is any violation in how this flag is set and assumptions that this flag is built on. |
This PR doesn't change when |
Somehow the |
I meant For standalone, the containers are recreated on every rebalance. It is possible that the previous attempts to shutdown the container failed and perhaps, that is being ignored in the application |
Then in this case, new Observing the logs in another way, new This opens a 3rd solution, prevent starts on consumers already started.
|
The idempotent protection line for KafkaSystemConsumer is never triggered in my logs, while there are 168 successful starts and 3 stops.
|
Changes
KafkaCheckpointManager is a reused class. On reuse, the
SystemConsumer
can be left unclosed, and a memory leak.This is because SystemConsumer is started on each start, but closed when
taskNamesToCheckpoints == null
andstopConsumerAfterFirstRead == true
, orstopConsumerAfterFirstRead == false
On the second run,
taskNamesToCheckpoints != null
, and SystemConsumer is never closed.There are 2 fixes.
null
onstop()
We are choosing solution 1., since the
stopConsumerAfterFirstRead
option expects only 1 read per task(?).Issue
SAMZA-2785
Calling start() and stop() multiple times on the same KafkaCheckpointManager, while stopConsumerAfterFirstRead == true, causes the SystemConsumer left unclosed. The unclosed SystemConsumer can cause memory leaks in some implementations.
Evidence:
In production logs, SystemConsumer was started 1741 times, but only closed 14 times.
We also have a heap dump of KafkaSystemConsumer taking up 8Gbs of memory.