-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chronos Intermittent Issue: Jobs get stuck #897
Comments
Based upon recent investigation, I have updated the problem description above. Chronos team, can you please help us resolve the issue. |
I have kept Chronos thread dump here. Relevant threads look like this:
|
@brndnmtthws can you please look into this issue ? |
@harjinder-flipkart I haven't been involved with this project in years, so I'm not really in a position to help. Good luck with your debugging. |
Can you send mesos state JSON? |
State JSON for mesos master is here: https://gist.github.com/harjinder-flipkart/58f1dfc8e077ee9a80f1b544cf87ff4c |
I suspect chronos is stuck with single offer. Have you tried restarting it? It might be helpful to set |
Thanks @janisz for your reply ! Yes restarting Chronos and ZK brings the cluster back in working condition. Restarting chronos/zk is a work-around for the time being. But we are looking for a permanent solution and need your help :) Also, I am not sure if Chronos was stuck with single offer. The thread dump shows that Chronos thread was trying to load jobs and it was waiting for ZK:
|
@janisz any pointers for this ? |
Intermittent Chronos Issue:
At our Chronos cluster, we have been encountering an intermittent issue where Chronos jobs stop getting executed on Mesos. The sequence of observed events is as follows:
-- master has not been sending resource offers to framework i.e. Chronos.
-- master keeps getting update from slaves for old tasks.
-- it keeps trying to forward the update to chronos.
-- Zookeeper and slaves are not down. They are working fine.
Whys:
Chronos, as a Mesos application (framework), waits for resource offers from Mesos master.
Mesos master generally sends resource offers at a very high frequency i.e. 100 ms to a few seconds. However, in this case, the master stopped sending resource offers. Without these resource offers, Chronos is stuck.
The mesos slaves were occupied with FINISHED tasks. Mesos slaves were telling the master that taks is FINISHED and the master was trying to tell Chronos leader the same and waiting for ACK. Chronos was not sending ACK.
The "JobScheduler::handleFinishedTask" thread in Chronos leader was waiting on ReentrantLock which was held by the "JobScheduler::mainLoop" thread.
The mainLoop thread is trying to reload jobs from ZK and it is blocked on ZK.
Software Versions:
The text was updated successfully, but these errors were encountered: