Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chronos Intermittent Issue: Jobs get stuck #897

Open
harjinder-flipkart opened this issue Feb 4, 2020 · 9 comments
Open

Chronos Intermittent Issue: Jobs get stuck #897

harjinder-flipkart opened this issue Feb 4, 2020 · 9 comments

Comments

@harjinder-flipkart
Copy link

harjinder-flipkart commented Feb 4, 2020

Intermittent Chronos Issue:
At our Chronos cluster, we have been encountering an intermittent issue where Chronos jobs stop getting executed on Mesos. The sequence of observed events is as follows:

  • Chronos jobs are not executed by Mesos.
  • Status of jobs on Chronos dashboard is ‘Queued’.
  • Mesos master logs show that
    -- master has not been sending resource offers to framework i.e. Chronos.
    -- master keeps getting update from slaves for old tasks.
    -- it keeps trying to forward the update to chronos.
    -- Zookeeper and slaves are not down. They are working fine.
  • After restarting Chronos and Zookeeper, the system starts working fine. Chronos jobs start getting executed.

Whys:

  • Why Chronos jobs stop getting executed ?
    Chronos, as a Mesos application (framework), waits for resource offers from Mesos master.
    Mesos master generally sends resource offers at a very high frequency i.e. 100 ms to a few seconds. However, in this case, the master stopped sending resource offers. Without these resource offers, Chronos is stuck.
  • Why Mesos master stopped sending resource offers ?
    The mesos slaves were occupied with FINISHED tasks. Mesos slaves were telling the master that taks is FINISHED and the master was trying to tell Chronos leader the same and waiting for ACK. Chronos was not sending ACK.
  • Why did Chronos not send ACK ?
    The "JobScheduler::handleFinishedTask" thread in Chronos leader was waiting on ReentrantLock which was held by the "JobScheduler::mainLoop" thread.
  • Why did "JobScheduler::mainLoop" thread not release the lock ?
    The mainLoop thread is trying to reload jobs from ZK and it is blocked on ZK.

Software Versions:

  • Chronos 3.0.3
  • Mesos 1.4.0
  • Zookeeper 3.4.5
@harjinder-flipkart
Copy link
Author

Based upon recent investigation, I have updated the problem description above.

Chronos team, can you please help us resolve the issue.

@harjinder-flipkart
Copy link
Author

I have kept Chronos thread dump here.

Relevant threads look like this:
...
"Thread-264485" #264523 prio=5 os_prio=0 tid=0x00007fd9d4006800 nid=0x5fb9 waiting for monitor entry [0x00007fda1c9da000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.replaceJob(JobScheduler.scala:152) - waiting to lock <0x00000007042d73d0> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.handleFinishedTask(JobScheduler.scala:244) at org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:210) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37) at com.sun.proxy.$Proxy30.statusUpdate(Unknown Source)
...

"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000] java.lang.Thread.State: RUNNABLE at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method) at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.liftedTree1$1(JobScheduler.scala:542) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler.mainLoop(JobScheduler.scala:540) - locked <0x00000007042d73d0> (a java.util.concurrent.locks.ReentrantLock) at org.apache.mesos.chronos.scheduler.jobs.JobScheduler$$anon$1.run(JobScheduler.scala:516) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

@harjinder-flipkart
Copy link
Author

@brndnmtthws can you please look into this issue ?

@brndnmtthws
Copy link
Member

@harjinder-flipkart I haven't been involved with this project in years, so I'm not really in a position to help. Good luck with your debugging.

@janisz
Copy link

janisz commented Feb 20, 2020

Can you send mesos state JSON?

@harjinder-flipkart
Copy link
Author

State JSON for mesos master is here: https://gist.github.com/harjinder-flipkart/58f1dfc8e077ee9a80f1b544cf87ff4c

@janisz
Copy link

janisz commented Feb 20, 2020

I suspect chronos is stuck with single offer. Have you tried restarting it? It might be helpful to set offer_timeout on Mesos Master.

@harjinder-flipkart
Copy link
Author

Thanks @janisz for your reply !

Yes restarting Chronos and ZK brings the cluster back in working condition. Restarting chronos/zk is a work-around for the time being. But we are looking for a permanent solution and need your help :)

Also, I am not sure if Chronos was stuck with single offer. The thread dump shows that Chronos thread was trying to load jobs and it was waiting for ZK:

...
"pool-4-thread-1" #48 prio=5 os_prio=0 tid=0x00007fd9ac006000 nid=0x6140 runnable [0x00007fd97fffe000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.mesos.state.AbstractState$FetchFuture.get(Native Method)
	at org.apache.mesos.state.AbstractState$FetchFuture.get(AbstractState.java:226)
	at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
	at org.apache.mesos.chronos.scheduler.state.MesosStatePersistenceStore$$anonfun$getJobs$2.apply(MesosStatePersistenceStore.scala:106)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.mesos.chronos.scheduler.jobs.JobUtils$.loadJobs(JobUtils.scala:68)
...

@harjinder-flipkart
Copy link
Author

@janisz any pointers for this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants