-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs don't run when a network partition occurs #511
Comments
It looks like this test can also cause Chronos nodes to go down permanently--some of them will refuse connections for the remainder of the test. I haven't figured out how to get logs out of Chronos yet, so I'll try to snarf those and put them here. |
Seems like the Mesos slave is unable to detect a new leading master, and all the masters all failed to continue after recovery with Zookeeper since the recovery of the registry log among the masters failed after a timeout 30 seconds. |
I upped the timeout to 2 minutes and exported GLOG_v=1, and that managed to get the Mesos master process to crash altogether, haha.
|
OK, so here's another failure case where a chronos node crashes. I've set Chronos to use GLOG_v=1 and upped the fetch period to 2 minutes. In this case, n1 crashes after a partition isolating it from the ZK leader, and jobs fail to run on schedule even after recovery. https://aphyr.com/media/chronos5.tar.bz2 |
This visualization shows a Chronos test where a network partition causes Chronos to fail to schedule any jobs during the partition--and even after the network recovers, it fails to start running jobs again--both for newly created and already known jobs. Grey indicates the duration of a network partition (feathered to show the initiation and known completion of the partitioning process across multiple nodes). Thick bars indicate targets: green for those satisfied by a run, and red for those where no run occurred. Thin, dark green bars inside the targets show when successful runs occurred, and red ones (not present in this graph) show incomplete runs. Oh, and there's a new and exciting failure mode I hadn't noticed before--it fails to run anything for the first few minutes. I'm not really sure why! Maybe it has something to do with upping the registry fetch timeout to 120 seconds? You can see Chronos running jobs after their target windows, all starting at ~120 seconds, which suggests it may have been blocked on something. You can download the full logs and analysis for this run here: http://aphyr.com/media/chronos6.tar.bz2. |
Thanks for the detailed report and all the useful graphs and logs. There are multiple things here that explain the behaviour you're observing in
You can workaround the Chronos bug by setting an offer timeout. Action items for us:
|
Here's a run with |
I just analysed the logs from your latest test run. The master processes in N2 and N3 were unable to recover from the initial partition, because they were affected by the first issue I described (they committed suicide after not having being able to fetch the replicated log for two minutes): N3 (initial leader, loses leadership due to the network partition):
N2 (second leader, crashes after being unable to recover registrar for 2 mins):
N1 (third leader, also crashes after being unable to recover registrar for 2 mins):
Someone with more in-depth knowledge of the Mesos Master code should be able to help us find out why the masters cannot recover the registrar. |
The following timeline I extracted from the different logs in chronos7.tar.bz2 might be useful to understand what's going on with the Mesos masters: Mesos Master flags:
t0 (15:36:07.407): Initial state
t1 = t0 + ~215 seconds (15:39:42.807): Network partition
t2 = t1 + ~2 seconds (15:39:44.421): Mesos Master N3 commits suicide
t3 = t2 + ~10 seconds (15:39:54.022): Mesos Master N2 becomes leader
t4 = t3 + ~120 seconds (15:41:54.023): Mesos Master N2 commits suicide after 2 min recovery timeout
t5 = t4 + ~65 seconds (15:42:59.682): Mesos Master N1 becomes leader
t6 = t5 + ~17 seconds (15:43:16.464): Network restored (ZK in N1 re-joins the quorum)
t7 = t6 + ~103 seconds (15:44:59.686): Mesos Master N1 commits suicide
|
@aphyr: How can we reproduce the bug scenario to inspect the master recovery failure? |
Repro instructions and a thorough description of the test are in #508. |
@aphyr: thanks! |
@aphyr: What are the Mesos and Chronos and ZK version used? |
According to @gkleiman it's Mesos 0.23.0 and ZK 3.4.5--1 and Chronos should not matter. |
There is now a Mesos ticket for this; |
See #508:
|
Ah, I've been verifying that the chronos processes were still running before the final read, but saw Mesos nodes recover from network failure just fine and presumed they weren't crashing also. I've added code to restart Mesos masters, Mesos slaves, and Chronos processes automatically after partitions resolve, and verified that every process is indeed running prior to the final read. Same results: Chronos won't run any jobs after a partition occurs. |
I can concur to this issue. I have been seeing similar behaviour from Chronos, when Zk crashes, or any other connectivity loss, the mesos cluster comes back ok. But either chronos dies and gets restarted with Marathon, but does not continue the jobs. We often have to to go clean Zk dir for chronos to get restarted. |
@aphyr: The initial Mesos Masters in chronos8.tar.bz2 die before the partitions resolve and there are no logs of the subsequent Mesos Master processes:
Are the logs of the other processes missing or did your restart logic fail? |
Sometimes they crash correctly, sometimes they don't! Here's another case where restarting nodes leads to a partial recovery: jobs are scheduled too close together, jobs are scheduled outside their target windows, and some jobs never run again even after the network heals. |
The initial failures in the latest run (chronos10.tar.bz2) are once again a consequence of the leading Mesos Master not being able to read the replicated log (MESOS-3280). The wave of failures minutes after recovery is a symptom of #520 (Chronos sometimes registers using an empty frameworkId after a leader fail over). The cluster somewhat recovers from this because the Mesos Masters are started with |
MESOS-3280 is resolved. What's more needs to be done? |
Building on #508, in the presence of complete network partitions splitting the cluster in half, Chronos will give up executing, well, basically any tasks. For instance, this run splits the network after 200 seconds, waits 200 seconds, heals, and so on, for 1000 seconds, followed by a 60 second stabilizing period and a final read. Essentially every job fails to meet its scheduling demands, whether enqueued before, during, or after failure. For example:
This job simply gives up running after 23:47:55, when a network partition separates [n2 n3] from [n1 n4 n5]. Other tasks never run at all. Note that the cluster is happy to accept new jobs after this time, which suggests that some components of Chronos are still alive and recover from network failure after a few seconds of unavailability.
In the first partition, the Mesos slaves (n4, n5) are isolated from some (but not all) Mesos masters (n1, n2, n3)--and I'd understand if Mesos wasn't smart enough to elect a master with reachable slaves. However, when the network heals at ~23:51, I would expect Chronos to resume scheduling this job--and that doesn't seem to happen. Are there timeouts we should adjust?
The text was updated successfully, but these errors were encountered: