Task failure during staging leaves orphaned executor, preventing recovery #290

dylanwilder · 2016-10-17T21:59:12Z

Executor log shows process has exited, however mesos state shows it as staging. This prevents framework from recovering as it is unable to retrieve the resources.

mohitsoni · 2016-10-17T22:12:59Z

This is interesting. So, just to clarify: executor never launched ? (because fetcher failed to fetch JRE which is required for executor launch).

dylanwilder · 2016-10-17T22:24:35Z

I've seen this happen several times as I'm testing a deployment. Most recent shows node-0 and node-0-template in STAGING. Stdout logs show that the child process exited, but the container still hangs around. I've also seen the task in FAILED state yet still under active tasks in Mesos. The framework appears to be trying to recover, but is unable to get the port reservation it needs because there appears to be an orphaned reservation occupying the port range from the failed task. The latter may be an issue with mesos that is just a side effect of the initial error. In any case the only way I'm able to recover is to assassinate the slave machine.

I just wiped an restarted so I could move forward, but if you'd like I can try to get you some logs. I also opened a support ticket with mesosphere before I had narrowed it down to failure to fetch the jre. That ticket contains more details.

mohitsoni · 2016-10-17T23:03:43Z

@dylanwilder Logs would be helpful. I'll find the ticket. Thanks.

dylanwilder · 2016-11-10T17:54:27Z

Ok, I've figured out the root cause here (should note i'm on version 1.0.17-3.0.8). There's two issues, the first on the deploy, the second on the inability to recover.

I am updating the executor configuration (i.e. to change the jre or cassandra version). This triggers a cluster rolling restart here. Looks like the issue was fixed here, but this used to read getExecutor().withNewId().getExecutorInfo() which caused the updates to be out of sync.
This error causes the the task to FAIL, however the executor is orphaned. The scheduler is unaware of it's existence but it is still consuming the executor port resource and the scheduler is unable to relaunch. killing the process loops back to (1)

Seeing as how (1) has been addressed already, is it possible that (2) has been fixed as well?

dylanwilder · 2016-11-10T18:22:15Z

Looks like this code will always attempt to create a new executor. Really it should be reusing the existing executor, however since we are not actually tracking the state of executors themselves this could lead to reverse problem. It seems to me that executors should really be tracked as first class assets rather than as a reference from tasks. Thoughts?

…ls to init

* Fix for #290. Abort executor if Cassandra daemon fails to init * Removing unused import

nickbp · 2016-12-07T18:51:59Z

Marking closed as #316 was merged

dylanwilder changed the title ~~Failure to fetch jre during executor launch leaves tasks in bad state~~ Executor failure during staging leaves orphaned executor, preventing recovery Nov 10, 2016

dylanwilder changed the title ~~Executor failure during staging leaves orphaned executor, preventing recovery~~ Task failure during staging leaves orphaned executor, preventing recovery Nov 11, 2016

dylanwilder pushed a commit to dylanwilder/dcos-cassandra-service that referenced this issue Nov 15, 2016

Fix for mesosphere-backup#290. Abort executor if Cassandra daemon fai…

7b571dc

…ls to init

nickbp pushed a commit that referenced this issue Dec 7, 2016

#290 Executor is orphaned if Cassandra fails to initialize (#316)

06ef931

* Fix for #290. Abort executor if Cassandra daemon fails to init * Removing unused import

nickbp closed this as completed Dec 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task failure during staging leaves orphaned executor, preventing recovery #290

Task failure during staging leaves orphaned executor, preventing recovery #290

dylanwilder commented Oct 17, 2016

mohitsoni commented Oct 17, 2016

dylanwilder commented Oct 17, 2016 •

edited

Loading

mohitsoni commented Oct 17, 2016

dylanwilder commented Nov 10, 2016

dylanwilder commented Nov 10, 2016

nickbp commented Dec 7, 2016

Task failure during staging leaves orphaned executor, preventing recovery #290

Task failure during staging leaves orphaned executor, preventing recovery #290

Comments

dylanwilder commented Oct 17, 2016

mohitsoni commented Oct 17, 2016

dylanwilder commented Oct 17, 2016 • edited Loading

mohitsoni commented Oct 17, 2016

dylanwilder commented Nov 10, 2016

dylanwilder commented Nov 10, 2016

nickbp commented Dec 7, 2016

dylanwilder commented Oct 17, 2016 •

edited

Loading