Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task failure during staging leaves orphaned executor, preventing recovery #290

Closed
dylanwilder opened this issue Oct 17, 2016 · 6 comments
Closed

Comments

@dylanwilder
Copy link
Contributor

Executor log shows process has exited, however mesos state shows it as staging. This prevents framework from recovering as it is unable to retrieve the resources.

@mohitsoni
Copy link
Contributor

This is interesting. So, just to clarify: executor never launched ? (because fetcher failed to fetch JRE which is required for executor launch).

@dylanwilder
Copy link
Contributor Author

dylanwilder commented Oct 17, 2016

I've seen this happen several times as I'm testing a deployment. Most recent shows node-0 and node-0-template in STAGING. Stdout logs show that the child process exited, but the container still hangs around. I've also seen the task in FAILED state yet still under active tasks in Mesos. The framework appears to be trying to recover, but is unable to get the port reservation it needs because there appears to be an orphaned reservation occupying the port range from the failed task. The latter may be an issue with mesos that is just a side effect of the initial error. In any case the only way I'm able to recover is to assassinate the slave machine.

I just wiped an restarted so I could move forward, but if you'd like I can try to get you some logs. I also opened a support ticket with mesosphere before I had narrowed it down to failure to fetch the jre. That ticket contains more details.

@mohitsoni
Copy link
Contributor

@dylanwilder Logs would be helpful. I'll find the ticket. Thanks.

@dylanwilder
Copy link
Contributor Author

Ok, I've figured out the root cause here (should note i'm on version 1.0.17-3.0.8). There's two issues, the first on the deploy, the second on the inability to recover.

  1. I am updating the executor configuration (i.e. to change the jre or cassandra version). This triggers a cluster rolling restart here. Looks like the issue was fixed here, but this used to read getExecutor().withNewId().getExecutorInfo() which caused the updates to be out of sync.
  2. This error causes the the task to FAIL, however the executor is orphaned. The scheduler is unaware of it's existence but it is still consuming the executor port resource and the scheduler is unable to relaunch. killing the process loops back to (1)

Seeing as how (1) has been addressed already, is it possible that (2) has been fixed as well?

@dylanwilder dylanwilder changed the title Failure to fetch jre during executor launch leaves tasks in bad state Executor failure during staging leaves orphaned executor, preventing recovery Nov 10, 2016
@dylanwilder
Copy link
Contributor Author

Looks like this code will always attempt to create a new executor. Really it should be reusing the existing executor, however since we are not actually tracking the state of executors themselves this could lead to reverse problem. It seems to me that executors should really be tracked as first class assets rather than as a reference from tasks. Thoughts?

@dylanwilder dylanwilder changed the title Executor failure during staging leaves orphaned executor, preventing recovery Task failure during staging leaves orphaned executor, preventing recovery Nov 11, 2016
dylanwilder pushed a commit to dylanwilder/dcos-cassandra-service that referenced this issue Nov 15, 2016
nickbp pushed a commit that referenced this issue Dec 7, 2016
* Fix for #290. Abort executor if Cassandra daemon fails to init

* Removing unused import
@nickbp
Copy link
Contributor

nickbp commented Dec 7, 2016

Marking closed as #316 was merged

@nickbp nickbp closed this as completed Dec 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants