-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task failure during staging leaves orphaned executor, preventing recovery #290
Comments
This is interesting. So, just to clarify: executor never launched ? (because fetcher failed to fetch JRE which is required for executor launch). |
I've seen this happen several times as I'm testing a deployment. Most recent shows node-0 and node-0-template in STAGING. Stdout logs show that the child process exited, but the container still hangs around. I've also seen the task in FAILED state yet still under active tasks in Mesos. The framework appears to be trying to recover, but is unable to get the port reservation it needs because there appears to be an orphaned reservation occupying the port range from the failed task. The latter may be an issue with mesos that is just a side effect of the initial error. In any case the only way I'm able to recover is to assassinate the slave machine. I just wiped an restarted so I could move forward, but if you'd like I can try to get you some logs. I also opened a support ticket with mesosphere before I had narrowed it down to failure to fetch the jre. That ticket contains more details. |
@dylanwilder Logs would be helpful. I'll find the ticket. Thanks. |
Ok, I've figured out the root cause here (should note i'm on version 1.0.17-3.0.8). There's two issues, the first on the deploy, the second on the inability to recover.
Seeing as how (1) has been addressed already, is it possible that (2) has been fixed as well? |
Looks like this code will always attempt to create a new executor. Really it should be reusing the existing executor, however since we are not actually tracking the state of executors themselves this could lead to reverse problem. It seems to me that executors should really be tracked as first class assets rather than as a reference from tasks. Thoughts? |
Marking closed as #316 was merged |
Executor log shows process has exited, however mesos state shows it as staging. This prevents framework from recovering as it is unable to retrieve the resources.
The text was updated successfully, but these errors were encountered: