Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct entrypoint.sh causing bootloop in some cases #1628

Merged
merged 2 commits into from
Jun 4, 2018

Conversation

adferrand
Copy link
Contributor

@adferrand adferrand commented Jun 4, 2018

Following issue #1626

Entrypoint executes following commands:

  • kill -15 $SUPERVISOR_PID
  • wait $SUPERVISOR_PID

This may fail if process is killed sufficiently fast: indeed process may already be killed when wait is invoked, and then results into an error. Command ps -p ensures process is still here before waiting to its state to change.

Lines "kill -15 $SUPERVISOR_PID" then "wait $SUPERVISOR_PID" may fail if process is killed sufficiently fast. Command "ps -p" ensure process is still here before waiting to its state to change.
@adferrand adferrand changed the title Correct entrypoint.sh causing bootloop in some cases (#1626) Correct entrypoint.sh causing bootloop in some cases. Jun 4, 2018
@adferrand adferrand changed the title Correct entrypoint.sh causing bootloop in some cases. Correct entrypoint.sh causing bootloop in some cases Jun 4, 2018
@solidnerd
Copy link
Collaborator

@adferrand Seems working for me. I tested it with a restart after a migration.

@solidnerd
Copy link
Collaborator

Are there any things left or can we start with a new release ?

@adferrand
Copy link
Contributor Author

@solidnerd And I deployed it on my production environment without any issue. I think we are good to go.

@solidnerd
Copy link
Collaborator

Okay I will start with this than now. Thanks for your work 👍

@solidnerd solidnerd merged commit 9b4f62f into sameersbn:master Jun 4, 2018
@paddy-hack
Copy link
Contributor

IIUC, you've tried to address a race between the kill and wait invocations, but didn't you just replace that with a race between the ps and wait invocations?
If so, would putting the wait in the background before the kill and then foregrounding it solve that? Like so

wait $SUPERVISOR_PID &
kill -15 $SUPERVISOR_PID
fg

Not sure if fg can be used in the container though.

If using bash's wait the -n option might be useful to detect that $SUPERVISOR_PID no longer exists.

@adferrand
Copy link
Contributor Author

I thought about the possibility of a the race condition between ps and wait, but in fact I think there is no race.

First empirically because the issue is solved in the same conditions.

Second theoretically. Before, kill and wait were executed as two different commands. I think that there is no guaranty that two consecutive bash commands are executed by two consecutive cpu cycles. So there could have sufficiently enough cpu cycles to allow the process to be effectively killed before the wait is invoked, leading to the original error.

But it is quite different for a boolean evaluation, which is the case of my correction for ps and wait. Consistency of the evaluation of two operands in a binary combination is required, or your boolean test could be different depending on the runtime. So I think that two operands in a binary association are evaluated as one unique command, and operands are all executed in consecutive cpu cycles, which guarantee that nothing happens between them, and so the consistency, and so the insurance that what ps gave will still be relevant for the wait.

For your example however, the wait and kill will never finish normally, because at the time of the wait we said nothing to the process which could change its state (kill it for our case), so wait will stale until timeout.

@paddy-hack
Copy link
Contributor

About the empirical evidence, I just had both 10.8.3 and 10.8.3-1 loop on a clean startup 😒. I use docker-compose and all the containers use a bind mounted volume (with a shared parent directory). That is to say, they use something like ./test/$containername. The clean startup is one where the shared parent directory does not exist yet.

As for your theoretical analysis, I think it depends a lot on the implementation details of the shell that's in use, i.c. /bin/bash. As for my example code, well, that's completely untested but I do put the wait in the background for a reason 😄

@adferrand
Copy link
Contributor Author

And so died my hypothesis. As a scientist, I have to admit my mistake ^^ I will remember that shells are definitely programming languages, and that everything must be designed with the multi-threading nature of the underlying os in mind.

So your solution is better as it ensures that no race can occur. I will make a new PR.

Thanks!

@adferrand
Copy link
Contributor Author

adferrand commented Jun 11, 2018

Hum. In fact executing wait $SUPERVISOR_PID & leads to a "wait pid is not a child of this shell". Indeed executing the wait in background makes it be executed in a sub-child of the current shell. Then it is not allowed to watch the process of pid $SUPERVISOR_PID, which is also a sub-child of the current shell.

Started from the foreground, wait, as a built-in command, stays in the shell and then can act normally.

So of course we could make a while loop on the /proc/$SUPERVISOR_PID until it disapear, but here goes again the race between while and its execution statement !

@paddy-hack, do you have an idea ?

@adferrand
Copy link
Contributor Author

Or will just do a wait $SUPERVISOR_PID || true to trap the potential error on wait, and we are good to go. Anyway if wait fails, it means that the process is already dead, and that is what we want before continuing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants