Correct entrypoint.sh causing bootloop in some cases #1628

adferrand · 2018-06-04T19:37:37Z

Following issue #1626

Entrypoint executes following commands:

kill -15 $SUPERVISOR_PID
wait $SUPERVISOR_PID

This may fail if process is killed sufficiently fast: indeed process may already be killed when wait is invoked, and then results into an error. Command ps -p ensures process is still here before waiting to its state to change.

Lines "kill -15 $SUPERVISOR_PID" then "wait $SUPERVISOR_PID" may fail if process is killed sufficiently fast. Command "ps -p" ensure process is still here before waiting to its state to change.

solidnerd · 2018-06-04T20:01:14Z

@adferrand Seems working for me. I tested it with a restart after a migration.

solidnerd · 2018-06-04T20:04:13Z

Are there any things left or can we start with a new release ?

adferrand · 2018-06-04T20:08:30Z

@solidnerd And I deployed it on my production environment without any issue. I think we are good to go.

solidnerd · 2018-06-04T20:09:21Z

Okay I will start with this than now. Thanks for your work 👍

paddy-hack · 2018-06-09T03:14:37Z

IIUC, you've tried to address a race between the kill and wait invocations, but didn't you just replace that with a race between the ps and wait invocations?
If so, would putting the wait in the background before the kill and then foregrounding it solve that? Like so

wait $SUPERVISOR_PID &
kill -15 $SUPERVISOR_PID
fg

Not sure if fg can be used in the container though.

If using bash's wait the -n option might be useful to detect that $SUPERVISOR_PID no longer exists.

adferrand · 2018-06-09T11:22:57Z

I thought about the possibility of a the race condition between ps and wait, but in fact I think there is no race.

First empirically because the issue is solved in the same conditions.

Second theoretically. Before, kill and wait were executed as two different commands. I think that there is no guaranty that two consecutive bash commands are executed by two consecutive cpu cycles. So there could have sufficiently enough cpu cycles to allow the process to be effectively killed before the wait is invoked, leading to the original error.

But it is quite different for a boolean evaluation, which is the case of my correction for ps and wait. Consistency of the evaluation of two operands in a binary combination is required, or your boolean test could be different depending on the runtime. So I think that two operands in a binary association are evaluated as one unique command, and operands are all executed in consecutive cpu cycles, which guarantee that nothing happens between them, and so the consistency, and so the insurance that what ps gave will still be relevant for the wait.

For your example however, the wait and kill will never finish normally, because at the time of the wait we said nothing to the process which could change its state (kill it for our case), so wait will stale until timeout.

paddy-hack · 2018-06-11T12:02:09Z

About the empirical evidence, I just had both 10.8.3 and 10.8.3-1 loop on a clean startup 😒. I use docker-compose and all the containers use a bind mounted volume (with a shared parent directory). That is to say, they use something like ./test/$containername. The clean startup is one where the shared parent directory does not exist yet.

As for your theoretical analysis, I think it depends a lot on the implementation details of the shell that's in use, i.c. /bin/bash. As for my example code, well, that's completely untested but I do put the wait in the background for a reason 😄

adferrand · 2018-06-11T18:53:56Z

And so died my hypothesis. As a scientist, I have to admit my mistake ^^ I will remember that shells are definitely programming languages, and that everything must be designed with the multi-threading nature of the underlying os in mind.

So your solution is better as it ensures that no race can occur. I will make a new PR.

Thanks!

adferrand · 2018-06-11T21:10:52Z

Hum. In fact executing wait $SUPERVISOR_PID & leads to a "wait pid is not a child of this shell". Indeed executing the wait in background makes it be executed in a sub-child of the current shell. Then it is not allowed to watch the process of pid $SUPERVISOR_PID, which is also a sub-child of the current shell.

Started from the foreground, wait, as a built-in command, stays in the shell and then can act normally.

So of course we could make a while loop on the /proc/$SUPERVISOR_PID until it disapear, but here goes again the race between while and its execution statement !

@paddy-hack, do you have an idea ?

adferrand · 2018-06-11T21:18:59Z

Or will just do a wait $SUPERVISOR_PID || true to trap the potential error on wait, and we are good to go. Anyway if wait fails, it means that the process is already dead, and that is what we want before continuing.

Correct bootloop in some cases.

0f7d683

Lines "kill -15 $SUPERVISOR_PID" then "wait $SUPERVISOR_PID" may fail if process is killed sufficiently fast. Command "ps -p" ensure process is still here before waiting to its state to change.

adferrand changed the title ~~Correct entrypoint.sh causing bootloop in some cases (#1626)~~ Correct entrypoint.sh causing bootloop in some cases. Jun 4, 2018

adferrand changed the title ~~Correct entrypoint.sh causing bootloop in some cases.~~ Correct entrypoint.sh causing bootloop in some cases Jun 4, 2018

Less output

12bdcda

solidnerd mentioned this pull request Jun 4, 2018

Unable to restart gitlab container #1629

Closed

solidnerd merged commit 9b4f62f into sameersbn:master Jun 4, 2018

Klowner mentioned this pull request Jun 22, 2018

Startup error: /sbin/entrypoint.sh: line 22: 2866 Terminated #1635

Closed

paddy-hack mentioned this pull request Aug 6, 2018

GITLAB_RELATIVE_URL_ROOT requires access to https://registry.yarnpkg.com #1696

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct entrypoint.sh causing bootloop in some cases #1628

Correct entrypoint.sh causing bootloop in some cases #1628

adferrand commented Jun 4, 2018 •

edited

Loading

solidnerd commented Jun 4, 2018

solidnerd commented Jun 4, 2018

adferrand commented Jun 4, 2018

solidnerd commented Jun 4, 2018

paddy-hack commented Jun 9, 2018

adferrand commented Jun 9, 2018

paddy-hack commented Jun 11, 2018

adferrand commented Jun 11, 2018

adferrand commented Jun 11, 2018 •

edited

Loading

adferrand commented Jun 11, 2018

Correct entrypoint.sh causing bootloop in some cases #1628

Correct entrypoint.sh causing bootloop in some cases #1628

Conversation

adferrand commented Jun 4, 2018 • edited Loading

solidnerd commented Jun 4, 2018

solidnerd commented Jun 4, 2018

adferrand commented Jun 4, 2018

solidnerd commented Jun 4, 2018

paddy-hack commented Jun 9, 2018

adferrand commented Jun 9, 2018

paddy-hack commented Jun 11, 2018

adferrand commented Jun 11, 2018

adferrand commented Jun 11, 2018 • edited Loading

adferrand commented Jun 11, 2018

adferrand commented Jun 4, 2018 •

edited

Loading

adferrand commented Jun 11, 2018 •

edited

Loading