Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpine 3.8 cluster failures #22308

Closed
rvagg opened this issue Aug 14, 2018 · 53 comments
Closed

Alpine 3.8 cluster failures #22308

rvagg opened this issue Aug 14, 2018 · 53 comments
Labels
build Issues and PRs related to build files or the CI. cluster Issues and PRs related to the cluster subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@rvagg
Copy link
Member

rvagg commented Aug 14, 2018

I just added Alpine 3.8 to CI and removed 3.6 in the process, shifting alpine-last-latest-x64 to Alpine 3.7 and giving alpine-latest-x64 to Alpine 3.8. It tested well on my local machine across our major branches, but now it's enabled in CI we're getting consistent errors with on all test runs:

15:57:44 not ok 647 parallel/test-cluster-master-error
15:57:44   ---
15:57:44   duration_ms: 120.63
15:57:44   severity: fail
15:57:44   exitcode: -15
15:57:44   stack: |-
15:57:44     timeout
15:57:44   ...
15:57:44 not ok 648 parallel/test-cluster-master-kill
15:57:44   ---
15:57:44   duration_ms: 120.92
15:57:44   severity: fail
15:57:44   exitcode: -15
15:57:44   stack: |-
15:57:44     timeout

What should we do with this? Do I remove it from CI for now or can someone spend some time investigating?

The Dockerfile is here minus the template strings which you can easily stub out to set this up locally if you want to give that a go. I'm not sure what the difference is with my local machine but perhaps I wasn't testing it on the latest master and there's something new or perhaps there's a Docker differential.

@Trott Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Aug 14, 2018
@Trott
Copy link
Member

Trott commented Aug 14, 2018

Quick thing to try: Does moving the tests to sequential cause them to pass in CI on this platform? (I would create a branch and try myself but I've got non-computer things to focus on for the next few hours.)

@Trott
Copy link
Member

Trott commented Aug 14, 2018

(Also: Our code now prints stdout and stderr on timeouts so someone can try adding a bunch of console.log() statements to see where things are hanging up.)

@refack
Copy link
Contributor

refack commented Aug 14, 2018

Testing move to sequential
https://ci.nodejs.org/job/node-test-commit-linux/20716/

@refack
Copy link
Contributor

refack commented Aug 14, 2018

BTW: before switch to 38 tests took < 2s

image

@refack
Copy link
Contributor

refack commented Aug 14, 2018

@joyeecheung joyeecheung added the cluster Issues and PRs related to the cluster subsystem. label Aug 14, 2018
@Trott
Copy link
Member

Trott commented Aug 14, 2018

@Trott
Copy link
Member

Trott commented Aug 14, 2018

The logging shows that the test succeeds except for that pollWorkers() in both tests never detects that the PID exits. I wonder if the bug is in common.isAlive(pid). That code uses process.kill(pid, 'SIGCONT'); so maybe there's something about SIGCONT and permissions or something in the Docker container?

@Trott
Copy link
Member

Trott commented Aug 14, 2018

@nodejs/docker

@Trott
Copy link
Member

Trott commented Aug 14, 2018

common.isAlive(pid) expects process.kill(pid, 'SIGCONT') to throw if pid does not exist. @rvagg @refack or someone who knows how to log into the Docker container running on CI, any chance you can see if ./node -e 'process.kill(12345)' throws or not (after checking that there is no pid 12345 running)?

@maclover7
Copy link
Contributor

bash-4.4$ ./node -e 'process.kill(12345)'
internal/process/per_thread.js:194
      throw errnoException(err, 'kill');
      ^

Error: kill ESRCH
    at process.kill (internal/process/per_thread.js:194:13)
    at [eval]:1:9
    at Script.runInThisContext (vm.js:88:20)
    at Object.runInThisContext (vm.js:285:38)
    at Object.<anonymous> ([eval]-wrapper:6:22)
    at Module._compile (internal/modules/cjs/loader.js:689:30)
    at evalScript (internal/bootstrap/node.js:563:27)
    at startup (internal/bootstrap/node.js:248:9)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:596:3)

Also, @Trott (or anybody else with the nodejs_build_test SSH key), I had written up some docs about how to restart/SSH into the CI docker containers (in case more testing is needed, and you want it to be self-service :))

@refack
Copy link
Contributor

refack commented Aug 14, 2018

Since Alpine runs with musl which support level is experimental I've removed the label from the main CI job, and created a dedicated job for stabilization efforts - https://ci.nodejs.org/job/node-test-commit-alpine38/

@refack refack added the build Issues and PRs related to build files or the CI. label Aug 14, 2018
@refack
Copy link
Contributor

refack commented Aug 14, 2018

Reporting from within the danger zone:

bash-4.4$ ./node test/parallel/test-cluster-master-error.js
message received: { cmd: 'worker', workerPID: 26420 }
message received: { cmd: 'worker', workerPID: 26421 }
exited with code 1
polling
polling
bash-4.4$ ./node test/parallel/test-cluster-master-kill.js
message received: { pid: 26447 }
exited with code 0
polling
polling
bash-4.4$
bash-4.4$ /usr/bin/python tools/test.py -j 4 -p tap --logfile test.tap --mode=release --flaky-tests=run parallel/test-cluster-master*
TAP version 13
1..2
ok 1 parallel/test-cluster-master-error
  ---
  duration_ms: 0.713
  ...
ok 2 parallel/test-cluster-master-kill
  ---
  duration_ms: 1.713
  ...

more like twilight zone.

@LaurentGoderre
Copy link
Member

Is the failure specific to running the test in a docker container?

@refack
Copy link
Contributor

refack commented Aug 15, 2018

Is the failure specific to running the test in a docker container?

We test Alpine only running as a Docker container, so I guess we don't know...

@LaurentGoderre
Copy link
Member

I would love to help but it looks like a pretty steep learning curve figuring out all the build stuff.

@LaurentGoderre
Copy link
Member

Btw, where is the test suite to test?

@refack
Copy link
Contributor

refack commented Aug 15, 2018

make test

@refack
Copy link
Contributor

refack commented Aug 15, 2018

which in turn runs:

/usr/bin/python tools/test.py -j 4 -p tap --logfile test.tap \
	--mode=release --flaky-tests=run \
	 default addons addons-napi doctool

(it assumes the node binary to test is out/Release/node)

@LaurentGoderre
Copy link
Member

Is there a way to specify the binary path to test?

@refack
Copy link
Contributor

refack commented Aug 15, 2018

AFAIR, no. When needed I copy/symlink a binary to that location.

@refack
Copy link
Contributor

refack commented Aug 15, 2018

New observation: after running a CI job there seem to be multiple node zombies around:

iojs         1  0.7  0.7 1692000 123156 ?      Ssl  01:27   5:37 java -Xmx128m -jar /home/iojs/slave.jar 
iojs      1544  0.0  0.0      0     0 ?        Z    01:38   0:00 [node] <defunct>
iojs     10025  0.0  0.0      0     0 ?        Z    01:42   0:00 [node] <defunct>
iojs     10064  0.0  0.0      0     0 ?        Z    01:42   0:00 [node] <defunct>
iojs     10690  0.0  0.0      0     0 ?        Z    01:43   0:00 [node] <defunct>
iojs     11805  0.0  0.0      0     0 ?        Z    01:43   0:00 [node] <defunct>
iojs     11830  0.0  0.0      0     0 ?        Z    01:43   0:00 [node] <defunct>
iojs     11836  0.0  0.0      0     0 ?        Z    01:43   0:00 [node] <defunct>
iojs     14805  0.0  0.0   6380  1932 pts/0    Ss   13:04   0:00 /bin/bash
iojs     16049  0.0  0.0   5700   616 pts/0    R+   13:31   0:00 ps axu
iojs     20074  0.0  0.0      0     0 ?        Zs   01:33   0:00 [node] <defunct>
iojs     22336  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     22456  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     22457  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     22480  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     22752  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     26383  0.0  0.0      0     0 ?        Z    01:36   0:00 [ls] <defunct>
iojs     31062  0.0  0.0      0     0 ?        Z    01:37   0:00 [node] <defunct>
iojs     31080  0.0  0.0      0     0 ?        Zs   01:37   0:00 [node] <defunct>
iojs     31094  0.0  0.0      0     0 ?        Zs   01:37   0:00 [node] <defunct>

That might make the polling to fail...

@LaurentGoderre
Copy link
Member

Ok, I created the following dockerfile to run the testsuit

FROM node:10-alpine

RUN apk add --no-cache --update \
    curl \
    python \
  && curl -L --compressed https://api.github.com/repos/nodejs/node/tarball -o node.tar.gz

RUN mkdir -p /node/out/Release \
  && tar -xf node.tar.gz --strip 1 -C /node \
  && ln -s /usr/local/bin/node /node/out/Release/node

RUN cd /node \
  && python tools/test.py -j 4 -p tap --logfile test.tap \
    --mode=release --flaky-tests=run \
    default addons addons-napi doctool

@rvagg
Copy link
Member Author

rvagg commented Nov 30, 2018

FYI this runs on top of Ubuntu 16.04 in our infra. These containers are fresh as I've just reprovisioned all of our Docker infra over the last couple of days, so this isn't about the process table filling up.

I can't reproduce locally on 18.04 using the same container config. One other difference is that we run from within Jenkins, so there's an additional layer to the process tree, although I'm not sure why that would matter.

@rvagg
Copy link
Member Author

rvagg commented Nov 30, 2018

OK, can repro locally, it's because of the process hierarchy inside the container. You need to remove bash from the hierarchy, which is present whenever you try to reproduce these things manually. Grab the Dockerfile above (edited just now to add a RUN mkdir /home/iojs/tmp), put it in /tmp/alpine38/ and run docker build -t node-alpine:3.8 /tmp/alpine38/. Then inside a clone of this repo do:

./configure:

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 ./configure

make

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 make -j8

run the two tests

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 /usr/bin/python tools/test.py parallel/test-cluster-master*

yields this output after waiting for timeouts:

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 /usr/bin/python tools/test.py parallel/test-cluster-master*

This reinforces the need to fix this as launching your application with minimal layers inside your minimal container is a thing that folks do with Docker / Alpine. (Aside from the fact that you shouldn't be using cluster).

@rvagg
Copy link
Member Author

rvagg commented Nov 30, 2018

Running the tests directly works too btw, you just need to kill it manually:

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 ./node test/parallel/test-cluster-master-error.js

@Trott
Copy link
Member

Trott commented Nov 30, 2018

(Unassigning refack. Don't want to discourage others from jumping in on this.)

@Trott
Copy link
Member

Trott commented Nov 30, 2018

Since this appears to be a bug in cluster or in Alpine, I'm going to remove the CI/flaky test label too.

@Trott Trott removed the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Nov 30, 2018
@Trott
Copy link
Member

Trott commented Nov 30, 2018

Both tests use common.isAlive() and if that is always returning true on this OS, then we'll be seeing these things timeout.

Here's the source for common.isAlive():

function isAlive(pid) {
  try {
    process.kill(pid, 'SIGCONT');
    return true;
  } catch {
    return false;
  }
}

Any chance 'SIGCONT' does not behave here as it does on other operating systems?

@Trott
Copy link
Member

Trott commented Nov 30, 2018

The addition of 'SIGCONT' via common.isAlive() was introduced by @jasnell in baa0ffdab37. Prior to that, it had been using 0 rather than SIGCONT. Might be worth seeing if changing 'SIGCONT' to 0 fixes these two tests in this environment. (It might break other tests, or make tests that should fail pass or something, though.)

EDIT: Changing to 0 seems to fix nothing...oh well...

@Trott Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Nov 30, 2018
Trott added a commit to Trott/io.js that referenced this issue Nov 30, 2018
Use signal value `0` rather than `'SIGCONT'` as the latter appears to
behave unexpectedly on Alpine 3.8.

Fixes: nodejs#22308
@Trott Trott mentioned this issue Nov 30, 2018
2 tasks
@Trott
Copy link
Member

Trott commented Nov 30, 2018

Probably too optimistic to think that #24756 will fix it without introducing other issues, but let's see...

EDIT: Indeed, too optimistic...didn't work...

@Trott
Copy link
Member

Trott commented Nov 30, 2018

Whee! I have Docker installed and setup and I can replicate this. It's like I'm living in the FUTURE or something. Or at least the RECENT PAST. Can't do it right now, but will investigate more in a little bit if no one beats me to it (which: please do!).

@Trott
Copy link
Member

Trott commented Nov 30, 2018

As far as I can tell, the worker really does exit, but in the master process, process._kill(pid, 'SIGCONT') does not return an error and ps -aux shows the worker process still in the table but "defunct":

iojs        19  3.7  0.0      0     0 pts/0    Z+   23:00   0:00 [node] <defunct>

This is I guess basically what refack noted on August 15.

@Trott
Copy link
Member

Trott commented Nov 30, 2018

PPID is process 1 which is the node process:

F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S  1000     1     0  4  80   0 - 64478 -      pts/0    00:00:00 node
0 Z  1000    19     1  1  80   0 -     0 -      pts/0    00:00:00 node <defunct>

@Trott
Copy link
Member

Trott commented Nov 30, 2018

I think test-cluster-master-kill.js may be slightly misleadingly named. kill() doesn't really come into it. But I can see why it was named that. Here's what it does:

  • root process: uses child_process.fork() to create a node subprocess
  • subprocess: uses cluster to launch a worker
  • worker: runs an http server so that it will run forever if left on its own
  • subprocess: sends the pid of the worker to the root process
  • subprocess: exits once it knows the worker is running
  • root process: once it detects that the subprocess has exited, it polls the pid of the worker process and makes sure it exits. This is the part that fails because the polling detects the defunct/zombie process.

I'm not sure if this is a bug in Node.js or an artifact of the way the test is run as rvagg described above. Is there some system call Node.js should be making to let the OS know to remove the pid from the process table? (If so, why does it only matter in this one edge case?) Or is this just what happens when you kinda sorta bypass some normal operating system stuff?

@nodejs/cluster @nodejs/child_process

@Trott
Copy link
Member

Trott commented Nov 30, 2018

I guess probably not a terrible time to re-ping @nodejs/docker too...

@rvagg
Copy link
Member Author

rvagg commented Dec 1, 2018

I bet some clever libuv folks might have a clue here too since we're getting pretty low @cjihrig @bnoordhuis

@Trott
Copy link
Member

Trott commented Dec 1, 2018

@nodejs/libuv

@rvagg
Copy link
Member Author

rvagg commented Dec 1, 2018

So I think this is all about the face that the killed child processes aren't reaped properly because we don't have init or a similar reaping parent process in the chain. I believe I can fix this by putting bash as the parent to Jenkins in the container, something like:

CMD [ "bash", "-c", "cd /home/iojs \
  && curl https://ci.nodejs.org/jnlpJars/slave.jar -O \
  && java -Xmx{{ server_ram|default('128m') }} \
          -jar /home/{{ server_user }}/slave.jar \
          -jnlpUrl {{ jenkins_url }}/computer/{{ item.name }}/slave-agent.jnlp \
          -secret {{ item.secret }}" ]

Here's the bit I quite know the answer to: is this bypassing something that should be Node's responsibility? Why are we not experiencing this on any of our other Docker containers where we do the same thing but execute Jenkins in the same way? I can't find anything special about Alpine 3.9 that would lead to different behaviour. I don't want to be putting a bandaid on something that's a genuine problem on our end.

@rvagg
Copy link
Member Author

rvagg commented Dec 1, 2018

actually, I solved a similar problem of non-reaping on the ARM machines running docker by using --init which starts a proper init at the root to deal with reaping. But my questions above still stand.

@rvagg
Copy link
Member Author

rvagg commented Dec 1, 2018

--init seems to have done the trick in this instance https://ci.nodejs.org/job/node-test-commit-linux/nodes=alpine-latest-x64/

I wouldn't mind thoughts from folks more expert in system-level concerns on why this might be a unique problem on a specific distrio+version; and does this suggest problems on our end? Otherwise, we can probably close this and wait to see if we get issues reported about it.

rvagg added a commit to nodejs/build that referenced this issue Dec 1, 2018
ARM Docker use already has this, this expands it to the rest of our
Docker usage. It helps with reping defunct processes when running bare
commands (i.e. jenkins).

Ref: nodejs/node#22308
rvagg added a commit to nodejs/build that referenced this issue Dec 1, 2018
ARM Docker use already has this, this expands it to the rest of our
Docker usage. It helps with reping defunct processes when running bare
commands (i.e. jenkins).

Ref: nodejs/node#22308
@rvagg rvagg closed this as completed Dec 4, 2018
@sam-github
Copy link
Contributor

docker containers not having a functional init is a common source of problems, at least when the container runs code that creates sub-sub-processes that don't get waited on by the sub-process. I recall reviewing C code for a mini-reaper of @rmg that was used as a runner. I wonder if the --init option is newish? Providing it seems the right thing to do for containers with complex sub-process management, no matter what the distro.

Hypothetically, if the order of process termination varies, if the sub-process is allowed to run just marginally past the sub-sub-process death, it will get the chance to wait on its child processes, and they won't become orphaned. If the sub-process terminates before the sub-sub-process exit statuses are available, then they become orphaned, and reparented to init, maybe process run/scheduling differences is what we are seeing.

@rvagg
Copy link
Member Author

rvagg commented Dec 5, 2018

I can buy that as an explanation I suppose. It's just strange that were only seeing this on one of the dockerised platforms. Granted, Ubuntu 16.04 is used for the majority but we've also had alpine in there for a while now without seeing this. Maybe it's a minor musl change that's impacted timing in some subtle but reliable way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. cluster Issues and PRs related to the cluster subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants