Alpine 3.8 cluster failures #22308

rvagg · 2018-08-14T07:16:18Z

I just added Alpine 3.8 to CI and removed 3.6 in the process, shifting alpine-last-latest-x64 to Alpine 3.7 and giving alpine-latest-x64 to Alpine 3.8. It tested well on my local machine across our major branches, but now it's enabled in CI we're getting consistent errors with on all test runs:

15:57:44 not ok 647 parallel/test-cluster-master-error
15:57:44   ---
15:57:44   duration_ms: 120.63
15:57:44   severity: fail
15:57:44   exitcode: -15
15:57:44   stack: |-
15:57:44     timeout
15:57:44   ...
15:57:44 not ok 648 parallel/test-cluster-master-kill
15:57:44   ---
15:57:44   duration_ms: 120.92
15:57:44   severity: fail
15:57:44   exitcode: -15
15:57:44   stack: |-
15:57:44     timeout

What should we do with this? Do I remove it from CI for now or can someone spend some time investigating?

The Dockerfile is here minus the template strings which you can easily stub out to set this up locally if you want to give that a go. I'm not sure what the difference is with my local machine but perhaps I wasn't testing it on the latest master and there's something new or perhaps there's a Docker differential.

The text was updated successfully, but these errors were encountered:

Trott · 2018-08-14T13:48:17Z

Quick thing to try: Does moving the tests to sequential cause them to pass in CI on this platform? (I would create a branch and try myself but I've got non-computer things to focus on for the next few hours.)

Trott · 2018-08-14T14:12:06Z

(Also: Our code now prints stdout and stderr on timeouts so someone can try adding a bunch of console.log() statements to see where things are hanging up.)

refack · 2018-08-14T14:19:18Z

Testing move to sequential
https://ci.nodejs.org/job/node-test-commit-linux/20716/

refack · 2018-08-14T14:23:12Z

BTW: before switch to 38 tests took < 2s

refack · 2018-08-14T14:42:02Z

Testing move to sequential
https://ci.nodejs.org/job/node-test-commit-linux/20716/

https://ci.nodejs.org/job/node-test-commit-linux/20716/nodes=alpine-latest-x64/console
Still fails.

Trott · 2018-08-14T18:49:08Z

Running with some logging: https://ci.nodejs.org/job/node-test-commit-linux/20725/nodes=alpine-latest-x64/console

Trott · 2018-08-14T20:10:23Z

The logging shows that the test succeeds except for that pollWorkers() in both tests never detects that the PID exits. I wonder if the bug is in common.isAlive(pid). That code uses process.kill(pid, 'SIGCONT'); so maybe there's something about SIGCONT and permissions or something in the Docker container?

Trott · 2018-08-14T20:13:52Z

@nodejs/docker

Trott · 2018-08-14T20:16:20Z

common.isAlive(pid) expects process.kill(pid, 'SIGCONT') to throw if pid does not exist. @rvagg @refack or someone who knows how to log into the Docker container running on CI, any chance you can see if ./node -e 'process.kill(12345)' throws or not (after checking that there is no pid 12345 running)?

maclover7 · 2018-08-14T20:50:21Z

bash-4.4$ ./node -e 'process.kill(12345)'
internal/process/per_thread.js:194
      throw errnoException(err, 'kill');
      ^

Error: kill ESRCH
    at process.kill (internal/process/per_thread.js:194:13)
    at [eval]:1:9
    at Script.runInThisContext (vm.js:88:20)
    at Object.runInThisContext (vm.js:285:38)
    at Object.<anonymous> ([eval]-wrapper:6:22)
    at Module._compile (internal/modules/cjs/loader.js:689:30)
    at evalScript (internal/bootstrap/node.js:563:27)
    at startup (internal/bootstrap/node.js:248:9)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:596:3)

Also, @Trott (or anybody else with the nodejs_build_test SSH key), I had written up some docs about how to restart/SSH into the CI docker containers (in case more testing is needed, and you want it to be self-service :))

refack · 2018-08-14T21:09:55Z

Since Alpine runs with musl which support level is experimental I've removed the label from the main CI job, and created a dedicated job for stabilization efforts - https://ci.nodejs.org/job/node-test-commit-alpine38/

refack · 2018-08-14T21:31:09Z

Reporting from within the danger zone:

bash-4.4$ ./node test/parallel/test-cluster-master-error.js
message received: { cmd: 'worker', workerPID: 26420 }
message received: { cmd: 'worker', workerPID: 26421 }
exited with code 1
polling
polling
bash-4.4$ ./node test/parallel/test-cluster-master-kill.js
message received: { pid: 26447 }
exited with code 0
polling
polling
bash-4.4$
bash-4.4$ /usr/bin/python tools/test.py -j 4 -p tap --logfile test.tap --mode=release --flaky-tests=run parallel/test-cluster-master*
TAP version 13
1..2
ok 1 parallel/test-cluster-master-error
  ---
  duration_ms: 0.713
  ...
ok 2 parallel/test-cluster-master-kill
  ---
  duration_ms: 1.713
  ...

more like twilight zone.

LaurentGoderre · 2018-08-15T13:00:14Z

Is the failure specific to running the test in a docker container?

refack · 2018-08-15T13:02:37Z

Is the failure specific to running the test in a docker container?

We test Alpine only running as a Docker container, so I guess we don't know...

LaurentGoderre · 2018-08-15T13:05:38Z

I would love to help but it looks like a pretty steep learning curve figuring out all the build stuff.

LaurentGoderre · 2018-08-15T13:06:00Z

Btw, where is the test suite to test?

refack · 2018-08-15T13:06:37Z

make test

refack · 2018-08-15T13:07:52Z

which in turn runs:

/usr/bin/python tools/test.py -j 4 -p tap --logfile test.tap \
	--mode=release --flaky-tests=run \
	 default addons addons-napi doctool

(it assumes the node binary to test is out/Release/node)

LaurentGoderre · 2018-08-15T13:12:10Z

Is there a way to specify the binary path to test?

refack · 2018-08-15T13:16:44Z

AFAIR, no. When needed I copy/symlink a binary to that location.

refack · 2018-08-15T13:34:10Z

New observation: after running a CI job there seem to be multiple node zombies around:

iojs         1  0.7  0.7 1692000 123156 ?      Ssl  01:27   5:37 java -Xmx128m -jar /home/iojs/slave.jar 
iojs      1544  0.0  0.0      0     0 ?        Z    01:38   0:00 [node] <defunct>
iojs     10025  0.0  0.0      0     0 ?        Z    01:42   0:00 [node] <defunct>
iojs     10064  0.0  0.0      0     0 ?        Z    01:42   0:00 [node] <defunct>
iojs     10690  0.0  0.0      0     0 ?        Z    01:43   0:00 [node] <defunct>
iojs     11805  0.0  0.0      0     0 ?        Z    01:43   0:00 [node] <defunct>
iojs     11830  0.0  0.0      0     0 ?        Z    01:43   0:00 [node] <defunct>
iojs     11836  0.0  0.0      0     0 ?        Z    01:43   0:00 [node] <defunct>
iojs     14805  0.0  0.0   6380  1932 pts/0    Ss   13:04   0:00 /bin/bash
iojs     16049  0.0  0.0   5700   616 pts/0    R+   13:31   0:00 ps axu
iojs     20074  0.0  0.0      0     0 ?        Zs   01:33   0:00 [node] <defunct>
iojs     22336  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     22456  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     22457  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     22480  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     22752  0.0  0.0      0     0 ?        Z    01:34   0:00 [node] <defunct>
iojs     26383  0.0  0.0      0     0 ?        Z    01:36   0:00 [ls] <defunct>
iojs     31062  0.0  0.0      0     0 ?        Z    01:37   0:00 [node] <defunct>
iojs     31080  0.0  0.0      0     0 ?        Zs   01:37   0:00 [node] <defunct>
iojs     31094  0.0  0.0      0     0 ?        Zs   01:37   0:00 [node] <defunct>

That might make the polling to fail...

LaurentGoderre · 2018-08-15T13:54:28Z

Ok, I created the following dockerfile to run the testsuit

FROM node:10-alpine

RUN apk add --no-cache --update \
    curl \
    python \
  && curl -L --compressed https://api.github.com/repos/nodejs/node/tarball -o node.tar.gz

RUN mkdir -p /node/out/Release \
  && tar -xf node.tar.gz --strip 1 -C /node \
  && ln -s /usr/local/bin/node /node/out/Release/node

RUN cd /node \
  && python tools/test.py -j 4 -p tap --logfile test.tap \
    --mode=release --flaky-tests=run \
    default addons addons-napi doctool

rvagg · 2018-11-30T09:19:07Z

FYI this runs on top of Ubuntu 16.04 in our infra. These containers are fresh as I've just reprovisioned all of our Docker infra over the last couple of days, so this isn't about the process table filling up.

I can't reproduce locally on 18.04 using the same container config. One other difference is that we run from within Jenkins, so there's an additional layer to the process tree, although I'm not sure why that would matter.

rvagg · 2018-11-30T12:13:07Z

OK, can repro locally, it's because of the process hierarchy inside the container. You need to remove bash from the hierarchy, which is present whenever you try to reproduce these things manually. Grab the Dockerfile above (edited just now to add a RUN mkdir /home/iojs/tmp), put it in /tmp/alpine38/ and run docker build -t node-alpine:3.8 /tmp/alpine38/. Then inside a clone of this repo do:

./configure:

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 ./configure

make

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 make -j8

run the two tests

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 /usr/bin/python tools/test.py parallel/test-cluster-master*

yields this output after waiting for timeouts:

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 /usr/bin/python tools/test.py parallel/test-cluster-master*

This reinforces the need to fix this as launching your application with minimal layers inside your minimal container is a thing that folks do with Docker / Alpine. (Aside from the fact that you shouldn't be using cluster).

rvagg · 2018-11-30T12:14:33Z

Running the tests directly works too btw, you just need to kill it manually:

docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 ./node test/parallel/test-cluster-master-error.js

Trott · 2018-11-30T17:47:14Z

(Unassigning refack. Don't want to discourage others from jumping in on this.)

Trott · 2018-11-30T17:47:56Z

~~Since this appears to be a bug in cluster or in Alpine, I'm going to remove the CI/flaky test label too.~~

Trott · 2018-11-30T17:55:45Z

Both tests use common.isAlive() and if that is always returning true on this OS, then we'll be seeing these things timeout.

Here's the source for common.isAlive():

function isAlive(pid) {
  try {
    process.kill(pid, 'SIGCONT');
    return true;
  } catch {
    return false;
  }
}

Any chance 'SIGCONT' does not behave here as it does on other operating systems?

Trott · 2018-11-30T18:02:43Z

The addition of 'SIGCONT' via common.isAlive() was introduced by @jasnell in baa0ffdab37. Prior to that, it had been using 0 rather than SIGCONT. Might be worth seeing if changing 'SIGCONT' to 0 fixes these two tests in this environment. (It might break other tests, or make tests that should fail pass or something, though.)

EDIT: Changing to 0 seems to fix nothing...oh well...

Use signal value `0` rather than `'SIGCONT'` as the latter appears to behave unexpectedly on Alpine 3.8. Fixes: nodejs#22308

Trott · 2018-11-30T18:23:35Z

Probably too optimistic to think that #24756 will fix it without introducing other issues, but let's see...

EDIT: Indeed, too optimistic...didn't work...

Trott · 2018-11-30T21:57:57Z

Whee! I have Docker installed and setup and I can replicate this. It's like I'm living in the FUTURE or something. Or at least the RECENT PAST. Can't do it right now, but will investigate more in a little bit if no one beats me to it (which: please do!).

Trott · 2018-11-30T23:01:31Z

As far as I can tell, the worker really does exit, but in the master process, process._kill(pid, 'SIGCONT') does not return an error and ps -aux shows the worker process still in the table but "defunct":

iojs        19  3.7  0.0      0     0 pts/0    Z+   23:00   0:00 [node] <defunct>

This is I guess basically what refack noted on August 15.

Trott · 2018-11-30T23:07:42Z

PPID is process 1 which is the node process:

F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S  1000     1     0  4  80   0 - 64478 -      pts/0    00:00:00 node
0 Z  1000    19     1  1  80   0 -     0 -      pts/0    00:00:00 node <defunct>

Trott · 2018-11-30T23:34:44Z

I think test-cluster-master-kill.js may be slightly misleadingly named. kill() doesn't really come into it. But I can see why it was named that. Here's what it does:

root process: uses child_process.fork() to create a node subprocess
subprocess: uses cluster to launch a worker
worker: runs an http server so that it will run forever if left on its own
subprocess: sends the pid of the worker to the root process
subprocess: exits once it knows the worker is running
root process: once it detects that the subprocess has exited, it polls the pid of the worker process and makes sure it exits. This is the part that fails because the polling detects the defunct/zombie process.

I'm not sure if this is a bug in Node.js or an artifact of the way the test is run as rvagg described above. Is there some system call Node.js should be making to let the OS know to remove the pid from the process table? (If so, why does it only matter in this one edge case?) Or is this just what happens when you kinda sorta bypass some normal operating system stuff?

@nodejs/cluster @nodejs/child_process

Trott · 2018-11-30T23:48:38Z

I guess probably not a terrible time to re-ping @nodejs/docker too...

rvagg · 2018-12-01T00:07:53Z

I bet some clever libuv folks might have a clue here too since we're getting pretty low @cjihrig @bnoordhuis

Trott · 2018-12-01T00:14:37Z

@nodejs/libuv

rvagg · 2018-12-01T00:58:32Z

So I think this is all about the face that the killed child processes aren't reaped properly because we don't have init or a similar reaping parent process in the chain. I believe I can fix this by putting bash as the parent to Jenkins in the container, something like:

CMD [ "bash", "-c", "cd /home/iojs \
  && curl https://ci.nodejs.org/jnlpJars/slave.jar -O \
  && java -Xmx{{ server_ram|default('128m') }} \
          -jar /home/{{ server_user }}/slave.jar \
          -jnlpUrl {{ jenkins_url }}/computer/{{ item.name }}/slave-agent.jnlp \
          -secret {{ item.secret }}" ]

Here's the bit I quite know the answer to: is this bypassing something that should be Node's responsibility? Why are we not experiencing this on any of our other Docker containers where we do the same thing but execute Jenkins in the same way? I can't find anything special about Alpine 3.9 that would lead to different behaviour. I don't want to be putting a bandaid on something that's a genuine problem on our end.

rvagg · 2018-12-01T01:04:00Z

actually, I solved a similar problem of non-reaping on the ARM machines running docker by using --init which starts a proper init at the root to deal with reaping. But my questions above still stand.

rvagg · 2018-12-01T02:04:24Z

--init seems to have done the trick in this instance https://ci.nodejs.org/job/node-test-commit-linux/nodes=alpine-latest-x64/

I wouldn't mind thoughts from folks more expert in system-level concerns on why this might be a unique problem on a specific distrio+version; and does this suggest problems on our end? Otherwise, we can probably close this and wait to see if we get issues reported about it.

ARM Docker use already has this, this expands it to the rest of our Docker usage. It helps with reping defunct processes when running bare commands (i.e. jenkins). Ref: nodejs/node#22308

sam-github · 2018-12-05T20:21:39Z

docker containers not having a functional init is a common source of problems, at least when the container runs code that creates sub-sub-processes that don't get waited on by the sub-process. I recall reviewing C code for a mini-reaper of @rmg that was used as a runner. I wonder if the --init option is newish? Providing it seems the right thing to do for containers with complex sub-process management, no matter what the distro.

Hypothetically, if the order of process termination varies, if the sub-process is allowed to run just marginally past the sub-sub-process death, it will get the chance to wait on its child processes, and they won't become orphaned. If the sub-process terminates before the sub-sub-process exit statuses are available, then they become orphaned, and reparented to init, maybe process run/scheduling differences is what we are seeing.

rvagg · 2018-12-05T22:10:51Z

I can buy that as an explanation I suppose. It's just strange that were only seeing this on one of the dockerised platforms. Granted, Ubuntu 16.04 is used for the majority but we've also had alpine in there for a while now without seeing this. Maybe it's a minor musl change that's impacted timing in some subtle but reliable way.

This was referenced Aug 14, 2018

ansible: retire Alpine 3.6, introduce 3.8 nodejs/build#1446

Closed

Add support for including GNU Gold linker's section ordering #16891

Closed

Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Aug 14, 2018

vsemozhetbyt mentioned this issue Aug 14, 2018

test: rename some disturbing allegories #22307

Closed

2 tasks

joyeecheung added the cluster Issues and PRs related to the cluster subsystem. label Aug 14, 2018

refack added the build Issues and PRs related to build files or the CI. label Aug 14, 2018

Trott unassigned refack Nov 30, 2018

Trott removed the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Nov 30, 2018

Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Nov 30, 2018

Trott added a commit to Trott/io.js that referenced this issue Nov 30, 2018

test: fix common.isAlive()

b96bcec

Use signal value `0` rather than `'SIGCONT'` as the latter appears to behave unexpectedly on Alpine 3.8. Fixes: nodejs#22308

Trott mentioned this issue Nov 30, 2018

test: fix common.isAlive() #24756

Closed

2 tasks

rvagg mentioned this issue Dec 1, 2018

docker: use --init for all docker images nodejs/build#1619

Closed

MylesBorins mentioned this issue Dec 2, 2018

Modules lkgr fix conflict nodejs/ecmascript-modules#17

Closed

rvagg closed this as completed Dec 4, 2018

Alpine 3.8 cluster failures #22308

Alpine 3.8 cluster failures #22308

Comments

rvagg commented Aug 14, 2018

Trott commented Aug 14, 2018

Trott commented Aug 14, 2018

refack commented Aug 14, 2018

refack commented Aug 14, 2018

refack commented Aug 14, 2018

Trott commented Aug 14, 2018

Trott commented Aug 14, 2018

Trott commented Aug 14, 2018

Trott commented Aug 14, 2018 • edited Loading

maclover7 commented Aug 14, 2018

refack commented Aug 14, 2018

refack commented Aug 14, 2018

LaurentGoderre commented Aug 15, 2018

refack commented Aug 15, 2018

LaurentGoderre commented Aug 15, 2018

LaurentGoderre commented Aug 15, 2018

refack commented Aug 15, 2018

refack commented Aug 15, 2018

LaurentGoderre commented Aug 15, 2018

refack commented Aug 15, 2018

refack commented Aug 15, 2018

LaurentGoderre commented Aug 15, 2018

rvagg commented Nov 30, 2018

rvagg commented Nov 30, 2018

rvagg commented Nov 30, 2018

Trott commented Nov 30, 2018

Trott commented Nov 30, 2018 • edited Loading

Trott commented Nov 30, 2018

Trott commented Nov 30, 2018 • edited Loading

Trott commented Nov 30, 2018 • edited Loading

Trott commented Nov 30, 2018

Trott commented Nov 30, 2018 • edited Loading

Trott commented Nov 30, 2018

Trott commented Nov 30, 2018 • edited Loading

Trott commented Nov 30, 2018

rvagg commented Dec 1, 2018

Trott commented Dec 1, 2018

rvagg commented Dec 1, 2018

rvagg commented Dec 1, 2018

rvagg commented Dec 1, 2018

sam-github commented Dec 5, 2018

rvagg commented Dec 5, 2018

Trott commented Aug 14, 2018 •

edited

Loading

Trott commented Nov 30, 2018 •

edited

Loading

Trott commented Nov 30, 2018 •

edited

Loading

Trott commented Nov 30, 2018 •

edited

Loading

Trott commented Nov 30, 2018 •

edited

Loading

Trott commented Nov 30, 2018 •

edited

Loading