sanity.functional cmdLineTester_pltest_0 j9vmem_test_numa (docker?) test failure #2143

lumpfish · 2021-01-06T10:36:14Z

Subtest j9vmem_test_numa within the test target cmdLineTester_pltest_0 fails with:

18:24:59   [ERR] Starting test j9vmem_test_numa
18:24:59   [ERR] j9vmemTest.c line 1810: j9vmem_test_numa Found zero nodes with memory even after NUMA was reported as supported (FAIL)
18:24:59   [ERR] 
18:24:59   [ERR] 		LastErrorNumber: -252
18:24:59   [ERR] 		LastErrorMessage: Unknown error -1
18:24:59   [ERR] 
18:24:59   [ERR] Ending test j9vmem_test_numa

jdk version under test:

17:45:47  openjdk version "1.8.0_282-internal"
17:45:47  OpenJDK Runtime Environment (build 1.8.0_282-internal-202101051708-b07)
17:45:47  Eclipse OpenJ9 VM (build master-1910cfa3a, JRE 1.8.0 Linux amd64-64-Bit 20210105_860 (JIT enabled, AOT enabled)
17:45:47  OpenJ9   - 1910cfa3a
17:45:47  OMR      - a9b64bdc8
17:45:47  JCL      - 722ab284 based on jdk8u282-b07)
17:45:47  =JAVA VERSION OUTPUT END=

Failing test machine: docker-packet-ubuntu2004-x64-1f1.
Test passed on test-ibmcloud-rhel6-x64-1 - maybe related to running on docker? (https://ci.adoptopenjdk.net/job/Test_openjdk8_j9_sanity.functional_x86-64_linux/63/)

Link to failing job: https://ci.adoptopenjdk.net/job/Test_openjdk8_j9_sanity.functional_x86-64_linux_xl/68/

The text was updated successfully, but these errors were encountered:

pshipton · 2021-01-14T17:32:41Z

The API code is https://github.ibm.com/runtimes/openj9-omr/blob/ibm_sdk/port/linux/omrvmem.c#L2081
It also determines the global memory policy https://github.ibm.com/runtimes/openj9-omr/blob/ibm_sdk/port/linux/omrvmem.c#L218
and node_bits https://github.ibm.com/runtimes/openj9-omr/blob/ibm_sdk/port/linux/omrvmem.c#L251-L279

The API code reads from /sys/devices/system/node, looking for node<num> (i.e. node0, node1, etc) sub-directories. Inside these directories there should be meminfo files which should contain a line such as Node 0 MemTotal: 16431076 kB.

The API sets the state of each node based on the global memory policy and if the node has memory.

The test code is https://github.com/eclipse/openj9/blob/master/runtime/tests/port/j9vmemTest.c#L1781-L1810

It looks for nodes which have the PREFERRED or ALLOWED policy. If there are none, the test fails.

sophia-guo · 2021-01-14T20:49:20Z

Is this intermittent? I see in recent builds running on testc-packet-fedora33-amd-2 or non-dockers passed.

https://ci.adoptopenjdk.net/job/Test_openjdk8_j9_sanity.functional_x86-64_linux_xl/73/
https://ci.adoptopenjdk.net/job/Test_openjdk8_j9_sanity.functional_x86-64_linux_xl/

smlambert · 2021-01-14T20:54:34Z

Failing test machine: docker-packet-ubuntu2004-x64-1f1.

but I do not know if its intermittent on that static docker instance.

sophia-guo · 2021-01-14T21:07:33Z

I didn't see there is machine docker-packet-ubuntu2004-x64-1f1? I thought all testc* machine is docker instance?

smlambert · 2021-01-14T21:26:39Z

@sxa - you are the most likely to know the answer to #2143 (comment)

sxa · 2021-01-14T21:30:18Z

I moved it to another host - replace x64 with amd in the name and you have effectively the same system

pshipton · 2021-01-14T21:34:04Z

I've created a pltest binary containing debug and sent it to Stewart to try on the docker instance.

sxa · 2021-01-14T21:34:37Z

On naming - yes it's a docker container. I've been experimenting with formats so there are several that can indicate it's docker adoptium/infrastructure#1809 (comment) ... Open to suggestions/comments on the preferred format that I'll cover them all too later 😊

pshipton · 2021-01-18T14:42:36Z

The test is failing because there are 2 or more nodes detected (in /sys/devices/system/node, like node0, node1, etc.) , but the get_mempolicy() API call is failing with EPERM.

Note the behavior is explicit in the VM code. It sets a variable PPG_numaSyscallNotAllowed, and when this variable is true vmem_numa_get_node_details() sets all the nodes to J9NUMA_DENIED (which results in the Found zero nodes test failure).

sxa · 2021-01-18T14:49:34Z

OK The two AMD EPYC systems I have had slightly different CPUs. NUMA isn't available on the EPYC 7402P but is (and so the test fails) on the EPYC 7401P (Memo to self: BOS2 cell)

Here is the strace from Peter's standalone pltest from the affected machine:
EPYC7401Pstrace.log.gz

It includes:

get_mempolicy(0x146fdd0, 0x146fd50, 1024, NULL, 0) = -1 EPERM (Operation not permitted)

numactl -s also gives get_mempolicy: Operation not permitted errors.

It can be resolved by starting the container with --privileged and possibly by doing other things with the cgroup controls to restrict the container

sxa · 2021-01-18T15:18:57Z

This can also be resolved with --cap-add=sys_nice on the docker run command line, which allows the container access to some control over the CPUs and may be a suitable option for our test environment if this recurs.

Reference: https://docs.docker.com/config/containers/resource_constraints/

sxa · 2021-04-14T16:57:14Z

I think I only have one NUMA-capable host system (It's a 2.2GHz Intel Xeon Gold 5120). It currently has six docker images on it. I've confirmed (By starting another instance of the Fedora33 docker image with --cap-add=sys_nice and re-running the test) that the test appears to pass on there, so to resolve this I'll need to quiesce them, then shut them all down and restart with that option.

I'd suggest there's still a question over whether this is the right thing to do - in the EPERM case should we trap it and default to assuming that we're not on a NUMA capable system? Presumably it's something an end-user could hit when running under a default docker setup on a NUMA-based system?

sxa · 2021-04-14T17:44:22Z

Although there were more containers deployed on the machine, only three have been active in jenkins, therefore I have restarted and tested those: #87 #88 #89 so the outstanding discussion is over whether we want to leave the machines with the sys_nice option, or modify the VM code to trap and deal wiht the situation.

sxa mentioned this issue Jan 6, 2021

Identify which tests seem unstable in docker containers #2138

Open

karianna added the bug label Jan 6, 2021

sxa mentioned this issue Apr 14, 2021

jdk16 xLinux: cmdLineTester_pltest_j9sig_ext_0: Found zero nodes with memory even after NUMA was reported as supported eclipse-openj9/openj9#12433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sanity.functional cmdLineTester_pltest_0 j9vmem_test_numa (docker?) test failure #2143

sanity.functional cmdLineTester_pltest_0 j9vmem_test_numa (docker?) test failure #2143

lumpfish commented Jan 6, 2021

pshipton commented Jan 14, 2021

sophia-guo commented Jan 14, 2021

smlambert commented Jan 14, 2021

sophia-guo commented Jan 14, 2021

smlambert commented Jan 14, 2021

sxa commented Jan 14, 2021

pshipton commented Jan 14, 2021

sxa commented Jan 14, 2021

pshipton commented Jan 18, 2021 •

edited

Loading

sxa commented Jan 18, 2021 •

edited

Loading

sxa commented Jan 18, 2021 •

edited

Loading

sxa commented Apr 14, 2021 •

edited

Loading

sxa commented Apr 14, 2021

sanity.functional cmdLineTester_pltest_0 j9vmem_test_numa (docker?) test failure #2143

sanity.functional cmdLineTester_pltest_0 j9vmem_test_numa (docker?) test failure #2143

Comments

lumpfish commented Jan 6, 2021

pshipton commented Jan 14, 2021

sophia-guo commented Jan 14, 2021

smlambert commented Jan 14, 2021

sophia-guo commented Jan 14, 2021

smlambert commented Jan 14, 2021

sxa commented Jan 14, 2021

pshipton commented Jan 14, 2021

sxa commented Jan 14, 2021

pshipton commented Jan 18, 2021 • edited Loading

sxa commented Jan 18, 2021 • edited Loading

sxa commented Jan 18, 2021 • edited Loading

sxa commented Apr 14, 2021 • edited Loading

sxa commented Apr 14, 2021

pshipton commented Jan 18, 2021 •

edited

Loading

sxa commented Jan 18, 2021 •

edited

Loading

sxa commented Jan 18, 2021 •

edited

Loading

sxa commented Apr 14, 2021 •

edited

Loading