Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64 CI tests: qemu hits memory limit and fails with SIMD tests enabled #1893

Closed
akirilov-arm opened this issue Jun 17, 2020 · 6 comments · Fixed by #1895
Closed

AArch64 CI tests: qemu hits memory limit and fails with SIMD tests enabled #1893

akirilov-arm opened this issue Jun 17, 2020 · 6 comments · Fixed by #1895

Comments

@akirilov-arm
Copy link
Contributor

The AArch64 CI test that runs using QEMU fails consistently for PR #1871 and the reasons are not clear - here's the relevant excerpt from the log:

2020-06-13T16:29:49.3730503Z test wast::Cranelift::spec::simd::simd_i32x4_cmp ... ok
2020-06-13T16:29:57.9345959Z test wast::Cranelift::spec::simd::simd_i8x16_sat_arith ... ignored
2020-06-13T16:30:08.5287111Z test wast::Cranelift::spec::simd::simd_lane ... ignored
2020-06-13T16:30:15.8261749Z test wast::Cranelift::spec::simd::simd_load ... ignored
2020-06-13T16:49:23.7624987Z error: test failed, to rerun pass '-p wasmtime-cli --test all'
2020-06-13T16:49:23.7648421Z 
2020-06-13T16:49:23.7651248Z Caused by:
2020-06-13T16:49:23.7664954Z   process didn't exit successfully: `/home/runner/qemu/bin/qemu-aarch64 -L /usr/aarch64-linux-gnu /home/runner/work/wasmtime/wasmtime/target/aarch64-unknown-linux-gnu/release/deps/all-0af4aa3748ec4770` (signal: 9, SIGKILL: kill)
2020-06-13T16:49:24.0613948Z ##[error]Process completed with exit code 101.
2020-06-13T16:49:25.4620071Z Post job cleanup.

I have reproduced the test environment locally using the following commands:

rm -rf qemu-5.0.0 ${HOME}/qemu
curl https://download.qemu.org/qemu-5.0.0.tar.xz | tar xJf -
cd qemu-5.0.0
./configure --target-list=aarch64-linux-user --prefix=${HOME}/qemu --disable-tools --disable-slirp --disable-fdt --disable-capstone --disable-docs
make -j$(nproc) install
cd ..
RUSTFLAGS="-D warnings" \
  CARGO_INCREMENTAL=0 \
  CARGO_PROFILE_DEV_DEBUG=1 \
  CARGO_PROFILE_TEST_DEBUG=1 \
  CARGO_BUILD_TARGET=aarch64-unknown-linux-gnu \
  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_RUNNER="${HOME}/qemu/bin/qemu-aarch64 -L /usr/aarch64-linux-gnu" \
  CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-linux-gnu-gcc \
  RUST_BACKTRACE=1 \
  cargo test \
  --features test-programs/test_programs \
  --release \
  --all \
  --exclude lightbeam \
  --exclude peepmatic \
  --exclude peepmatic-automata \
  --exclude peepmatic-fuzzing \
  --exclude peepmatic-macro \
  --exclude peepmatic-runtime \
  --exclude peepmatic-test \
  --exclude wasmtime-fuzz

However, I don't experience any test failures. In addition to that, I don't see any issues either when I run the test natively in an AArch64 environment. In that case the list of commands can be simplified to:

cargo test --release --all --exclude lightbeam

Note that the --features test-programs/test_programs parameter is omitted because it requires rust-lld, which appears not to be a part of the native AArch64 toolchain.

This issue has also been discussed in PR #1802.

cc @cfallin

@cfallin
Copy link
Member

cfallin commented Jun 17, 2020

I suspect a qemu issue, as @alexcrichton had said earlier; it's too bad that upgrading to 5.0.0 didn't fix it.

I wonder if we could transition to running CI jobs on our native aarch64 machine, now that we have one -- @alexcrichton, thoughts (I think GitHub has a native-CI-runner feature)?

@alexcrichton
Copy link
Member

Locally I ran the test suite in qemu 5.0.0 and I saw the peak memory usage jump by ~1GB after applying #1871. This is the peak memory usage of QEMU itself when running the test suite. Already 10GB is pretty huge, for comparison it takes 200MB on native to run the all-* test suite.

I ran a small test on Github Actions CI and found that a program could allocate a 10687086592-byte (9.95 GiB) vector but would fail to allocate 10791944192 bytes (10.05 GiB). Similarly in local testing (according to /usr/bin/time) the before all-* test suite in qemu took 10129944k bytes (9.6 GiB) and went to 11286384k (10.7 GiB) after enabling this test. My test program was killed by SIGKILL on Github Actions as well.

Given that this doesn't feel like a bug in QEMU other than "maybe too much memory is used?" and it seems like we're just hitting OOM on CI. It appears that if we cross the 10GiB threshold for allocated memory we get OOM-killed. That would explain why it's not an issue locally either because we presumably have lots more ram and/or less aggressive OOM killers.

In terms of fixing this, that may be a bit harder. Some options include:

  • Move to native AArch64 CI. This is unfortunately pretty tricky to do, and boils down to GitHub recommends we don't do this. There are possible workarounds we could apply (rust-lang/rust is pioneering this, we'll likely just copy them). This will take some time though and rust-lang/rust is still in the process of working out all the various issues.

  • Split apart our test suite. I suspect the issue is that QEMU isn't freeing something it should, so we could fewer tests inside of a single QEMU process. Unfortunately I don't know of a great way to do this automatically. Ironically we actually unified our test suite for other CI-related issues. Our binaries are quite large so we can't have dozens of test binaries since that'll blow our disk limit.

  • There's experimental support on nightly where each test is run in a forked process, which we may be able to try out. I'm not holding my breath for this though.

  • Split just the execution of the test suite by having a "driver program" which executes the test suite with --list and then manually splits that list into shards and runs the test executable multiple times with --exact options and a list of test names.

None of these AFAIK are easy-ish things to do, unfortunately... I suppose there's the option of writing fewer tests :)

@cfallin
Copy link
Member

cfallin commented Jun 17, 2020

Hmm. Just now I went down a small rabbit-hole trying to work out if there's a way to reduce the translation cache size for qemu's JIT, in case that's the issue. Unfortunately it seems there's only -accel tb-size=... for system-mode qemu, but not user-mode qemu. (Anyone else know another option?)

Another option to add to the above list would be "fix qemu's memory blowup". Unfortunately that doesn't seem a whole lot easier than the other options, but who knows, maybe it's a quick fix once found.

@akirilov-arm: for now, while we develop aarch64 SIMD support, I think it's reasonable to keep the SIMD tests specifically disabled in-tree, in the absence of better options. (We should be careful to run tests locally on a native aarch64 machine, of course.) We'll have to find a better solution before declaring SIMD "done", though.

I'll go ahead and rename this issue to track the qemu memory blowup (which is the root problem), if you don't mind. Sorry again about our CI wonkiness!

@cfallin cfallin changed the title AArch64 CI test failure AArch64 CI tests: qemu hits memory limit and fails with SIMD tests enabled Jun 17, 2020
@akirilov-arm
Copy link
Contributor Author

@cfallin What is your preference with respect to opening PRs implementing AArch64 functionality - don't enable any relevant tests, but document their names in the description, so that people may run them manually, or enable all relevant tests, but disable them afterwards in case of CI failures (whose cause seems to be running out of memory)? I like the second option more - we have already merged a couple of changes after I had tried to push the first iteration of #1871, so evidently it works. Honestly, it's a little bit bizarre that the spec::simd::simd_align test triggers the issue because from a quick look at it there is nothing special about it, with one exception - it has the highest number of linear memory definitions of all SIMD tests (just run grep -R '(memory' tests/spec_testsuite/proposals/simd | cut -d: -f1 | sort | uniq -c | sort -rn), in fact it has more than the next 5 tests combined:

     92 tests/spec_testsuite/proposals/simd/simd_align.wast
     22 tests/spec_testsuite/proposals/simd/simd_load.wast
     20 tests/spec_testsuite/proposals/simd/simd_load_extend.wast
     16 tests/spec_testsuite/proposals/simd/simd_bit_shift.wast
     14 tests/spec_testsuite/proposals/simd/simd_load_splat.wast
     12 tests/spec_testsuite/proposals/simd/simd_i32x4_arith2.wast

On the other hand I have the feeling that we may run out of luck soon and start seeing consistent failures with any test.

cc @jgouly

@cfallin
Copy link
Member

cfallin commented Jun 18, 2020

enable all relevant tests, but disable them afterwards in case of CI failures (whose cause seems to be running out of memory)?

Yes, I think this is the best option -- let's do this for now, and reference this issue when we have to disable a test to get a green CI to merge.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Jun 18, 2020
This commit disables the usage of "static" memory on CI and instead
forces all memories to be "dynamic" meaning that they reserve much
smaller chunks of memory. This causes the QEMU process's memory to
drastically drop (10GiB -> 600MiB) and should allow us to keep enabling
tests without hitting the OOM killer on CI.

Closes bytecodealliance#1871 (includes that)
Closes bytecodealliance#1893
@alexcrichton
Copy link
Member

it has the highest number of linear memory definitions of all SIMD tests

Whoa nice find, that gives me an idea and testing locally it drastically reduces the memory usage of qemu (10GB -> 600MB). I think that means we can fix our CI quite easily actually!

alexcrichton added a commit that referenced this issue Jun 18, 2020
* Enable the spec::simd::simd_align test for AArch64

Copyright (c) 2020, Arm Limited.

* Disable static memory under QEMU on CI

This commit disables the usage of "static" memory on CI and instead
forces all memories to be "dynamic" meaning that they reserve much
smaller chunks of memory. This causes the QEMU process's memory to
drastically drop (10GiB -> 600MiB) and should allow us to keep enabling
tests without hitting the OOM killer on CI.

Closes #1871 (includes that)
Closes #1893

* Fix typo

Co-authored-by: Anton Kirilov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants