Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Intermittent failure in abstractarray on aarch64-linux-gnu #52434

Closed
giordano opened this issue Dec 6, 2023 · 8 comments
Closed

[CI] Intermittent failure in abstractarray on aarch64-linux-gnu #52434

giordano opened this issue Dec 6, 2023 · 8 comments
Labels
ci Continuous integration system:arm ARMv7 and AArch64 system:linux Affects only Linux

Comments

@giordano
Copy link
Contributor

giordano commented Dec 6, 2023

Example:

Error in testset abstractarray:
Error During Test at none:1
  Got exception outside of a @test
  ProcessExitedException(8)
  Stacktrace:
    [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
      @ Base ./task.jl:935
    [2] wait()
      @ Base ./task.jl:999
    [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
      @ Base ./condition.jl:130
    [4] wait
      @ Base ./condition.jl:125 [inlined]
    [5] take_buffered(c::Channel{Any})
      @ Base ./channels.jl:477
    [6] take!(c::Channel{Any})
      @ Base ./channels.jl:471
    [7] take!(::Distributed.RemoteValue)
      @ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:726
    [8] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String}; kwargs::@Kwargs{seed::UInt128})
      @ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:461
    [9] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String}; kwargs::@Kwargs{seed::UInt128})
      @ Distributed /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/stdlib/v1.11/Distributed/src/remotecall.jl:492
   [10] (::var"#37#47"{Vector{Task}, var"#print_testworker_errored#43"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#41"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
      @ Main /cache/build/default-armageddon-0/julialang/julia-master/julia-bdbee27ae7/share/julia/test/runtests.jl:258

This is the only error I've seen lately on this platform, but it's intermittent, although quite frequent.

@giordano giordano added system:linux Affects only Linux system:arm ARMv7 and AArch64 ci Continuous integration labels Dec 6, 2023
@giordano
Copy link
Contributor Author

During the ci-dev call we debugged this a little bit. A related error

SharedArrays                                      (1) |        started at 2024-01-29T18:35:21.628
      From worker 17:
      From worker 17:	[13757] signal 7 (2): Bus error
      From worker 17:	in expression starting at none:0
      From worker 16:
      From worker 16:	[13756] signal 7 (2): Bus error
      From worker 16:	in expression starting at none:0
      From worker 17:	setindex! at ./array.jl:972 [inlined]
      From worker 17:	setindex! at ./subarray.jl:403 [inlined]
      From worker 17:	map! at ./abstractarray.jl:3289 [inlined]
      From worker 17:	#67 at /cache/build/default-armageddon-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/SharedArrays/src/SharedArrays.jl:548

can be reproduced on the build machine with

using SharedArrays
TR = Float64
dims = (10,)
SharedArray{TR,length(dims)}(dims; init = S -> (@show S.loc_subarr_1d[1]))

There appear to be a crash in segv_handler.

@staticfloat
Copy link
Sponsor Member

#ci-dev looked into this, and the following is a (more minimal) reproducer:

using SharedArrays

# Arbitrary
dims = (10,)
T = Float64

# Create shared memory, truncate to the right size
fd_mem = SharedArrays.shm_open("/foo", SharedArrays.JL_O_CREAT | SharedArrays.JL_O_RDWR, SharedArrays.S_IRUSR | SharedArrays.S_IWUSR)
s = SharedArrays.fdio(fd_mem, true)
rc = ccall(:jl_ftruncate, Cint, (Cint, Int64), fd_mem, prod(dims)*sizeof(T))

# Ensure that shared memory file exists
run(`ls -la /dev/shm/foo`)

# mmap it and attempt to dereference
A = SharedArrays.mmap(s, Array{T, length(dims)}, dims, zero(Int64); grow=false);
A[1] # <-- dies with SIGBUS

@staticfloat
Copy link
Sponsor Member

Confirmed that the equivalent C program fails in the same way:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>           /* For O_* constants */
#include <sys/stat.h>        /* For mode constants */
#include <sys/mman.h>        /* For shared memory */
#include <unistd.h>          /* For ftruncate */
#include <string.h>          /* For strlen */

int main() {
    const char *name = "/my_shared_memory"; // Name of the shared memory object
    const char *message = "Hello, Shared Memory!"; // Message to be written
    int shm_fd;     // File descriptor of the shared memory
    void *ptr;      // Pointer to the shared memory

    // Create the shared memory object
    shm_fd = shm_open(name, O_CREAT | O_RDWR, 0666);
    if (shm_fd == -1) {
        perror("Error creating shared memory");
        return EXIT_FAILURE;
    }

    // Configure the size of the shared memory object
    ftruncate(shm_fd, 4096);

    // Memory map the shared memory object
    ptr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
    if (ptr == MAP_FAILED) {
        perror("Error mapping shared memory");
        return EXIT_FAILURE;
    }

    // Write to the shared memory object
    sprintf(ptr, "%s", message);
    ptr += strlen(message);

    // Now, the memory contains "Hello, Shared Memory!"
    printf("Data written to shared memory: %s\n", (char *)ptr - strlen(message));

    // Unmap the shared memory
    munmap(ptr, 4096);

    // Close the shared memory object
    close(shm_fd);

    // Optionally, remove the shared memory object
    // shm_unlink(name);

    return EXIT_SUCCESS;
}

(Many thanks to ChatGPT)

We assume that this is now a kernel or glibc bug (currently running v5.4 and v2.17, respectively), and we may need to upgrade our buildbot to a newer version. It is easier to try upgrading the kernel (ironically) as we supposedly support glibc v2.17+, so we should first try a newer kernel with the current rootfs images and see what happens.

@giordano
Copy link
Contributor Author

giordano commented Jan 29, 2024

For the record, I couldn't reproduce the crash, not with the julia nor C reproducers, on any aarch64-linux-gnu system I have access to (the C example runs fine and prints Data written to shared memory: Hello, Shared Memory!), but they typically have glibc newer than 2.19, so this smells a bit like glibc bug.

@staticfloat
Copy link
Sponsor Member

Confirmed that updating the kernel from v5.4 -> v5.15 has solved this, so presumably this was a kernel bug! Huzzah!

Our aarch64-linux worker has been updated, so this should be fixed now. Will leave open until tests confirm this.

@staticfloat
Copy link
Sponsor Member

Update: looks like we solved the SharedArrays problem, but the abstractarray test is still failing 😂

@vchuravy
Copy link
Member

To quote myself from the meeting:

I don't care about Shared Arrays

;)

@ViralBShah ViralBShah changed the title [CI] Intermitted failure in abstractarray on aarch64-linux-gnu [CI] Intermittent failure in abstractarray on aarch64-linux-gnu Feb 16, 2024
@staticfloat
Copy link
Sponsor Member

Looks like this was fixed by #54718!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous integration system:arm ARMv7 and AArch64 system:linux Affects only Linux
Projects
None yet
Development

No branches or pull requests

3 participants