Compiled executable fails to launch when built with AVX and LTO enabled #44056

yvt · 2017-08-23T05:10:58Z

A generated executable occasionally fails to launch when built with the rustc options -Ctarget-feature=+avx -Copt-level=2 -Clto.

I tried this code:

fn main(){}

Compiled with the following shell script:

#!/bin/sh
rustc main.rs -Ctarget-feature=+avx -C opt-level=3 -Clto -g

When I ran the generated executable main repeatedly, the execution of the program stalled (did not terminate nor output anything; did not even enter the main function) 5 out of 100 times.

When I ran the executable from lldb, I could see that EXC_BAD_ACCESS had occured because it attempted to load a 32-byte block from an unaligned memory using vmovdqa (which requires the operand address to be 32-byte aligned).

- thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000100000bf6 main`main + 518
main`main:
->  0x100000bf6 <+518>: vmovdqa (%rax), %ymm0
    0x100000bfa <+522>: movl   $0x1, %ecx
    0x100000bff <+527>: vmovq  %rcx, %xmm1
    0x100000c04 <+532>: vmovdqa %ymm1, (%rax)
(lldb) register read
General Purpose Registers:
       rax = 0x0000000100300470

Analysis

The offending instruction is supposedly a part of libcore::ptr::swap_nonoverlapping_bytes, which is called during the execution of libstd::thread::local::LocalKey::init, which is called when the runtime is being initialized.

#[inline]
unsafe fn swap_nonoverlapping_bytes(x: *mut u8, y: *mut u8, len: usize) {
    // <snip>
    #[cfg_attr(not(any(target_os = "emscripten", target_os = "redox",
                       target_endian = "big")),
               repr(simd))]
    struct Block(u64, u64, u64, u64);
    // <snip>
        // Swap a block of bytes of x & y, using t as a temporary buffer
        // This should be optimized into efficient SIMD operations where available
        copy_nonoverlapping(x, t, block_size); // <--- HERE
    // <snip>
}

After the optimization, this call to the intrinsic function copy_nonoverlapping is translated into the following LLVM instruction:

%t.0.copyload.i.i.i.i.i.i.i.i.i = load <4 x i64>, <4 x i64>* bitcast ({ { { i64, [32 x i8] } }, { { i1 } }, { { i1 } }, [6 x i8] }* @_ZN3std10sys_common11thread_info11THREAD_INFO7__getit5__KEY17h80e4cdc49b84860aE to <4 x i64>*), align 32, !dbg !3742, !noalias !3762

This is translated into the following x86_64 instruction:

vmovdqa (%rax), %ymm0

The text was updated successfully, but these errors were encountered:

whitequark · 2017-08-23T10:45:09Z

Related: rust-embedded/cortex-m#44

parched · 2017-08-23T13:02:29Z

Good analysis @yvt. I guess that bitcast is dodgy, creating an unaligned pointer. I wonder if that comes from an LLVM pass or rust codegen.

@whitequark I'm not sure that is related is it?

kennytm · 2017-08-23T13:17:06Z

The bitcast is fine, at least if LTO is not applied. As shown in #40454, those memcpy is translated to movups with SSE. The question is why vmovdqa is chosen instead of vmovdqu...

parched · 2017-08-23T14:03:06Z

I guess because it expects a <4 x i64>* to be correctly 32 byte aligned and maybe the aligned instruction is more performant?

parched · 2017-08-23T16:47:16Z

Ah yes, the bitcast is fine, its the align 32 with that load that is the problem. Without LTO you get align 1 as expected.

yvt · 2017-08-23T18:04:48Z

Narrowed down the code to reproduce the issue. The issue can be reproduced with -Ctarget-feature=+avx -Copt-level=2 --extern libc=<environment dependent value>:

#![feature(lang_items)]
#![feature(start)]
#![feature(libc)]
#![feature(repr_simd)]
#![feature(const_fn)]
#![feature(thread_local)]
#![no_std]
#![no_main]
use core::mem;
extern crate libc;

struct Hoge(u64, u64, u64, u64);

#[thread_local]
static mut STATIC_VAR: Hoge = Hoge(0, 0, 0, 0);

#[no_mangle]
pub extern fn main(_argc: i32, _argv: *const *const u8) -> i32 {
    let mut local_var = Hoge(0, 0, 0, 0);
    unsafe {
        mem::swap(&mut local_var, &mut STATIC_VAR); // CRASH! (sometimes)
        local_var.0 as i32
    }
}

#[lang = "eh_personality"] extern fn eh_personality() {}
#[lang = "panic_fmt"] fn panic_fmt() -> ! { loop {} }

It wasn't rustc nor LLVM's optimization passes; it all had to do with how thread-local variables (TLVs) are handled by macOS's dyld.

LLVM expects global variables (including thread local ones) are aligned as it is specified:

@_ZN5prog210STATIC_VAR17ha13b998d4541fb2fE = 
internal thread_local global { i64, i64, i64, i64 } zeroinitializer, align 32

It does not have to be align 32 by itself, but maybe LLVM decided to make it align 32 anyway because that way it could optimize the program better by using aligned load/stores, I suppose. But anyway, this makes it legitimate to use align 32 for all load/stores on this variable like this:

%local_var.sroa.0.0.copyload = load <4 x i64>, <4 x i64>* 
bitcast ({ i64, i64, i64, i64 }* @_ZN5prog210STATIC_VAR17ha13b998d4541fb2fE to <4 x i64>*), 
align 32

And it outputs a Mach-O section for TLVs with a proper alignment requirement:

$ otool -l prog2
<snip>
Section
  sectname __thread_bss
   segname __DATA
      addr 0x0000000100001020
      size 0x0000000000000020
    offset 0
     align 2^5 (32)
    reloff 0
    nreloc 0
     flags 0x00000012
 reserved1 0
 reserved2 0

The problem is that, dyld's threadLocalVariables.c does not actually take the alignment info into account when allocating a memory region for TLVs:

// allocate buffer and fill with template
void* buffer = malloc(size);

The aforementioned program fails if the returned buffer happened not to be 32-byte aligned.

By the way, LDC devs seem to have experienced an similar issue.

kennytm · 2017-08-23T18:29:19Z

@yvt According to the LDC report, it has already been fixed on Xcode 8, but we are still seeing this bug? 😕 Would it be due to the distributed rustc is built with Xcode 7?

BTW none of the following will fix the issue: putting #[repr(align(32))] on Hoge, nor making STATIC_VAR nonzero (put it in __thread_data instead of __thread_bss).

alexcrichton · 2017-08-23T21:53:52Z

Is this something we could perhaps work around by allocating larger thread locals on our end and then doing the alignment ourselves?

yvt · 2017-08-25T15:09:28Z

@alexcrichton Yes, methods like this would work. The overhead of 32 * 2 + sizeof(size_t) bytes for every (TLV, thread) seems a little bit too much to me, but maybe it's reasonable on modern computer systems where plenty of memory is available.

A compiler support would be required for types with even larger alignment requirements. (related to #33626)

@kennytm
I guess the fix introduced with Xcode 8 only changes the way TLV segments are handled by the linker. The actual memory regions for TLVs are still allocated by dyld (that comes with macOS, not Xcode) using malloc, which effectively limits TLV's alignment size to 16 bytes. I think there's nothing we can do about this.

This can be verified by running the following program, which sporadically crashes even if compiled with Xcode 8:

#include <stdio.h>
#include <string.h>

__thread char tb = 42;
__thread char zb32[32];  // LLVM opt adds `align 32`

int main()
{
    printf("%p %p\n", &tb, zb32);
    *(__m256 *)zb32 = _mm256_set_ps(0, 0, 0, 0, 0, 0, 0, 0);
}

$ gcc prog3.c -march=native -O3 -o poisson-rng
$ while ./poisson-rng; do :; done
0x7fcb22c02760 0x7fcb22c02780
0x7f93d7402760 0x7f93d7402780
0x7ffefc5026c0 0x7ffefc5026e0
0x7fde87c02760 0x7fde87c02780
0x7fd1af402760 0x7fd1af402780
0x7fcc506006b0 0x7fcc506006d0
Segmentation fault: 11

[DO NOT MERGE] Do not allow LLVM to increase a TLS's alignment on macOS. This addresses the various TLS segfault on macOS 10.10. Fix #51794. Fix #51758. Fix #50867. Fix #48866. Fix #46355. Fix #44056.

…xcrichton Do not allow LLVM to increase a TLS's alignment on macOS. This addresses the various TLS segfault on macOS 10.10. Fix rust-lang#51794. Fix rust-lang#51758. Fix rust-lang#50867. Fix rust-lang#48866. Fix rust-lang#46355. Fix rust-lang#44056.

Do not allow LLVM to increase a TLS's alignment on macOS. This addresses the various TLS segfault on macOS 10.10. Fix #51794. Fix #51758. Fix #50867. Fix #48866. Fix #46355. Fix #44056.

shepmaster added A-codegen Area: Code generation C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 25, 2017

alexcrichton mentioned this issue Nov 29, 2017

SIGSEGV when using non-trivial thread_local! with -C target-cpu=haswell on macOS #46355

Closed

kennytm mentioned this issue Jun 28, 2018

Do not allow LLVM to increase a TLS's alignment on macOS. #51828

Merged

bors closed this as completed in #51828 Jun 30, 2018

This was referenced Oct 31, 2018

make run-pass tests with empty main just compile-pass tests #54680

Merged

we don't detect LLVM needing a rebuild correctly in llvm-finished-building #55537

Closed

nikic mentioned this issue Nov 4, 2018

Don't test issue-44056.rs on non-AVX hardware #55667

Closed

gnzlbg mentioned this issue Nov 16, 2018

issue-44056.rs was incorrectly moved to compile pass and should not if hardware does not support AVX #55996

Open

luqmana mentioned this issue Jun 26, 2020

Incorrect compilation / STATUS_ACCESS_VIOLATION when linking with lld with target-cpu set #72145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiled executable fails to launch when built with AVX and LTO enabled #44056

Compiled executable fails to launch when built with AVX and LTO enabled #44056

yvt commented Aug 23, 2017 •

edited

Loading

whitequark commented Aug 23, 2017

parched commented Aug 23, 2017

kennytm commented Aug 23, 2017

parched commented Aug 23, 2017

parched commented Aug 23, 2017

yvt commented Aug 23, 2017

kennytm commented Aug 23, 2017

alexcrichton commented Aug 23, 2017

yvt commented Aug 25, 2017

Compiled executable fails to launch when built with AVX and LTO enabled #44056

Compiled executable fails to launch when built with AVX and LTO enabled #44056

Comments

yvt commented Aug 23, 2017 • edited Loading

Meta

Analysis

whitequark commented Aug 23, 2017

parched commented Aug 23, 2017

kennytm commented Aug 23, 2017

parched commented Aug 23, 2017

parched commented Aug 23, 2017

yvt commented Aug 23, 2017

kennytm commented Aug 23, 2017

alexcrichton commented Aug 23, 2017

yvt commented Aug 25, 2017

yvt commented Aug 23, 2017 •

edited

Loading