Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop #566

Open
douglas-raillard-arm opened this issue Oct 3, 2024 · 38 comments
Open

Infinite loop #566

douglas-raillard-arm opened this issue Oct 3, 2024 · 38 comments

Comments

@douglas-raillard-arm
Copy link

douglas-raillard-arm commented Oct 3, 2024

A very simple snippet lands on an infinite loop in some environment it seems:

    let left: u64 = 1;
    let right: u64 = 1;
    let mut myhash = HashMap::new();
    myhash.insert(left, right);

This works everywhere except in a very specific place: in a kernel module of Android 6.1 kernel (Pixel 6 device), compiled in release mode:

  • This does not use the Rust for Linux support
  • The code is compiled as a no_std staticlib, prelinked and then provided to the kernel build system as a standalone object file.
  • The global allocator is a straightforward wiring of the GlobalAllocator trait to kmalloc/kfree kernel functions.
  • In debug mode, everything is working fine, it only breaks in release.
  • The symptom is a hard lockup of the CPU running the code.

Unfortunately, I was not able to get a useful backtrace out of the kernel for whatever reason, which is a real shame

EDIT: added no_std

@clarfonthey
Copy link
Contributor

That is very peculiar. Are you sure that it's not just the allocation itself that's failing? I.e., does HashMap::with_capacity(1) not give the same error?

Since by default, HashMap::new doesn't allocate any memory, and so it wouldn't be calling those allocator functions until the insert call.

@douglas-raillard-arm
Copy link
Author

The allocation itself succeeds (I can log the allocated pointer from kmalloc easily), so it's not that, but the part that actually fails is indeed the insert(). HashMap::new() on its own is fine.

@douglas-raillard-arm
Copy link
Author

Is there any trick based on using unused pointer bits in hashbrown ? The addresses range used in kernel space is different from the one in userspace AFAIR. Other than that, it should be a run-of-the-mill no_std environment.

@clarfonthey
Copy link
Contributor

The stored pointer isn't at the beginning of the allocation, but other than that, it shouldn't affect things. The empty map is stored as static data, but since you got to the allocation, that shouldn't have been the source of the issue, since the code would have read the data before that point.

Since you have the ability to log the pointer allocation, I'd honestly recommend just trying to build the crate yourself and injecting logging statements to debug things if you can, since that'd be the best way to figure out what the actual issue is. Besides that, there really isn't much to go on since this feels pretty specific to your issue.

@Amanieu
Copy link
Member

Amanieu commented Oct 4, 2024

This is almost certainly due to the use of NEON instructions in hashbrown, which aren't normally allowed in the kernel. Are you using the aarch64-unknown-none-softfloat target which specifically disables the use of FP/SIMD instructions?

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Oct 4, 2024

I'll try to add some debug prints when I figured how to make a local version of the crate.

Are you using the aarch64-unknown-none-softfloat target which specifically disables the use of FP/SIMD instructions?

It's using a custom target.json, derived from aarch64-unknown-none-softfloat:

{
  "abi": "softfloat",
  "arch": "aarch64",
  "data-layout": "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32",
  "features": "+v8a,+strict-align,-neon,-fp-armv8",
  "llvm-target": "aarch64-unknown-none",
  "max-atomic-width": 128,
  "stack-probes": {
    "kind": "none"
  },
  "target-pointer-width": "64",
  "supported-sanitizers": [
    "kcfi"
  ]
}

So there should not be any NEON instruction in the binary. I haven't checked exhaustively the disassembly though, so maybe it could still sneak-in via intrinsics (?) or asm!().

EDIT: figured how to build my version of hashbrown with debug prints, let's see what happens when I get the pixel 6 back to experiment.

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Oct 4, 2024

I managed to find the problematic spot: when a print is added at the beginning of this loop, the issue disappears. So there probably is something unsound somewhere there that is exposed by optimizations (I'm using rust 1.81.0).

EDIT: I also confirm that this is the loop that becomes infinite in the broken case.

@clarfonthey
Copy link
Contributor

Hmm, that is very strange, since we've thoroughly tested this and run it through miri without any undefined behaviour triggering. Since you're running on a target without NEON, you're almost certainly using the generic, not-SIMD implementation (which is also what we run via MIRI), which is very weird.

@clarfonthey
Copy link
Contributor

To be honest, I'm not entirely a fan of the way we currently do the probe sequence, since like you're experiencing, it doesn't have any guarantee to end. This means that a potentially malicious hash or a bad implementation could lead to an infinite loop.

The one thing that immediately jumps to mind is that we have an explicit test for the target pointer width when factoring in the cache: if the pointer width is 32, then we take bits 31-25 of the hash for the 7-bit control code, whereas if the pointer width is 64, we take bits 63-57. It could be that the hash is still ending up as a 32-bit number and thus the control code is always zero, but since zero is a valid hash, that still feels unlikely.

@Amanieu
Copy link
Member

Amanieu commented Oct 4, 2024

The global allocator is a straightforward wiring of the GlobalAllocator trait to kmalloc/kfree kernel functions.

Are you absolutely sure that the pointer returned by the allocator is correctly aligned? This is a problem that several people have encountered when using custom allocators. It might be worth dropping an assert there to be sure.

To be honest, I'm not entirely a fan of the way we currently do the probe sequence, since like you're experiencing, it doesn't have any guarantee to end. This means that a potentially malicious hash or a bad implementation could lead to an infinite loop.

Because the table is a power of 2, a quadratic probe sequence is guaranteed to visit each group exactly once before looping. The loop is then guaranteed to terminate as long as there is at least one empty bucket, which is guaranteed by the load factor. The hash only selects the starting position, all groups are still visited.

@douglas-raillard-arm
Copy link
Author

Are you absolutely sure that the pointer returned by the allocator is correctly aligned? This is a problem that several people have encountered when using custom allocators. It might be worth dropping an assert there to be sure.

I'll add some asserts. Currently the code is like that, in which I assume the address will be correctly aligned under this condition:

align <= 8 || (size.is_power_of_two() && align <= size)

This is based on the kernel doc:

The address of a chunk allocated with kmalloc is aligned to at least ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the alignment is also guaranteed to be at least the respective size. For other sizes, the alignment is guaranteed to be at least the largest power-of-two divisor of the size.

(and on arm64, ARCH_KMALLOC_MINALIGN is 8).

when factoring in the cache [...] whereas if the pointer width is 64, we take bits 63-57

I'm not sure of what you mean here by "factoring in the cache" but those bits will be in use for kernel space pointers on arm64, as can be seen here.

@douglas-raillard-arm
Copy link
Author

Assert added and it's still landing on the infinite loop without panicking first.

@Amanieu
Copy link
Member

Amanieu commented Oct 7, 2024

I'm honestly out of ideas at this point. I would recommend setting up kernel debugging to figure out exactly which loop it's getting stuck in.

@douglas-raillard-arm
Copy link
Author

@Amanieu that's figured already, see #566 (comment)

@Amanieu
Copy link
Member

Amanieu commented Oct 7, 2024

I mean single-stepping that loop with a debugger to figure out why it's not exiting on the first iteration since there's only one element in the table. match_empty should return at least 1 set bit and cause the loop to exit with None.

@douglas-raillard-arm
Copy link
Author

Single stepping is going to be really challenging. Even installing a debugger is going to be a problem (it's Android, not a regular distro with a gdb package). On top of that the binary is mixed language (it's a C kernel module with one object file coming from Rust). I can however apply any patch you want to hashbrown, so if there is things you want me to print and add asserts that's straightforward.

The only thing to keep in mind is that extra printing in the loop leads to the issue disappearing, similarly to compiling with debug, so it's quite likely there is something unsound quite "local" to that spot (or anything inlined there).

The SAFETY comment of that snippet lists 4 points, maybe some of them can be turned into assert!() ?

        let group = unsafe { Group::load(self.ctrl(probe_seq.pos)) };

@douglas-raillard-arm
Copy link
Author

I added the following obvious asserts according to SAFETY and they passed fine, with the infinite loop still happening:

            assert_eq!(self.bucket_mask, self.buckets() - 1);
            assert!(probe_seq.pos <= self.bucket_mask);

@Amanieu
Copy link
Member

Amanieu commented Oct 7, 2024

Maybe you could try sprinkling some #[inline(never)] around these functions to figure out which one is causing the issue? That should at least narrow down the issue to a set of functions.

You could also add an assert to check that all bits are set in the bitmask returned by match_empty since the table should be completely empty at this point.

In fact can you also add an assert that the group load returns 0x8080808080808080 (top bit set in each byte).

@douglas-raillard-arm
Copy link
Author

Was thinking the same so I tried with #[inline(never)] on Group::load (generic.rs) and it "fixes" the issue. However, it also did "fix" it for Group::match_byte so maybe it's just an unlucky side effect, similar to adding debug prints.

You could also add an assert to check that all bits are set in the bitmask returned by match_empty since the table should be completely empty at this point.

In fact can you also add an assert that the group load returns 0x8080808080808080 (top bit set in each byte).

I realized the issue is currently manifesting itself with this snippet:

    let mut mymap = hashbrown::HashMap::new();
    mymap.insert(left, right);
    let val = mymap.get(&left).unwrap();

and it's looping in mymap.get(). When I opened the issue, this get() was not necessary, but maybe it "became necessary" after adding some debug prints. Would your invariants still hold then ?

Considering the extra assert added in the allocator did not fire, it seems pretty likely the issue is somewhere in hashbrown. I'm happy to try alternative branches with extra asserts etc to help figuring that issue but for now I don't have any code actually depending on hashbrown (it was preliminary tests), so I'm going to go with the std BTreeMap since it's fine too for the use case.

@Amanieu
Copy link
Member

Amanieu commented Oct 7, 2024

You should still be able to dump the contents of Group::load as hex, which should give us a clue as to what the underlying problem is. I would normally expect all bytes to be 0x80 except for the one byte containing the inserted element. And the loop should still exit after 1 iteration.

@douglas-raillard-arm
Copy link
Author

So I tried adding #[derive(Debug)] on struct Group and then print it just after Group::load. This lead to a NULL pointer deref (the address was actually 9, not 0, but it also falls in the guard page). The second time I tried (rebuild), the print did not fire and it got stuck in the infinite loop, despite the print being obviously inside the loop. So codegen is play tricks on us to evade scrutiny, we must be onto something :)

Note that I validated the print itself by printing successfully probe_seq.pos, which was equal to 3. It's using format!() so that brings a String allocation in the loop.

@douglas-raillard-arm
Copy link
Author

So there really is something going with the pointer obtained via self.ctrl(probe_seq.pos), which is self.ctrl(3). As soon as this *const u8 is dereferenced into a u64, the printing code is "skipped" and the lockup happens.

@Amanieu
Copy link
Member

Amanieu commented Oct 7, 2024

Could this be because the kernel doesn't support unaligned memory access for some reason?

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Oct 7, 2024

Maybe but I'd expect a fault leading to a panic.

The only print thing that has worked is turning the ptr into a slice of length 8 and printing that:

ptr=0xffffff88283080c2 data=[6, 255, 255, 255, 255, 255, 255, 255]

@cuviper
Copy link
Member

cuviper commented Oct 7, 2024

Given that this only happens in release mode, not debug, and that added prints and such perturb the issue, it sounds like there could be a codegen bug here.

@Amanieu
Copy link
Member

Amanieu commented Oct 7, 2024

Right, that's the expected result. The first control byte represents the item you are searching for and the rest represent EMPTY. I would expect match_empty to return 0b0111111... where everything except the first bit is set.

If this really is a codegen bug then there isn't much we can do without directly looking at the generated assembly code.

@douglas-raillard-arm
Copy link
Author

Maybe but I'd expect a fault leading to a panic.

Actually I don't think so: printing the slice as-is worked, but calling that locks up:

        let x: u64 = u64::from_le_bytes(slice.try_into().unwrap());

So I doubt it's kernel-related. It looks like from_le_bytes from core is basically a transmute(). So it's not surprising it behaves as badly as casting the *const u8 into a *const u64, because it probably is exactly the same after optimizations.

@douglas-raillard-arm
Copy link
Author

If this really is a codegen bug then there isn't much we can do without directly looking at the generated assembly code.

I'll see if I can de-inline the whole find_inner() function.

Interestingly, I ran again and got a different output, is that expected ?

ptr=0xffffff802d686cc2 data=[96, 255, 255, 255, 255, 255, 255, 255]

@douglas-raillard-arm
Copy link
Author

douglas-raillard-arm commented Oct 7, 2024

Here is the disassembly of RawTableInner::find_inner in v0.15.0 tag, with all #[inline(never)] removed except on find_inner() and compiled with the a very aggressive inline threshold:

0000000000000000 <_ZN9hashbrown3raw13RawTableInner10find_inner17hcb6003a74dbed1fcE>:
   0:   a9ba7bfd        stp     x29, x30, [sp, #-96]!
   4:   d379fc28        lsr     x8, x1, #57
   8:   b200c3e9        mov     x9, #0x101010101010101          // #72340172838076673
   c:   a9035ff8        stp     x24, x23, [sp, #48]
  10:   a90267fa        stp     x26, x25, [sp, #32]
  14:   a9406019        ldp     x25, x24, [x0]
  18:   9b097d17        mul     x23, x8, x9
  1c:   f940107a        ldr     x26, [x3, #32]
  20:   a9016ffc        stp     x28, x27, [sp, #16]
  24:   a90457f6        stp     x22, x21, [sp, #64]
  28:   aa1f03f6        mov     x22, xzr
  2c:   910003fd        mov     x29, sp
  30:   a9054ff4        stp     x20, x19, [sp, #80]
  34:   aa0203f3        mov     x19, x2
  38:   8a01031b        and     x27, x24, x1
  3c:   8b1b0328        add     x8, x25, x27
  40:   b207dbf0        mov     x16, #0xfefefefefefefefe        // #-72340172838076674
  44:   39400509        ldrb    w9, [x8, #1]
  48:   3940010a        ldrb    w10, [x8]
  4c:   39400d0b        ldrb    w11, [x8, #3]
  50:   3940090c        ldrb    w12, [x8, #2]
  54:   3940150d        ldrb    w13, [x8, #5]
  58:   f29fdff0        movk    x16, #0xfeff
  5c:   38404d0e        ldrb    w14, [x8, #4]!
  60:   3940090f        ldrb    w15, [x8, #2]
  64:   d370bd8c        lsl     x12, x12, #16
  68:   39400d08        ldrb    w8, [x8, #3]
  6c:   aa092149        orr     x9, x10, x9, lsl #8
  70:   53103def        lsl     w15, w15, #16
  74:   aa0b618a        orr     x10, x12, x11, lsl #24
  78:   2a0d21cb        orr     w11, w14, w13, lsl #8
  7c:   2a0861e8        orr     w8, w15, w8, lsl #24
  80:   aa090149        orr     x9, x10, x9
  84:   2a0b0108        orr     w8, w8, w11
  88:   aa088134        orr     x20, x9, x8, lsl #32
  8c:   ca170288        eor     x8, x20, x23
  90:   8b100109        add     x9, x8, x16
  94:   8a280128        bic     x8, x9, x8
  98:   f201c11c        ands    x28, x8, #0x8080808080808080
  9c:   54000180        b.eq    cc <_ZN9hashbrown3raw13RawTableInner10find_inner17hcb6003a74dbed1fcE+0xcc>  // b.none
  a0:   dac00388        rbit    x8, x28
  a4:   aa1303e0        mov     x0, x19
  a8:   dac01108        clz     x8, x8
  ac:   8b480f68        add     x8, x27, x8, lsr #3
  b0:   8a180115        and     x21, x8, x24
  b4:   aa1503e1        mov     x1, x21
  b8:   d63f0340        blr     x26
  bc:   37000160        tbnz    w0, #0, e8 <_ZN9hashbrown3raw13RawTableInner10find_inner17hcb6003a74dbed1fcE+0xe8>
  c0:   d1000788        sub     x8, x28, #0x1
  c4:   ea1c011c        ands    x28, x8, x28
  c8:   54fffec1        b.ne    a0 <_ZN9hashbrown3raw13RawTableInner10find_inner17hcb6003a74dbed1fcE+0xa0>  // b.any
  cc:   8a140688        and     x8, x20, x20, lsl #1
  d0:   f201c11f        tst     x8, #0x8080808080808080
  d4:   540001c1        b.ne    10c <_ZN9hashbrown3raw13RawTableInner10find_inner17hcb6003a74dbed1fcE+0x10c>  // b.any
  d8:   910022d6        add     x22, x22, #0x8
  dc:   8b1b02c8        add     x8, x22, x27
  e0:   8a18011b        and     x27, x8, x24
  e4:   17ffffd6        b       3c <_ZN9hashbrown3raw13RawTableInner10find_inner17hcb6003a74dbed1fcE+0x3c>
  e8:   52800020        mov     w0, #0x1                        // #1
  ec:   aa1503e1        mov     x1, x21
  f0:   a9454ff4        ldp     x20, x19, [sp, #80]
  f4:   a94457f6        ldp     x22, x21, [sp, #64]
  f8:   a9435ff8        ldp     x24, x23, [sp, #48]
  fc:   a94267fa        ldp     x26, x25, [sp, #32]
 100:   a9416ffc        ldp     x28, x27, [sp, #16]
 104:   a8c67bfd        ldp     x29, x30, [sp], #96
 108:   d65f03c0        ret
 10c:   aa1f03e0        mov     x0, xzr
 110:   17fffff7        b       ec <_ZN9hashbrown3raw13RawTableInner10find_inner17hcb6003a74dbed1fcE+0xec>

@Amanieu
Copy link
Member

Amanieu commented Oct 7, 2024

Can you remove all the #[inline(never)] except the one on find_inner. The asm is actually easier to read with everything inlined.

@douglas-raillard-arm
Copy link
Author

@Amanieu I updated the asm in the previous comment and it's indeed much, much shorter.

@Amanieu
Copy link
Member

Amanieu commented Oct 8, 2024

Can you try removing +strict-align from the target json? I think this may be an LLVM codegen issue but I can't find anything wrong with the assembly code at first glance.

@douglas-raillard-arm
Copy link
Author

Still crashing, but the disassembly is ~30% shorter with -strict-align:

0000000000000000 <_ZN9hashbrown3raw13RawTableInner10find_inner17h58079395b7fac51aE>:
   0:   a9ba7bfd        stp     x29, x30, [sp, #-96]!
   4:   d379fc28        lsr     x8, x1, #57
   8:   b200c3e9        mov     x9, #0x101010101010101          // #72340172838076673
   c:   a9035ff8        stp     x24, x23, [sp, #48]
  10:   a90267fa        stp     x26, x25, [sp, #32]
  14:   a9406019        ldp     x25, x24, [x0]
  18:   9b097d17        mul     x23, x8, x9
  1c:   f940107a        ldr     x26, [x3, #32]
  20:   a9016ffc        stp     x28, x27, [sp, #16]
  24:   a90457f6        stp     x22, x21, [sp, #64]
  28:   aa1f03f6        mov     x22, xzr
  2c:   910003fd        mov     x29, sp
  30:   a9054ff4        stp     x20, x19, [sp, #80]
  34:   aa0203f3        mov     x19, x2
  38:   8a01031b        and     x27, x24, x1
  3c:   f87b6b34        ldr     x20, [x25, x27]
  40:   b207dbe9        mov     x9, #0xfefefefefefefefe         // #-72340172838076674
  44:   f29fdfe9        movk    x9, #0xfeff
  48:   ca170288        eor     x8, x20, x23
  4c:   8b090109        add     x9, x8, x9
  50:   8a280128        bic     x8, x9, x8
  54:   f201c11c        ands    x28, x8, #0x8080808080808080
  58:   54000180        b.eq    88 <_ZN9hashbrown3raw13RawTableInner10find_inner17h58079395b7fac51aE+0x88>  // b.none
  5c:   dac00388        rbit    x8, x28
  60:   aa1303e0        mov     x0, x19
  64:   dac01108        clz     x8, x8
  68:   8b480f68        add     x8, x27, x8, lsr #3
  6c:   8a180115        and     x21, x8, x24
  70:   aa1503e1        mov     x1, x21
  74:   d63f0340        blr     x26
  78:   37000160        tbnz    w0, #0, a4 <_ZN9hashbrown3raw13RawTableInner10find_inner17h58079395b7fac51aE+0xa4>
  7c:   d1000788        sub     x8, x28, #0x1
  80:   ea1c011c        ands    x28, x8, x28
  84:   54fffec1        b.ne    5c <_ZN9hashbrown3raw13RawTableInner10find_inner17h58079395b7fac51aE+0x5c>  // b.any
  88:   8a140688        and     x8, x20, x20, lsl #1
  8c:   f201c11f        tst     x8, #0x8080808080808080
  90:   540001c1        b.ne    c8 <_ZN9hashbrown3raw13RawTableInner10find_inner17h58079395b7fac51aE+0xc8>  // b.any
  94:   910022d6        add     x22, x22, #0x8
  98:   8b1b02c8        add     x8, x22, x27
  9c:   8a18011b        and     x27, x8, x24
  a0:   17ffffe7        b       3c <_ZN9hashbrown3raw13RawTableInner10find_inner17h58079395b7fac51aE+0x3c>
  a4:   52800020        mov     w0, #0x1                        // #1
  a8:   aa1503e1        mov     x1, x21
  ac:   a9454ff4        ldp     x20, x19, [sp, #80]
  b0:   a94457f6        ldp     x22, x21, [sp, #64]
  b4:   a9435ff8        ldp     x24, x23, [sp, #48]
  b8:   a94267fa        ldp     x26, x25, [sp, #32]
  bc:   a9416ffc        ldp     x28, x27, [sp, #16]
  c0:   a8c67bfd        ldp     x29, x30, [sp], #96
  c4:   d65f03c0        ret
  c8:   aa1f03e0        mov     x0, xzr
  cc:   17fffff7        b       a8 <_ZN9hashbrown3raw13RawTableInner10find_inner17h58079395b7fac51aE+0xa8>

@douglas-raillard-arm
Copy link
Author

Also I tried on the emulator platform where everything seems to works (never locked up) and the disassembly is identical (as expected, since it's the exact same toolchain)

@douglas-raillard-arm
Copy link
Author

So I think inlining played some tricks again: the problem seems to now appear in insert() rather than get(). It may have been in insert() all along, and the adding/removing of get() that seemingly triggered the bug maybe just influenced the inlined blob, with the issue actually being in insert() ... This thing is a nightmare.

That being said, it's not totally surprising since the loop we have been looking at seems to be almost copy/pasted in a few functions.

@Amanieu
Copy link
Member

Amanieu commented Oct 10, 2024

The fact that this doesn't reproduce in QEMU despite being the exact same assembly code strongly suggests that the issue is not from hashbrown but rather from something specific to the kernel and/or hardware.

Even if you can't get kgdb working, surely you should be able to get a register state or at least a PC from a kernel thread stuck in a loop? A register state should be able to give enough information to figure out what is going on.

@douglas-raillard-arm
Copy link
Author

I don't think we can state it is the exact same assembly. The issue has been "moving" around due to inlining and I can't just disable it fully otherwise the problem is hidden. So I while I was looking at one function, the problem was materialized somewhere else.

When the lockup happens, the CPU get stuck for 10s, then a watchdog detects it and the only info printed is a back trace with a single entry. That entry is either a kernel function involved in syscalls or the scheduler. However, removing the snippet from the module fixes the issue, so it definitely has something to do with it (and it's even clearer there is a problem in the code or compiler given that it depends on rust optimization level).

@clarfonthey
Copy link
Contributor

I think that it should be mostly possible to verify that it's the same assembly if you use the same custom target for both and ensure that insert itself isn't inlined. It would be helpful to clarify that it is in fact the wrong code being generated and not just a hardware issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants