Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libspl/backtrace: dump registers in libunwind backtraces #16653

Closed
wants to merge 4 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Oct 15, 2024

Motivation and Context

In postmortem debugging, often all we have is the binary. Having the registers makes it a least possible to follow the disassembly and try to guess how we got here.

Description

libunwind knows how to fish all the registers out of the stack frame, not just IP. Loop over 'em and spit 'em out.

I've made no effort to do the same in the non-libunwind builds, because libc doesn't typically give you the tools to do it well/at all.

I'm not yet fully satisfied with this. I think I can do better by capturing the register state at the point the assert is tripped, but it will be a lot more invasive. I'll keep working on this, but I think this is still very useful right now.

How Has This Been Tested?

Induced a crash, enjoyed the output:

ASSERT at module/zfs/spa.c:5791:spa_open_common()
"oh no" == 0 (0x7fe60100db1b == 0)
  PID: 870661    COMM: zdb
  TID: 870661    NAME: zdb
Registers:
  RAX: 0x0000000000000037  RDX: 0x0000000000000000  RCX: 0x0000000000000000
  RBX: 0x0000000000000002  RSI: 0x00007ffe7eac2e60  RDI: 0x00007ffe7eac4770
  RBP: 0x00007ffe7eac4f70  RSP: 0x00007ffe7eac46c0   R8: 0x0000000000000000
   R9: 0x0000000000000073  R10: 0x00007ffe7eac4d30  R11: 0x0000000000000021
  R12: 0x00007ffe7eac46f0  R13: 0x0000000000000000  R14: 0x00007ffe7eac4770
  R15: 0x00007ffe7eac4b40  RIP: 0x00007fe600fd8d26
Call trace:
  [0x00007fe600fd8c45] libspl_assertf+0x135 (in /home/robn/code/zfs/.libs/libzpool.so.6.0.0 +0x2ccc45)
  [0x00007fe600e66bc7] spa_open_common+0x57 (in /home/robn/code/zfs/.libs/libzpool.so.6.0.0 +0x15abc7)
  [0x00007fe600e2648f] dsl_pool_hold+0x1f (in /home/robn/code/zfs/.libs/libzpool.so.6.0.0 +0x11a48f)
  [0x00007fe600dd93c7] dmu_objset_hold_flags+0x27 (in /home/robn/code/zfs/.libs/libzpool.so.6.0.0 +0xcd3c7)
  [0x0000562e1ced46cc] open_objset+0x10c (in /home/robn/code/zfs/.libs/zdb +0x126cc)
  [0x0000562e1ceccbfb] main+0x149b (in /home/robn/code/zfs/.libs/zdb +0xabfb)
  [0x00007fe60068724a] __libc_start_call_main+0x7a (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x2724a)
  [0x00007fe600687305] __libc_start_main_alias_2+0x85 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x27305)
  [0x0000562e1cecccf1] _start+0x21 (in /home/robn/code/zfs/.libs/zdb +0xacf1)

I haven't tested on non-amd64 architectures, but the libunwind documentation doesn't say anything about particular facilities not being available. At worst, I'd expect it to show nothing, or maybe only the IP.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. We can continue to incrementally improve this as needed.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Oct 16, 2024
@robn
Copy link
Member Author

robn commented Oct 16, 2024

FYI, I managed to try this on a RPi4b this morning:

ASSERT at cmd/zdb/zdb.c:693:usage()
"ohno" == 0 (0x5555d71550 == 0)
  PID: 62086     COMM: zdb
  TID: 62086     NAME: zdb
Registers:
   x0: 0x0000007fc44ce5c8   x1: 0x0000007f9c0bfb90   x2: 0x0000007fc44ce510
   x3: 0x0000000000000000   x4: 0x0000007f9b5a89f0   x5: 0x0000000000000073
   x6: 0x00000000fffffffd   x7: 0x00000000ffffffff   x8: 0x0000000000000040
   x9: 0x0000007fc44cf150  x10: 0x0000000000000000  x11: 0x0000000000000000
  x12: 0x000000000000000a  x13: 0x00000000010ec6b3  x14: 0x0000000000001c30
  x15: 0x00000000010ec6b3  ip0: 0x0000007f9bdff590  ip1: 0x0000007f9bf20040
  x18: 0x0000000000001d90  x19: 0x000000000000f286  x20: 0x0000007f9c0753d0
  x21: 0x0000000000000002  x22: 0x0000007f9b6d14e0  x23: 0x0000007fc44d16a1
  x24: 0x00000000000002b5  x25: 0x0000005555d77490  x26: 0x0000007f9c075000
  x27: 0xffffffffffffffff  x28: 0x0000000000000000   fp: 0x0000007fc44ce410
   lr: 0x0000007f9bdff610   sp: 0x0000007fc44ce410   pc: 0x0000007f9bdff610
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000  ???: 0x0000000000000000  ???: 0x0000000000000000
  ???: 0x0000000000000000   v0: 0x0000000000000000   v1: 0x0000000000000000
   v2: 0x0000000000000000   v3: 0x0000000000000000   v4: 0x0000000000000000
   v5: 0x0000000000000000   v6: 0x0000000000000000   v7: 0x0000000000000000
   v8: 0x0000000000000000   v9: 0x0000000000000000  v10: 0x0000000000000000
  v11: 0x0000000000000000  v12: 0x0000000000000000  v13: 0x0000000000000000
  v14: 0x0000000000000000  v15: 0x0000000000000000  v16: 0x0000000000000000
  v17: 0x0000000000000000  v18: 0x0000000000000000  v19: 0x0000000000000000
  v20: 0x0000000000000000  v21: 0x0000000000000000  v22: 0x0000000000000000
  v23: 0x0000000000000000  v24: 0x0000000000000000  v25: 0x0000000000000000
  v26: 0x0000000000000000  v27: 0x0000000000000000  v28: 0x0000000000000000
  v29: 0x0000000000000000  v30: 0x0000000000000000  v31: 0x0000000000000000
  fpsr: 0x0000000000000000  fpcr: 0x0000000000000000
Call trace:
  [0x0000007f9bdff4bc] libspl_assertf+0x11c
  [0x0000005555d60f4c] usage+0x4c
  [0x0000005555d59cbc] main+0x5bc
  [0x0000007f9b557740] __libc_init_first+0x80
  [0x0000007f9b557818] __libc_start_main+0x98
  [0x0000005555d5a9b0] _start+0x30

So I suspect I'm holding something slightly wrong, but also I may not have a chance to get back to this for a few days. It's not the worst thing, so probably still nbd to merge but if I get a chance before you do I'll look into it.

@behlendorf
Copy link
Contributor

Thanks for checking this elsewhere. We'll just hold of merging this until you have a chance to sort that out.

More useful stuff, especially when trying to follow a disassembly.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
My eyes are going blurry looking at all those write calls. This is much
nicer.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
This is the sort of code that we get right once and never look at again.
Anyone reading this code is already likely in the middle of a debugging
nightmare, and then they have a wall of manual string construction and
an unfamiliar and idiosyncratic library to deal with. So, comment the
whole thing to try to make it clear what's going on.

In pursuit of the above, I've added return checks to some of the
libunwind calls, fixed the frame loop to not skip the "top" frame
(however unseful it may be), and fix a couple of calls to
spl_bt_u64_to_hex_str() which requested 18 digits instead of 16.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
@robn
Copy link
Member Author

robn commented Oct 18, 2024

Alright, sorted: was holding it wrong (libunwind is kinda weird, but also quite powerful).

Turns out register names are just a static lookup table by enum. Register existence is more about whether or not you can actually get the value for it - not every implementation of an architecture has every register. So now we're checking error codes and skipping registers if we couldn't get a value.

But, libunwind doesn't have a hardcoded name for every register it knows about, so for those, we have to show something. They're pretty niche though; the only one I have nearby is the ARM64 "PSTATE" register, which is some synthetic thing based on certain status flags (idk, I didn't look). So while we're not naming it perfectly, we can at least show it.

Put it together:

Registers:
   x0: 0x0000007fc35348e8   x1: 0x0000000000025b4d   x2: 0x0000000000000006
   x3: 0x0000007f8a07f040   x4: 0x0000007f8a0c2b90   x5: 0x0000000000000001
   x6: 0x0000000000000020   x7: 0x0000007f89520a78   x8: 0x0000000000000083
   x9: 0x0000007f89504418  x10: 0x0000000000000000  x11: 0x0000000000000000
  x12: 0x000000000000000a  x13: 0x00000000010ec6b3  x14: 0x0000000000000008
  x15: 0x00000000010ec6b3  ip0: 0x0000007f89dff590  ip1: 0x0000005555ae0af8
  x18: 0x0000000000001d90  x19: 0x0000007fc3534830  x20: 0x0000000000000002
  x21: 0x0000000000000006  x22: 0x0000007f896d14e0  x23: 0x0000007fc35396a1
  x24: 0x00000000000002b5  x25: 0x0000005555ac7490  x26: 0x0000007f8a075000
  x27: 0xffffffffffffffff  x28: 0x0000000000000000   fp: 0x0000007fc3534720
   lr: 0x0000007f89dff610   sp: 0x0000007fc3534720   pc: 0x0000007f89dff610
  ?21: 0x000000000000b302
Call trace:
  [0x0000007f89dff610] libspl_backtrace+0x80
  [0x0000005555ab1468] sig_handler+0x18
  [0x0000007f8a0c07b0] ???
  [0x0000007f895b09f0] pthread_key_delete+0x1a0
  [0x0000007f8956a72c] gsignal+0x1c
  [0x0000007f8955747c] abort+0xf0
  [0x0000007f89dff4e8] libspl_assertf+0x148
  [0x0000005555ab0f4c] usage+0x4c
  [0x0000005555aa9cbc] main+0x5bc
  [0x0000007f89557740] __libc_init_first+0x80
  [0x0000007f89557818] __libc_start_main+0x98
  [0x0000005555aaa9b0] _start+0x30

But there's more! The soup of write calls were getting out of hand, so I took the opportunity to make some macros to help with that. And then I figured screw it, I'll document the whole thing because its finicky and weird and rare that you should have to go in there, so more help rather than less is nice.

And that was worth it, because I found half a dozen small bugs and quirks. Nothing dangerous, but there because it was so hard to see!

I now have designs on a more comprehensive crash output, especially for assert/verify which then raises a signal, but I'll do that in a later PR, because it touches more stuff than just this file.

@behlendorf
Copy link
Contributor

Very nice!

behlendorf pushed a commit that referenced this pull request Oct 20, 2024
My eyes are going blurry looking at all those write calls. This is much
nicer.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Close #16653
behlendorf pushed a commit that referenced this pull request Oct 20, 2024
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16653
behlendorf pushed a commit that referenced this pull request Oct 20, 2024
This is the sort of code that we get right once and never look at again.
Anyone reading this code is already likely in the middle of a debugging
nightmare, and then they have a wall of manual string construction and
an unfamiliar and idiosyncratic library to deal with. So, comment the
whole thing to try to make it clear what's going on.

In pursuit of the above, I've added return checks to some of the
libunwind calls, fixed the frame loop to not skip the "top" frame
(however unseful it may be), and fix a couple of calls to
spl_bt_u64_to_hex_str() which requested 18 digits instead of 16.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #16653
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 21, 2024
More useful stuff, especially when trying to follow a disassembly.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16653
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 21, 2024
My eyes are going blurry looking at all those write calls. This is much
nicer.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Close openzfs#16653
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 21, 2024
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16653
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 21, 2024
This is the sort of code that we get right once and never look at again.
Anyone reading this code is already likely in the middle of a debugging
nightmare, and then they have a wall of manual string construction and
an unfamiliar and idiosyncratic library to deal with. So, comment the
whole thing to try to make it clear what's going on.

In pursuit of the above, I've added return checks to some of the
libunwind calls, fixed the frame loop to not skip the "top" frame
(however unseful it may be), and fix a couple of calls to
spl_bt_u64_to_hex_str() which requested 18 digits instead of 16.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16653
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 21, 2024
More useful stuff, especially when trying to follow a disassembly.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16653
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 21, 2024
My eyes are going blurry looking at all those write calls. This is much
nicer.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Close openzfs#16653
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 21, 2024
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16653
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 21, 2024
This is the sort of code that we get right once and never look at again.
Anyone reading this code is already likely in the middle of a debugging
nightmare, and then they have a wall of manual string construction and
an unfamiliar and idiosyncratic library to deal with. So, comment the
whole thing to try to make it clear what's going on.

In pursuit of the above, I've added return checks to some of the
libunwind calls, fixed the frame loop to not skip the "top" frame
(however unseful it may be), and fix a couple of calls to
spl_bt_u64_to_hex_str() which requested 18 digits instead of 16.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tino Reichardt <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16653
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants