Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dwm terminated with signal SIGSEGV, segmentation fault #436

Open
adolfgatonegro opened this issue Oct 6, 2024 · 6 comments
Open

dwm terminated with signal SIGSEGV, segmentation fault #436

adolfgatonegro opened this issue Oct 6, 2024 · 6 comments

Comments

@adolfgatonegro
Copy link

adolfgatonegro commented Oct 6, 2024

Hey, I'm running into an issue with dwm, similar to #324.

SYSTEM: Arch Linux
KERNEL: 6.11.2-arch1-1
NVIDIA DRIVER: 560.35.03-11
XORG X SERVER: 21.1.13-1
DWM VERSION: dwm-6.5 (last commit: 36cbcf5)

My regular build uses flexipatch, though I'm also seeing the issue with the latest unmodified dwm from upstream.

Issue

Installing after compilation, with sudo make install, causes dwm to crash, dropping me to the TTY. Sometimes it happens right after the install finishes, sometimes it takes a couple of seconds; regardless it crashes every time without further input on my part (not even triggering a restart of dwm myself).

Additional info

So far, this issue happens only on my desktop, which has an NVIDIA GPU. I am using the same build on my laptop, with AMD graphics, and everything seems to work correctly. Never mind, it is now happening on both of my systems.

This has not been an issue before kernel update 6.11. I had been using dwm-flexipatch based on dwm 6.4 since early last year, and everything worked fine. The issue started happening with my 6.4 build, and remains after a fresh build of 6.5.

I can reliably reproduce the issue with an unmodified build of dwm-flexipatch, without any customisation or patching, so it does not seem like an issue with any particular patch I'm using.

I've managed to dig up the following information. I'm not a developer and have no experience debugging software, so I might be missing something obvious.

  1. dmesg log
[  506.820021] dwm[853]: segfault at 54b6 ip 00000000000054b6 sp 00007fff284aa998 error 14 likely on CPU 10 (core 4, socket 0)
[  506.820031] Code: Unable to access opcode bytes at 0x548c.
  1. Debugging coredump with gdb
Core was generated by `dwm'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000000054b6 in XNextEvent@plt ()
(gdb) bt
#0  0x00000000000054b6 in XNextEvent@plt ()
#1  0x0000577c31be88a8 in ?? ()
#2  0x0000000100000001 in ?? ()
#3  0x0001000100000003 in ?? ()
#4  0x00007b070000000e in ?? ()
#5  0x00000000000007a2 in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) 

This is as far as I've got. I assume XNextEvent is related to Xorg in some way, but I have not been able to find any references to issues like this.

Do let me know if there's anything else I can look at, and apologies if this is not the right place to submit this issue. Seems to affect upstream as well, but maybe something can come out of posting here.

Cheers

@adolfgatonegro adolfgatonegro changed the title dwm terminated with signal SIGSEV, segmentation fault dwm terminated with signal SIGSEGV, segmentation fault Oct 6, 2024
@rayvermey
Copy link

this happens since kernel 6.11
Why i do not know
When you downgrade tot kernel 6.10 all is back to normal

@bakkeby
Copy link
Owner

bakkeby commented Oct 6, 2024

The only report I have so far is that this happens with Kernel 6.11 and does not happen with Kernel 6.10. This also happens with a bare dwm.

The crash / segmentation violation seems to be in relation to the binary file being overwritten, as manually moving the old file away (from /usr/local/bin) before compiling seemingly mitigates the issue.

I'll let you know once I know more.

@adolfgatonegro
Copy link
Author

The only report I have so far is that this happens with Kernel 6.11 and does not happen with Kernel 6.10. This also happens with a bare dwm.

The crash / segmentation violation seems to be in relation to the binary file being overwritten, as manually moving the old file away (from /usr/local/bin) before compiling seemingly mitigates the issue.

I'll let you know once I know more.

This is indeed an issue with 6.11. Rolling back to 6.10 prevents this from happening, as does using the 6.6 LTS kernel, which is what I'm currently doing.

Thanks for looking into it, mate. Let me know if I can provide any additional info or test anything to help.

@gozenka
Copy link

gozenka commented Oct 7, 2024

6.11.1-arch1-1
Plain dwm and with some light patching.
Intel iGPU: Intel Corporation HD Graphics 630

Oct 07 03:47:27 zn systemd[1]: Starting /usr/bin/make install...
Oct 07 03:47:27 zn systemd[1]: Started /usr/bin/make install.
Oct 07 03:47:27 zn systemd[1]: run-u44.service: Deactivated successfully.
Oct 07 03:47:27 zn kernel: dwm[728]: segfault at 819e ip 000000000000819e sp 00007ffe5cf6f988 error 14 likely on CPU 2 (core 2, socket 0)
Oct 07 03:47:27 zn kernel: Code: Unable to access opcode bytes at 0x8174.

Similar output when make install aslstatus, as I wanted to check it with another application.

Oct 07 04:08:15 zn kernel: temperature[4248]: segfault at 55e7 ip 00000000000055e7 sp 0000765ed57ffd58 error 14 likely on CPU 3 (core 3, socket 0)
Oct 07 04:08:15 zn kernel: Code: Unable to access opcode bytes at 0x55bd.

@bakkeby
Copy link
Owner

bakkeby commented Oct 7, 2024

Running lsof showed that the process holds a file descriptior of type "mem" pointing to the the binary file.

$ sudo lsof | grep -E "COMMAND|/usr/local/bin/dwm"
COMMAND     PID   TID TASKCMD               USER  FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
dwm        2697                         sbakkeby txt       REG               0,27    448376   30882447 /usr/local/bin/dwm
dwm        2697                         sbakkeby mem       REG               0,26             30882447 /usr/local/bin/dwm (path dev=0,27)

I am assuming that this is a new thing in Kernel 6.11.

My interpretation of what is happening here is that when we re-compile and install dwm the binary data of the file handle (/usr/local/bin/dwm) is being overwritten ultimately causing a segmentation fault for the process holding the memory file handle.

A quick workaround for this issue is to delete the original file before we copy the new file.

diff --git a/Makefile b/Makefile
index ffa69b4..c5e7554 100644
--- a/Makefile
+++ b/Makefile
@@ -32,6 +32,7 @@ dist: clean

 install: all
        mkdir -p ${DESTDIR}${PREFIX}/bin
+       rm -f ${DESTDIR}${PREFIX}/bin/dwm
        cp -f dwm ${DESTDIR}${PREFIX}/bin
        chmod 755 ${DESTDIR}${PREFIX}/bin/dwm
        mkdir -p ${DESTDIR}${MANPREFIX}/man1

Here is what the lsof output looks like after the file has been deleted (or is moved).

$ sudo lsof | grep -E "COMMAND|/usr/local/bin/dwm"
COMMAND     PID   TID TASKCMD               USER  FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
dwm        2697                         sbakkeby txt       REG               0,27    448376   30882447 /usr/local/bin/dwm (deleted)
dwm        2697                         sbakkeby DEL       REG               0,26             30882447 /usr/local/bin/dwm

@gozenka
Copy link

gozenka commented Oct 7, 2024

I'm just following this out of curiosity, I do not know much about what I am doing. I thought of checking lsof too but didn't know what to do with the output.

Are any of the reports from distros other than Arch Linux?

In case it might offer more clues, here is some information from my system:

  • dwm does not have the mem FD for me.
  • Manually deleting the binary gets the same (deleted), but no extra DEL FD like yours.
  • Same for other applications. I only have txt FDs for all applications.

So, those might be unrelated? Maybe related to the filesystem? I use ext4. In case it might be related to swap, zram, etc., I have none of those on my system. I also tried suspend / wakeup, no difference.

% sudo lsof | grep -iE "COMMAND|bin/dwm"
COMMAND    PID  TID TASKCMD               USER  FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
dwm       1834                              km txt       REG              254,0      67920    3543170 /usr/local/bin/dwm

% sudo rm -f /usr/local/bin/dwm

% sudo lsof | grep -iE "COMMAND|bin/dwm"
COMMAND    PID  TID TASKCMD               USER  FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
dwm       1834                              km txt       REG              254,0      67920    3543170 /usr/local/bin/dwm (deleted)

Trying with other applications:

  • nsxiv when make install from git repo gets the segfault and crash. (/usr/local/bin/nsxiv)
  • nsxiv when reintalled via pacman -S nsxiv does not get the segfault or crash, but gets the (deleted). (/usr/bin/nsxiv)
  • aslstatus: It seems its random specific modules segfault each time, and then the entire thing restarts with a new PID, so I notice no crash.
  • Other things when reinstalled with pacman -S had no issue neither, but got the (deleted). Nothing peculiar in journal.
  • In case the location matters somehow (/usr/local/bin VS /usr/bin), aslstatus gets installed to /usr/bin when make install. So there seems to be no effect of that.

pacman nsxiv:

% sudo lsof | grep -iE "COMMAND|bin/nsxiv"
COMMAND    PID  TID TASKCMD               USER  FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
nsxiv     3837                              km txt       REG              254,0      88712    3546241 /usr/bin/nsxiv

% sudo pacman -S nsxiv

% sudo lsof | grep -iE "COMMAND|bin/nsxiv"
COMMAND    PID  TID TASKCMD               USER  FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
nsxiv     3837                              km txt       REG              254,0      88712    3546241 /usr/bin/nsxiv (deleted)

git nsxiv:

Oct 07 19:37:42 zn kernel: nsxiv[3677]: segfault at 5aa6 ip 0000000000005aa6 sp 00007fff7170e3e8 error 14 likely on CPU 3 (core 3, socket 0)
Oct 07 19:37:42 zn kernel: Code: Unable to access opcode bytes at 0x5a7c.

aslstatus:

Oct 07 19:43:56 zn kernel: cpu_percentage[3771]: segfault at 3dac ip 0000000000003dac sp 00007b49c3dffd58 error 14 likely on CPU 2 (core 2, socket 0)
Oct 07 19:43:56 zn kernel: Code: Unable to access opcode bytes at 0x3d82.

[...]

Oct 07 19:45:24 zn kernel: temperature[4276]: segfault at 55e7 ip 00000000000055e7 sp 0000763b359ffd58 error 14
Oct 07 19:45:24 zn kernel: ram_used[4277]: segfault at 5481 ip 0000000000005481 sp 0000763b34fffd58 error 14
Oct 07 19:45:24 zn kernel:  likely on CPU 0 (core 0, socket 0)
Oct 07 19:45:24 zn kernel:  likely on CPU 2 (core 2, socket 0)
Oct 07 19:45:24 zn kernel:
Oct 07 19:45:24 zn kernel: Code: Unable to access opcode bytes at 0x55bd.
Oct 07 19:45:24 zn kernel: Code: Unable to access opcode bytes at 0x5457.

gokberkgunes added a commit to gokberkgunes/st that referenced this issue Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants