Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang-18 built kernel v6.11 leads to desktop graphics corruption + "AMDGPU(0): amdgpu_setup_kernel_mem failed" vs. same kernel built with gcc-14 runs ok (v6.11, x86_64) #2053

Open
ernsteiswuerfel opened this issue Sep 22, 2024 · 7 comments

Comments

@ernsteiswuerfel
Copy link

ernsteiswuerfel commented Sep 22, 2024

I reported the amdgpu graphics corruption issue at https://gitlab.freedesktop.org/drm/amd/-/issues/3638. I got the issue starting with kernel v6.10.10, so I bisected it to the following commit:

[...]
# first bad commit: [675d6d34fc1c36a7cee0d10e06985fb1e6bc7746] drm/amdgpu: always allocate cleared VRAM for GEM allocations

drm/amdgpu: always allocate cleared VRAM for GEM allocations
This adds allocation latency, but aligns better with user
expectations.  The latency should improve with the drm buddy
clearing patches that Arun has been working on.

In addition this fixes the high CPU spikes seen when doing
wipe on release.

v2: always set AMDGPU_GEM_CREATE_VRAM_CLEARED (Christian)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3528
Fixes: a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality")
Acked-by: Arunpravin Paneer Selvam <[email protected]>
Reviewed-by: Michel Dänzer <[email protected]> (v1)
Signed-off-by: Alex Deucher <[email protected]>
Cc: Arunpravin Paneer Selvam <[email protected]>
Cc: Christian König <[email protected]>
(cherry picked from commit 6c0a7c3c693ac84f8b50269a9088af8f37446863)
Cc: [email protected] # 6.10.x

Turns out this only happens when I build the kernel with clang-18 but not when I build it with gcc-14. Reverting the above commit 'fixes' the clang-18 built kernel too and I get no graphics corruption. Current mainline kernel v6.11 shows the same issue on my system.

clang and gcc kernel .config attached.
config_6110_zen3_gcc14.txt
config_6110_zen3_clang18.txt

@nathanchance
Copy link
Member

nathanchance commented Sep 23, 2024

None of the differences I see between the configurations would appear to cause this. There is not much information to go on here. It is entirely possible this is a code problem that just happens to show up with clang due to optimization or code generation differences. Testing with CONFIG_KMSAN, CONFIG_KCSAN, or CONFIG_UBSAN may help unearth some clues. Some input from the AMD developers would be helpful, hopefully they will continue to investigate it.

@ernsteiswuerfel
Copy link
Author

No suspicious output with KASAN and/or UBSAN.

With KMSAN (no other sanitizers selected) the the kernel doesn't boot at all however. It gets stuck at UEFI bootscreen stating that it will load the kernel image, but after that nothing for several minutes...

@nathanchance
Copy link
Member

Thanks for double checking on the sanitizers. Another thing to check is if this happens with an older or newer version of LLVM, which may point to a regression.

With KMSAN (no other sanitizers selected) the the kernel doesn't boot at all however. It gets stuck at UEFI bootscreen stating that it will load the kernel image, but after that nothing for several minutes...

cc @ramosian-glider, that sounds unexpected?

@ernsteiswuerfel
Copy link
Author

I'll check with llvm 19.1.1 once it's out and report back (also on the KMSAN issue).

@ramosian-glider
Copy link

With KMSAN (no other sanitizers selected) the the kernel doesn't boot at all however. It gets stuck at UEFI bootscreen stating that it will load the kernel image, but after that nothing for several minutes...

Can you please share the kernel build config?

@ernsteiswuerfel
Copy link
Author

@ramosian-glider Here's my v6.10.12 one. Same config without KMSAN boots just fine with my Ryzen 5950X.
config_61012_zen3.txt

@ernsteiswuerfel
Copy link
Author

ernsteiswuerfel commented Oct 2, 2024

Did a build now with 19.1.1 and kernel v6.11.1 and can confirm the amdgpu graphics corruption is still there.

Also the kernel not booting with KMSAN enabled. Just opened #2054 on that one to avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants