Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot represent a difference across sections due to debuginfo #61932

Closed
nikic opened this issue Apr 4, 2023 · 16 comments
Closed

Cannot represent a difference across sections due to debuginfo #61932

nikic opened this issue Apr 4, 2023 · 16 comments

Comments

@nikic
Copy link
Contributor

nikic commented Apr 4, 2023

Running https://gist.github.com/nikic/6f60958ed4295f0bcd48797b5e541aba through llc -filetype=obj we get errors like this:

<unknown>:0: error: Cannot represent a difference across sections
<unknown>:0: error: Cannot represent a difference across sections
<unknown>:0: error: Cannot represent a difference across sections
<unknown>:0: error: Cannot represent a difference across sections

This is related to the debug info in some way.

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 4, 2023

@llvm/issue-subscribers-debuginfo

@nikic
Copy link
Contributor Author

nikic commented Apr 4, 2023

Bisected to d20e4a1. cc @DianQK @dwblaikie

@jmorse
Copy link
Member

jmorse commented Apr 4, 2023

It looks like test2 being "cold" controls this -- looking at the textual assembly, .debug_ranges is being generated thus:

.Ldebug_ranges0:
    .quad   .Ltmp0-.Lfunc_begin1
    .quad   .Ltmp1-.Lfunc_begin1
    .quad   .Ltmp2-.Lfunc_begin1
    .quad   .Ltmp3-.Lfunc_begin1

Where the temporary symbols are in the test1 function, and .Lfunc_begin1 is the start of test2. test2 ends up in a different section to test1, so this is an illegal cross-section reference.

Exactly why this happens is pretty unclear though: if I move test2 to the start of the input file, .debug_ranges still tries to describe test1 in terms of the end of test2, so there's something in the debug-info metadata that causes the two to be connected. test1's scope appears to be deeply buried in some types that are used for the global variable in !8, which is in CU !4's retained_nodes list. However, test1 says that it belongs to CU !1. Breaking this connection at various points in the type-scope chain, or test1's CU, makes the error go away. This gives me the impression that this is a debug-info deduplication / linking error? Exactly how that translates to test1's .debug_ranges connecting to another function still isn't clear.

(The type hierarchy metadata isn't my strong point).

@DianQK DianQK self-assigned this Apr 4, 2023
@DianQK
Copy link
Member

DianQK commented Apr 4, 2023

It may be time to apply https://reviews.llvm.org/P8297 to solve various such problems. @dwblaikie
I will address why this happened as soon as possible to decide what to do next.

@DianQK
Copy link
Member

DianQK commented Apr 5, 2023

@nikic @jmorse I probably figured out why the debug info is often inconsistent under LTO here.
We solved some issues by determining the CU based on the parent DIE. So I want to apply the P8297 patch. I have added more construct method adjustments based on P8297. But it didn't solve the issue.

But we should still consider applying the idea of P8297 while putting the method of constructing the DIE into DwarfDebug. I had a headache switching CU debugging in DwarfCompileUnit.cpp.

The core issue I found is that DISubprogram operations usually use getUnit, while its DIE uses getScope. Once these two are associated with different CUs, associated DIEs may use another CU.

If we can make the associated CU for each DIE unambiguous, these kinds of problems should all go away. But what are the rules for determining who the parent DIE of a DIE is? Ignoring the unit and following the scope chain might be the way to go.

I have created a patch that is just working.

@nikic
Copy link
Contributor Author

nikic commented May 2, 2023

It looks like there is no consensus on the proposed patch (https://reviews.llvm.org/D147620). @dwblaikie is arguing that the debuginfo should not be considered valid in the first place. It sounds like for LLVM 17, a new debuginfo verifier check needs to be implemented and Rust's debuginfo generation adjusted to comply with the new check.

However, this regression still needs to be fixed for LLVM 16, in a way that works with the existing debuginfo format. This is a pretty critical issue, because it breaks building some Rust projects (including Firefox) with debuginfo and LTO. As such, I'd like to propose that we just revert the original change for LLVM 16 and then work on better debuginfo verification (and fixing the fallout it entails) for LLVM 17.

/branch nikic/llvm-project/revert-pr61932

@llvmbot
Copy link
Collaborator

llvmbot commented May 2, 2023

/pull-request llvm/llvm-project-release-prs#438

@dwblaikie
Copy link
Collaborator

What's the behavior before the regression (which was a bug fix for incorrect ranges base addresses, I think?), though? I think it was probably also broken - though perhaps silently so (& maybe that's acceptable, if unfortunate)?

@nikic
Copy link
Contributor Author

nikic commented May 2, 2023

@dwblaikie I'm not sure about the rest of it, but at least debug_ranges used to be well-formed. Here's the diff between LLVM 15 and 16: https://llvm.godbolt.org/z/1EsKP73Y8

Though for now my primary concern is indeed just that debuginfo generation doesn't break compilation outright -- if the generated DWARF is inaccurate in some way that would be a longer term concern.

@dwblaikie
Copy link
Collaborator

@DianQK - do you have a sense of/diff of the effect on the example in this bug, and in the original patch, of reverting your patch? (the patch suggests it fixes some base address issues)

Is it the case that reverting this patch doesn't get back to LLVM 15 behavior?

Rejecting the LLVM metadata in the verifier would cause it to drop all the debug info that contains a verification failure - this'd be about the same as building the Rust code without debug info.

@DianQK
Copy link
Member

DianQK commented May 2, 2023

Reverting the patch is fine for me. This patch can cause Rust to fail to generate object files in some cases. Also, I realized that switching CU in such a depth of logic is not a good way to go. This patch cannot handle base address adjustment for all shared DIE cases.

After reverting, you may also get Cannot represent a difference across sections errors. Because the problem of DICompositeType across CU still exists. But none of this happened in practice, probably because I missed some ELF generation implementations.

We can continue to discuss a better solution in D147620. In fact I have reverted D136039 in D147620 (D58538 also).

I added a new comment to D147620, and the discovery at 73d697c makes me think that the new patch is a more suitable solution.

@nikic
Copy link
Contributor Author

nikic commented May 5, 2023

rust-lang/rust#111167 fixes the debuginfo generation on the rustc side, in line with @dwblaikie's recommendation. Assuming that gets approved for backport there will be no need to do anything for LLVM 16.

We should still implement a verifier check on the LLVM side. There is a suspicion that this has also caused a number of other issues that were never fully understood, so we'll want to save other people the trouble :)

@dwblaikie
Copy link
Collaborator

rust-lang/rust#111167 fixes the debuginfo generation on the rustc side, in line with @dwblaikie's recommendation. Assuming that gets approved for backport there will be no need to do anything for LLVM 16.

thumbs up

We should still implement a verifier check on the LLVM side. There is a suspicion that this has also caused a number of other issues that were never fully understood, so we'll want to save other people the trouble :)

Yeah... would be good. @DianQK any chance you'd be up for trying that? (a verifier check that checks inlined subroutines and definition subprograms and checks that they refer to a declaration, never directly to their enclosing class?)

@DianQK
Copy link
Member

DianQK commented May 9, 2023

I can try it out in my free time. This verifier check should be helpful. This issue has existed for a long time in Rust and Swift.

@dwblaikie
Copy link
Collaborator

Hmm do you have examples of the problem in Swift? Are they going to change to add function declarations like Rust has changed now? (@adrian-prantl @JDevlieghere )

@nikic nikic removed this from the LLVM 16.0.X Release milestone May 13, 2023
@DianQK
Copy link
Member

DianQK commented Jun 4, 2023

I have submitted a validation patch: https://reviews.llvm.org/D152095.

@DianQK DianQK closed this as completed in 2ee4d03 Jul 25, 2023
eymay pushed a commit to eymay/llvm-project that referenced this issue Jul 28, 2023
…Type when enabling ODR.

Resolve llvm#61932. We should add the validation.

LLVM can't handle IR where subprogram definitions are nested within DICompositeType when doing LTO builds, because there's no good way to cross the CU boundary to insert a nested DISubprogram definition in one CU into a type defined in another CU.

The test cases `cross-cu-inlining-2.ll` and `cross-cu-inlining-ranges.ll` can be deleted. In the `cross-cu-inlining-2.ll`, the low pc and high pc of the CU are also incorrect.

Reviewed By: dblaikie

Differential Revision: https://reviews.llvm.org/D152095
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants