Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error from wrong header parsing when compiling CentOS kernel #47

Open
DanielKriz opened this issue May 6, 2021 · 10 comments
Open

Error from wrong header parsing when compiling CentOS kernel #47

DanielKriz opened this issue May 6, 2021 · 10 comments

Comments

@DanielKriz
Copy link

When building CentOS kernel (v. 4.18..0-193.el8) with gllvm, sometimes there is some non-existent header (usually consisting from one letter and .h file extension, for example r.h). I suspect that this could be because of some bug in parsing.

This kernel and it's config is acquired using rhel-kernel-get

Enviroment

  • Linux 64-bit, Fedora 34
  • go version go1.16.3 linux/amd64
  • the most recent version of gllvm

Example of error

fixdep: error opening file: r.h: No such file or directory
make[2]: *** [scripts/Makefile.build:313: arch/x86/crypto/aesni-intel_glue.o] Error 2
make[1]: *** [scripts/Makefile.build:553: arch/x86/crypto] Error 2
make: *** [Makefile:1069: arch/x86] Error 2

from

  gclang -Wp,-MD,arch/x86/crypto/.aesni-intel_glue.o.d -nostdinc -isystem /usr/lib64/clang/12.0.0/include -I./arch/x86/include -I./arch/x86/include/generated   -I./include/drm-backport -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Qunused-arguments -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -Werror-implicit-function-declaration -Wno-format-security -std=gnu89 -no-integrated-as -fno-PIE -DCC_HAVE_ASM_GOTO -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mstack-alignment=8 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_AVX512=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mretpoline-external-thunk -fno-delete-null-pointer-checks -Wno-frame-address -Wno-int-in-bool-context -O2 -Werror -Wframe-larger-than=2048 -fstack-protector-strong -Wno-format-invalid-specifier -Wno-gnu -Wno-address-of-packed-member -Wno-tautological-compare -mno-global-merge -Wno-unused-const-variable -g -gdwarf-4 -pg -mfentry -DCC_USING_FENTRY -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fno-merge-all-constants -fno-stack-check -Werror=implicit-int -Werror=strict-prototypes -Werror=date-time -Werror=incompatible-pointer-types -fmacro-prefix-map=./= -Wno-initializer-overrides -Wno-unused-value -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-uninitialized -Wno-pointer-to-enum-cast    -DKBUILD_BASENAME='"aesni_intel_glue"' -DKBUILD_MODNAME='"aesni_intel"' -c -o arch/x86/crypto/.tmp_aesni-intel_glue.o arch/x86/crypto/aesni-intel_glue.c

Intereting thing is, that there is no header ending with r.h. Just to be sure I checked all preceding gclang calls and there is none such header either.

How to reproduce

This error usually happened when using multiple threads/core to compile linux kernel (-j option), it is almost guaranteed to happen at some point during compilation.
Rarely it happens when no number of cores is specified, this way it usually only once per compilation.
When called with make -j1 CC=gclang (because for example ninja build system needs to be called with -j1 to use just one core), it seems to be almost guaranteed to occur.

Is there a way how to fix this?

@ianamason
Copy link
Member

I have searched high and low for this "race".

Usually restarting the build works. I have no idea what is going on.

I used the go tool for race detection. No luck. (i.e no races)

Searched through the code. Gave up in the end. It was very very annoying.

@DanielKriz
Copy link
Author

DanielKriz commented May 7, 2021

Restarting the build really usually worked, but I run into totally new thing, i get this error every time i start a new build (even after calling make clean and starting anew). I even tried to download new kernel and everything.

fixdep: error opening file: o.h: No such file or directory
make[2]: *** [scripts/Makefile.build:313: arch/x86/kernel/crash_dump_64.o] Error 2
make[1]: *** [scripts/Makefile.build:553: arch/x86/kernel] Error 2
make: *** [Makefile:1069: arch/x86] Error 2

Whole error log with the same error ran with KBUILD_VERBOSE=1: error_crash_dump64.log
One quite unique specimen of this error is this one:

fixdep: error opening file: elper.h: No such file or directory
make[2]: *** [scripts/Makefile.build:313: arch/x86/crypto/cast6_avx_glue.o] Error 2
make[1]: *** [scripts/Makefile.build:553: arch/x86/crypto] Error 2
make: *** [Makefile:1069: arch/x86] Error 2

As it is not just one letter and .h file extension. Whole log: error_elper.log

As you said, it usually only required to start again, but now I get it every few compiler calls. (perhaps it could be clang issue?)

Would you give me some pointers to gllvm source code and how the parsing works please? I really want to help with this issue.

Edit: Another interesting thing, clang doesn't know option --mfentry and I don't need it for my purposes, so I removed it from makefile and after that this error haven't occured. This suggest that it could actually really be on clangs side.

@ianamason
Copy link
Member

Excellent. Thank you @DanielKriz! I will give you a tour later today.

@ianamason
Copy link
Member

ianamason commented May 7, 2021

Lets concentrate on gclang, gclang++ is almost identical.
The entry point is gllvm/cmd/gclang/main.go which passes
all the work on to shared.Compile(args, "clang"), args here being
the cmd line args not including gclang.

The parsing is done on line 63 of shared/compiler.go
All the parsing is located in shared/parser.go

The parser's job is to:

  1. figure out if we need to actually produce bitcode
  2. divide the options into link time, compile time etc ...

Note that there is no concurrency yet. Once we have parsed the cmds into a
ParserResult object we then decide what to do.

This is where the concurrency occurs, assuming we have to produce bitcode.
Lines 85 and 86 of shared/compiler.go are the two concurrent jobs that
produce the object file(s), and produce the bitcode file(s), respectively.

The parser is long but pretty straightforward, it tries to do exact matches first, then
does some pattern matching. The parser grows as the command lines to clang grow.
You will see comments on the more obscure switches. The kernel of course is the mother lode
of obscure switches.

I am pretty sure, but you should check, that once created the ParserResult object pr is not mutated.

So really not a lot of room for parallel weirdness.

@ianamason
Copy link
Member

ianamason commented May 7, 2021

Note that you could pretty easily instrument the code to dump each pr object out to the log.
Something like:

LogWarning("pr: %v",  @pr)

say by adding this to line 64 of compiler.go.

@ianamason
Copy link
Member

By the way, do you mean that

Another interesting thing, clang doesn't know option --mfentry and I don't need it for my purposes, so I removed it from makefile and after that this error haven't occured. This suggest that it could actually really be on clangs side.

All errors disappear, or just one particular type of error?

@DanielKriz
Copy link
Author

DanielKriz commented May 7, 2021

Just this particular error, but after some times (kernel compilation is very long) it again threw that error, but much much later. Unfortunately no matter how many times I restarted the build, the error prevailed.
edit: I want also to thank you for the hints

@ianamason
Copy link
Member

Any progress on this mystery @DanielKriz?

@DanielKriz
Copy link
Author

My main suspect is clang, because I have two system fedora 34 with clang 12 and Ubuntu 18,04 with clang 11. On Fedora the bug happens very often and even restarting the build doesn't help. On the ubuntu on the other hand, the build progresses with 0-2 occurences.

I am preparing some containers to try the build with different versions of clang and I am also trying to understand gllvm codebase and Golang (as I started learning it just because of this bug, it's pretty neat language. I must say)

I will update you as I get all the logs from the containers.

@ianamason
Copy link
Member

Interesting. One possibility is that the two almost identical concurrent calls to clang somehow manage to interfere
with one another. Since they are separate processes this must be via the filesystem or some such other external state.
I wonder if they are writing/reading/removing the same auxiliary files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants