Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tst-mmap test fails when compiled with GCC 7 #967

Closed
wkozaczuk opened this issue May 8, 2018 · 14 comments
Closed

tst-mmap test fails when compiled with GCC 7 #967

wkozaczuk opened this issue May 8, 2018 · 14 comments

Comments

@wkozaczuk
Copy link
Collaborator

The error is repeatable. Also it is running under Docker container.

TEST tst-mmap.so OSv
eth0: 192.168.122.15
page fault outside application, addr: 0x0000100001603f10
[registers]
RIP: 0x0000000000397822 <elf::object::arch_relocate_jump_slot(unsigned int, void*, long)+34>
RFL: 0x0000000000010202 CS: 0x0000000000000008 SS: 0x0000000000000010
RAX: 0x00000000004b4110 RBX: 0x0000100001603f10 RCX: 0x000000000098d9d8 RDX: 0x0000000000000002
RSI: 0xffff80000250e680 RDI: 0x00002000001ff9b0 RBP: 0x00002000001ff9d0 R8: 0xfffffffffffff8e0
R9: 0x0000000000c70598 R10: 0x0000000000000007 R11: 0x0000000000000009 R12: 0xffff900001444010
R13: 0x0000000000c70598 R14: 0x0000000000000007 R15: 0x0000000000000009 RSP: 0x00002000001ff9b0
Aborted

[backtrace]
0x0000000000330ff5 <???+3346421>
0x000000000033374a <mmu::vm_fault(unsigned long, exception_frame*)+234>
0x000000000039365e <page_fault+142>
0x00000000003924f6 <???+3745014>
0x00000000003475cc <elf::object::resolve_pltgot(unsigned int)+332>
0x00000000003476e6 <elf_resolve_pltgot+70>
0x000000000039212a <???+3744042>
0x00002000001ffecf <???+2096847>
0x0000000000344212 <elf::program::init_library(int, char**)+386>
0x000000000020dc7a osv::application::main()+58
0x0000000000423798 <???+4339608>
0x00000000004562e5 <???+4547301>
0x00000000003f0ba6 <thread_main_c+38>
0x0000000000393472 <???+3748978>
Test tst-mmap.so FAILED
Traceback (most recent call last):
File "./scripts/test.py", line 188, in
main()
File "./scripts/test.py", line 175, in main
run_tests()
File "./scripts/test.py", line 166, in run_tests
run(tests_to_run)
File "./scripts/test.py", line 86, in run
run_test(test)
File "./scripts/test.py", line 61, in run_test
test.run()
File "/git-repos/osv/scripts/tests/testing.py", line 29, in run
run_command_in_guest(self.command).join()
File "/git-repos/osv/scripts/tests/testing.py", line 163, in join
raise Exception('Guest failed')
Exception: Guest failed

Here is the exact gcc version:
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.

I wonder if this has anything to do with the compilation warnings:
CXX tests/tst-mmap-file.cc
CXX tests/tst-mmap.cc
LD tests/tst-mmap-file.so
In file included from /git-repos/osv/tests/tst-mmap.cc:11:0:
/git-repos/osv/tests/tst-mmap.hh: In function 'bool try_write(int (*)())':
/git-repos/osv/tests/tst-mmap.hh:76:37: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
char byte = (volatile char)&func;
^~~~
/git-repos/osv/tests/tst-mmap.hh:77:25: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
(volatile char)&func = byte;
^~~~
LD tests/tst-mmap.so

@nyh
Copy link
Contributor

nyh commented May 8, 2018

I cannot reproduce this on my machine, with gcc 7.3.1 and without KVM (scripts/run.py -p qemu -e tests/tst-mmap.so).
If the backtrace it to believed (one from GDB could have been clearer), it seems the crash happened while loading the test - in the program::init_library() function, so I doubt a bug inside tst-mmap is responsible for this. If you can reproduce it, can you please run the kernel with --verbose and see if you can see get more insites on which library was being loaded while the bug happening.

Since init_library is a new function (with moved code), it is also possible that this is a regression. If you can easily reproduce this bug, can you please try it with a slightly version of OSv, before init_library() was introduced (for the go support)?

By code inspection, I think I found a potential bug:

In the past, the init_library() code was run by program::get_library() inside a SCOPE_LOCK(_mutex).
Today, it is run later in app.cc, but it is not run with this mutex locked. Maybe we have a problem with init_library() code racing with other threads also adding libraries or doing other ELF stuff? You can try adding SCOPE_LOCK(_mutex) to the beginning of init_library() and seeing if it helps (our mutex is recursive, so it's fine for init_library to take the lock even if its caller has it).

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented May 8, 2018 via email

@wkozaczuk
Copy link
Collaborator Author

I just checkout before init_library() method was added for Golang and still getting error - stack trace different but I think in the same place:

  TEST tst-mmap.so                        OSv v0.24-510-gc604860c
eth0: 192.168.122.15
page fault outside application, addr: 0x0000100000e03f10
[registers]
RIP: 0x0000000000396e12 <elf::object::arch_relocate_jump_slot(unsigned int, void*, long)+34>
RFL: 0x0000000000010202  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x00000000004b2f70  RBX: 0x0000100000e03f10  RCX: 0x000000000098aa38  RDX: 0x0000000000000002
RSI: 0xffff8000019d1a80  RDI: 0x00002000000ff270  RBP: 0x00002000000ff290  R8:  0xfffffffffffff8e0
R9:  0x0000000000c6d598  R10: 0x0000000000000007  R11: 0x0000000000000009  R12: 0xffff90000143f010
R13: 0x0000000000c6d598  R14: 0x0000000000000007  R15: 0x0000000000000009  RSP: 0x00002000000ff270
Aborted

[backtrace]
0x0000000000330f15 <???+3346197>
0x000000000033366a <mmu::vm_fault(unsigned long, exception_frame*)+234>
0x0000000000392c4e <page_fault+142>
0x0000000000391ae6 <???+3742438>
0x000000000034724c <elf::object::resolve_pltgot(unsigned int)+332>
0x0000000000347366 <elf_resolve_pltgot+70>
0x000000000039171a <???+3741466>
0x00002000000ff75f <???+1046367>
0x0000000000349068 <elf::program::get_library(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)+472>
0x00000000004225f2 <osv::application::application(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*)+674>
0x0000000000422dde <osv::application::run(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*)+110>
0x0000000000422feb <osv::application::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)+27>
0x000000000021596c <do_main_thread(void*)+1964>
0x0000000000455dc5 <???+4545989>
0x00000000003f0ba6 <thread_main_c+38>
0x0000000000392a62 <???+3746402>
Test tst-mmap.so FAILED
Traceback (most recent call last):
  File "./scripts/test.py", line 187, in <module>
    main()
  File "./scripts/test.py", line 174, in main
    run_tests()
  File "./scripts/test.py", line 165, in run_tests
    run(tests_to_run)
  File "./scripts/test.py", line 85, in run
    run_test(test)
  File "./scripts/test.py", line 60, in run_test
    test.run()
  File "/git-repos/osv2/scripts/tests/testing.py", line 29, in run
    run_command_in_guest(self.command).join()
  File "/git-repos/osv2/scripts/tests/testing.py", line 163, in join
    raise Exception('Guest failed')
Exception: Guest failed

Wanted to add that I am doing it with KVM enabled (QEMU 2.11). Also from docker container running Ubuntu 18.04

@nyh
Copy link
Contributor

nyh commented May 8, 2018 via email

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented May 8, 2018 via email

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented May 9, 2018

Some extra debug:

Success with tst-mmap-file.so:

Calling load_object for libvdso.so
---> relocate_pltgot()
Calling init_library for libvdso.so
In init_library
Calling load_object for /tests/tst-mmap-file.so
Calling load_object for libc.so.6
---> relocate_pltgot()
Before callin init_library from main() for /tests/tst-mmap-file.so
In init_library
---> resolve_pltgot()
--> arch_relocate_jump_slot: b
---> resolve_pltgot()
--> arch_relocate_jump_slot: 5
PASS: open
---> resolve_pltgot()
--> arch_relocate_jump_slot: 6
PASS: ftruncate
---> resolve_pltgot()
--> arch_relocate_jump_slot: 4
---> resolve_pltgot()
--> arch_relocate_jump_slot: 9
PASS: write pattern to MAP_SHARED

failure with tst-mmap.so

Calling load_object for libvdso.so
---> relocate_pltgot()
Calling init_library for libvdso.so
In init_library 
Calling load_object for /tests/tst-mmap.so
Calling load_object for libstdc++.so.6
Calling load_object for libgcc_s.so.1
Calling load_object for libc.so.6
---> relocate_pltgot() 
Calling load_object for libc.so.6
---> relocate_pltgot()
Before callin init_library from main() for /tests/tst-mmap.so
In init_library 
---> resolve_pltgot() 
--> arch_relocate_jump_slot: 7
page fault outside application, addr: 0x0000100001603f10
[registers] 
RIP: 0x000000000039773e <elf::object::arch_relocate_jump_slot(unsigned int, void*, long)+62>
RFL: 0x0000000000010206  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x00000000004b3ec0  RBX: 0x0000000000000007  RCX: 0x000000000098d838  RDX: 0x0000000000000002
RSI: 0xffff8000019fff80  RDI: 0x00002000001ff9a0  RBP: 0x00002000001ff9d0  R8:  0xfffffffffffff8e0
R9:  0x0000000000c70598  R10: 0x0000000000000007  R11: 0x0000000000000009  R12: 0x0000100001603f10
R13: 0x0000000000c70598  R14: 0x0000000000000007  R15: 0x0000000000000009  RSP: 0x00002000001ff9a0
Aborted

[backtrace]
0x0000000000331045 <???+3346501>
0x000000000033379a <mmu::vm_fault(unsigned long, exception_frame*)+234>
0x000000000039355e <page_fault+142>
0x00000000003923f6 <???+3744758>
0x000000000034746c <elf::object::resolve_pltgot(unsigned int)+348>
0x0000000000347586 <elf_resolve_pltgot+70>
0x000000000039202a <???+3743786>
0x00002000001ffecf <???+2096847>
0x0000000000344272 <elf::program::init_library(int, char**)+402>
0x000000000020dcb9 <osv::application::main()+121>
0x0000000000423798 <???+4339608>
0x0000000000456095 <???+4546709>
0x00000000003f0ba6 <thread_main_c+38>
0x0000000000393372 <???+3748722>
 

Another success with similar linker settings:

Calling load_object for libvdso.so
---> relocate_pltgot()
Calling init_library for libvdso.so
In init_library 
Calling load_object for /tests/tst-elf-permissions.so
Calling load_object for libstdc++.so.6
Calling load_object for libc.so.6
---> relocate_pltgot()
Before callin init_library from main() for /tests/tst-elf-permissions.so
In init_library 
---> resolve_pltgot()
--> arch_relocate_jump_slot: 4
---> resolve_pltgot()
--> arch_relocate_jump_slot: 6
---> resolve_pltgot()
--> arch_relocate_jump_slot: 9
Running elf segments permissions tests
---> resolve_pltgot()
--> arch_relocate_jump_slot: e
---> resolve_pltgot()
--> arch_relocate_jump_slot: c
---> resolve_pltgot()
--> arch_relocate_jump_slot: 1
elf segments permissions tests succeeded

Note that tst-mmap.so loads one extra library - libgcc_s.so.1

Does it potentially have anything to do with #687 and the fact that we still take libgcc_s.so.1 from external?

@nyh
Copy link
Contributor

nyh commented May 9, 2018 via email

@wkozaczuk
Copy link
Collaborator Author

Extra detailed debug for failed scenario:

Calling load_object for libvdso.so ... and relocate() later
---> load_object for /libvdso.so - after setprivate()
---> load_object for /libvdso.so - before load_segments()
---> load_object for /libvdso.so - before relocate()
---> relocate_pltgot()
---> relocate_pltgot(): bind_now: 0
---> relocate_pltgot(): nrel: 3
Calling init_library for libvdso.so
In init_library
Calling load_object for /tests/tst-mmap.so ... and relocate() later
---> load_object for /tests/tst-mmap.so - after setprivate()
---> load_object for /tests/tst-mmap.so - before load_segments()
Calling load_object for libstdc++.so.6 ... and relocate() later
---> load_object for libstdc++.so.6 - returning 1 bc was in _files
Calling load_object for libgcc_s.so.1 ... and relocate() later
---> load_object for /usr/lib/libgcc_s.so.1 - after setprivate()
---> load_object for /usr/lib/libgcc_s.so.1 - before load_segments()
Calling load_object for libc.so.6 ... and relocate() later
---> load_object for libc.so.6 - returning 1 bc was in _files
---> load_object for /usr/lib/libgcc_s.so.1 - before relocate()
---> relocate_pltgot()
---> relocate_pltgot(): bind_now: 0
---> relocate_pltgot(): nrel: 44
Calling load_object for libc.so.6 ... and relocate() later
---> load_object for libc.so.6 - returning 1 bc was in _files
---> load_object for /tests/tst-mmap.so - before relocate()
---> relocate_pltgot()
---> relocate_pltgot(): bind_now: 0
---> relocate_pltgot(): nrel: 27
Before callin init_library from main() for /tests/tst-mmap.so
In init_library
---> resolve_pltgot()
--> arch_relocate_jump_slot: 7
page fault outside application, addr: 0x0000100001603f10
[registers]
RIP: 0x0000000000397ace <elf::object::arch_relocate_jump_slot(unsigned int, void*, long)+62>

I noticed (maybe wrong) - why bind_now is 0 (false)?

When I run
readelf -a build/release/tests/tst-mmap.so | grep NOW
0x000000000000001e (FLAGS) BIND_NOW
0x000000006ffffffb (FLAGS_1) Flags: NOW

Things look correct.

Could it be that our logic in loading/initializing ELF relies on certain order of information and it this case (the way things are linked in tst-mmap.so) is different?

@nyh
Copy link
Contributor

nyh commented May 9, 2018

Yes, I believe the fact that bind_now==0 for tst-mmap.so is the underlying cause of the bug (together
with your linker's asking this segment to be read-only after loading the object, which causes this crash instead of a more subtle problem).

bind_now is set as:

bool bind_now = dynamic_exists(DT_BIND_NOW);

This does NOT test the "BIND_NOW" flag you noticed, but rather a BIND_NOW section: On my machine (where this bug doesn't happen), I get this:

$ readelf -a build/release.x64/tests/tst-mmap.so
...
Dynamic section at offset 0x3cd0 contains 29 entries:
  Tag        Type                         Name/Value
...
 0x0000000000000018 (BIND_NOW)           

It seems in your machine, this section is missing, and instead you have a flag. I'll investigate and prepare a patch for you to try.

@nyh
Copy link
Contributor

nyh commented May 9, 2018

Hi @wkozaczuk thanks for all the useful debugging information. I just sent a patch to the mailing list hopefully fixing this bug. Can you please test it? Thanks.

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented May 9, 2018 via email

@nyh
Copy link
Contributor

nyh commented May 9, 2018

Excellent, thanks.
I guessed that yes, it's a different linker version (I have ld.bfd --version showing GNU ld version 2.29-13.fc27, what do you have?), although I don't know why and I can't find any evidence for such a change in ld's change log. Some documents I read suggest that DT_BIND_NOW is considered the "old way" and DF_BIND_NOW is the "new" way, but I can't for the life of me understand what might prompt them to do this change now, or to not leave both markers behind, just in case.

@nyh nyh closed this as completed in cc6e8f5 May 9, 2018
@nyh
Copy link
Contributor

nyh commented May 9, 2018

I was curious and looked at ld's code (GNU binutils, bfd/elflink.c). Indeed, If ld is run with "--enable-new-dtags", then a DT_FLAGS entry is created, and DT_BIND_NOW is not created - as in your machine.
This flag used to be off by default. But in June 2017, commit 6734f10ae9af6cfd47e70baa2cd3e326afabe5ed of the binutils repository, they added a "--enable-new-dtags" configure option, and when it is on - and it is on by default - it sets the runtime parameter to on by default.
Looking at the dates, I guess that GNU bintuils 2.29 (which I use) didn't have that commit yet, and binutils 2.30 (which you use?) did.
Sad that this change isn't mentioned in the ld "NEWS" file.
So this is why we got this new behavior. Mystery solved.

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented May 9, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants