Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project does not build with clang, and does not link with GCC 10 #13

Open
ned14 opened this issue Mar 1, 2021 · 4 comments
Open

Project does not build with clang, and does not link with GCC 10 #13

ned14 opened this issue Mar 1, 2021 · 4 comments

Comments

@ned14
Copy link

ned14 commented Mar 1, 2021

Firstly, onload's external headers don't parse correctly with clang, so even if you build onload with GCC and then everything else uses clang, you can't #include onload from clang without error. Here is our patch:

        # openonload headers included by XXX are not compatible with clang, so fix that now
        file(GLOB_RECURSE openonloadheaders "${XXX_ROOT_DIR}/thirdparty/openonload/*/debug.h")
        foreach(openonloadheader ${openonloadheaders})
          file(READ "${openonloadheader}" origcontents)
          string(REPLACE "[%s=\"IPX_FMT\"]\"" "[%s=\" IPX_FMT \"]\"" newcontents "${origcontents}")
          if(NOT origcontents STREQUAL newcontents)
            message(STATUS "NOTE: Fixed illegal C++ in '${openonloadheader}'")
            file(WRITE "${openonloadheader}" "${newcontents}")
          endif()
        endforeach()

Secondly, attempting to build onload with clang fails badly. You appear to be using quite a bit of GCC-specific stuff. It would be great if you fixed this, as we currently have to special case only onload when we are compiling with clang.

Thirdly, onload won't link when built with GCC 10 due to:

ci_bpf_oobpf_elf_sh.o:(.bss+0x0): multiple definition of `citp_signal_data'
ci_bpf_oobpf_sh.o:(.bss+0x0): first defined here
collect2: error: ld returned 1 exit status
make[3]: *** [oobpfintf0.so] Error 1

Also, your configure script can't handle GCC 10:

gcc_major_version() {
  gcc --version |head -1 |sed "s/^gcc.*) \([1-9]*\)\..*$/\1/"
}

This doesn't account for GCC 10.

@rhughes-xilinx
Copy link

Hi, thanks for trying Onload.

Your second and third issues are no longer relevant to this repository - the offending code was removed/changed some time ago. For support with formal Onload releases please use [email protected]

Your first issue is something we should definitely clean up. To help us prioritise the problem, can you please tell us what the top-level file you originally #included in your source file was?

@ned14
Copy link
Author

ned14 commented Mar 1, 2021

Good to hear about the second and third issues being fixed. I'll pass that on to the others.

For the first issue, it's a widespread problem best illustrated using https://godbolt.org/z/q65nxG

To fix, you need spaces between all macros and their surrounding string literals, in all your header files. That's what the above cmake patch does for all debug.h files in all our many versions of openonload.

@abower-amd
Copy link
Collaborator

Hi Niall,

My colleague was interested in knowing which of the external header files you were trying to include from your application to help direct our effort.

You are also most welcome to raise a pull request with the changes your recommend!

Thanks,
Andy

@ned14
Copy link
Author

ned14 commented Mar 19, 2021

We currently use these open onloads:

  • onload-7.0.0.176
  • onload-7.1.0.265
  • openonload-201310-u1
  • openonload-201405-u2
  • openonload-201502-u3
  • openonload-201509
  • openonload-201606
  • openonload-201606-u1.2
  • openonload-201606-u1.3
  • openonload-201805
  • openonload-201811
  • openonload-201811-u1

As a result, we cannot avoid source patching in any case, but I figured it would be nice to let you know of the issue.

In terms of which header files we use, I think these are the approximate set:

  • #include <etherfabric/vi.h>
  • #include <etherfabric/pd.h>
  • #include <etherfabric/memreg.h>
  • #include <etherfabric/packedstream.h>
  • #include <ci/tools.h>
  • #include <ci/tools/ippacket.h>
  • #include <ci/net/ipv4.h>

In any case, it doesn't matter: you can't have macros in between string literals in C++ 11 onwards without whitespace separating them. So, for example, you can't have this in valid C++ 11 code: https://github.com/Xilinx-CNS/onload/blob/master/src/include/ci/tools/debug.h#L220 where "where [%s="IPX_FMT"]" would need to become "where [%s=" IPX_FMT "]" to become valid C++ 11.

If you search your source code for the case sensitive regex "[^"]+"[A-Z_]+"[^"]+", that gives a nice list of candidates to fix.

cns-ci-onload-xilinx pushed a commit that referenced this issue Feb 1, 2023
Deferring oo_exit_hook() fixes a stuck C++ application:

    #0  0x00007fd2d7afb87b in ioctl () from /lib64/libc.so.6
    #1  0x00007fd2d80c0621 in oo_resource_op (cmd=3221510722, io=0x7ffd15be696c, fp=<optimized out>) at /home/iteterev/lab/onload_internal/src/include/onload/mmap.h:104
    #2  __oo_eplock_lock (timeout=<synthetic pointer>, maybe_wedged=0, ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/eplock_slow.c:35
    #3  __ef_eplock_lock_slow (ni=ni@entry=0x20c8480, timeout=timeout@entry=-1, maybe_wedged=maybe_wedged@entry=0) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/eplock_slow.c:72
    #4  0x00007fd2d80d7dbf in ef_eplock_lock (ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/include/onload/eplock.h:61
    #5  __ci_netif_lock_count (stat=0x7fd2d5c5b62c, ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/include/ci/internal/ip_shared_ops.h:79
    #6  ci_tcp_setsockopt (ep=ep@entry=0x20c8460, fd=6, level=level@entry=1, optname=optname@entry=9, optval=optval@entry=0x7ffd15be6acc, optlen=optlen@entry=4) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/tcp_sockopts.c:580
    #7  0x00007fd2d8010da7 in citp_tcp_setsockopt (fdinfo=0x20c8420, level=1, optname=9, optval=0x7ffd15be6acc, optlen=4) at /home/iteterev/lab/onload_internal/src/lib/transport/unix/tcp_fd.c:1594
    #8  0x00007fd2d7fde088 in onload_setsockopt (fd=6, level=1, optname=9, optval=0x7ffd15be6acc, optlen=4) at /home/iteterev/lab/onload_internal/src/lib/transport/unix/sockcall_intercept.c:737
    #9  0x00007fd2d7dcb7dd in ?? ()
    #10 0x00007fd2d83392e0 in ?? () from /home/iteterev/lab/onload_internal/build/gnu_x86_64/lib/transport/unix/libcitransport0.so
    #11 0x000000000060102c in data_start ()
    #12 0x00007fd2d8339540 in ?? () from /home/iteterev/lab/onload_internal/build/gnu_x86_64/lib/transport/unix/libcitransport0.so
    #13 0x00000001d85426c0 in ?? ()
    #14 0x00007fd2d7fcbe08 in ?? ()
    #15 0x00007fd2d7a433c7 in __cxa_finalize () from /lib64/libc.so.6
    #16 0x00007fd2d7dcb757 in ?? ()
    #17 0x00007ffd15be6be0 in ?? ()
    #18 0x00007fd2d834f2a6 in _dl_fini () from /lib64/ld-linux-x86-64.so.2

Here, _fini() is a function that calls all library destructors. The
problem is that _fini() decides to run the C++ library destructor
*after* Onload and makes it operate on an invalid Onload state.

The patch leverages the fact that Glibc sets up _fini() after running
the last library constructor, so by manually installing the exit handler
(instead of providing a library destructor), Onload wins the race with
_fini().

There's still an issue if the user library sets a custom exit handler
with atexit() or on_exit() and makes intercepted system calls from
there.

Tested:

* RHEL 7.9/glibc 2.17
* RHEL 8.2/glibc 2.28
* RHEL 9.1/glibc 2.34

Thanks-to: Richard Hughes <[email protected]>
Thanks-to: Siân James <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants