Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sfc driver does not load in a bare metal with i40e driver #7

Closed
shirshen12 opened this issue Jan 2, 2021 · 12 comments
Closed

sfc driver does not load in a bare metal with i40e driver #7

shirshen12 opened this issue Jan 2, 2021 · 12 comments

Comments

@shirshen12
Copy link

Hello @maciejj-xilinx

I have been trying to get Onload working on a:

40GBe NiC (X710, 4 x 10 bonded)
i40e driver, version: 2.8.20-k (its a stock version, same issue with 2.13 i40e driver update as well)
64 core Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Centos 8, 4.18.0-240.1.1.el8_3.x86_64

So onload compiles well, but when I go to reload it, I see this error:
Screenshot 2021-01-02 at 12 16 33 PM

Please see dmesg as well below:
Screenshot 2021-01-02 at 12 17 28 PM

Any help is appreciated.

@maciejj-xilinx
Copy link
Contributor

maciejj-xilinx commented Jan 7, 2021

I have not been able to reproduce this issue with rhel8.3 4.18.0-240.el8.x86_64 kernel and fad4522
I have not tried centos 8 4.18.0-240.1.1.el8_3.x86_64 specifically.

It looks like dependency on mtd module. Is mtd module inserted, would modprobe mtd help?

@shirshen12
Copy link
Author

Hi @maciejj-xilinx It did not help modprobe mtd also results in the below error:
[root@shir-bare01 ~]# modprobe mtd
modprobe: FATAL: Module mtd not found in directory /lib/modules/4.18.0-240.1.1.el8_3.x86_64

I am not certain why would we need a Memory Tech Device for sfc driver. Else this should have also affected ixgbe based system as well.

Are you reproducing the error on a i40e based system ?

@maciejj-xilinx
Copy link
Contributor

Thanks for trying.
It looks unfortunately to be kernel compatibility issue of our driver suite. I have raised the issue internally.

@shirshen12
Copy link
Author

Hi @maciejj-xilinx ,

I tried to make i40e driver with onload and the same issue persists as before. I checked the commit @abower-xilinx made and hence made an attempt.

This is an FYI

@shirshen12
Copy link
Author

Hello @maciejj-xilinx ,

I have tried to compile the latest Onload on top of 140e driver again and it looks like we are still hitting the same issue, as described in this ticket.

Can you please let me know the status of the issue you have raised internally ?

@maciejj-xilinx
Copy link
Contributor

I have raised the priority of the issue. We depend on other internal team to have this fixed.

@shirshen12
Copy link
Author

Hello @maciejj-xilinx any updates on this issue ? Just checking, if any updates, I will test it again.

1 similar comment
@shirshen12
Copy link
Author

Hello @maciejj-xilinx any updates on this issue ? Just checking, if any updates, I will test it again.

@maciejj-xilinx
Copy link
Contributor

maciejj-xilinx commented Feb 25, 2021

Hey @shirshen12

We can reproduce the issue with 4.18.0-240.15.1.el8_3.x86_64 on rhel8.3
It is not fixed yet unfortunately.

UPDATE: we canNOT reproduce the issue with 4.18.0-240.15.1.el8_3.x86_64 on rhel8.3
Would you be able to try on updated kernel version?

Regards,
Maciej

@shirshen12
Copy link
Author

Thanks for sharing the exact kernel version, can you also share the exact driver version @maciejj-xilinx for i40e ? I am updating the driver version to obtain zerocopy support, are you also updating the driver ?

@maciejj-xilinx
Copy link
Contributor

My earlier survey of the standalone i40 driver code brought me to believe that zerocopy gets disabled when built against redhat/centos 8 kernels. The distro driver code did not seem to suffer from this issue.

@shirshen12
Copy link
Author

Closing this ticket since i40e driver is compatible now with Onload.

cns-ci-onload-xilinx pushed a commit that referenced this issue Feb 1, 2023
Deferring oo_exit_hook() fixes a stuck C++ application:

    #0  0x00007fd2d7afb87b in ioctl () from /lib64/libc.so.6
    #1  0x00007fd2d80c0621 in oo_resource_op (cmd=3221510722, io=0x7ffd15be696c, fp=<optimized out>) at /home/iteterev/lab/onload_internal/src/include/onload/mmap.h:104
    #2  __oo_eplock_lock (timeout=<synthetic pointer>, maybe_wedged=0, ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/eplock_slow.c:35
    #3  __ef_eplock_lock_slow (ni=ni@entry=0x20c8480, timeout=timeout@entry=-1, maybe_wedged=maybe_wedged@entry=0) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/eplock_slow.c:72
    #4  0x00007fd2d80d7dbf in ef_eplock_lock (ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/include/onload/eplock.h:61
    #5  __ci_netif_lock_count (stat=0x7fd2d5c5b62c, ni=0x20c8480) at /home/iteterev/lab/onload_internal/src/include/ci/internal/ip_shared_ops.h:79
    #6  ci_tcp_setsockopt (ep=ep@entry=0x20c8460, fd=6, level=level@entry=1, optname=optname@entry=9, optval=optval@entry=0x7ffd15be6acc, optlen=optlen@entry=4) at /home/iteterev/lab/onload_internal/src/lib/transport/ip/tcp_sockopts.c:580
    #7  0x00007fd2d8010da7 in citp_tcp_setsockopt (fdinfo=0x20c8420, level=1, optname=9, optval=0x7ffd15be6acc, optlen=4) at /home/iteterev/lab/onload_internal/src/lib/transport/unix/tcp_fd.c:1594
    #8  0x00007fd2d7fde088 in onload_setsockopt (fd=6, level=1, optname=9, optval=0x7ffd15be6acc, optlen=4) at /home/iteterev/lab/onload_internal/src/lib/transport/unix/sockcall_intercept.c:737
    #9  0x00007fd2d7dcb7dd in ?? ()
    #10 0x00007fd2d83392e0 in ?? () from /home/iteterev/lab/onload_internal/build/gnu_x86_64/lib/transport/unix/libcitransport0.so
    #11 0x000000000060102c in data_start ()
    #12 0x00007fd2d8339540 in ?? () from /home/iteterev/lab/onload_internal/build/gnu_x86_64/lib/transport/unix/libcitransport0.so
    #13 0x00000001d85426c0 in ?? ()
    #14 0x00007fd2d7fcbe08 in ?? ()
    #15 0x00007fd2d7a433c7 in __cxa_finalize () from /lib64/libc.so.6
    #16 0x00007fd2d7dcb757 in ?? ()
    #17 0x00007ffd15be6be0 in ?? ()
    #18 0x00007fd2d834f2a6 in _dl_fini () from /lib64/ld-linux-x86-64.so.2

Here, _fini() is a function that calls all library destructors. The
problem is that _fini() decides to run the C++ library destructor
*after* Onload and makes it operate on an invalid Onload state.

The patch leverages the fact that Glibc sets up _fini() after running
the last library constructor, so by manually installing the exit handler
(instead of providing a library destructor), Onload wins the race with
_fini().

There's still an issue if the user library sets a custom exit handler
with atexit() or on_exit() and makes intercepted system calls from
there.

Tested:

* RHEL 7.9/glibc 2.17
* RHEL 8.2/glibc 2.28
* RHEL 9.1/glibc 2.34

Thanks-to: Richard Hughes <[email protected]>
Thanks-to: Siân James <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants