Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithread application sharing struct nl_cache objects using NL_AUTO_PROVIDE crashes in libnl3-3.4.0 #273

Open
amritpaldhillon opened this issue Mar 15, 2021 · 1 comment

Comments

@amritpaldhillon
Copy link

The test app is a simple application that launches multiple threads (2 by default). Each thread creates its own nl_cache_mngr object with nl_cache_mngr_alloc() API. Then each thread creates 2 caches, one for "route/link" and another for "route/addr". Then each thread polls the Netlink Cache Manager, using nl_cache_mngr_poll() API. The callback functions previously registered in the nl_cache_mngr_add() API are triggered when the corresponding events happen.

The callback functions don't do much except print the fact callback has been called every 100 multiples, along with the Thread ID for the callback.

The system information:

$ cat /etc/debian_version
10.3

$ uname -a
Linux <hostname-removed> 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

$ dpkg -l libnl* | grep "^ii"
ii  libnl-3-200:amd64       3.4.0-1      amd64        library for dealing with netlink sockets
ii  libnl-3-dev:amd64       3.4.0-1      amd64        development library and headers for libnl-3
ii  libnl-genl-3-200:amd64  3.4.0-1      amd64        library for dealing with netlink sockets - generic netlink
ii  libnl-route-3-200:amd64 3.4.0-1      amd64        library for dealing with netlink sockets - route interface
ii  libnl-route-3-dev:amd64 3.4.0-1      amd64        development library and headers for libnl-route-3

$ gcc --version
gcc (Debian 8.3.0-6) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Command to compile the source code into test application

$ gcc -Wall -I/usr/include/libnl3 -O0 -ggdb  -lnl-3 -lnl-route-3 -lpthread  netlinkThreads.c   -o netlinkThreads

In a terminal (normal user) run the test app (below command launches 10 threads). I have found running 10 threads cause the crash to happen quickly as compared to when running 2 threads.

$ ./netlinkThreads 10

Now in a separate shell (root) create 1024 VLAN interfaces and constantly bring them up and down.

Command to create VLAN on eth0 (run as root)

# for f in {1..1024} ; do ip link add link eth0 name eth0.$f type vlan id $f; done

Command to bring VLAN interfaces up and down (also run as root)

# while true; do for f in {1..1024} ; do ip link set dev eth0.$f down; done; for f in {1..1024} ; do ip link set dev eth0.$f up; done; sleep 1;  done

Command to delete VLAN interfaces (run as root), once the test is complete.

# for f in {1..1024} ; do ip link set dev eth0.$f down; done
# for f in {1..1024} ; do ip link del dev eth0.$f; done

Sample output of the test.

$ ./netlinkThreads 10
main: Launching 10 threads
main: created 1 thread
main: created 2 thread
main: created 3 thread
main: created 4 thread
main: created 5 thread
main: created 6 thread
main: created 7 thread
main: created 8 thread
main: created 9 thread
main: created 10 thread
netlinkQuery: Thread #3 is ready
netlinkQuery: Thread #10 is ready
netlinkQuery: Thread #6 is ready
netlinkQuery: Thread #5 is ready
netlinkQuery: Thread #7 is ready
netlinkQuery: Thread #8 is ready
netlinkQuery: Thread #1 is ready
netlinkQuery: Thread #2 is ready
netlinkQuery: Thread #9 is ready
netlinkQuery: Thread #4 is ready
addrCallback: Thread #3 100
addrCallback: Thread #6 100
addrCallback: Thread #10 100
addrCallback: Thread #7 100
addrCallback: Thread #5 100
addrCallback: Thread #2 100
addrCallback: Thread #8 100
addrCallback: Thread #1 100
addrCallback: Thread #9 100
addrCallback: Thread #4 100
addrCallback: Thread #5 200
addrCallback: Thread #9 200
addrCallback: Thread #2 200
addrCallback: Thread #6 200
addrCallback: Thread #8 200
addrCallback: Thread #3 200
addrCallback: Thread #10 200
addrCallback: Thread #7 200
addrCallback: Thread #1 200
addrCallback: Thread #4 200
addrCallback: Thread #10 300
addrCallback: Thread #3 300
addrCallback: Thread #6 300
addrCallback: Thread #5 300
addrCallback: Thread #7 300
addrCallback: Thread #2 300
addrCallback: Thread #4 300
addrCallback: Thread #1 300
addrCallback: Thread #9 300
addrCallback: Thread #8 300
addrCallback: Thread #10 400
addrCallback: Thread #3 400
addrCallback: Thread #4 400
addrCallback: Thread #2 400
addrCallback: Thread #6 400
addrCallback: Thread #7 400
addrCallback: Thread #8 400
addrCallback: Thread #1 400
addrCallback: Thread #9 400
addrCallback: Thread #5 400
BUG at file position /build/libnl3-3.4.0/./lib/object.c:225:nl_object_put
BUG at file position /build/libnl3-3.4.0/./lib/object.c:225:nl_object_put
netlinkThreads: /build/libnl3-3.4.0/./lib/object.c:225: nl_object_put: Assertion '0' failed.
BUG at file position /build/libnl3-3.4.0/./lib/object.c:225:nl_object_put
netlinkThreads: /build/libnl3-3.4.0/./lib/object.c:225: nl_object_put: Assertion '0' failed.
Aborted

The crash goes away if NL_AUTO_PROVIDE is not used, i.e. replaced by 0. See commented code in the APP. Any thread can use the NL_AUTO_PROVIDE flag and the bug will be produced. All threads need to NOT use the NL_AUTO_PROVIDE in order to suppress the bug.

netlinkThreads.zip

amritpaldhillon added a commit to amritpaldhillon/libnl that referenced this issue Feb 11, 2023
The nl_object_get() and nl_object_put() APIs were not updating the
ce_refcnt in thread safe manner. This change added the ce_lock member
to the struct nl_object and makes sure ce_refcnt is accessed in thread
safe manner.

Fixes thom311#273
@thom311
Copy link
Owner

thom311 commented Jul 25, 2023

As you found out, the objects and data structures of libnl are not thread-safe.

What is supposed to work, is that you (for example) create two caches on two threads, and each thread handles the objects that belong to it. That implies, that no global data is used without locking. If you are careful, you can also pass an object from one thread to another, but make sure that at not time both threads access the objects. But you cannot access objects from multiple threads (unless you add your own locking on top of it).

Maybe (just maybe), the basic reference counting of objects should be thread-safe too. Then at least, you could easier pass one object to another thread (although, still two threads could not do anything with those objects except ref/get and unref/put). Still, it's not clear to me that that is desirable. What are you going to do with that? You pass a cache/object to another thread, and then what? I think applications should either do their own locking on top of libnl object, or avoid accessing data from multiple threads.

Also, if it were implemented, then I think this should be done using atomic counters (compare+exchange), and not a pthread lock. <pthreads> is a very popular library, so it probably would be OK to add that as an unconditional dependency, however, you don't need that for atomic operations (we do depend on gcc/clang already).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants