MeshIPC causes valgrind to hang #582

vlovich · 2022-11-17T22:00:31Z

I haven't fully traced down what's happening but valgrind seems to randomly hang, but only when using a mesh.

Normal without valgrind:

cargo run
   Compiling valgrind-deadlock v0.1.0 (/home/vlovich/valgrind-deadlock)
    Finished dev [unoptimized + debuginfo] target(s) in 1.77s
     Running `target/debug/valgrind-deadlock`
Started thread 1
Started thread 2
Initializing ping from peer 0
1: Responding pong to 0
1: Sending ping to 0
Thread 1 task results: [()]
Terminating thread 1
1: Stopping - Channel is closed { .. }
Thread 2 task results: [()]
Terminating thread 2

With valgrind:

valgrind target/debug/valgrind-deadlock
==223234== Memcheck, a memory error detector
==223234== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==223234== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==223234== Command: target/debug/valgrind-deadlock
==223234== 
Started thread 2
^C^C^C==223234== 
==223234== Process terminating with default action of signal 2 (SIGINT)
==223234==    at 0x259784: core::cell::RefCell<T>::borrow (cell.rs:863)
==223234==    by 0x215C58: glommio::sys::source::Source::raw (source.rs:303)
==223234==    by 0x19488C: glommio::sys::uring::Reactor::wait (uring.rs:1839)
==223234==    by 0x19EF10: glommio::reactor::Reactor::react (reactor.rs:817)
==223234==    by 0x148506: glommio::parking::Inner::park (parking.rs:87)
==223234==    by 0x1485B8: glommio::parking::Parker::poll_io (parking.rs:64)
==223234==    by 0x188B49: glommio::executor::LocalExecutor::run::{{closure}} (mod.rs:1461)
==223234==    by 0x14996F: scoped_tls::ScopedKey<T>::set (lib.rs:137)
==223234==    by 0x1887BE: glommio::executor::LocalExecutor::run (mod.rs:1439)
==223234==    by 0x18B004: glommio::executor::LocalExecutorPoolBuilder::spawn_thread::{{closure}} (mod.rs:987)
==223234==    by 0x1A262D: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:122)
==223234==    by 0x19C66E: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} (mod.rs:514)
==223234== 
==223234== HEAP SUMMARY:
==223234==     in use at exit: 21,318,672 bytes in 143 blocks
==223234==   total heap usage: 31,045 allocs, 30,902 frees, 35,782,609 bytes allocated
==223234== 
==223234== LEAK SUMMARY:
==223234==    definitely lost: 0 bytes in 0 blocks
==223234==    indirectly lost: 0 bytes in 0 blocks
==223234==      possibly lost: 2,024 bytes in 8 blocks
==223234==    still reachable: 21,316,648 bytes in 135 blocks
==223234==         suppressed: 0 bytes in 0 blocks
==223234== Rerun with --leak-check=full to see details of leaked memory
==223234== 
==223234== For lists of detected and suppressed errors, rerun with: -s
==223234== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

The specific stack trace tends to vary:

^C^C^C==223700== 
==223700== Process terminating with default action of signal 2 (SIGINT)
==223700==    at 0x1D22AC: <crossbeam_channel::select::Token as core::default::Default>::default (select.rs:30)
==223700==    by 0x1EF547: crossbeam_channel::flavors::list::Channel<T>::try_recv (list.rs:433)
==223700==    by 0x209488: crossbeam_channel::channel::Receiver<T>::try_recv (channel.rs:751)
==223700==    by 0x250490: glommio::sys::SleepNotifier::process_foreign_wakes (mod.rs:342)
==223700==    by 0x1E89E4: glommio::sys::uring::Reactor::process_foreign_wakes (uring.rs:1406)
==223700==    by 0x1F7E6B: glommio::reactor::Reactor::process_shared_channels (reactor.rs:772)
==223700==    by 0x1F7FAA: glommio::reactor::Reactor::spin_poll_io (reactor.rs:800)
==223700==    by 0x188EF2: glommio::executor::LocalExecutor::run::{{closure}} (mod.rs:1480)
==223700==    by 0x14996F: scoped_tls::ScopedKey<T>::set (lib.rs:137)
==223700==    by 0x1887BE: glommio::executor::LocalExecutor::run (mod.rs:1439)
==223700==    by 0x18B004: glommio::executor::LocalExecutorPoolBuilder::spawn_thread::{{closure}} (mod.rs:987)
==223700==    by 0x1A262D: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:122)

main.rs: https://gist.github.com/vlovich/fddbd15c52a3b86648688e2fc3d66e30
If you change use_mesh to false the code works fine under valgrind which makes me think it might be some kind of issue with the mesh. The interesting thing is that thread 1 never starts but I don't know enough to think of why that be impacted by the existence of a mesh... Similarly, even if I increase the number of threads, still only one prints as started. Additionally, the channel size doesn't have an impact either.

In my specific code that I'm seeing this in, ctrl-c doesn't even work. I have to pkill -9 -f valgrind.

The text was updated successfully, but these errors were encountered:

vlovich · 2022-11-17T22:03:31Z

I wonder if there's some subtle race condition in the events where if the timing happens just right, even non-valgrind would hang.

bryandmc · 2022-12-02T01:47:05Z

Is it just me or does it look like it's being triggered by crossbeam? Which would be... interesting..

vlovich · 2022-12-05T14:56:33Z

Added some instrumentation into glommio join. Successful run without valgrind:

Started thread 1
glommio: Connecting receiver
Started thread 2
glommio: Connecting peer
glommio: Peer connected
glommio: Receiver connected
glommio: Peer id = 1, role id = 1
glommio: Peer id = 0, role id = 0
glommio: 0 waiting for sender to 1 to connect
glommio: 1 waiting for sender to 0 to connect
glommio: 0 connected sender to 1
glommio: 0 waiting for receiver to 1 to connect
glommio: 0 connected receiver to 1
glommio: 0 Joined
Initializing ping from peer 0
glommio: 1 connected sender to 0
glommio: 1 waiting for receiver to 0 to connect
glommio: 1 connected receiver to 0
glommio: 1 Joined
1: Responding pong to 0
1: Sending ping to 0
Thread 1 task results: [()]
Terminating thread 1
1: Stopping - Channel is closed { .. }
Thread 2 task results: [()]
Terminating thread 2

Under valgrind:

Started thread 1
Connecting receiver
Started thread 2
Connecting peer
Peer connected
Peer id = 1, role id = 1
1 waiting for sender to 0 to connect
Receiver connected
Peer id = 0, role id = 0
0 waiting for sender to 1 to connect
0 connected sender to 1
0 waiting for receiver to 1 to connect
0 connected receiver to 1
0 Joined
Initializing ping from peer 0
^CKilled

So it seems like the second thread is failing to initialize properly for some reason. @bryandmc I don't see crossbeam anywhere in the stack trace (via target remote | vgdb). I think the second thread that never starts is blocked on io_uring_wait_cqes. There's 2 additional threads that are doing a blocking wait via futex_wait for glommio::sys::blocking::BlockingThreadReq (I'm guessing this is just waiting on the glommio thread to terminate?). It's not clear to me why the second thread never gets the I/O completion...

vlovich · 2022-12-05T15:11:12Z

Interesting. On a recent run where I added even more instrumentation:
shared_channel.rs (290):

    pub async fn send(&self, item: T) -> Result<(), T> {
        println!("{} Waiting for room to send...", self.id);
        let waiter = future::poll_fn(|cx| self.wait_for_room(cx));
        println!("{} awaiting checking for room", self.id);

The first log line is printed and then it hangs. The stack trace shows:

Thread 2 (Thread 21742 (tid 2 VgTs_WaitSys unnamed-1)):
#0  __GI___libc_write (nbytes=30, buf=0x69a80b0, fd=1) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=1, buf=0x69a80b0, nbytes=30) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00000000003fafe8 in std::sys::unix::fd::FileDesc::write () at library/std/src/sys/unix/fd.rs:152
#3  std::sys::unix::stdio::{impl#3}::write () at library/std/src/sys/unix/stdio.rs:39
#4  std::io::stdio::{impl#1}::write () at library/std/src/io/stdio.rs:120
#5  std::io::buffered::bufwriter::BufWriter::flush_buf<std::io::stdio::StdoutRaw> () at library/std/src/io/buffered/bufwriter.rs:166
#6  0x00000000003fbf20 in std::io::buffered::linewritershim::{impl#1}::write_all<std::io::stdio::StdoutRaw> () at library/std/src/io/buffered/linewritershim.rs:269
#7  std::io::buffered::linewriter::{impl#1}::write_all<std::io::stdio::StdoutRaw> () at library/std/src/io/buffered/linewriter.rs:206
#8  std::io::stdio::{impl#14}::write_all () at library/std/src/io/stdio.rs:735
#9  0x00000000003fd835 in std::io::Write::write_fmt::{impl#0}::write_str<std::io::stdio::StdoutLock> () at library/std/src/io/mod.rs:1668
#10 0x000000000041b771 in core::fmt::write () at library/core/src/fmt/mod.rs:1226
#11 0x00000000003fbc24 in std::io::Write::write_fmt<std::io::stdio::StdoutLock> () at library/std/src/io/mod.rs:1679
#12 std::io::stdio::{impl#13}::write_fmt () at library/std/src/io/stdio.rs:715
#13 0x00000000003fc603 in std::io::stdio::{impl#12}::write_fmt () at library/std/src/io/stdio.rs:689
#14 std::io::stdio::print_to<std::io::stdio::Stdout> () at library/std/src/io/stdio.rs:1017
#15 std::io::stdio::_print () at library/std/src/io/stdio.rs:1030
#16 0x0000000000182c21 in glommio::channels::shared_channel::{impl#9}::send::{async_fn#0}<bool> () at /home/vlovich/glommio/glommio/src/channels/shared_channel.rs:291

If I attach gdb and hit continue I get much farther. Not sure why stdio print is hanging...

vlovich · 2022-12-05T15:13:28Z

Finally thread 2 hangs at:

#0  core::sync::atomic::atomic_load<u32> () at library/core/src/sync/atomic.rs:2961
#1  core::sync::atomic::AtomicU32::load () at library/core/src/sync/atomic.rs:2075
#2  std::sys::unix::locks::futex_mutex::Mutex::spin () at library/std/src/sys/unix/locks/futex_mutex.rs:70
#3  std::sys::unix::locks::futex_mutex::Mutex::lock_contended () at library/std/src/sys/unix/locks/futex_mutex.rs:61
#4  0x00000000003fbcdc in std::sys::unix::locks::futex_mutex::Mutex::lock () at library/std/src/sys/unix/locks/futex_mutex.rs:30
#5  std::sys_common::mutex::MovableMutex::raw_lock () at library/std/src/sys_common/mutex.rs:75
#6  std::sys_common::remutex::ReentrantMutex::lock<core::cell::RefCell<std::io::buffered::linewriter::LineWriter<std::io::stdio::StdoutRaw>>> () at library/std/src/sys_common/remutex.rs:103
#7  std::io::stdio::Stdout::lock () at library/std/src/io/stdio.rs:656
#8  std::io::stdio::{impl#13}::write_fmt () at library/std/src/io/stdio.rs:715

and the hack of continuing via gdb doesn't work here. Not sure what's going on.

vlovich · 2022-12-05T15:15:34Z

When I run valgrind with drd, it starts with a complaint about latch and then a bunch of other data race warnings follow: https://gist.github.com/vlovich/b553ce3a2450907b62b135e3bdf8038c. Helgrind has similar complaints: https://gist.github.com/vlovich/4162aefdaaa8dc4a477431d1f44fab5c

vlovich · 2022-12-08T13:49:58Z

I suspect these are false positives though. @glommer any thoughts here?

glommer · 2023-01-05T00:00:02Z

I took a quick look at this, and I don't have a great explanation yet.

It seems so far that the hang is not on the mesh, but on the shared channel (which is used internally to build the mesh).
It hangs on its connect function.

connect() will essentially make sure that the other side is connected. So what it seems like is happening is that because the other thread never starts, this never connects, so it hangs.

I don't know exactly how valgrind does threading, but based on what it does, it makes sense that it will multiplex everything in a single thread of execution internally. A quick search reveals that this may be the case: https://stackoverflow.com/questions/8663148/valgrind-stalls-in-multithreaded-socket-program

I'll take a look at those articles and see if they yield something valuable.

But tl:dr: the hang seems related to the second thread never starting, which in term happens because thread 1 blocks on waiting for it, and it wouldn't really happen outside valgrind.

glommer · 2023-01-05T00:08:30Z

The shared channel (channels/shared_channel.rs) has a class, Connector that implements Future.

When the other side connects, peer_id() should return the id of the other side. It is being polled on a loop and always returning 0. At this point my process is using 100% CPU. Even if all the threads end up running on a single core on valgrind, it should emulate a yield. So the biggest mystery here is why the other thread won't start.

vlovich · 2023-01-05T03:43:58Z

Btw I tried --fair-sched=yes and it's still hanging which is the only thing I've found about valgrind. Maybe I should open a bug upstream...

vlovich · 2023-01-05T04:03:20Z

Filed upstream too just in case https://bugs.kde.org/show_bug.cgi?id=463859.

vlovich · 2023-01-05T04:06:58Z

I wonder if there's a more direct repro case we can construct...

glommer · 2023-01-05T20:43:51Z

I made some progress on this. I narrowed it down to the creation of the companion blocking thread.

in sys/uring.rs you will find the creator of the main reactor, and you will see a line saying:

            blocking_thread: BlockingThreadPool::new(thread_pool_placement, notifier.clone())?,

this blocks and never returns.

This thread is used to execute expensive operations out of the main thread. I still don't know why, but you can easily verify with printlns if this the case for you as well - if yes, we made some progress!

vlovich · 2023-01-06T05:30:54Z

Hmm.... not sure I'm seeing the same thing. What I did was I moved that blocking_thread construction as a standalone line and then added surrounding println:

+        println!("starting blocking thread");
+        let blocking_thread = BlockingThreadPool::new(thread_pool_placement, notifier.clone())?;
+        println!("started blocking thread");
+
         Ok(Reactor {
             main_ring: RefCell::new(main_ring),
             latency_ring: RefCell::new(latency_ring),
             poll_ring: RefCell::new(poll_ring),
             latency_preemption_timeout_src: Cell::new(None),
             throughput_preemption_timeout_src: Cell::new(None),
-            blocking_thread: BlockingThreadPool::new(thread_pool_placement, notifier.clone())?,
+            blocking_thread,

Running this without valgrind:

starting blocking thread
starting blocking thread
started blocking thread
started blocking thread
Started thread 2
Started thread 1
Initializing ping from peer 0
1: Responding pong to 0
1: Sending ping to 0
Thread 1 task results: [()]
Terminating thread 1
1: Stopping - Channel is closed { .. }
Thread 2 task results: [()]
Terminating thread 2

Running this with valgrind:

valgrind --fair-sched=yes target/debug/valgrind-deadlock
==1376864== Memcheck, a memory error detector
==1376864== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1376864== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1376864== Command: target/debug/valgrind-deadlock
==1376864== 
starting blocking thread
starting blocking thread
started blocking thread
Started thread 2
started blocking thread
Started thread 1

without fair-sched:

valgrind target/debug/valgrind-deadlock
==1377287== Memcheck, a memory error detector
==1377287== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1377287== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1377287== Command: target/debug/valgrind-deadlock
==1377287== 
starting blocking thread
starting blocking thread
started blocking thread
Started thread 1
started blocking thread
Started thread 2
Initializing ping from peer 0

What I do wonder though is that in the non-valgrind case, the "Started thread N message" prints after "started blocking thread" whereas in the valgrind case there's typically some interleaving. Of course, I haven't investigated if that's at all important. Just something I thought I'd note.

Now what's interesting is that if I leave a println only after the BlockingThreadPool::new call (but not before) then valgrind looks like this:

valgrind target/debug/valgrind-deadlock
==1378072== Memcheck, a memory error detector
==1378072== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1378072== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1378072== Command: target/debug/valgrind-deadlock
==1378072== 
started blocking thread
Started thread 1

but fair-sched looks different:

valgrind --fair-sched=yes target/debug/valgrind-deadlock
==1378295== Memcheck, a memory error detector
==1378295== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1378295== Using Valgrind-3.21.0.GIT and LibVEX; rerun with -h for copyright info
==1378295== Command: target/debug/valgrind-deadlock
==1378295== 
started blocking thread
started blocking thread
Started thread 1
Started thread 2
Initializing ping from peer 0

Notice the above has no interleaving but still deadlocks...

glommer · 2023-01-06T18:54:01Z

interesting. I'll keep poking.
For me it was very consistent.

We're narrowing this down.

vlovich · 2023-09-30T23:02:45Z

Actually, nothing even to do with MeshIPC / LocalExecutorPoolBuilder necessarily I think. Here's an even simpler scenario that also seems to hang https://gist.github.com/vlovich/6876632b48df4289eb3e05716f9f431a - only uses tokio sync and a single LocalExecutorBuilder in a background thread. If you remove glommio from being instantiated this code runs fine under valgrind. If you replace glommio with the tokio runtime it still runs under valgrind:

            let rt = tokio::runtime::Builder::new_current_thread()
                .build()
                .unwrap();
            rt.block_on(async move {
                loop {
                    eprintln!("BACKGROUND: tokio Waiting for task to process");
                    let task = match receiver.recv().await {
                        Some(task) => task,
                        None => return,
                    };  

                    return;
                }   
            });

vlovich · 2023-09-30T23:24:10Z

It's something to do with the Context / Waker but I thought that was a Rust thing? After instrumenting tokio oneshot internals, I see wake_by_ref being called on the send side but the receive side never actually sees the event. So maybe something to do with how waking Wakers across threads works within glommio is busted under valgrind for some reason?

vlovich · 2023-10-01T16:39:04Z

Well, the latest repro disappears if I use --fair-sched=yes. A deadlock on the stderr futex within the Rust stdlib reappears (even with --fair-sched=yes) if I add a bunch of logging instrumentation within tokio. I'm starting to suspect there's something subtle initializing glommio does that breaks valgrind's futex implementation or something. I had thought maybe around CPU affinities, but I thought using Unbound would bypass that aspect (at least from skimming that code it would). @glommer do you have any tips about anything "non-trivial" glommio might be doing on initialization that might confuse valgrind's futex implementation?

glommer · 2023-10-03T10:42:04Z

the implementation of the waker is the domain of the executor, so there could be a leak there. That's the code under src/task, and there's a lot of unsafe there...

vlovich · 2024-04-11T20:58:33Z

I also noticed problems with flume channels not processing for the blocking operations (e.g. copy_file_range) under valgrind. If I add an eprintln or std::thread::sleep(std::time::Duration::from_millis(1)); after tx.send_async in BlockingThreadPool::push then the channel seems to work correctly. Very strange (this one is regardless of the --fair-sched=yes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MeshIPC causes valgrind to hang #582

MeshIPC causes valgrind to hang #582

vlovich commented Nov 17, 2022 •

edited

Loading

vlovich commented Nov 17, 2022

bryandmc commented Dec 2, 2022

vlovich commented Dec 5, 2022

vlovich commented Dec 5, 2022

vlovich commented Dec 5, 2022

vlovich commented Dec 5, 2022 •

edited

Loading

vlovich commented Dec 8, 2022

glommer commented Jan 5, 2023

glommer commented Jan 5, 2023

vlovich commented Jan 5, 2023

vlovich commented Jan 5, 2023

vlovich commented Jan 5, 2023

glommer commented Jan 5, 2023

vlovich commented Jan 6, 2023

glommer commented Jan 6, 2023

vlovich commented Sep 30, 2023

vlovich commented Sep 30, 2023 •

edited

Loading

vlovich commented Oct 1, 2023 •

edited

Loading

glommer commented Oct 3, 2023

vlovich commented Apr 11, 2024

MeshIPC causes valgrind to hang #582

MeshIPC causes valgrind to hang #582

Comments

vlovich commented Nov 17, 2022 • edited Loading

vlovich commented Nov 17, 2022

bryandmc commented Dec 2, 2022

vlovich commented Dec 5, 2022

vlovich commented Dec 5, 2022

vlovich commented Dec 5, 2022

vlovich commented Dec 5, 2022 • edited Loading

vlovich commented Dec 8, 2022

glommer commented Jan 5, 2023

glommer commented Jan 5, 2023

vlovich commented Jan 5, 2023

vlovich commented Jan 5, 2023

vlovich commented Jan 5, 2023

glommer commented Jan 5, 2023

vlovich commented Jan 6, 2023

glommer commented Jan 6, 2023

vlovich commented Sep 30, 2023

vlovich commented Sep 30, 2023 • edited Loading

vlovich commented Oct 1, 2023 • edited Loading

glommer commented Oct 3, 2023

vlovich commented Apr 11, 2024

vlovich commented Nov 17, 2022 •

edited

Loading

vlovich commented Dec 5, 2022 •

edited

Loading

vlovich commented Sep 30, 2023 •

edited

Loading

vlovich commented Oct 1, 2023 •

edited

Loading