-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MeshIPC causes valgrind to hang #582
Comments
I wonder if there's some subtle race condition in the events where if the timing happens just right, even non-valgrind would hang. |
Is it just me or does it look like it's being triggered by crossbeam? Which would be... interesting.. |
Added some instrumentation into glommio
Under valgrind:
So it seems like the second thread is failing to initialize properly for some reason. @bryandmc I don't see crossbeam anywhere in the stack trace (via |
Interesting. On a recent run where I added even more instrumentation:
The first log line is printed and then it hangs. The stack trace shows:
If I attach gdb and hit continue I get much farther. Not sure why stdio print is hanging... |
Finally thread 2 hangs at:
and the hack of continuing via gdb doesn't work here. Not sure what's going on. |
When I run valgrind with drd, it starts with a complaint about latch and then a bunch of other data race warnings follow: https://gist.github.com/vlovich/b553ce3a2450907b62b135e3bdf8038c. Helgrind has similar complaints: https://gist.github.com/vlovich/4162aefdaaa8dc4a477431d1f44fab5c |
I suspect these are false positives though. @glommer any thoughts here? |
I took a quick look at this, and I don't have a great explanation yet. It seems so far that the hang is not on the mesh, but on the shared channel (which is used internally to build the mesh).
I don't know exactly how valgrind does threading, but based on what it does, it makes sense that it will multiplex everything in a single thread of execution internally. A quick search reveals that this may be the case: https://stackoverflow.com/questions/8663148/valgrind-stalls-in-multithreaded-socket-program I'll take a look at those articles and see if they yield something valuable. But tl:dr: the hang seems related to the second thread never starting, which in term happens because thread 1 blocks on waiting for it, and it wouldn't really happen outside valgrind. |
The shared channel (channels/shared_channel.rs) has a class, When the other side connects, |
Btw I tried |
Filed upstream too just in case https://bugs.kde.org/show_bug.cgi?id=463859. |
I wonder if there's a more direct repro case we can construct... |
I made some progress on this. I narrowed it down to the creation of the companion blocking thread. in
this blocks and never returns. This thread is used to execute expensive operations out of the main thread. I still don't know why, but you can easily verify with printlns if this the case for you as well - if yes, we made some progress! |
Hmm.... not sure I'm seeing the same thing. What I did was I moved that
Running this without valgrind:
Running this with valgrind:
without fair-sched:
What I do wonder though is that in the non-valgrind case, the "Started thread N message" prints after "started blocking thread" whereas in the valgrind case there's typically some interleaving. Of course, I haven't investigated if that's at all important. Just something I thought I'd note. Now what's interesting is that if I leave a
but fair-sched looks different:
Notice the above has no interleaving but still deadlocks... |
interesting. I'll keep poking. We're narrowing this down. |
Actually, nothing even to do with MeshIPC / LocalExecutorPoolBuilder necessarily I think. Here's an even simpler scenario that also seems to hang https://gist.github.com/vlovich/6876632b48df4289eb3e05716f9f431a - only uses tokio sync and a single LocalExecutorBuilder in a background thread. If you remove glommio from being instantiated this code runs fine under valgrind. If you replace glommio with the tokio runtime it still runs under valgrind:
|
It's something to do with the Context / Waker but I thought that was a Rust thing? After instrumenting tokio oneshot internals, I see |
Well, the latest repro disappears if I use |
the implementation of the waker is the domain of the executor, so there could be a leak there. That's the code under src/task, and there's a lot of unsafe there... |
I also noticed problems with flume channels not processing for the blocking operations (e.g. copy_file_range) under valgrind. If I add an |
I haven't fully traced down what's happening but valgrind seems to randomly hang, but only when using a mesh.
Normal without valgrind:
With valgrind:
The specific stack trace tends to vary:
main.rs: https://gist.github.com/vlovich/fddbd15c52a3b86648688e2fc3d66e30
If you change
use_mesh
to false the code works fine under valgrind which makes me think it might be some kind of issue with the mesh. The interesting thing is that thread 1 never starts but I don't know enough to think of why that be impacted by the existence of a mesh... Similarly, even if I increase the number of threads, still only one prints as started. Additionally, the channel size doesn't have an impact either.In my specific code that I'm seeing this in, ctrl-c doesn't even work. I have to
pkill -9 -f valgrind
.The text was updated successfully, but these errors were encountered: