-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: connmgr: concurrent map access in connmgr #1860
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very tricky indeed.
One general thought, did we evaluate the option of adopting the Communicating sequential process approach to serialize the operations? i.e, sync through channels rather than locking by multiple mutexes. Is CSP a feasible option before we add more complexity here?
|
||
// lock this to protect from concurrent modifications from connect/disconnect events | ||
leftSegment := segments.get(left.id) | ||
leftSegment.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this might be very tricky. But I am not very sure about the fine grained locking, does locking at bucketMu level make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We need to get the segment locks because something else (like a connection notification event) can be writing to the conns map.
- We may need to get 2 segment locks, so the
bucketMu
protects us from a deadlock when grabbing the 2 locks.
If we only locked the bucketMu here we wouldn't fix anything since writers can still modify the conns map. If we make writers also take the bucketMu, then there's no point in the segmented locks anymore and possibly regress in performance here: libp2p/go-libp2p-connmgr#40.
We may no longer need the segmented locks, but without some benchmarks I would be hesitant to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer to the segment lock history. That make sense now.
I did a quick search on the code, did not find other cases of needing two segment locks at the same time (other than the sort here). Hope that is an exhaustive search and the condition holds here. Anyway we may need to come back and review this for a better option such as CSP.
Context around this segmented lock is here: libp2p/go-libp2p-connmgr#40 We can avoid locks if we had a single channel and pass operations as messages to it, then have a single goroutine consume the channel and process the messages. I think this would do better than the original single lock solution. I'm not sure how it would compare with the segmented lock (depends on usage patterns, and I'm not sure I understand all the usage patterns here. For example if we are updating many different peers this would be faster.) I'm definitely open to refactoring this code, but I'm a bit afraid to do that as part of this quick fix without spending the time to fully understand all the ways this is used, and having realistic benchmarks to make sure we don't regress on the performance here (which is the whole point of the segmented locks). It's much safer to focus on fixing this specific concurrency bug than refactoring the whole thing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks OK for an urgent fix. Let's come back and revisit this to see if there are other better solutions. Adding to the locks is getting the situation more complicated.
* fix: return filtered addrs (#1855) * Bump version * Fix concurrent map access in connmgr (#1860) * Add some guard rails and docs (#1863) Co-authored-by: Dennis Trautwein <[email protected]>
Fixes #1847 by taking the "segment" lock for the relevant peerInfo we're using. Adds a "bucketsMu" to prevent deadlocks from concurrent processes each getting multiple segment locks (e.g. goroutine 1 takes locks A, then B. goroutine 2 takes locks B, then A. They both get the first lock and are now deadlocked).