fix: connmgr: concurrent map access in connmgr #1860

MarcoPolo · 2022-11-08T20:02:55Z

Fixes #1847 by taking the "segment" lock for the relevant peerInfo we're using. Adds a "bucketsMu" to prevent deadlocks from concurrent processes each getting multiple segment locks (e.g. goroutine 1 takes locks A, then B. goroutine 2 takes locks B, then A. They both get the first lock and are now deadlocked).

julian88110

This looks very tricky indeed.
One general thought, did we evaluate the option of adopting the Communicating sequential process approach to serialize the operations? i.e, sync through channels rather than locking by multiple mutexes. Is CSP a feasible option before we add more complexity here?

julian88110 · 2022-11-08T22:08:25Z

p2p/net/connmgr/connmgr.go

+
+		// lock this to protect from concurrent modifications from connect/disconnect events
+		leftSegment := segments.get(left.id)
+		leftSegment.Lock()


I know this might be very tricky. But I am not very sure about the fine grained locking, does locking at bucketMu level make more sense?

We need to get the segment locks because something else (like a connection notification event) can be writing to the conns map.

We may need to get 2 segment locks, so the bucketMu protects us from a deadlock when grabbing the 2 locks.

If we only locked the bucketMu here we wouldn't fix anything since writers can still modify the conns map. If we make writers also take the bucketMu, then there's no point in the segmented locks anymore and possibly regress in performance here: libp2p/go-libp2p-connmgr#40.

We may no longer need the segmented locks, but without some benchmarks I would be hesitant to change this.

Thanks for the pointer to the segment lock history. That make sense now.

I did a quick search on the code, did not find other cases of needing two segment locks at the same time (other than the sort here). Hope that is an exhaustive search and the condition holds here. Anyway we may need to come back and review this for a better option such as CSP.

MarcoPolo · 2022-11-08T22:38:25Z

This looks very tricky indeed.
One general thought, did we evaluate the option of adopting the Communicating sequential process approach to serialize the operations? i.e, sync through channels rather than locking by multiple mutexes. Is CSP a feasible option before we add more complexity here?

Context around this segmented lock is here: libp2p/go-libp2p-connmgr#40

We can avoid locks if we had a single channel and pass operations as messages to it, then have a single goroutine consume the channel and process the messages. I think this would do better than the original single lock solution. I'm not sure how it would compare with the segmented lock (depends on usage patterns, and I'm not sure I understand all the usage patterns here. For example if we are updating many different peers this would be faster.)

I'm definitely open to refactoring this code, but I'm a bit afraid to do that as part of this quick fix without spending the time to fully understand all the ways this is used, and having realistic benchmarks to make sure we don't regress on the performance here (which is the whole point of the segmented locks). It's much safer to focus on fixing this specific concurrency bug than refactoring the whole thing.

julian88110

It looks OK for an urgent fix. Let's come back and revisit this to see if there are other better solutions. Adding to the locks is getting the situation more complicated.

* fix: return filtered addrs (#1855) * Bump version * Fix concurrent map access in connmgr (#1860) * Add some guard rails and docs (#1863) Co-authored-by: Dennis Trautwein <[email protected]>

Fix concurrent map access in connmgr

7d0f3af

MarcoPolo requested a review from julian88110 November 8, 2022 20:02

p-shahi mentioned this pull request Nov 8, 2022

connmgr: fix concurrent map misuse. #1857

Closed

julian88110 reviewed Nov 8, 2022

View reviewed changes

MarcoPolo mentioned this pull request Nov 8, 2022

Release v0.23.4 #1864

Merged

1 task

julian88110 approved these changes Nov 8, 2022

View reviewed changes

MarcoPolo mentioned this pull request Nov 9, 2022

connmgr: refactor to remove segment locks and bucketsMu #1865

Open

MarcoPolo merged commit da3adbc into master Nov 9, 2022

MarcoPolo added a commit that referenced this pull request Nov 9, 2022

Fix concurrent map access in connmgr (#1860)

d6e725b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: connmgr: concurrent map access in connmgr #1860

fix: connmgr: concurrent map access in connmgr #1860

MarcoPolo commented Nov 8, 2022

julian88110 left a comment

julian88110 Nov 8, 2022 •

edited

Loading

MarcoPolo Nov 8, 2022

julian88110 Nov 8, 2022

MarcoPolo commented Nov 8, 2022

julian88110 left a comment

fix: connmgr: concurrent map access in connmgr #1860

fix: connmgr: concurrent map access in connmgr #1860

Conversation

MarcoPolo commented Nov 8, 2022

julian88110 left a comment

Choose a reason for hiding this comment

julian88110 Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

MarcoPolo Nov 8, 2022

Choose a reason for hiding this comment

julian88110 Nov 8, 2022

Choose a reason for hiding this comment

MarcoPolo commented Nov 8, 2022

julian88110 left a comment

Choose a reason for hiding this comment

julian88110 Nov 8, 2022 •

edited

Loading