Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: connmgr: concurrent map access in connmgr #1860

Merged
merged 1 commit into from
Nov 9, 2022

Conversation

MarcoPolo
Copy link
Collaborator

Fixes #1847 by taking the "segment" lock for the relevant peerInfo we're using. Adds a "bucketsMu" to prevent deadlocks from concurrent processes each getting multiple segment locks (e.g. goroutine 1 takes locks A, then B. goroutine 2 takes locks B, then A. They both get the first lock and are now deadlocked).

Copy link
Contributor

@julian88110 julian88110 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very tricky indeed.
One general thought, did we evaluate the option of adopting the Communicating sequential process approach to serialize the operations? i.e, sync through channels rather than locking by multiple mutexes. Is CSP a feasible option before we add more complexity here?


// lock this to protect from concurrent modifications from connect/disconnect events
leftSegment := segments.get(left.id)
leftSegment.Lock()
Copy link
Contributor

@julian88110 julian88110 Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this might be very tricky. But I am not very sure about the fine grained locking, does locking at bucketMu level make more sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We need to get the segment locks because something else (like a connection notification event) can be writing to the conns map.
  2. We may need to get 2 segment locks, so the bucketMu protects us from a deadlock when grabbing the 2 locks.

If we only locked the bucketMu here we wouldn't fix anything since writers can still modify the conns map. If we make writers also take the bucketMu, then there's no point in the segmented locks anymore and possibly regress in performance here: libp2p/go-libp2p-connmgr#40.

We may no longer need the segmented locks, but without some benchmarks I would be hesitant to change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer to the segment lock history. That make sense now.

I did a quick search on the code, did not find other cases of needing two segment locks at the same time (other than the sort here). Hope that is an exhaustive search and the condition holds here. Anyway we may need to come back and review this for a better option such as CSP.

@MarcoPolo
Copy link
Collaborator Author

This looks very tricky indeed.
One general thought, did we evaluate the option of adopting the Communicating sequential process approach to serialize the operations? i.e, sync through channels rather than locking by multiple mutexes. Is CSP a feasible option before we add more complexity here?

Context around this segmented lock is here: libp2p/go-libp2p-connmgr#40

We can avoid locks if we had a single channel and pass operations as messages to it, then have a single goroutine consume the channel and process the messages. I think this would do better than the original single lock solution. I'm not sure how it would compare with the segmented lock (depends on usage patterns, and I'm not sure I understand all the usage patterns here. For example if we are updating many different peers this would be faster.)

I'm definitely open to refactoring this code, but I'm a bit afraid to do that as part of this quick fix without spending the time to fully understand all the ways this is used, and having realistic benchmarks to make sure we don't regress on the performance here (which is the whole point of the segmented locks). It's much safer to focus on fixing this specific concurrency bug than refactoring the whole thing.

@MarcoPolo MarcoPolo mentioned this pull request Nov 8, 2022
1 task
Copy link
Contributor

@julian88110 julian88110 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks OK for an urgent fix. Let's come back and revisit this to see if there are other better solutions. Adding to the locks is getting the situation more complicated.

@MarcoPolo MarcoPolo merged commit da3adbc into master Nov 9, 2022
MarcoPolo added a commit that referenced this pull request Nov 9, 2022
MarcoPolo added a commit that referenced this pull request Nov 9, 2022
* fix: return filtered addrs (#1855)

* Bump version

* Fix concurrent map access in connmgr (#1860)

* Add some guard rails and docs (#1863)

Co-authored-by: Dennis Trautwein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

connmgr: Concurrent Map Misuse
2 participants