Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bitswap: clean up ledgers when disconnecting #3437

Merged
merged 2 commits into from
May 20, 2017

Conversation

whyrusleeping
Copy link
Member

@whyrusleeping whyrusleeping commented Nov 29, 2016

License: MIT
Signed-off-by: Jeromy [email protected]

@whyrusleeping whyrusleeping added status/in-progress In progress need/review Needs a review labels Nov 29, 2016
@whyrusleeping whyrusleeping assigned ghost and Kubuxu Nov 29, 2016
// TODO: release ledger
e.lock.Lock()
defer e.lock.Unlock()
l, ok := e.ledgerMap[p]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not locking l.lk here, and again we have situation with two locks. It shouts to me deadlock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the ledger lock. Re locking concerns, these ones are well scoped. the engine lock is either always held first, or not held while taking the ledger lock. And the engine lock is never taken while holding a ledger lock.

@Kubuxu
Copy link
Member

Kubuxu commented Dec 7, 2016

Other concern: what if PeerConnected gets the instance but can't acquire lock for ledger as it is locked by PeerDisconnected. Then PeerConnected will increase value on ledger that is not longer in ledger map.

@whyrusleeping
Copy link
Member Author

@Kubuxu hrm... for that to happen, PeerConnected would have to return from findOrCreate, and then PeerDisconnected would have to take the engine lock, pull the ledger out of the map, and take the ledger lock before the PeerConnected process is able to. This IS possible.

One option is to not use findOrCreate and instead take the engine lock ourselves throughout the entire call to PeerConnected, essentially reimplementing findOrCreate in that function.

@Kubuxu
Copy link
Member

Kubuxu commented Dec 7, 2016

I know that it is something that might happen very rarely or never but those edge cases add up and create hard to track down bugs and instability. If chance of this bug occuring is 0.00001% then chance that it will occur across 100000 runs is more than 60% and if we don't stop possibly introducing bugs like that go-ipfs will be always unstable and unreliable.

@whyrusleeping
Copy link
Member Author

@Kubuxu Right, So i think the solution is to make PeerConnected hold the engine lock through the entire method.

@Kubuxu
Copy link
Member

Kubuxu commented Dec 9, 2016

So now it is thread safe, but does function findOrCreate makes sense if we have introduced ref counting?

@Kubuxu
Copy link
Member

Kubuxu commented Dec 9, 2016

Also I am still not a fan of those two locks as some not really connected change can introduce deadlock (locking for engine while holding ledger some ledger) and we might not catch it when we introduce it. We should really look into Actor oriented communication and how bad/good it will be.

@Kubuxu Kubuxu force-pushed the feat/bitswap-cleanup-ledger branch from efb2e39 to 22ba9bb Compare December 9, 2016 17:23
@Kubuxu
Copy link
Member

Kubuxu commented Dec 9, 2016

I rebased it to run coverage on it.

@Kubuxu
Copy link
Member

Kubuxu commented Dec 9, 2016

It isn't tested anywhere, it might be worth to do that.

@whyrusleeping
Copy link
Member Author

whyrusleeping commented Dec 9, 2016 via email

@Kubuxu
Copy link
Member

Kubuxu commented Dec 9, 2016

I am positive that we can make it clean and not so complicated with enough layers of sugarcoating.

I am just almost sure that we will introduce deadlock around this place sooner or later and it won't be diagnosed for a long time as reproduction of this will be almost impossible.

Also for someone to report deadlock like this one he would have to 1. encounter this deadlock 2. don't try resetting the node 3. capture goroutine dump 4. have us find those blocked routines on this lock. I miss Java's features in this regard.

This change LGTM if I get some tests. In case of not directly sharness tested features I would like the codecov/patch build check to be green.

@Kubuxu Kubuxu unassigned ghost and Kubuxu Dec 14, 2016
@Kubuxu Kubuxu mentioned this pull request Dec 14, 2016
@Kubuxu Kubuxu added status/ready Ready to be worked status/in-progress In progress and removed status/in-progress In progress status/ready Ready to be worked labels Dec 19, 2016
@Kubuxu
Copy link
Member

Kubuxu commented Dec 19, 2016

Ok, it shows as if there was no coverage due to lack of cross package cover testing.

@ghost
Copy link

ghost commented Dec 20, 2016

Can I add the RFM label here? Let's continue the locking discussion in #3506.

License: MIT
Signed-off-by: Jeromy <[email protected]>
@whyrusleeping whyrusleeping merged commit ec43fe4 into master May 20, 2017
@whyrusleeping whyrusleeping deleted the feat/bitswap-cleanup-ledger branch May 20, 2017 02:23
@whyrusleeping whyrusleeping removed the status/in-progress In progress label May 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/review Needs a review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants