shutdown notifications engine when closing a bitswap session #4658

whyrusleeping · 2018-02-04T23:15:05Z

License: MIT
Signed-off-by: Jeromy [email protected]

whyrusleeping · 2018-02-05T01:03:41Z

very strange test failure...

whyrusleeping · 2018-02-05T01:15:03Z

not strange, i just wasnt paying attention.

whyrusleeping · 2018-02-05T20:16:08Z

@Stebalien I think my second commit is the right fix for the wantlist thing, but the test doesnt actually reproduce the issue. I think its because there are a couple different places the wantlist is accounted for, and bitswap.Wantlist is not the same as the wantmanager where the sessions track their wants.

magik6k · 2018-02-05T20:56:03Z

exchange/bitswap/session.go

+		cs, _ := cid.Cast([]byte(c))
+		live = append(live, cs)
+	}
+	bs.CancelWants(live, s.id)


What happens if another session wants a cid which is cancelled here?

Wants are tracked per session (note that we pass in the session ID)

Stebalien · 2018-02-06T17:10:30Z

Are you not running into the issue I have here: #4659 (comment)? It looks like unsubscribing after the session is shut down causes the unsubscribe function to hang (looks like a bug in the pubsub library we're using).

Actually, we should probably just replace that library (https://github.com/briantigerchow/pubsub). It's way too complicated for what it does (and we don't actually need to spawn a goroutine like that).

whyrusleeping · 2018-02-06T17:20:53Z

@Stebalien I didn't notice any issues, all the tests passed when i ran them. And i'm wary of replacing a library that has caused us zero problems in three years.

Stebalien · 2018-02-06T18:17:13Z

Calling Unsub after Shutdown will hang forever trying to write to the cmds channel. This patch has the same bug as mine. Try downloading something you don't have from the gateway, cancel, and then run:

> curl 'http://localhost:5001/debug/pprof/goroutine?debug=2' | grep 'pubsub.*Unsub'

And i'm wary of replacing a library that has caused us zero problems in three years.

My primary motivation was getting rid of the goroutine but we need goroutines anyways (one per subscription, actually) to make our contexts work... However, we still need to fix the unsubscribe issue (and the easiest way would be to at least modify this library).

Stebalien · 2018-02-08T03:35:01Z

@whyrusleeping off the top of your head, do you recall why we have separate notification engines per session instead of just using the main one?

whyrusleeping · 2018-02-08T05:44:05Z

@Stebalien I actually don't remember off the top of my head. I think it was something around me thinking that multiple different subscriptions wouldnt play nicely. But they should... I'd say try it. It seems like it could work.

License: MIT Signed-off-by: Jeromy <[email protected]>

License: MIT Signed-off-by: Steven Allen <[email protected]>

…g it down Otherwise, we'll deadlock and leak a goroutine. This fix is kind of crappy but modifying the pubsub library would have been worse (and, really, it *is* reasonable to say "don't use the pubsub instance after shutting it down"). License: MIT Signed-off-by: Steven Allen <[email protected]>

Stebalien · 2018-02-09T20:24:29Z

@whyrusleeping so, I fixed the unsubscribe after shutdown deadlock by just not unsubscribing after shutting down. I'm not happy with the fix but cleanly fixing the pubsub library didn't seem possible either due to its API (e.g., when a user calls AddSub(ch, topics...) on a pubsub instance after shutting it down, should we close ch? That could panic if we had already registered it and closed it).

I figured if we couldn't have a nice fix, we might as well have a small fix internally and avoid maintaining a fork.

whyrusleeping · 2018-02-09T20:36:41Z

exchange/bitswap/notifications/notifications.go

+	// Interrupt in-progress subscriptions.
+	close(ps.cancel)
+	// Wait for them to finish.
+	ps.wg.Wait()


This will wait for all active wants to be cancelled, which happens if the caller closes the session, right? I would like to see a test around this.

No, it'll just wait for all unsubscribes to finish (which should happen immediately after we close the ps.cancel channel).

However, I have added a test to ensure that shutting down the PubSub while a subscription is active works (and doesn't block as it did before). Is that what you're looking for?

License: MIT Signed-off-by: Steven Allen <[email protected]>

Stebalien · 2018-02-11T17:04:11Z

Real test failures:

--- FAIL: TestMultipleSessions (20.01s)
	session_test.go:284: bad juju
--- FAIL: TestWantlistClearsOnCancel (0.00s)
	session_test.go:318: expected empty wantlist

(will deadlock) License: MIT Signed-off-by: Steven Allen <[email protected]>

Stebalien · 2018-02-11T20:56:08Z

So that I don't forget where I left off...

Canceling wants for one session sometimes prevents the other session from receiving its requested block. So far, I've narrowed this down to the mq.out.Cancel(e.Cid) line in the addMessage function of msgQueue in wantmanager.go. Commenting that line out fixes it.

whyrusleeping · 2018-02-13T01:34:48Z

@Stebalien status here?

Stebalien · 2018-02-13T03:27:23Z

@whyrusleeping still debugging.

…ages Before, we weren't using a pointer so we were throwing away the update. License: MIT Signed-off-by: Steven Allen <[email protected]>

Stebalien · 2018-02-13T05:04:21Z

Well... that's an old bug. Unfortunately, that fix still doesn't fix everything. go test -v -count 100 -run TestMultipleSessions hangs after ~20 runs (better than the 1-5 before...).

whyrusleeping · 2018-02-13T06:16:27Z

very weird behaviour here. Putting a sleep after cancel1() causes the issue to no longer happen. But putting the cancel1() after the next GetBlocks call also causes the issue to no longer happen.

whyrusleeping · 2018-02-13T06:17:54Z

Putting the cancel1() after the NewSession call still allows me to reproduce the issue.

whyrusleeping · 2018-02-13T06:41:18Z

Okay, looking at wantlist messages being sent from peer A to peer B. In success cases we have one of:

A -> B: Want X
or
A -> B: Want X
A -> B: Want X
or
A -> B: Want X
A -> B: Cancel X
A -> B: Want X

But in the failure case, I'm always seeing:

A -> B: Want X
A -> B: Want X
A -> B: Cancel X

Which makes sense why it hangs, we cancelled our request. The other peer is respecting that. Now the question is, what makes us cancel?

whyrusleeping · 2018-02-13T07:13:04Z

So with the above pattern, addMessage gets called three times:

Want X (session 1)
Cancel X (session 1)
Want X (session 2)

This should result in three messages getting sent, a want, a cancel, and a want. My only thought now is that the messages are getting reordered in transit somehow.

whyrusleeping · 2018-02-13T07:14:57Z

Oh look: https://github.com/ipfs/go-ipfs/blob/master/exchange/bitswap/testnet/virtual.go#L80

The messages are delivered by throwing them off into a goroutine...

whyrusleeping · 2018-02-13T07:16:15Z

adding a random delay to every message send causes the bug to be reproduced ~30% of the time.

License: MIT Signed-off-by: Jeromy <[email protected]>

ghost assigned whyrusleeping Feb 4, 2018

ghost added the status/in-progress In progress label Feb 4, 2018

whyrusleeping requested review from Stebalien and magik6k February 4, 2018 23:15

whyrusleeping force-pushed the fix/session-cleanup branch from 0b80b2c to 5833ecc Compare February 5, 2018 01:14

whyrusleeping mentioned this pull request Feb 5, 2018

[WIP] fix leaks on ipfs cat #4659

Closed

magik6k reviewed Feb 5, 2018

View reviewed changes

whyrusleeping and others added 4 commits February 9, 2018 12:18

shutdown notifications engine when closing a bitswap session

82e1ff5

License: MIT Signed-off-by: Jeromy <[email protected]>

WIP: fix wantlist clearing by closing down session

64c19cc

License: MIT Signed-off-by: Jeromy <[email protected]>

remove excessive time.Now() calls from bitswap sessions

2b99858

License: MIT Signed-off-by: Steven Allen <[email protected]>

Stebalien force-pushed the fix/session-cleanup branch from 974eaa0 to 5395826 Compare February 9, 2018 20:20

ghost assigned Stebalien Feb 9, 2018

whyrusleeping commented Feb 9, 2018

View reviewed changes

bitswap: test canceling subscription context after shutting down

2baa331

License: MIT Signed-off-by: Steven Allen <[email protected]>

avoid publishing if notification system has been shut down

1a37c0a

(will deadlock) License: MIT Signed-off-by: Steven Allen <[email protected]>

bitswap: actually *update* wantlist entries in outbound wantlist mess…

d4d30f4

…ages Before, we weren't using a pointer so we were throwing away the update. License: MIT Signed-off-by: Steven Allen <[email protected]>

bitswap virtual test net code should send messages in order

0dd0f25

License: MIT Signed-off-by: Jeromy <[email protected]>

Stebalien approved these changes Feb 13, 2018

View reviewed changes

whyrusleeping merged commit 4aaf24f into master Feb 13, 2018

ghost removed the status/in-progress In progress label Feb 13, 2018

whyrusleeping deleted the fix/session-cleanup branch February 13, 2018 21:12

This was referenced Feb 14, 2018

Bitswap sessions don't treat independent gets as independent #4674

Open

Goroutine leak and wantlist build-up on master #4657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shutdown notifications engine when closing a bitswap session #4658

shutdown notifications engine when closing a bitswap session #4658

whyrusleeping commented Feb 4, 2018

whyrusleeping commented Feb 5, 2018

whyrusleeping commented Feb 5, 2018

whyrusleeping commented Feb 5, 2018

magik6k Feb 5, 2018

whyrusleeping Feb 5, 2018

Stebalien commented Feb 6, 2018

whyrusleeping commented Feb 6, 2018

Stebalien commented Feb 6, 2018

Stebalien commented Feb 8, 2018

whyrusleeping commented Feb 8, 2018

Stebalien commented Feb 9, 2018

whyrusleeping Feb 9, 2018

Stebalien Feb 10, 2018

Stebalien commented Feb 11, 2018

Stebalien commented Feb 11, 2018

whyrusleeping commented Feb 13, 2018

Stebalien commented Feb 13, 2018

Stebalien commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

shutdown notifications engine when closing a bitswap session #4658

shutdown notifications engine when closing a bitswap session #4658

Conversation

whyrusleeping commented Feb 4, 2018

whyrusleeping commented Feb 5, 2018

whyrusleeping commented Feb 5, 2018

whyrusleeping commented Feb 5, 2018

magik6k Feb 5, 2018

Choose a reason for hiding this comment

whyrusleeping Feb 5, 2018

Choose a reason for hiding this comment

Stebalien commented Feb 6, 2018

whyrusleeping commented Feb 6, 2018

Stebalien commented Feb 6, 2018

Stebalien commented Feb 8, 2018

whyrusleeping commented Feb 8, 2018

Stebalien commented Feb 9, 2018

whyrusleeping Feb 9, 2018

Choose a reason for hiding this comment

Stebalien Feb 10, 2018

Choose a reason for hiding this comment

Stebalien commented Feb 11, 2018

Stebalien commented Feb 11, 2018

whyrusleeping commented Feb 13, 2018

Stebalien commented Feb 13, 2018

Stebalien commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018

whyrusleeping commented Feb 13, 2018