Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve query performance by limiting query width to KValue peers #291

Closed
wants to merge 21 commits into from

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Mar 8, 2019

This is actually still incorrect. We should be limiting our query to AlphaValue peers and then expanding to KValue peers once we run out of peers to query. However, this is still much better and we can do that in a followup commit (maybe in this PR, maybe in a new PR, we'll see).

This now correctly implements kademlia (everywhere).

TODO:

  • Multipath: punted
  • Avoid unecessary RPCs. Provide/PutValue currently calls GetClosestPeers which finds the 20 closest peers responding to queries. Unfortunately, validating this last part requires actually querying those peers. That means we query them twice: Once to see if they're live, once to put/provide.
  • Reconcile with the "quorum" concept. Answer: The "quorum" now allows us to stop the query early.
  • Play with slop and dial parallelism. I can get some pretty fast (relatively, ~ 10s) queries this way). However, this "slop" is definitely not proper kademlia. punted
  • Deal with the fact that FindPeer now runs to the end of the query instead of returning early. We'll probably have to revert this unless we can get the routed host to use FindPeerAsync sometime soon.
  • Log Kademlia distance to the target with every peer that we return – allows us to verify the speed at which we're getting closer to the target as we progress.

fixes #290

@ghost ghost assigned Stebalien Mar 8, 2019
@ghost ghost added the status/in-progress In progress label Mar 8, 2019
@Stebalien Stebalien force-pushed the fix/closest-peers branch 2 times, most recently from 72782a9 to b9edb2c Compare March 12, 2019 08:53
@anacrolix
Copy link
Contributor

I think if some of these can be dealt with in stand-alone PRs it will make it much more digestible.

While you're poking around on this area, it always irks me that it appears that queries are only initialized with alpha closest peers, and yet can expand to many more once closer peers come back in replies. If all of those initial alpha peers fail, the entire query fails. There's no reason not to be starting with a lot more peers, but only actively querying alpha to begin with.

@raulk
Copy link
Member

raulk commented Mar 13, 2019

@anacrolix rationale here: #192 (comment). But I agree we need to recover from a poisoned start -- in fact, I'd say that's urgent.

@Stebalien
Copy link
Member Author

Honestly, we could probably just seed with KValue peers. That should just work.

I can break this up into separate PRs for commenting but it should be merged all at once. I've broken it into reasonably logical commits that can be reviewed separately but, well, github reviews don't play well with that workflow.

@raulk
Copy link
Member

raulk commented Mar 13, 2019

@Stebalien I'd say keep it all in one PR, and group commits into logical changesets, and post a list of diff ranges like this to facilitate review: https://github.com/libp2p/go-libp2p-kad-dht/pull/291/files/2006602434583ea06634813330437f31df9300a1..1fcc9db35d65c32914c1b5bed4c8825437b697fe.

This is actually still incorrect. We _should_ be limiting our query to
AlphaValue peers and then expanding to KValue peers once we run out of peers to
query. However, this is still much better and we can do that in a followup
commit.

Considerations: We may not want to merge this until we get the multipath lookup
patch. It turns out, our current DHT effectively explores _all_ paths.

fixes #290
Returning early is _not_ valid, ever, for any reason.

Note: `query.Run` now returns the final peers. Any other values should be
exported via channels, etc (it's a _lot_ simpler).
Unfortunately, while returning early isn't valid, FindPeer would block for
_ages_ if we didn't. We should switch to a progressive FindPeerAsync but this'll
have to do for now.
// setup concurrency rate limiting
for i := 0; i < r.query.concurrency; i++ {
for len(r.rateLimit) < cap(r.rateLimit) {
r.rateLimit <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing exactly? Are we just trying to fill up the channel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (unnecessary but I was already messing with this code).

routing.go Outdated
@@ -323,7 +319,7 @@ func (dht *IpfsDHT) getValues(ctx context.Context, key string, nvals int) (<-cha
switch err {
case routing.ErrNotFound:
// in this case, they responded with nothing,
// still send a notification so listeners can know the
// still send a routingication so listeners can know the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oops)

This was causing us to do _slightly_ more work than necessary.
I.e., stop testing something that shouldn't work.
for i := 0; i < nDHTs; i++ {
dhts[i].BootstrapOnce(ctx, DefaultBootstrapConfig)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I spent a while trying to avoid having to fix this test without complicating the query logic too much. Then I realized that was just stupid.

@Stebalien Stebalien marked this pull request as ready for review June 6, 2019 07:29
@Stebalien
Copy link
Member Author

So, I wanted this to be the "ultimate patch that fixed everything". Yeah...

This now correctly implements Kademlia, that's it.

@Stebalien
Copy link
Member Author

TODO: Consider increasing Alpha to 6 (or more). The recurse step is tacking ages and I believe trying more paths may help us find a faster path.

@dirkmc
Copy link
Contributor

dirkmc commented Jun 6, 2019

@jacobheun did some interesting experimentation with DHT configuration parameters for js-libp2p-kad-dht, including changing alpha to 6, but ended up reverting back to 3: libp2p/js-libp2p-kad-dht#107

@jacobheun could you summarize why an alpha of 3 worked better in the end?

@jacobheun
Copy link

A higher alpha will probably work fine for go. The major issue I hit was performance of the js-ipfs node with an alpha of 6. I ended up using 4 for js-ipfs in Node.js and 3 in Browser, ipfs/js-ipfs#1994.

When I tested an alpha of 3 vs 6, 6 yielded a more normalized, lower range for the query times.

Originally we hit problems with the higher alpha's due to dialing timeouts of the peers. If the query to peers isn't being fairly aggressively limited, having a higher alpha can result in a path taking a long time to finish because it basically trickles responses. This is also why we ended up with a "sloppy" approach to ending queries, which significantly improved query times. For a given path, if we finish one of the concurrent queries and there are no closer peers queued we complete that path, even if there are queries in progress. The results of this approach being able to consistently find the top closest peers was pretty consistent, libp2p/js-libp2p-kad-dht#107 (comment).

@jacobheun
Copy link

Ah, it looks like this doesn't include disjoint paths, so the stuff I linked isn't going to help a lot here. Bumping the alpha to 6 here is only going to have 6 rpc calls concurrently, iiuc. With disjoint paths in JS the concurrency of 4 is going to get us 40 concurrent calls. (kValue / 2) * 4.

If this is going to stay single path the alpha is going to need to increase pretty significantly I think, maybe 20 or more until disjoint paths are added.

@raulk
Copy link
Member

raulk commented Jun 7, 2019

I'm going to run this with my enhanced logging and I'll report how it performs.

@raulk
Copy link
Member

raulk commented Jun 7, 2019

@Stebalien the results seem iffy in my case. Two tests managed to finish "get closer peers" and start providing in 1 minute or less:

1 minute:

❯ # start looking for closest peers
❯ grep QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc log2.log | head -1
16:04:10.453 DEBUG        dht: [outbound rpc] writing request outbound message; to_peer=QmSoLV4Bbm51jM9C4gDYZQ9Cy3U6aXMJDAbzgu2fzaDs64, type=FIND_NODE, cid_key=QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc, peer_key=QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc, raw_key=1220b506c0b62860ab8246c15e18514a13e032b794cc9566260dce7c295dd7a1c2a1, closer=[], providers=[] dht_net.go:

❯ # start providing
❯ grep QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc log2.log | grep fire-and-forget | head -1
16:05:10.537 DEBUG        dht: [outbound rpc] writing fire-and-forget outbound message; to_peer=QmXPqBVTPhUBpWWoVuGrAqAdCH8Av2Ash1rL9jLP1NwrWr, type=ADD_PROVIDER, cid_key=QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc, peer_key=QmaXJUWEjWK3ef7ySAX5PkugyniFA7xCmwEdoBYbSbYNEc, raw_key=1220b506c0b62860ab8246c15e18514a13e032b794cc9566260dce7c295dd7a1c2a1, closer=[], providers=[{QmdGmB5iczLtmxBrAk8NtJsPDP5E4kiF9w4rAKqsq6u7TE: [/ip6/::1/tcp/4001 /ip4/127.0.0.1/tcp/4001 /ip4/192.168.0.132/tcp/4001 /ip4/79.154.225.195/tcp/64905]}] dht_net.go:335

10 seconds:

❯ # start looking for closest peers
❯ grep QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv log2.log | head -1
16:06:29.849 DEBUG        dht: [outbound rpc] writing request outbound message; to_peer=QmNQT4Da4xZZbJoqVjjXcyKnDuzWeDodCXpJJmWvZAavRC, type=FIND_NODE, cid_key=QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv, peer_key=QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv, raw_key=1220d1574dab405ea3be062eea8fa16fb33155283e9d79853bafad4569818a8e4973, closer=[], providers=[] dht_net.go:393

❯ # start providing
❯ grep QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv log2.log | grep fire-and-forget | head -1
16:06:40.706 DEBUG        dht: [outbound rpc] writing fire-and-forget outbound message; to_peer=QmT4maLTXfvn7K1gMwdgV2wiKhn7BSLRG8ecYUYNAdkPGF, type=ADD_PROVIDER, cid_key=QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv, peer_key=QmcRq8VMRHRxNbQBhAmPbVeum25LQHUW2ZwnjYbL6dp7cv, raw_key=1220d1574dab405ea3be062eea8fa16fb33155283e9d79853bafad4569818a8e4973, closer=[], providers=[{QmdGmB5iczLtmxBrAk8NtJsPDP5E4kiF9w4rAKqsq6u7TE: [/ip4/127.0.0.1/tcp/4001 /ip4/192.168.0.132/tcp/4001 /ip6/::1/tcp/4001 /ip4/79.154.225.195/tcp/64905]}] dht_net.go:335

10 minutes (and counting):

❯ # start looking for closest peers
❯ grep QmdpY9Ee6hWsL98pYbU6CgHunkjEMPK9SEyD7aF9gE3oZc log2.log | head -1
16:08:31.320 DEBUG        dht: [outbound rpc] writing request outbound message; to_peer=QmYLRiRq1FiSdeL2AAX3rHgfmGKQRmgFgJhnskvEQyWmao, type=FIND_NODE, cid_key=QmdpY9Ee6hWsL98pYbU6CgHunkjEMPK9SEyD7aF9gE3oZc, peer_key=QmdpY9Ee6hWsL98pYbU6CgHunkjEMPK9SEyD7aF9gE3oZc, raw_key=1220e604243a8479c245b641df6f76de92be237913ddb9fb525f82b08fbccdb04f5b, closer=[], providers=[] dht_net.go:393

❯ # start providing
❯ grep QmdpY9Ee6hWsL98pYbU6CgHunkjEMPK9SEyD7aF9gE3oZc log2.log | grep fire-and-forget | head -1

❯ date
Fri Jun  7 16:19:45 WEST 2019

Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the separation between the recurse function and the finish function. It wouldn't hurt us to have a PING message (even if a reflexive FIND_NODE moonlights as that right now).

I was hoping we could take this opportunity to simplify various aspects of the implementation.

For example, the separation between dhtQuery and dhtQueryRunner seems redundant.

Our peer management is all over the place. I was thinking KPeerSet could encapsulate the state of peer traversal. We'd instantiate with a target count (e.g. 16 KValue), and as we traverse the network, we would notify it of the state of each peer via methods: Failed(peer.ID), OK(peer.ID), Querying(peer.ID). We could also condense the todocounter functionality into it.

Whenever a worker was ready, it'd ask for the next peer via Next() peer.ID. It would also expose a channel via Done() chan struct{} to signal when the target was met, and we'd fetch the resulting peers via Peers() and execute the finish function on them.

Dunno whether I should take a stab to put this approach together? WDYT, @Stebalien?

@raulk
Copy link
Member

raulk commented Jun 10, 2019

@Stebalien – I added an extra TODO point in the description for the Kademlia distance logging we discussed today.

@bonedaddy
Copy link

I tested out this PR in combination with the ipfs/go-ipfs-provider#8, and seems to be giving substantial performance improvements.

By DHT provide announcements seem to be happening faster, additionally it seems like my go-ipfs nodes are able to pick up on the provide announcements from my custom nodes faster than without this proposed change.

@Stebalien
Copy link
Member Author

Dunno whether I should take a stab to put this approach together? WDYT, @Stebalien?

I'd like to leave further refactors to a future PR unless this one makes things worse. I agree the query system has gotten a bit confusing but I wanted to save a larger refactor for a PR that only includes that refactor.

raulk added a commit to ipfs/kubo that referenced this pull request Jun 14, 2019
License: MIT
Signed-off-by: Raúl Kripalani <[email protected]>
@Stebalien
Copy link
Member Author

This has been replaced by #436

@Stebalien Stebalien closed this Mar 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Terminate Queries Correctly
8 participants