Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dissociating routing table membership from connection state #283

Closed
anacrolix opened this issue Mar 4, 2019 · 24 comments
Closed

Dissociating routing table membership from connection state #283

anacrolix opened this issue Mar 4, 2019 · 24 comments
Assignees
Labels

Comments

@anacrolix
Copy link
Contributor

anacrolix commented Mar 4, 2019

Currently routing table entries are evicted if a persistent connection to them is lost. Due to the unpredictable nature of connectivity, this causes several problems. The routing table becomes sparse over time and without constant activity, as peers disconnect and leave buckets partially full. This directly weakens a critical property of a functioning Kademlia-style network. Furthermore, disconnects across the board can empty the routing table entirely, and the implementation currently doesn't recover from a bad routing table state.

Edit (@jacobheun):

Design notes

  • Three states for entries in the routing table:
    • <0 | alive>
    • <1 | missing>
    • <2 | dead> (this is a virtual state, peers in this state are evicted from the routing table).
  • For each bucket, we keep a queue of recently seen peers that we can replace dead peers with.
  • On refresh, we test <1 | missing> peers and either transition them to <0 | alive>, or <2 | dead>.
  • On refresh:
    • Check all peers in the <1 | missing> state.
      • If we can't reach them, evict and start trying peers in the queue.
    • If – after evictions – the bucket is full, move on (Don't refresh buckets that are full #434)
    • Otherwise, query the bucket.
      • If we haven't recently queried the bucket (see the current logic).
  • We refresh buckets more frequently than we do now: every 10 (2?) minutes by default.
    • Note: we should test this under network churn.
    • Note: we could also reconnect every two minutes but only refresh every 10.
  • End-to-end behaviour (from the viewpoint of a client):
    • Clients will get pruned by servers frequently.
    • Despite the connection being killed, the server will remain in the routing table, marked with state <1 | missing>.
    • Frequent refreshes in closer buckets will test liveness, update the connection state, and will push the latest peer routing record via IDENTIFY.

Testing mechanics

  • We should test this in testground by:
    • Forming a well-connected, fully bootstrapped network.
    • Thrashing the network by performing repeated queries and connection manager trims.
    • Churning the network (removing/adding nodes) at a fixed rate.
  • We should measure:
    • Stability of buckets (i.e., how stable is the first bucket, second, third, etc.).
    • Number of peers in the routing tables.
    • Age of peers in the routing tables.
    • Query time.
    • Network overhead from frequent bootstrapping. If we switch from 10m to a 2m bootstrap frequency, how much extra dialing will that cause.

Success Criteria

  • Routing tables, for both clients and servers, remain healthy as described above (i.e., low churn, long lived peers, full table, etc.).
@anacrolix
Copy link
Contributor Author

There are some complications in removing the requirement that entries correspond to an active connections. It's desirable to have routing tables be persisted, however the entries are worthless without the associated peer addresses, as it's not cost-effective to resume a saved routing table if the peers must be located by querying the network, and to do so may require bootstrapping all over again (defeating the purpose for the most part). Unfortunately, as far as I can tell, the address expiries are controlled by the peer store, and the time for which a routing table entry may lie dormant is indeterminable (or at least should be for maximum reliability). The routing table should be able to maintain an up to date set of addresses for a peer, but I don't think that's trivial with the existing abstractions (there would need to be racy periodic updates and propagation). Possibly by implementing routing table pinging/grooming (per a standard implementation), any entry still in the routing table when it is persisted could be relied upon to have have unexpired addresses in the peer store.

@anacrolix
Copy link
Contributor Author

https://github.com/libp2p/go-libp2p-kad-dht/tree/decouple-routing-table tracks my efforts to implement this.

@jhiesey
Copy link

jhiesey commented Mar 4, 2019

This is a very complex issue, but I'd say it's evidence that we need to rethink the interfaces between the connection manager, peer store, and routing table.

In general, peer store and routing table entries (for nodes that aren't client only) should have the same lifetime, right? If we have a routing table entry, there is no reason to throw out its addresses, and if we have addresses for a node (that isn't client only) we might as well have it in the routing table.

This really doesn't fit the current structure, but what about combining the peer store and routing table into one data structure that we add any node we communicate with to? The eviction policy would then be to remove nodes that have expired TTL (like the peerstore does now) AND are in a sufficiently full bucket. There's really no need to only allow routing to at most K specific nodes like in the traditional kademlia as long as we aren't short on memory.

@anacrolix
Copy link
Contributor Author

Thanks @jhiesey. Getting the peerstore addresses and routing table to play nicely together is definitely a tricky part here. The other tricky thing is where to handle prioritization of peers in each bucket, the current solution is inadequate.

@raulk
Copy link
Member

raulk commented Jun 17, 2019

@bigs assigning this to you in case you wanna tackle it as discussed elsewhere! ;-)

@bigs
Copy link
Contributor

bigs commented Jun 20, 2019

@raulk i've had some time to dig into this today. i'm thinking of tackling in stages. first, i'll take care of the routing table persistence. after that, it gets foggier. it's clear we need to persist dht peers and their addresses, but obviously we don't want to let that grow unbounded. perhaps in a first iteration, peers are kept in the peerstore with the "permanent" ttl until they're evicted from the routing table. it seems like we have an actual use for ping, now, as a culling mechanism.

@yusefnapora
Copy link
Contributor

yusefnapora commented Jul 17, 2019

I was reading through the wip PR for the kad-dht spec and the Kad paper, and it really drove home the importance of this issue. A full routing table in our DHT needs 5120 entries (256 buckets * k=20), but the practical connection limit is much lower than that.

I don't know what people are running in production, but the default high water mark for the connmgr in go-ipfs seems to be 900 connections, or ~17% of a full kad table. In practice, it seems nobody ever has a full routing table anyway, but since we're pulling peers out of the table randomly, we can't really satisfy this property from the paper:

"The Kademlia protocol ensures that every node knows of at least one node in
each of its subtrees, if that subtree contains a node"

Since there's nothing stopping us from removing the only peer in a bucket if they temporarily disconnect.

The quickest and dirtiest fix I can think of would be to still use connectedness as a heuristic for eviction, but move the check so that it only happens when deciding whether to evict a node from a full k-bucket. In other words, don't remove a peer from the table when they disconnect, but if a bucket is full, evict the least-recently-seen node that isn't currently connected.

Overall though, I think connectedness is the wrong way to think about this. We care about whether we can get a message to the peer, not if we're currently connected to them. But it might provide some short-term benefit while we work on a better eviction policy.

@Chris-R-111
Copy link

Chris-R-111 commented Jul 18, 2019

Agreed, IPFS-Kad requiring peers to stay connected misses a CRITICAL property of Kad design.

Kad has an Extreme bias (only property it cares about) towards long lived peers. The only eviction policy for a peer is that the peer fails an explicit health check. Even peers that fail to respond to queries does not cause immediate eviction in order to be resistant to temporary connection hiccups as truly dead peers will quickly become eligible to the health check mechanism.

As for the health check the Kad spec is kinda weird as its initial description in section 2.2 is soon altered in section 4.1 for being problematic, specifically "the algorithm as described would require a large number of network messages". This is likely why most Kad implementations have different solutions to health checks.

  • uTorrent tests all "stale" peers every 15 minutes.
  • Bitorrent marks nodes as questionable/bad and either tests/replaces entries based on these flags.
  • NICE tests the last seen peer every 6 seconds. (Similar to uTorrent but less bursty)
  • Kad section 2.2 tests the last seen peer for every new candidate found.
  • Kad section 4.1 defers these tests to attempt to piggyback off a future RPC to the peer.
  • IPFS-Kad piggybacks on the connection manager state (if peer is disconnected it is dead).

Except for IPFS-Kad a property that is shared by all these different implementations is to only remove peers that have been proven to be dead. This miss-attribution that Disconnected = Dead, combined with the connection manager itself often having thrashing issues is causing the routing table to be quite unstable.

@lanzafame
Copy link
Contributor

I don't know what people are running in production, but the default high water mark for the connmgr in go-ipfs seems to be 900 connections, or ~17% of a full kad table.

Just stopping by to provide some production values, the current IPFS gateway deployments have HighWater set to 20,000 and LowWater set to 19,000, with plans to increase that and I believe Pinata have theirs set to ~30k.

@yusefnapora
Copy link
Contributor

Thanks for those numbers @lanzafame, that's very interesting. It helps explain why the DHT works in practice, since there are bound to be some "infrastructural" nodes that have high connection limits and are generally well-connected.

Also thanks for the writeup on the different health check strategies @Chris-R-111, that's very interesting. I had been thinking along the lines of putting questionable nodes into a "testing" queue, which sounds like it's basically the Bittorrent approach.

The sad thing is that the connection limits are asymmetric. Even if our gateway nodes have high connection limits, the other nodes in their routing tables may not. If a "regular node" is in the routing table of a gateway node and needs to close the connection because they're at their limit, the gateway node will then forget about the regular node completely and damage its own routing table as a result.

If participating well in the DHT requires maintaining thousands of long-lived connections without interruption, it seems like that puts the bar too high for anyone not located in a data center 😞

@meyer9
Copy link

meyer9 commented Oct 15, 2019

@yusefnapora - The routing table should never be full. That would mean there are 20 * 2^256 nodes on the network. The current 900 peer limit means we could have 45 full buckets or 20 * 2^45 nodes on the network.

However, that doesn't change the fact that the assumption that disconnected = dead causes some critical problems with the DHT.

@jacobheun jacobheun added the Epic label Jan 23, 2020
@jacobheun jacobheun added this to the Working Kademlia milestone Jan 23, 2020
@jacobheun jacobheun changed the title Don't require routing table entries to be connected Dissociating routing table membership from connection state Jan 23, 2020
@aarshkshah1992
Copy link
Contributor

@jacobheun We always mark a peer we lose connection with as missing right ?

Irrespective of whether we/the other party is a client/server & irrespective of whether the disconnection was caused by the connection manager/transient network issue/ actively closed by the other peer.

@raulk
Copy link
Member

raulk commented Jan 30, 2020

@aarshkshah1992 we should only be having DHT servers in our routing table, following the other planned changes.

@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Feb 10, 2020

Considering the design discussions above and this great blog post by Arvid Norberg on RT maintenance based on solving the same problem for libtorrent (which in turn is based on section V of this paper), I'm tempted to try out the following approach:

Approach:


  • Each peer in the RT has two additional fields:
    • Last Queried Time(LQT):
      When did we last hear any kind of response from the peer ?
    • State:
      • Pinged:
        • We've heard atleast one response from the peer.
        • A peer that we create an outbound connection to would start out in this state with LQT = time of connection. We then update the LQT whenever we hear any kind of response from it.
      • Unpinged:
        • We've never heard a response from the peer. A peer that we get an inbound connection from would be in this state with LQT = 0 till we actually ask it for something & it reciprocates following which we mark it as Pinged.
  • The RT hides the Unpinged peers from the DHT & never returns those as a part of any lookup.
  • We now periodically run a Find_Peer(randPeerId) query against the Most Stale Peer where randPeerId = bucket of the most stale peer.
    • The query allows us to verify if the peer is alive/update it's LQT. As an added benefit, a successful response to Find_Peer enables us to discover more peers in the network.
    • If a peer fails to respond to a query two times in a row(can be changed based on testing), we evict it.
    • The post & paper mention a period of 6 seconds, but we can tune it based on testing.
    • Most Stale Peer is the peer with the lowest LQT:
      • An Unpinged peer is always considered more stale than a pinged one.
      • In case of ties, we always have a bias towards the peer closest to us/peer in the "lowest" K-bucket. This is good because we discover closer peers/more peers in our Kad-vicinity and it helps increase the size of the RT because we could end up unrolling the last buckets.
      • Once we've filled up 16 buckets, we do this from the 16th bucket to the top, as we can only generate random Peer Ids which share a Cpl of 16 with us.
  • A Pinged peer always trumps an Unpinged peer if a K-bucket is full i.e. we always evict an Unpinged peer to make space for a Pinged peer.
  • We keep expanding the pool of peers in the RT by using peers that we learn about from DHT queries during the normal mode of operation. If we contact a peer among those for a query, it goes to the Pinged state otherwise, it goes to the Unpinged state.
  • We DO NOT evict peers when we disconnect from them as we do now. The only times we evict a peer are when:
    • It fails to respond to two Find_Peer probes in a row.
    • Remove an Unpinged peer to make space for a Pinged peer.
  • We replace our current RT refresh mechanism with this as the constant Find_Peer queries will keep our buckets fresh & loaded. We could still run "query for self" as we do currently to ensure we keep searching for peers closer to us than we already know of.

Notes on address management


  • A peer address we learn of during a DHT query gets added to the peerstore with a TTL of 2 mins & it will remain so if we do not dial it.
  • When we disconnect from any peer, Identify updates the TTL of the address to 10 minutes.
  • This could potentially create problems in two scenarios:
    • We disconnect from a peer -> it isn't selected as the most stale peer for 10 mins -> we don't have address for it when we query.
    • We discover a peer during a query but don't dial it -> it isn't selected as the MSP for 2 mins -> no address when querying.
  • One approach I can think of for the above is to cache addresses during disconnect/query response like we do in AutoNAT service & evict the address when we evict the peer.

Why I like this approach


@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Feb 10, 2020

Ping @Stebalien @aschmahmann @dirkmc

@raulk
Copy link
Member

raulk commented Feb 10, 2020

@aarshkshah1992 Thanks for doing this research. Indeed these mechanisms are necessary since day 0 in pure Kademlia because the underlying protocol is unreliable (UDP); therefore you rely on PING messages to figure out if an apparently missing peer is actually alive.

Taking one step back, this problem/solution domain can be broken off in various abstract pieces that, together, make a coherent solution:

  • failure counting / tolerance: what events do we count as a failure, and how many of those are we willing to tolerate before we evict?
  • connectivity state: do we consider the peer to be online, disputed, or offline?
  • eviction policy: when do we evict the peer from the routing table?
  • replacements: when a peer is ultimately evicted, what do we replace it with? Do we keep a replacement set, so we can replace rapidly, or do we rely on network queries solely?
  • in/out quotas: I'm not sure we've explicitly mentioned this before. One problem with the solution in Arvid's blog post (which I'm sure was addressed later in implementation) is that an implementation adding all nodes to the RT it learnt of throughout a query (as unpinged) makes that implementation subject to poisoning/eclipse attacks, if the bucket is empty / mostly vacant. So the logic for selecting peers to add to a bucket should ideally take provenance into account, to avoid situations peers returned by a single source can take over a bucket (quasi-)entirely.

Going back to the concrete proposal, @aarshkshah1992:

  • I think the solution you proposed is very much aligned in spirit with the solution that @Stebalien and I made. I don't see any aspect that's radically different from our solution. We attach states to peers, we do failure counting, we evict on a specific condition, then replace the peer. The actual mechanics may differ, but the concerns being addressed are the same.
  • There's no silver bullet here. What we end up with will be a custom solution inspired by our own ideas, and other precedents.
  • The post you cited is from 2014 -- the protocol/implementations have evolved since then, so it would be great to unpack the status quo. HOWEVER...
  • At this point, I would encourage us to get started implementing the original idea, keeping our minds open to new input like the one you've supplied.
  • All these things can be pluggable event-based strategies; how we do failure counting, how we transition connectivity states, etc. We could really benefit from making the the Kad DHT a functional event-based system, where these functions respond to events and make transactional changes to the underlying system (routing table, scores, etc.)

@raulk
Copy link
Member

raulk commented Feb 10, 2020

To paint the background a little bit more, there are items of (S/)Kademlia we can take as prescriptive, and others that we can't. The algorithmic elements we are taking as prescriptive, but the items related to connectivity and transports, we really can't.

One example: merely using LQT as a heuristic is suboptimal, because it assumes a disconnected scenario (UDP). In our case, we might actually hold a connection with that peer, which is kept alive by TCP and the multiplexer, so even if we haven't queried that peer in a long time, we have a high confidence that it's alive (albeit not healthy?).

@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Feb 10, 2020

@raulk Thanks for the great reply.

@aarshkshah1992 Thanks for doing this research. Indeed these mechanisms are necessary since day 0 in pure Kademlia because the underlying protocol is unreliable (UDP); therefore you rely on PING messages to figure out if an apparently missing peer is actually alive.

Taking one step back, this problem/solution domain can be broken off in various abstract pieces that, together, make a coherent solution:

  • failure counting / tolerance: what events do we count as a failure, and how many of those are we willing to tolerate before we evict?
  • connectivity state: do we consider the peer to be online, disputed, or offline?
  • eviction policy: when do we evict the peer from the routing table?
  • replacements: when a peer is ultimately evicted, what do we replace it with? Do we keep a replacement set, so we can replace rapidly, or do we rely on network queries solely?

This is a great point and I agree that the problem has a "generic" scaffolding & lends itself to a pluggable strategies solution very well and this should indeed be the design we go ahead with.

  • in/out quotas: I'm not sure we've explicitly mentioned this before. One problem with the solution in Arvid's blog post (which I'm sure was addressed later in implementation) is that an implementation adding all nodes to the RT it learnt of throughout a query (as unpinged) makes that implementation subject to poisoning/eclipse attacks, if the bucket is empty / mostly vacant. So the logic for selecting peers to add to a bucket should ideally take provenance into account, to avoid situations peers returned by a single source can take over a bucket (quasi-)entirely.

Even the current DHT implementation & the original idea suffers from the same problem. We do not currently limit how many peers suggested by a single peer are inserted into the K-bucket as long as we are able to connect to them. IMO, this deals with a larger scope of work of making DHT resistant to eclipse/sybil attacks. We could impose a limit here but this problem shouldn't guide which approach we go ahead with here as it is a shortcoming of both the approaches.

Going back to the concrete proposal, @aarshkshah1992:

  • I think the solution you proposed is very much aligned in spirit with the solution that @Stebalien and I made. I don't see any aspect that's radically different from our solution. We attach states to peers, we do failure counting, we evict on a specific condition, then replace the peer. The actual mechanics may differ, but the concerns being addressed are the same.
  • There's no silver bullet here. What we end up with will be a custom solution inspired by our own ideas, and other precedents.
  • The post you cited is from 2014 -- the protocol/implementations have evolved since then, so it would be great to unpack the status quo. HOWEVER...
  • At this point, I would encourage us to get started implementing the original idea, keeping our minds open to new input like the one you've supplied.

Like I said, I am all in for going ahead with a pluggable event-based strategies solution. However, is there any specific reason we'd prefer the original idea over the one proposed here as the default implementation?

  • The latter has been tried in production
  • Maps nicely to how we currently refresh the RT & could even replace it entirely
  • Makes the traffic less bursty by spreading out the lookups/maintenance over a prolonged duration
  • I am not entirely sure about this but rather than pinging all missing peers and then all the candidates(if the missing ones aren't available) in addition to refreshing unused buckets periodically as is proposed in the original idea, running just one FindPeer query periodically could save us bandwidth. I can assert this with more confidence once we have some testground numbers.

@aarshkshah1992
Copy link
Contributor

One example: merely using LQT as a heuristic is suboptimal, because it assumes a disconnected scenario (UDP). In our case, we might actually hold a connection with that peer, which is kept alive by TCP and the multiplexer,

  1. Moving to UDP is the longer term direction we want to take with DHT
  2. Just because we have a TCP connection with a peer does not mean it is dialable as it could be behind a NAT(inbound connections).

so even if we haven't queried that peer in a long time, we have a high confidence that it's alive (albeit not healthy?).

Sure, all the latter approach does is send it a FindPeer query to assert that it's still healthy/discover more peers as a side affect & it then wont bother it till it finishes cycling through all other peers in the RT.

@raulk
Copy link
Member

raulk commented Feb 10, 2020

We had some off-band discussion. Both solutions (Arvid and ours) are fundamentally akin; they do not compete. They basically vary in two parameters: frequency of validation / bootstrapping, and the peer selection function. We can make those aspects (and others) parameterisable.


On a second order of things, the points that you raise about the peerstore and address management are all valid. If we stay disconnected from a peer too long, two things can happen:

  1. we still remember their peer record, but it may be outdated (their addresses could’ve changed).
  2. we forget their peer record -- this can be prevented by bumping up the TTL of peers in the RT to permanent, when reducing it when we finally evict the peer.

In both cases we’d need to find that peer. Kademlia favours older peers, so we'd rather validate that peer than find a new one.

@Stebalien
Copy link
Member

Moving to UDP is the longer term direction we want to take with DHT

It's definitely not. This is a common misconception that I want to make sure doesn't get propagated.

The long-term directions are:

  1. Be transport agnostic.
  2. Support messaging (reliable, unreliable, etc.) transports.
  3. Support more efficient (in terms of connection setup latency, file descriptor usage, etc.) transports.

Really, the long term direction is likely packet-switched overlays over arbitrary network transports.

@raulk
Copy link
Member

raulk commented Feb 12, 2020

Copying over notes from a call with @aarshkshah1992 earlier today:

  • We should kill the kbucket repo and migrate the data structure into kad-dht.
  • IMO, the RT should no longer use an imperative API, i.e. Insert, Remove, etc. as that’s misleading (removing a peer may not remove the peer, just mark it inactive).
    • Instead, it should expose an event-based API: MarkActive(peer), MarkInactive(peer).
    • kad-dht then notifies the routing table of state transitions, and lets the RT manage them by (conditionally) mutating the data.
  • kad-dht injects a peer validation predicate into the routing table (via the constructor; potential signature (peer.ID) => (Result<Alive|Dead|Unknown>, error). This is the function that the RT will use to validate missing peers; it wires to host.Connect.
    • Rationale: we don’t want to inject the Host itself into the RT — that’s too leaky.
  • Peer validation and routing table refresh are two different, but related, background processes.
    • Peer validation is owned by the RT, and RT refresh is owned by kad-dht.
    • Rationale: the RT is a self-cleaning data structure that looks after its own health (peer validation); whereas table refresh is owned by the kad-dht implementation.
  • Replacing dead peers: using a bounded per-bucket replacement queue like @Stebalien and I proposed (instead of on-the-spot FIND_PEER queries) factors in the age heuristic of Kademlia (older is better).
    • During the course of queries, we encounter many well-behaving peers but have no capacity for them in the routing table. We record them in the replacement queue, and don’t care about their connection state.
    • When we need to replace a dead peer, we pop the head (eldest) and test it; if it’s alive we add it to the bucket. If not, we keep burning through the replacement cache.
    • If none of the peers was alive, there will be a gap in the bucket, and the routing table refresh goroutine will take care of doing a random walk in that bucket, when it's time for it.

@aarshkshah1992
Copy link
Contributor

ping @Stebalien

@Stebalien
Copy link
Member

We should kill the kbucket repo and migrate the data structure into kad-dht.

We split it off because we were planning on (or did?) create alternative DHTs. However, I don't have any strong objections to merging it back in if the separation is causing problems.

Note, we're not the only users: https://godoc.org/github.com/libp2p/go-libp2p-kbucket?importers.

IMO, the RT should no longer use an imperative API, i.e. Insert, Remove, etc. as that’s misleading (removing a peer may not remove the peer, just mark it inactive).

Makes total sense.

kad-dht injects a peer validation predicate into the routing table (via the constructor; potential signature (peer.ID) => (Result<Alive|Dead|Unknown>, error). This is the function that the RT will use to validate missing peers; it wires to host.Connect.

Great idea!

Peer validation and routing table refresh are two different, but related, background processes.

👍

Replacing dead peers

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests