Discv5 factorize #480

cskiraly · 2022-02-23T20:41:32Z

This set of patches tries to break up protocols.nim and encoding.nim into smaller chunks of code,
reducing dependencies and simplifying modifications to the discv5 code.

The main part is to separate protocol.nim into two layers:

a lower-half handling authentication, encryption, and key exchange. This is now called Transport.
an upper-half handling DHT messages and request/response semantics. This is still called Protocol.

Another patch separates encoding related to the lower-half from encoding related to the upper-half.

Other patches try to improve semantics of various send messages, and remove some duplicated code.

The last patch removes some dependencies, but also an export from enr.nim. I can imagine this could create
issues, so it can be removed from the set.

Messages and message encoding has nothing to do with the underlying authenticated communications framework. Separate these two.

Protocol was actually made of two sub-protocols. * a lower-half handling authentication, encryption, key exchange, and request/response. This is now called Transport. * an upper-half handling DHT messages. This is still called Protocol. Separation of these two reduces dependencies and simplifies modifications to the protocol. Signed-off-by: Csaba Kiraly <[email protected]>

Note that the test does not compile, but it was also not compiling before Signed-off-by: Csaba Kiraly <[email protected]>

- interface between Transport and Protocol is at the encoded Message level

Node already has the address, so it does not make sense to pass it as a separate parameter.

This is a request part of a Request/Response, generating also a reques ID. So call it what it is.

This completes the Request/Response semantics.

It is better to rely on protocol.nim to do all the encoding, thus cleaning up dependencies.

kdeme

This set of patches tries to break up protocols.nim and encoding.nim into smaller chunks of code, reducing dependencies and simplifying modifications to the discv5 code.

The main part is to separate protocol.nim into two layers:

I'm fine with doing something like this. I'd like to hear a bit more about the context and your use case. What are you trying to achieve?

I've thought in the past about making parts of discv5 generic. But I was then specifically thinking about doing it at a layer which skips the whole wire protocol of the discv5 spec. So, generic nodes queries / lookups. (And on the other side of the spectrum, perhaps the possibility of switching also the actual UDP transport). This was in the context of reusing most of the code, except the wire protocol, for discv4. However, never got to that considering I was unsure if we would still need discv4 and thus if it was worthy the effort. This resurfaced however in the context of our Fluffy project.

With your suggestion in this PR, we split of the messages and their encoding (rlp), from the packet encoding. That sounds good, but it would only allow you to swap in/out either the packet encoding or the protocol messages. So I'm assuming you are planning to reuse only one of these 2 parts in the dagger project?

I think most of the other suggestions are fine too, but I'll do a more thorough review later on.
I would probably drop the message_encoding just in the messages file. There is already rlp encoding going on there anyhow, no need for a new file. Unless you want to move all rlp encoding out there too, and only keep the message definitions, in order to abstract that part also.

cskiraly · 2022-02-24T11:13:43Z

This set of patches tries to break up protocols.nim and encoding.nim into smaller chunks of code, reducing dependencies and simplifying modifications to the discv5 code.
The main part is to separate protocol.nim into two layers:

I'm fine with doing something like this. I'd like to hear a bit more about the context and your use case. What are you trying to achieve?

I've thought in the past about making parts of discv5 generic. But I was then specifically thinking about doing it at a layer which skips the whole wire protocol of the discv5 spec. So, generic nodes queries / lookups. (And on the other side of the spectrum, perhaps the possibility of switching also the actual UDP transport). This was in the context of reusing most of the code, except the wire protocol, for discv4. However, never got to that considering I was unsure if we would still need discv4 and thus if it was worthy the effort. This resurfaced however in the context of our Fluffy project.

With your suggestion in this PR, we split of the messages and their encoding (rlp), from the packet encoding. That sounds good, but it would only allow you to swap in/out either the packet encoding or the protocol messages. So I'm assuming you are planning to reuse only one of these 2 parts in the dagger project?

I think most of the other suggestions are fine too, but I'll do a more thorough review later on. I would probably drop the message_encoding just in the messages file. There is already rlp encoding going on there anyhow, no need for a new file. Unless you want to move all rlp encoding out there too, and only keep the message definitions, in order to abstract that part also.

Indeed, I see I’ve left some of the RPL part. I’m adding a patch to move that out as well. I think it would make sense.

Now, the broader context: The main thing we need is a fast DHT, that implements the primitives of the libp2p-dht spec, and integrates well with the libp2p world, implemented in Nim.

Ideally, It should be useful both for libp2p-dht (which relies on the underlying stream transport), and for a dht that skips the overhead coming with those streams and goes UDP.

As you can see your code ticks many of these boxes. It is fast, UDP based, with good datagram crypto, it is in Nim, and it already went through several iterations. There are however some things that would need changing, and there are some conflicting requirements:
1, we need some additional messaging (Add/GetProviders, Put/GetValue).
2, message encoding and message IDs are dictated by specs, and these do collide.
3, RLP encoding is not ideal for me, although if it is in another layer, it is not a big issue
4, a mix of 2 worlds of crypto primitives, one based on ENR, and another based on libp2p signed PeerRecord sounds and feels problematic
5, finally, libp2p-dht assumes a dht running on top of stream oriented libp2p transports, which complicates things

Now one by one:

Additional messages: These could be even done on top of talk, if considering functionality, but not considering encoding constraints. Anyway, focusing on functionality, these are all defined as starting with a lookup and then going direct. Talk is a bit restrictive in its single message Request/Response model not allowing multiple response datagrams like in findNode, but apart from that, it is almost like on-top-of-talk.
specs: this clearly requires separating encoding and the handling of message IDs. Also swapping out layers below, changing functionality, etc. It also means additional messages can’t be on top of talk, but rather next to it. 
RLP: This is why I try to separate all these binary encoding parts. As you say, your RPL encoding is there is several layers, and I've tried to separate these. It is there in transport, it is there in the message encoding on top, and it is there in the encoding of ENRs. Ideally one could swap any of these independent of each other. 
Crypto and peer records: this is complicated, with distinct used in both worlds and with slightly different primitives. However, if I would like to use libp2p peer records and your AES-GCM, I need some changes here at least in my version. 
There are many aspects here, for example differences in the messaging and operation of lookups. The dicv5 distance based lookup is not what libp2p-dht specifies. That one is closer to discv4 in its messages, but different in its iterations due to stream oriented transports.   I hope this gives you the context. Code-wise I think it would be good to do the factorisation, and then we can see how it allows us to use common generic parts. What do you think?

kdeme · 2022-02-25T21:57:10Z

I hope this gives you the context.

It certainly does, thanks!

Code-wise I think it would be good to do the factorisation, and then we can see how it allows us to use common generic parts. What do you think?

Moving the messages encoding is definitely fine. Moving what you call the Transport makes mostly sense in the case you would have custom Clients, which would be your use case. Not that needed/interesting for purely discv5, but its fine separating it out, they can be considered separate layers within discv5 I guess.
I'm somewhat more interested in separating out the wire protocol, but this could be a good first step in that direction.
Not so found on the naming used currently though.

It would be good to know how far this gets you already, some additional remarks/info on your points:

Additional messages: These could be even done on top of talk, if considering functionality, but not considering encoding constraints. Anyway, focusing on functionality, these are all defined as starting with a lookup and then going direct. Talk is a bit restrictive in its single message Request/Response model not allowing multiple response datagrams like in findNode, but apart from that, it is almost like on-top-of-talk.

Yes, additional messages could be done over talk. And you could choose your encoding, but the outer layer needs to be rlp encoded byte string. And as you mentioned, there are some constraints such as the multiple response problem.
We actually use this system in the Portal network (fluffy client in nimbus-eth1, WIP) to define an additional protocol on top. And the lack of multiple responses is a problem there. However, in Portal network, there are also several DHTs on top which use each this protocol over talk.
For a single DHT project, it is rather ugly to have messages on two different layers. So I'd understand you'd want to get your own custom messages defined at the base layer with the encoding fit for the project.

2. specs: this clearly requires separating encoding and the handling of message IDs.  Also swapping out layers below, changing functionality, etc. It also means additional messages can’t be on top of talk, but rather next to it.

Well, they could I guess, but it would be ugly, see above.

3. RLP: This is why I try to separate all these binary encoding parts. As you say, your RPL encoding is there is several layers, and I've tried to separate these. It is there in transport, it is there in the message encoding on top, and it is there in the encoding of ENRs. Ideally one could swap any of these independent of each other.

The packet encoding doesn't use rlp, only the messages and the ENR. For the messages it seems solved with this PR if you make a custom messages definition / encoding?
ENR would not be solved yet. The ENR is used in the messages, and thus could be replaced if you make that custom. However it can also be provided during the handshake mechanism, which would require additional changes. (aside from the fact that the ENR is stored in Node, which is what is stored in the routing table).

4. Crypto and peer records: this is complicated, with distinct used in both worlds and with slightly different primitives. However, if I would like to use libp2p peer records and your AES-GCM, I need some changes here at least in my version.

I'd have to read up what "libp2p signed PeerRecord" are, but assuming they are similar to ENRs in the sense that give same security guarantees?

5. There are many aspects here, for example differences in the messaging and operation of lookups. The dicv5 distance based lookup is not what libp2p-dht specifies. That one is closer to discv4 in its messages, but different in its iterations due to stream oriented transports.

Well, not sure what can be done here. Depends if the goal is to have a DHT that can be used efficiently to lookup, or if it is to follow purely libp2p-dht specs. Discv5 has indeed altered the FindNodes message slightly to be distance based, compared to just passing the target like in discv4 or just original Kademlia. IIRC this was adjusted so that a node would know when another node is returning not really its closest neighbours (whether due to bug or malicious).
In the case where you would have custom messages, you could also just adjust that again.

cskiraly · 2022-03-17T08:17:31Z

I hope this gives you the context.

It certainly does, thanks!

Sorry for coming back to this only now.

Code-wise I think it would be good to do the factorisation, and then we can see how it allows us to use common generic parts. What do you think?

Moving the messages encoding is definitely fine. Moving what you call the Transport makes mostly sense in the case you would have custom Clients, which would be your use case. Not that needed/interesting for purely discv5, but its fine separating it out, they can be considered separate layers within discv5 I guess. I'm somewhat more interested in separating out the wire protocol, but this could be a good first step in that direction. Not so found on the naming used currently though.

Naming was just a first shot. It is essentially a "SecureDatagramTransport" or "AuthenticatedEncryptedDatagramTransport", I guess.
What do you mean by separating out the wire protocol here?

It would be good to know how far this gets you already, some additional remarks/info on your points:

Additional messages: These could be even done on top of talk, if considering functionality, but not considering encoding constraints. Anyway, focusing on functionality, these are all defined as starting with a lookup and then going direct. Talk is a bit restrictive in its single message Request/Response model not allowing multiple response datagrams like in findNode, but apart from that, it is almost like on-top-of-talk.

Yes, additional messages could be done over talk. And you could choose your encoding, but the outer layer needs to be rlp encoded byte string. And as you mentioned, there are some constraints such as the multiple response problem. We actually use this system in the Portal network (fluffy client in nimbus-eth1, WIP) to define an additional protocol on top. And the lack of multiple responses is a problem there. However, in Portal network, there are also several DHTs on top which use each this protocol over talk. For a single DHT project, it is rather ugly to have messages on two different layers. So I'd understand you'd want to get your own custom messages defined at the base layer with the encoding fit for the project.
2. specs: this clearly requires separating encoding and the handling of message IDs.  Also swapping out layers below, changing functionality, etc. It also means additional messages can’t be on top of talk, but rather next to it.
Well, they could I guess, but it would be ugly, see above.
3. RLP: This is why I try to separate all these binary encoding parts. As you say, your RPL encoding is there is several layers, and I've tried to separate these. It is there in transport, it is there in the message encoding on top, and it is there in the encoding of ENRs. Ideally one could swap any of these independent of each other.
The packet encoding doesn't use rlp, only the messages and the ENR. For the messages it seems solved with this PR if you make a custom messages definition / encoding? ENR would not be solved yet. The ENR is used in the messages, and thus could be replaced if you make that custom. However it can also be provided during the handshake mechanism, which would require additional changes. (aside from the fact that the ENR is stored in Node, which is what is stored in the routing table).

Right, I think I was confused by still leaving some RLP code in decodeHandshakePacket.

For the Node, it really goes hand-in-hand with the ENR I think. I see it mostly as an in-memory representation of the ENR.

4. Crypto and peer records: this is complicated, with distinct used in both worlds and with slightly different primitives. However, if I would like to use libp2p peer records and your AES-GCM, I need some changes here at least in my version.
I'd have to read up what "libp2p signed PeerRecord" are, but assuming they are similar to ENRs in the sense that give same security guarantees?

They are similar in their base functionality of having the underlay address(es) info and the public key, all signed, although without ENR's generic extensible key/value pairs.

5. There are many aspects here, for example differences in the messaging and operation of lookups. The dicv5 distance based lookup is not what libp2p-dht specifies. That one is closer to discv4 in its messages, but different in its iterations due to stream oriented transports. 
Well, not sure what can be done here. Depends if the goal is to have a DHT that can be used efficiently to lookup, or if it is to follow purely libp2p-dht specs. Discv5 has indeed altered the FindNodes message slightly to be distance based, compared to just passing the target like in discv4 or just original Kademlia. IIRC this was adjusted so that a node would know when another node is returning not really its closest neighbours (whether due to bug or malicious). In the case where you would have custom messages, you could also just adjust that again.

Just curious, do you have a link to an evaluation of the cost of this in lookup speed?

arnetheduck · 2022-09-02T12:11:33Z

@cskiraly what is the state of this PR? In general, we have a lot of copies of the discovery and kademlia in particular spread out over the projects - it might be a good time to review these and see if they can be unified perhaps

kdeme · 2022-10-07T10:09:51Z

I might pick up some of this PR while I look into reusing more discv5 code in discv4

cskiraly · 2023-05-21T12:06:18Z

@kdeme, I'm now back at this codebase, so I will check how can we rebase this to the current master

cskiraly added 12 commits February 22, 2022 15:18

move message encoding to separate file

083e185

Messages and message encoding has nothing to do with the underlying authenticated communications framework. Separate these two.

fixup: fix imports in test_discoveryv5_encoding

f49801e

Note that the test does not compile, but it was also not compiling before Signed-off-by: Csaba Kiraly <[email protected]>

attempt to clean up send proc names

0843ea9

- interface between Transport and Protocol is at the encoded Message level

transport.sendMessage: avoid passing address twice

9b86ade

Node already has the address, so it does not make sense to pass it as a separate parameter.

protocol.sendMessage: allow destination as (NodeId, Address)

7f1fd9f

rename protocol.sendMessage to sendRequest

29584e1

This is a request part of a Request/Response, generating also a reques ID. So call it what it is.

protocol: introduce sendResponse

0d0b7ad

This completes the Request/Response semantics.

utp_discv5: use protocol.sendRequest

c923a76

It is better to rely on protocol.nim to do all the encoding, thus cleaning up dependencies.

cleaning up imports and dependencies

6b6146e

fixup: fix trace message

354165b

fixup: fix tests/fuzzing imports

cd848a8

kdeme reviewed Feb 24, 2022

View reviewed changes

moving all RLP enoding from messages to messages_encoding

99fc5fd

cskiraly mentioned this pull request May 21, 2023

WIP: Test DHT with large number of nodes codex-storage/nim-codex-dht#57

Draft

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discv5 factorize #480

Discv5 factorize #480

cskiraly commented Feb 23, 2022

kdeme left a comment •

edited

Loading

cskiraly commented Feb 24, 2022

kdeme commented Feb 25, 2022

cskiraly commented Mar 17, 2022

arnetheduck commented Sep 2, 2022

kdeme commented Oct 7, 2022

cskiraly commented May 21, 2023

Discv5 factorize #480

Are you sure you want to change the base?

Discv5 factorize #480

Conversation

cskiraly commented Feb 23, 2022

kdeme left a comment • edited Loading

Choose a reason for hiding this comment

cskiraly commented Feb 24, 2022

kdeme commented Feb 25, 2022

cskiraly commented Mar 17, 2022

arnetheduck commented Sep 2, 2022

kdeme commented Oct 7, 2022

cskiraly commented May 21, 2023

kdeme left a comment •

edited

Loading