Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rfm 17.1 - Sharing Provider Records with Multiaddress #22

Merged
merged 17 commits into from
Jan 26, 2023

Conversation

cortze
Copy link
Contributor

@cortze cortze commented Nov 11, 2022

This is the first draft of the report that extends RFM17 to measure if the Multiaddresses of a content provider are being shared during the retrieval process of a CID process.

It includes the study's motivation, the methodology we followed, the discussion of the results we got out of the study, and a conclusion.

All kind of feedback is appreciated, so please, go ahead to point out improvements!

Also, should I be running a more extensive set of CIDs for extended periods?

cc: @yiannisbot @guillaumemichel @dennis-tra

Copy link
Member

@yiannisbot yiannisbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Some typos and a couple of clarification comments.

RFMs.md Outdated Show resolved Hide resolved
RFMs.md Outdated Show resolved Hide resolved
RFMs.md Outdated Show resolved Hide resolved
RFMs.md Outdated
#### Measurement Plan

- Spin up a node that generates random CIDs and publishes provider records.
- Periodically attempt to fetch the PR from the DHT, tracking whether they are retrievable and whether they are shared among the multiaddresses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean "shared among the multiaddresses"? Whether the PR can be found in the multiaddress of the original node that stored the record?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a third "Measurement Plan", how about just getting a bunch of PeerIDs and their multiaddresses and pinging them over time to see whether they listen to that multiaddress? The same as we do for PRs, but now for peer records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean "shared among the multiaddresses"? Whether the PR can be found in the multiaddress of the original node that stored the record?

In the networking layer, when we ask for the PR of a CID, we just get as a reply an AddrInfo of each provider that the remote peer is aware of. So the PR as we understand it, it's just how we store it in the DHT.
This AddrInfo contains two fields: PeerID and Multiaddresses, and it will only fill up the Multiaddresses if their TTL are still valid.

how about just getting a bunch of PeerIDs and their multiaddresses and pinging them over time to see whether they listen to that multiaddress?

That can be a nice side experiment, yes. Although I think that we are indirectly doing it. In the hoarder, I keep the AddrInfo of each PR Holder with the first Multiaddresses that I got from the publication, and I only use those addresses to establish the connections. So if they were changing IPs, I wouldn't be able to connect to them.

Let me know anyways if you want me to set up a specific test for the IP rotation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: "shared among the multiaddresses": ok, so IIUC, we either mean if the PR is available in the multiaddresses of all the content providers, or if all the multiaddresses of all content providers are included in the PR. :) Is it any of these two?

Re: IP Rotation: that's great! But for this experiment we're keeping the connection open for 30mins to check the TTL, right? Can we run an experiment where we keep those connections open for a time period equal to the Expiry Interval? It would be 24hrs according to the current setting and 48hrs according to our proposal. Ideally, we'd also need to do that for a large number of peers.

Copy link
Contributor Author

@cortze cortze Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we either mean if the PR is available in the multiaddresses of all the content providers, or if all the multiaddresses of all content providers are included in the PR. :) Is it any of these two?

We are in the second one, we only get an AddrInfo for those providers that the remote peer is aware of, and it depends on the TTL of the Multiaddress to include them or not in the AddrInfo of the provider. Should I say it in the opposite way "the multiaddress is shared among the PRs"?
Let me point you to the code; maybe is easier to understand it:

  • here is the inner method during the dht.FindProviders() method.
  • here is the networking method to ask for the PRs to a remote peer.

Can we run an experiment where we keep those connections open for a time period equal to the Expiry Interval?

Absolutely! I can make a new run with 10k-20k CIDs over 60h if that is enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrasing the first point to make it clearer. Regarding the extra experiment: that would be fantastic, yes!

RFMs.md Outdated Show resolved Hide resolved

Results are similar when we analyze the replies of the peers that report back the PR from the DHT lookup process. We increased the number of content providers we were looking for to track the multiple remote peers. Figure [3] represents the number of remote peers reporting the PR for the CIDs we were looking for, where we can see a stable 20 peers by median over the entire study.

For those wondering why more than 20 peers (k replication value when publishing the CIDs) are reporting the PR, we must remind you that Hydra-Boosters share the PR database among different `PeerID` heads. Which means that if one hydra hears about a PR, all the leaders of that common database will also share it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that were the case, then we'd see about 2k holders, i.e., approximately the same as the number of Hydra heads in the network. Could it instead be other peers that fetch and reprovide the content?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For those wondering why more than 20 peers (k replication value when publishing the CIDs) are reporting the PR, we must remind you that Hydra-Boosters share the PR database among different `PeerID` heads. Which means that if one hydra hears about a PR, all the leaders of that common database will also share it.
For those wondering why more than 20 peers (k replication value when publishing the CIDs) are reporting the PR, we must remind you that Hydra-Boosters share the PR database among different `PeerID` heads. Which means that if one hydra hears about a PR, all the heads of that common database will also share it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the hoarder, I always check if the PRs that are shared back match the PeerID of the content publisher (which, in this case, is my local host 1, the publisher one). So if someone tries to reprovide or ping the CID, it wouldn't affect these results.

About the hydras, I'm not aware of how many hydra "bellies" are out there. Is there a single big one or multiple small ones? Also, we have to keep in mind that the DHT lookup converges into a region of the SHA256 hash space, so it's quite unlucky that we will get connections and replies from hydras that are in the opposite part of the hash space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a single big one or multiple small ones?

Yup, there is a single one shared among all of them.

results/rfm17.1-sharing-prs-with-multiaddresses.md Outdated Show resolved Hide resolved
results/rfm17.1-sharing-prs-with-multiaddresses.md Outdated Show resolved Hide resolved
results/rfm17.1-sharing-prs-with-multiaddresses.md Outdated Show resolved Hide resolved
results/rfm17.1-sharing-prs-with-multiaddresses.md Outdated Show resolved Hide resolved
@yiannisbot
Copy link
Member

@cortze I've just done a thorough review of this - great work! My main worry is that the claim of: "if a multiaddress is returned together with the PeerID for the TTL period (10mins or 30mins), then we can extend the TTL to the PR expiry interval" doesn't really hold. Why would we arrive to this conclusion?

The main argument in order to increase the multiaddress TTL to the PR expiry interval would be to show that the multiaddress of the PR holder doesn't usually change. It would be great to have some experiments along the lines of the comment I inserted above: #22 (comment)

I'd love to hear your thoughts on this. Basically, similar to the CID Hoarder, what we need here is a PeerID Hoarder :-D This tool would get a lot of PeerIDs, record the multiaddress by which we first saw the peer and then periodically ping the peer to figure out if it changed its Multiaddress within the PR Expiry Interval. I'm not sure if this functionality can easily be included in Nebula @dennis-tra ? This is what would give us a solid justification to argue for the extension of the TTL.

Other thoughts?

Typos and rephrasings

Co-authored-by: Yiannis Psaras <[email protected]>
@cortze
Copy link
Contributor Author

cortze commented Nov 15, 2022

Thanks for the feedback @yiannisbot , I really appreciate it!

My main worry is that the claim of: "if a multiaddress is returned together with the PeerID for the TTL period (10mins or 30mins), then we can extend the TTL to the PR expiry interval" doesn't really hold. Why would we arrive to this conclusion

I will try to make it a bit more explicit in the conclusion (my bad). It's not an "it won't hold" statement. It is an "It won't have as much impact as we are expecting" statement.

As far as your network has different TTL values for Multiaddresses (like in the current network), the smallest TTL will be the one limiting negatively the final result of the DHT lookup process (at least the go-libp2p-kad-dht one). So unless the largest part of the network updates to that TTL, we will still face the same problem, and there will still be sporadic problems originated from those remaining "old" clients. (The double-hashing implementation would be a nice incentive to force a total network update)

Basically, similar to the CID Hoarder, what we need here is a PeerID Hoarder :-D This tool would get a lot of PeerIDs, record the multiaddress by which we first saw the peer and then periodically ping the peer to figure out if it changed its Multiaddress within the PR Expiry Interval.

I left you a comment as well in the #22 comment
I think that we have a few options here. The hoarder already does this indirectly (it contacts the PR Holders to the Multiaddress that we stored while storing the PRs). Also, I think that Nebula already tracks IP rotation. We could have a deeper chat about this :)

I'll iter again over your comments and suggestions, will ping you back whenever I make a commit!

@dennis-tra
Copy link
Contributor

I'm not sure if this functionality can easily be included in Nebula @dennis-tra ?

Sorry for the late reply! The information is already recorded by Nebula and would just need to be analyzed :)

I'll iter again over your comments and suggestions, will ping you back whenever I make a commit!

Just ping here or in Discord and I'll also have a proper read. I just skimmed it in the past 🙈

@cortze
Copy link
Contributor Author

cortze commented Nov 21, 2022

I already added some explanations and most of the changes that @yiannisbot suggested. I set up another Hoarder run with 20k CIDs for 60 hours, so the plots and some numbers might change.

If you can go through and give me some thoughts @dennis-tra , I would appreciate your feedback as well 😄

RFMs.md Outdated Show resolved Hide resolved
RFMs.md Outdated
#### Measurement Plan

- Spin up a node that generates random CIDs and publishes provider records.
- Periodically attempt to fetch the PR from the DHT, tracking whether they are retrievable and whether they are shared among the multiaddresses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrasing the first point to make it clearer. Regarding the extra experiment: that would be fantastic, yes!

results/rfm17.1-sharing-prs-with-multiaddresses.md Outdated Show resolved Hide resolved

Results are similar when we analyze the replies of the peers that report back the PR from the DHT lookup process. We increased the number of content providers we were looking for to track the multiple remote peers. Figure [3] represents the number of remote peers reporting the PR for the CIDs we were looking for, where we can see a stable 20 peers by median over the entire study.

For those wondering why more than 20 peers (k replication value when publishing the CIDs) are reporting the PR, we must remind you that Hydra-Boosters share the PR database among different `PeerID` heads. Which means that if one hydra hears about a PR, all the leaders of that common database will also share it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a single big one or multiple small ones?

Yup, there is a single one shared among all of them.

Copy link
Member

@yiannisbot yiannisbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor edit to address one of my previous comments.

@yiannisbot
Copy link
Member

The hoarder already does this indirectly (it contacts the PR Holders to the Multiaddress that we stored while storing the PRs). Also, I think that Nebula already tracks IP rotation.

Great that the Hoarder contacts the original Multiaddress! That's what we need. So if we run the experiment for long enough and monitor that, then we have what we're looking for.

This ^ together with an analysis of logs from Nebula will tell us what is the rate of PR Holders that switch IP addresses over the republish interval. I think with those two, this will be complete and ready for merging.


_Figure 2: Number of PR Holders replying with the `PeerID` + `Multiaddress` combo._

### 4.2-Reply of peers reporting the PR during the DHT lookup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understood this correctly: The only difference between 4.1 and 4.2 is that Hydras appears in 4.2 but not 4.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since hydras are present in the set of PR holders, they appear in both 4.1 and 4.2.
However, since the DHT lookup wasn't stopped after the first retrieval of the PRs, I assume that most of the peers that report the PRs beyond those initial PR Holders are Hydras (for their shared DB of PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what exactly is the perform operation? Is it only a FindProviders? And it may get more than 20 peers responding with the PR, because some peers on the path to the CID would be Hydra nodes?
As the number of hops in a DHT lookup is usually 3-5, we would expect at MOST 23-25 peers responding with a PR, if all of the peers helping to route the request (NOT PR holders) are Hydra nodes. According to the plot in 4.2 there are regularly much more than this number. How do you explain this?
Or maybe I missed something here ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what exactly is the perform operation? Is it only a FindProviders?

Yes, it's a modification of the FindProviders() method that doesn't look in the local Provider DB of the host, and that directly performs the DHT lookup.

And it may get more than 20 peers responding with the PR, because some peers on the path to the CID would be Hydra nodes?

Exactly, that is the explanation that I gave for this phenomenon.

As the number of hops in a DHT lookup is usually 3-5, we would expect at MOST 23-25 peers responding with a PR

Can you give a bit more context on this statement? My understanding from RFM 17 is that we perform between 3 and 6 hops, however, that only determines the depth of the peer tree that is built during the lookup. We are not taking into account that the tree can also grow in width.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a bit more context on this statement? My understanding from RFM 17 is that we perform between 3 and 6 hops, however, that only determines the depth of the peer tree that is built during the lookup. We are not taking into account that the tree can also grow in width.

In Figure 3, we see that up to 60 peers respond with the PR during the DHT lookup. There are only 20 PR holders, and 2-5 intermediary DHT server nodes to which we send the request (2-5 as the last hop is a PR holder). How can we get responses from 60 peers?

In the case where we would expect the most answers, we would have the 20 PR holders + 5 intermediary nodes that are all Hydras, which is far from 60. Even if we add the concurrency factor $\alpha=3$, and suppose that the requests to the DHT intermediary nodes are performed exactly at the same time, to 15 Hydra nodes (5 hops * $\alpha$), + 20 PR holders, this only makes 45 answers in this very specific corner case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point worth digging into, but want to understand a detail:

However, since the DHT lookup wasn't stopped after the first retrieval of the PRs

@cortze how does the operation of the Hoarder differ compared to the vanilla version? When it gets a response with a PR, it doesn't stop and keep looking, but up to which point? And when does it stop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiannisbot The FindProviders() that I use in the hoarder slightly differs from the vanilla operation:
It removes the "Find in the ProvidersStore" operation, forcing it to look for the Providers only using the vanilla DHT lookup, and adds some traces to track when we receive a new Provider.

I've been relaunching the test with a two-minute timeout for the FindProviders operation, and the results seem to be in the range that @guillaumemichel suggests (keep in mind that the Hydras' DB has been plugged off).

The number of remote peers replying with the PR during the DHT lookup (with a 2-minute timeout) looks like this.
image

@yiannisbot
Copy link
Member

I set up another Hoarder run with 20k CIDs for 60 hours, so the plots and some numbers might change.

@cortze do we have any results from this experiment? I think with these results and addressing Guillaume's question, this should be ready to be merged, right?

@cortze
Copy link
Contributor Author

cortze commented Jan 18, 2023

@cortze do we have any results from this experiment? I think with these results and addressing Guillaume's question, this should be ready to be merged, right?

@yiannisbot The results of this run were not as good as I expected. To track such a large set of CIDs, I had to increase the concurrency parameters of the hoarder, and as we spotted in our last meeting (link to the Issue describing the bottleneck) the code is not that prepared to support such a high degree of concurrency.

However, I think that even with such a low number of CIDs and a lower ping-interval between pings (3 minutes), we can conclude that increasing provider Multiaddress' TTL would improve content fetching times. And the impact would be much higher if we merge it with go-libp2p-kad-dht#802.

RFM17 already proved that the IP rotation of PRHolders barely happens:
image

@cortze
Copy link
Contributor Author

cortze commented Jan 18, 2023

@yiannisbot I've updated the document with your suggestions and with two extra paragraphs describing:

  1. The observed IP-churn of DHT servers in IPFS (from the RFM-17)
  2. Contribution section, where I aggregated all the pull requests related to this RFM

I've also updated the figures. The new ones have the DHT lookup limited to 2 mins - which shows a reasonable number of peers that return the PRs as pointed out by @guillaumemichel .

The new data still faces a lower number of online PR Holders due to a problem storing the records in a part of the network. However, I consider them more than good enough to conclude that increasing the TTL of the Provider's Multiaddres would avoid the second DHT lookup to map the PeerID of the Provider with its Multiaddres.

Let me know what do you think about the update :) Cheers!

Copy link
Member

@yiannisbot yiannisbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! 👏🏼 I hope the suggested changes land in production soon. Thanks for making the final touches.

@yiannisbot yiannisbot merged commit 956a3bb into probe-lab:master Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants