PutValue appears to not put to K peers #139

whyrusleeping · 2018-04-23T12:35:08Z

In order for queries to function properly, PutValue should ensure it puts out to K (20) peers. It currently seems that this is not happening, and that a much smaller number of records (less than 16, our GetValue threshold) is actually being put.

This was brought up in ipfs/kubo#3860

We should investigate and fix this.

inetic · 2018-04-26T15:14:17Z

I've been testing go-ipfs as is currently in master (sorry, I'm still quite new to go/gx so don't know to which commit it translates here, but the ipfs version is QmY1y2M1aCcVh...).

I've added a few fmt.Printf(...) lines to routing.go/PutValue and routing.go/GetValues. Basically I added a line

fmt.Printf("routing.go/GetValues query reply from %v %q hasval:%v err:%v\n", p, key, rec.GetValue() != nil, err)

Right after the call to dht.getValueOrPeers in routing.go/GetValues and I added these lines:

if err != nil {
    fmt.Printf("routing.go/PutValue failed putting value to peer %v %q\n", p, err)
} else {
    fmt.Printf("routing.go/PutValue success putting value to peer %v\n", p)
}

Right after the call to dht.putValueToPeer(ctx, p, key, rec) in routing.go/PutValue.

Then I started the a program that does the publishing in a loop (with few seconds in between publishing) and got this output. That is, the publisher never actually succeeded in publishing the value to any node (am I reading it correctly?).

Once the call to the first publish in the loop was done, I fired another program which started resolving that IPNS (in a loop as well). The output is here. That is, the only peer from which the resolver was able to get the value first 4 times was the publisher itself (look for the hasval:true string). Later, one more node (ID atonFX) came to know that value as well, although the publisher's PutValue function never seems to have noticed that it succeeded with that one user.

whyrusleeping · 2018-04-26T15:33:48Z

@inetic thats disconcerting... If your code is right then i'd say your observation is correct. There are a weirdly large number of 'default failure' errors there that really shouldnt be popping up like that.

whyrusleeping · 2018-04-26T15:36:58Z

In your case, it looks like you are unable to dial to any of the DHT nodes you are supposed to be connecting to.

whyrusleeping · 2018-04-26T15:37:37Z

@inetic question, how many peers show up in ipfs swarm peers ?

inetic · 2018-04-26T15:50:03Z

There are a weirdly large number of 'default failure' errors there that really shouldnt be popping up like that.

It seems the default failure errors happen for nodes that previously failed with a different error.

inetic · 2018-04-26T16:00:11Z

In your case, it looks like you are unable to dial to any of the DHT nodes you are supposed to be connecting to.

I'll try to modify the ipfs executable directly tomorrow to make sure it's not my misuse of the go API, as I'm doing direct calls to some of the ipfs functions (1, 2). But in the mean time, do you guys experience better connectivity in the PutValue function?

whyrusleeping · 2018-04-26T16:15:52Z

Currently, my only internet connection is a tethered LTE connection on a 1GB sim card. I'm trying to conserve data.

Maybe @Kubuxu @vyzo or @magik6k could take a look?

vyzo · 2018-04-26T18:28:12Z

@inetic were you testing go-libp2p-kad-dht on its own?

inetic · 2018-04-27T12:03:36Z

@vyzo I did just now, nothing out of the ordinary seems to be there. With these modifications to QmY1y2M1aCcV.../go-libp2p-kad-dht/routing.go

+++ /tmp/go-libp2p-kad-dht.orig/routing.go	2018-04-27 13:56:08.998398626 +0200
@@ -83,11 +83,6 @@
 
 			err := dht.putValueToPeer(ctx, p, key, rec)
 			if err != nil {
-				fmt.Printf("PutValue failure %q %v %q\n", p, err, key)
-			} else {
-				fmt.Printf("PutValue success %q %q\n", p, key)
-			}
-			if err != nil {
 				log.Debugf("failed putting value to peer: %s", err)
 			}
 		}(p)
@@ -211,7 +206,6 @@
 		})
 
 		rec, peers, err := dht.getValueOrPeers(ctx, p, key)
-		fmt.Printf("getValueOrPeers peercount:%d hasval:%v %v\n", len(peers), rec.GetValue() != nil, err)
 		switch err {
 		case routing.ErrNotFound:
 			// in this case, they responded with nothing,

I'm seeing this output.

inetic · 2018-04-27T13:07:41Z

So I've done the same test with only go-ipfs sources. Basically what I've done is this:

Install fresh 1.9.5 go
Compile and install newest go-ipfs from github
Add the fmt.Printf debug printing to routing.go/PutValue
Run the daemon, wait for it to get ready
Run ipfs name publish Qm...

~~Here~~ Here is the script in detail (and here the printf.patch file which the script uses).

Unfortunately, my node still doesn't connect to any other peer when doing PutValue. The log can be found here.

EDIT: Improved the script

inetic · 2018-04-27T14:14:18Z

@whyrusleeping sorry, missed your question yesterday

question, how many peers show up in ipfs swarm peers ?

In the above tests, I started publishing very soon after the daemon printed "Daemon is ready".

Now I modified the script again so that it prints swarms, I waited about a minute to get ~100 peers in the swarm and then started publishing. This time the PutValue did succeed with 7 peers (out of 80). Here is the log.

inetic · 2018-04-27T14:29:25Z

One more test: I did the same as above, but this time I did it on a DigitalOcean instance and I waited for ~200 peers. Unfortunately, did not succeed in connecting to any peer. Here is the log.

whyrusleeping · 2018-04-27T16:13:24Z

So something is up here. Either your node isnt able to dial properly for some reason (could be bad addresses in the network, or something) or there are a huge number of undialable nodes in the network.

vyzo · 2018-04-27T16:54:16Z

I tried a test IPNS publish from my node -- it took 2m49s and I am seeing lots of failed putting value to peer errors.

Note I am running master (no printf patches) on my laptop node.

inetic · 2018-04-27T16:57:05Z

Either your node isnt able to dial properly for some reason (could be bad addresses in the network, or something)

That's what I was thinking as well and why I did the test on the DigitalOcean's instance which has an open IP to the internet. All the other tests I did were on my home PC which is behind a router.

whyrusleeping · 2018-04-28T12:34:17Z

Escalating this...

vyzo · 2018-04-30T07:08:07Z

I wrote a debug crawler that crawls the dht and tries to connect to each discovered peer.

The results after an overnight run:

2197 peers discovered
1336 of them are undialable (1155 fail with context deadline exceeded)

In short, 61% of the dht is undialable!

Edit: Up to 2268 peers now, 1391 undialable.

MichaelMure · 2018-04-30T12:13:23Z

FWIW, I get a similar result from a home network:

799 "ERROR"
503 "OK"

61% undialable.

whyrusleeping · 2018-04-30T13:21:52Z

Thanks @MichaelMure!

We're working on adding more data collection to this crawler. Current approaches we're thinking about are:

Use dial failure data to build a better heuristic for dialable addresses
Enable relay advertisement, put up relay infra nodes, and encourage others to do so as well
- Going this route, we should make relay smarter
Have NAT detection libp2p services to aid nodes in discovering their own NAT situation

florianlenz · 2018-06-03T18:33:15Z

When I put a value to the DHT it's only put to 3 out of 20 peers (sometimes more sometimes less). I think the error I get panic: routing: not found is related to this issue. The project I am working on is blocked by this issue. Is there something I can help with? Wouldn't fetching peers and putting the key to them till we are sure it was put to 20 peers fix the problem?

whyrusleeping · 2018-06-04T05:03:18Z

@florianlenz yeah, making sure we put it to 20 peers would help, but we need to make sure those are roughly the same 20 peers that someone doing a lookup of the value will contact. I think changing the code slightly to find the '20 closest peers we can connect to' might help dramatically. I'm doing some experiments right now with DHT resolution, let me know if you try anything and figure something interesting out

whyrusleeping · 2018-06-04T06:40:15Z

random thing i tried:

reducing the dial timeout in go-libp2p-conn from 60s to 5s. Seems to have made things quite a bit faster overall, without reducing the number of records I get back. (i'm testing DHT 'Gets' for ipns records)

whyrusleeping · 2018-06-04T07:03:30Z

another interesting thing:

I changed the GetClosestPeers logic to only return peers we 'successfully connected' to (by calling host.Connect on each peer before adding it to our 'good set'). For some reason, i'm still getting puts failing because of 'connection reset'. This is really fishy...

cc @Stebalien

whyrusleeping · 2018-06-04T07:39:22Z

yamux returns ErrConnectionReset when the stream gets reset. Very misleading error message.

whyrusleeping · 2018-06-04T09:54:35Z

Two things.

First off. It turns out, that if a peer doesnt have the public key to validate an ipns record it receives, it just tosses the record and resets the stream. That stream reset was producing the 'connection reset' errors i was seeing, which led me to this bug.
Basically, using an ipns key other than your peers key will result in most peers not accepting the record you are putting (aside from the peers that overlap in the keyspace between the closest peers to your record and the closest peers to your public key). This is why both @florianlenz and I were seeing puts only going to like three peers successfully. Its not that we were sending it to them, its that they were just dropping them. This is something we need to address ASAP.

The other thing,
This patch makes things work reasonably well:

diff --git a/lookup.go b/lookup.go
index dd47516..d6f04a5 100644
--- a/lookup.go
+++ b/lookup.go
@@ -10,6 +10,7 @@ import (
 	pb "github.com/libp2p/go-libp2p-kad-dht/pb"
 	kb "github.com/libp2p/go-libp2p-kbucket"
 	peer "github.com/libp2p/go-libp2p-peer"
+	pstore "github.com/libp2p/go-libp2p-peerstore"
 	notif "github.com/libp2p/go-libp2p-routing/notifications"
 )

@@ -100,13 +101,19 @@ func (dht *IpfsDHT) GetClosestPeers(ctx context.Context, key string) (<-chan pee

 		if res != nil && res.finalSet != nil {
 			sorted := kb.SortClosestPeers(res.finalSet.Peers(), kb.ConvertKey(key))
-			if len(sorted) > KValue {
-				sorted = sorted[:KValue]
-			}

+			var found int
 			for _, p := range sorted {
-				out <- p
+				if err := dht.host.Connect(ctx, pstore.PeerInfo{ID: p}); err == nil {
+					found++
+					out <- p
+				}
+				if found >= KValue {
+					break
+				}
 			}
 		}
 	}()

Basically "Get me the K closest peers that I can actually connect to".
Theres definitely room for some parallelism here, and also we should do some analysis on this, but in general, using this, my ipns resolves get all 16 expected values back, and they get them back pretty fast (especially on repeated resolves).

whyrusleeping · 2018-06-04T09:54:49Z

(brb, getting some dinner)

whyrusleeping · 2018-06-04T12:23:38Z

So it appears that the lack of access to public keys is a pretty significant issue here... If we started switching over the embeddable ed25519 keys already, we would be okay, but with the RSA its a problem.

One option would be to augment the DHT codebase in a pretty ugly way to make PutValue for IPNS records, first put the public key to the peer, then put the ipns record. This is the easiest change to make from a protocol design standpoint, but actually writing that code will be... rather unpleasant.

Another option would be to add the public key to the ipns record itself so that we guarantee the peer has it. This would require no changes to the DHT codebase, but would require some changes in the IPNS validator, to handle that. This change is easier from the implementation perspective, but requires pretty much everyone update their nodes for it to take effect.

I'm leaning towards the first choice, but it will require some code mangling. I'll have a chat with @Stebalien when he wakes up, need to get this fixed ASAP

florianlenz · 2018-06-04T12:33:34Z

@whyrusleeping really nice work :) Would it make sense to drop the connection after connecting? A host could end up having a lot off connections when doing a lot of lookups.

whyrusleeping · 2018-06-04T14:00:33Z

I think its probably best to leave that up to the connection manager. It will be able to make decisions about connections informed by the broader picture than just the DHT

whyrusleeping · 2018-06-04T14:06:14Z

@florianlenz question, which version of the DHT code are you testing with?

florianlenz · 2018-06-04T14:40:10Z

I tested with version 4.0.4 (QmSBxn1eLMdViZRDGW9rRHRYwtqq5bqUgipqTMPuTim616)

whyrusleeping · 2018-06-04T14:50:02Z

Ah, interesting. So you we'rent running with @Stebalien's latest changes (which resulted in gx version 4.1.0). There are some changes there that make this process work a little differently, thanks for the data point :)

whyrusleeping · 2018-06-05T11:22:09Z

This patch should help for ipns publishing in general: ipfs/kubo#5079
It will at least stop certain peers from throwing away the record you are trying to put (once peers update to that version of the code :/ )

The other bigger problem i see is that (as discussed previously) 'GetClosestPeers' currently returns the 20 closest peers it has heard of. I have a patch locally that gathers all the 'close' peers, sorts them, then tries connecting to them in order of closeness. Once it succeeds in connecting to 20 of them, it returns those 20. This results in a very well work ipns (but my patch is pretty ugly). Most runs go through 40-70 peers before they find 20 connectable ones (implying between 50% and 30% connectivity, which roughly matches what our earlier experiments showed).

I'm going to try out a patch where we try to relay every connection and see how that affects those ratios.

whyrusleeping · 2018-06-05T13:01:59Z

automatically trying to relay every connection definitely lowers the 'miss rate' on connecting to closest peers. Last few tests i ran show it about 30% lower (we can connect to 30% more peers than before). But for some currently-unknown reason, overall performance is pretty degraded compared to not trying relays. I also end up connected to a lot more peers at the end of a run than before, >300 as opposed to ~100 before... very odd.

Stebalien · 2018-06-20T23:03:02Z

@whyrusleeping what's the status of your closest peers patch?

whyrusleeping · 2018-06-21T15:12:32Z

@Stebalien should be much better: #159 (merged)

I havent had great internet since then, but I do now, and can start debugging things a bit more. In any case though, we should start pushing these changes out and cut a new ipfs release. Let's get it all deployed!

Stebalien · 2018-06-21T17:15:15Z

Got it. I asked because I tried publishing an IPNS entry on two different temporary nodes and:

On the first node, multiple publishes all succeeded but my other node only ever saw the first value (even after, e.g., restarting).
On the second node, I was never able to complete a single publish.

whyrusleeping · 2018-06-21T17:16:31Z

@Stebalien was this on a testnet of nodes all using the proper value? or was this on the main network?

Getting back an old value seems a bit odd though, we should investigate that.

Stebalien · 2018-06-21T19:21:18Z

Main network.

whyrusleeping · 2018-06-21T19:26:10Z

Yeah, you're unlikely to succeed much at all trying to publish to the main network right now

…

On Thu, Jun 21, 2018, 12:21 PM Steven Allen ***@***.***> wrote: Main network. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABL4HAl1ljIjQvgZGFBjDIMmAYVLC2DEks5t-_IvgaJpZM4Tfzrq> .

Stebalien · 2018-06-21T22:05:00Z

Oh. Yeah, I forgot about that.

_{stebalien slinks off into a corner to quietly continue breaking everything}

markg85 · 2019-12-17T01:36:37Z

Hi,

A couple of hours ago i made issue ipfs/kubo#6800 (name resolve takes exactly 1 minute and loses cache after 30 seconds). That issue is closed which is fine. But when reading though this whole thread i'm lead to believe this was (or is?) the actual core issue of the observed behavior.

Now this does reference issue filecoin-project/venus#437 (filecoin?.. what does that have to do with this very issue?.. anyhow.) where it's said to be no issue anymore (that was on Feb. 27 of 2019) where it was subsequently closed. This issue however remained open.

And then i read much of ipfs/kubo#3860 where it eventually turns out to be a feature in it's current form (if i'm reading it correctly).

So now i'm totally lost if the behavior i see in ipfs name resolve (that takes exactly 1 minute for new publishes) is by design and working properly (thus meaning the issue in this thread is resolved too) or if there still is an issue, as i can't imagine 1 minute is by design and "proper".

Please note, i try to access a newly published website so i'm using a browser to access that ipns link. Command line trickery to speed it up won't work in the browser. I'm guessing i have to modify ipfs companion if i want to try the same command line trickery in the browser?

I'd happily try out a go-ipfs 0.50.0 alpha + IPFS companing to go along with it to see if the actual issue is resolved (aka, have a fast working IPNS) but i have no clue what i'm even supposed to install now. So just a friendly question: Is this issue resolved in the current master that is going to be 0.50.0? Can i help? (by testing stuff out, and how?)

aschmahmann · 2019-12-17T06:39:00Z

@markg85 your question about the current state of IPNS is better asked on discuss.ipfs.io then here. If you start an issue there I'll happily go through why things are working how they are currently, some of the ongoing work to improve performance, and some areas where having some help would be great (e.g. testing).

markg85 · 2019-12-17T12:39:23Z

@aschmahmann that's a weird and concerning request. I'm only wondering how it's supposed to work because of said "perceived bugs" and the "hints" i'm catching elsewhere that this is actually a feature. This very bug report should be the place as it's the root cause. If i move to discuss, it will also add yet another place where "ipns performance" is discussed where it's already been discussed in so many different places (multiple threads on discuss, multiple github issues and even branches). Are you sure you want that?

If you still say yes to this then i happily open a new topic there, i just try to prevent even more thinning out of information in this regard.

markg85 · 2020-02-17T23:04:30Z

Hi,

Any update on this one? I tried using IPFS again on some test site which now works awesomely with an ipfs link in the dns record. An ipns link still seemed way too ambitious (aka, so slow that it times out). That being said, the latest IPFS release really works loads faster for me for ipfs data. While off topic here, thanx a lot for that :)

But ontopic.
The initial topic here was the PutValue that is used for ipns is not putting the value or not to enough nodes which then subsequently causes an ipns lookup to be as slow as the timeout that is set for ipfs publish (which is 1 minute by default) as it can't find enough nodes to get the latest link.

Now this issue is a prio 1, it is tagged as such, so I'm really curious if this is on the radar for 0.50.0?

Stebalien · 2020-02-17T23:28:45Z

This specific issue will be fixed in the next go-ipfs release, but the relevant changes haven't been merged to master yet (currently in testing).

Changes:

We put to the K closest peers we can actually connect to.
Undialable (e.g., behind NATs) peers won't join the DHT.
We search until we find the K closest peers and then stop instead of backtracking indefinitely. Then, we return whatever results we get regardless of how many results we see.

Unfortunately, a large portion of this depends on 2 which means the new code won't actually perform any better than the existing code until it's actually deployed widely on the network. The current plan is:

Run two DHTs, the old kind of crappy one and a shiny new fast one.
New nodes query the fast DHT first, falling back on the old DHT.

markg85 · 2020-02-17T23:48:34Z

@Stebalien that sounds like a lot of good news! Thank you very much!

But I do wonder what you mean by that last part.

I'm assuming that once enough nodes switch to 0.50.0 the ipns, in this regard, just slowly gets better as the new code path is being used by more and more nodes, right?

By running 2 DHT nodes you mean one IPFS instance before 0.50.0 and one at (or above) 0.50.0? Or will there be a config option to pick different DHT versions?

aschmahmann · 2020-02-18T08:27:32Z

I'm assuming that once enough nodes switch to 0.50.0 the ipns, in this regard, just slowly gets better as the new code path is being used by more and more nodes, right?

Yes, although if a provider and consumer of an IPNS record both upgrade to go-ipfs 0.5.0 they will immediately get benefits without waiting for the rest of the network to upgrade.

By running 2 DHT nodes you mean one IPFS instance before 0.50.0 and one at (or above) 0.50.0? Or will there be a config option to pick different DHT versions?

@markg85 the plan is that your go-ipfs 0.5.0 instance will be able to participate in a DHT with nodes running the current protocol (kad v1) while simultaneously participating in a DHT with the new protocol (kad v2). This will allow users of go-ipfs 0.5.0 to continue to find peers and information that are only in the v1 DHT while still benefiting from the fast-path of using the v2 DHT.

whyrusleeping mentioned this issue Apr 26, 2018

libp2p DHT work filecoin-project/venus#351

Closed

2 tasks

whyrusleeping added kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up labels Apr 28, 2018

Kubuxu mentioned this issue Apr 29, 2018

IPNS very slow ipfs/kubo#3860

Open

vyzo mentioned this issue May 1, 2018

Relay Infrastructure Integration ipfs/kubo#4990

Closed

laser mentioned this issue May 15, 2018

Investigate DHT put issue (go-libp2p-kad-dht) filecoin-project/venus#437

Closed

whyrusleeping mentioned this issue Jun 4, 2018

very large number of "protocol mux failed: connection reset" errors libp2p/go-libp2p#339

Closed

anacrolix self-assigned this Jan 17, 2019

anacrolix added the status/ready Ready to be worked label Jan 22, 2019

reardenlife mentioned this issue Mar 4, 2019

IPNS resolve fails using ECC key ipfs/kubo#6043

Closed

anacrolix removed their assignment Dec 10, 2019

Stebalien closed this as completed Apr 5, 2021

PutValue appears to not put to K peers #139

PutValue appears to not put to K peers #139

Comments

whyrusleeping commented Apr 23, 2018

inetic commented Apr 26, 2018

whyrusleeping commented Apr 26, 2018

whyrusleeping commented Apr 26, 2018

whyrusleeping commented Apr 26, 2018

inetic commented Apr 26, 2018

inetic commented Apr 26, 2018

whyrusleeping commented Apr 26, 2018

vyzo commented Apr 26, 2018 • edited Loading

inetic commented Apr 27, 2018

inetic commented Apr 27, 2018 • edited Loading

inetic commented Apr 27, 2018

inetic commented Apr 27, 2018

whyrusleeping commented Apr 27, 2018

vyzo commented Apr 27, 2018

inetic commented Apr 27, 2018

whyrusleeping commented Apr 28, 2018

vyzo commented Apr 30, 2018 • edited Loading

MichaelMure commented Apr 30, 2018

whyrusleeping commented Apr 30, 2018

florianlenz commented Jun 3, 2018 • edited Loading

whyrusleeping commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

florianlenz commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

florianlenz commented Jun 4, 2018

whyrusleeping commented Jun 4, 2018

whyrusleeping commented Jun 5, 2018

whyrusleeping commented Jun 5, 2018

Stebalien commented Jun 20, 2018

whyrusleeping commented Jun 21, 2018

Stebalien commented Jun 21, 2018

whyrusleeping commented Jun 21, 2018

Stebalien commented Jun 21, 2018

whyrusleeping commented Jun 21, 2018 via email

Stebalien commented Jun 21, 2018

markg85 commented Dec 17, 2019

aschmahmann commented Dec 17, 2019

markg85 commented Dec 17, 2019

markg85 commented Feb 17, 2020

Stebalien commented Feb 17, 2020

markg85 commented Feb 17, 2020

aschmahmann commented Feb 18, 2020

vyzo commented Apr 26, 2018 •

edited

Loading

inetic commented Apr 27, 2018 •

edited

Loading

vyzo commented Apr 30, 2018 •

edited

Loading

florianlenz commented Jun 3, 2018 •

edited

Loading