-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIP-29: Simple time-based Sync #826
base: master
Are you sure you want to change the base?
NIP-29: Simple time-based Sync #826
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to see this coming into the protocol. I think this could be simplified by just using a HASH
verb, since filters are used anyway. That way, a client can window more or less granularly as needed.
Interesting choice to encode the week as the base unit of time in the protocol! The 7 day cycle of days we use today has been unbroken since at least 60 CE during the reign of Augustus, and possibly much longer -- if synchronised with Judaism then perhaps a thousand years earlier than that. This makes the week by far the most stable calendrical unit of time. However, no matter the unit of time you choose, there will be cases where it is sub-optimal. Consider queries that match infrequent events. If they are posted once a week or less, then weekly hashes degrades into simply transferring the event IDs on every sync. Here it would be ideal to have something like monthly or yearly buckets. Alternatively, consider syncing the comments on a thread. Most of the time a thread will have a flurry of activity in the first couple days after posting, and then go quiet after. So every time a difference is detected by a sync you will need to re-download the full thread (and potentially paginate it, since most relays limit the number of stored events that a As you suggest, it is possible to have custom intervals selected by the protocol. Requiring implementations to do calendar arithmetic seems far too complicated to me. With negentropy, the matching events are divided into (by default) 16 equal-sized (by number of events, not time) buckets, and the starting timestamp for each bucket is sent. The second problem: Once you detect a difference in the hash of a set of items, what do you do next? If you simply perform the entire query then the amount of data transferred is linear in the entire set for each sync. In the worst case where you run a separate sync for each individual event: quadratic. With @staab's suggestion, you could split the range into half-week windows, and get each of their hashes, and recurse into the ones that differ. This would result in bandwidth overhead logarithmic in the set size. This is effectively how negentropy works, except that many ranges can be batched together and worked on concurrently (while strictly adhering to transport frame size limits), and when a small enough range is found, the item IDs are sent directly, rather then being hashed.
I guess you are referring to the complexity of a negentropy implementation? Granted, this is non-trivial. However, there are 3 existing implementations and a pretty decent reference test suite. What language would be most useful for your app? |
Yes, my hope is that it is ok to be suboptimal for simplicity's sake. We will see.
Clients would simply download the full week again (we already do that, so it shouldn't be that much of a problem). But I assume past discrepancies of hashes to be rare. It would only happen when an event from the past is re-broadcasted to the relay or when there was an EOSE issue somewhere and the client stopped asking for a given range of events. Most of the time, hashes should match.
Yes, I spent some time going through it. I think it is very powerful but too hard/open for clients to be able to declare full compliance between two implementations. Since we have so many people building relays from scratch, I think it is important to keep this as simple as possible. Anybody should be able to code and reply to these calls correctly, even if they are coding from scratch. |
… each event to make sync work well.
What about the option of relays storing a Maybe then clients could store for each filter, just the last moment it requested events from each relay. If it is enough for your syncing use case, it would be much easier for relays to implement. |
It's orthogonal to the solution here. This could help, but it doesn't solve the problem of making sure the past has been fully synced. |
Going to restate my comment, why do we need to enforce weekly? Since we're accepting arbitrary filters here, the hardcoded window doesn't seem to do anything to help relays with performance, other than forcing clients to limit the scope of their request. But that's a pretty weak heuristic, since a year of one pubkey's data is going to be less than 10 minutes of global data. So why not just calculate the hash based on the filter, and relays can complain if the filter isn't restrictive enough? So: The client sends a Request: ["HASH-REQ", <subscription ID string>, <nostr filter>, <nostr filter2>, <nostr filter3>] The relay calculates the hash and responds with the following: Response: ["HASH-RES", <subscription ID string>, <SHA256(JSON.stringify([event1.id, event2.id, event3.id, ...])) in hex>] The client then compares the receiving hashes with those stored locally and, if different, uses the same filter to download all desired events. Ok, so thinking about this more, this is actually ok. A binary search of hashes over a long time period would require a relay to do more work, and clients can always impose their own narrower windows if desired. |
All window sizes that I could imagine came with weird problems. Calendar formatting (timezone, leap years, leap seconds, etc) for instance is already an issue between implementations in multiple languages. Like @hoytech mentioned "week of the year" seems to be the most stable metric among all calendar types. An option would be to group by a substring of the first
In that way, implementers don't need to format the date but groups are less intuitive. But the goal is to make a Sync that is stupidly simple to implement. |
Right, I was suggesting omitting time-based chunking entirely. But I convinced myself. |
Time-based chunking seems right because sync is only useful when reprocessing past events. We have since/until to deal with downloading events since the last time the user was online. For everything else, sync should be used. |
I just reviewed the negentropy protocol, and while it's a little more complex and not as tidy, there are good libraries available for multiple languages. It's also already implemented in strfry, and probably elsewhere. How much better is negentropy in terms of time/space complexity compared with simple sync? I'm inclined to just use what already exists unless there's a good reason note to. @hoytech would definitely like your opinion on how best to get this into a NIP. |
Frankly I don't see any relay dev coding a fully interoperable interface with what strfry has. Much less all the other ad-hoc relays that were coded from scratch that we see out there. Even as a Client, that just needs to code one of the multiple ways to use the protocol, it took me forever to figure out how to code it. Imagine relays that must support all Negentropy options. That's why I made this PR. The goal is to create something where the end result is very similar in time/costs at a fraction of the complexity of the full protocol. |
On Fri, Oct 20, 2023 at 03:21:23PM -0700, hodlbod wrote:
I just reviewed the negentropy protocol, and while it's a little more complex and not as tidy, there are good libraries available for multiple languages. It's also already implemented in strfry, and probably elsewhere. How much better is negentropy in terms of time/space complexity compared with simple sync? I'm inclined to just use what already exists unless there's a good reason note to. @hoytech would definitely like your opinion on how best to get this into a NIP.
I had the same thought, I'm also planning on implementing negentropy in nostrdb. Why not just standardize around that? I trust hoy's tech.
|
If we can get people to actually code it, sure. |
Sorting by If you only got one hash at a time, you could specify any time period with 'since' and 'until'. EDIT: I presume you want the time periods fixed so relays can have the hashes pre-calculated
I'm still in favor of If I know that I got all the events from a relay up to |
Agreed.
So,basically, relays must store a |
Yep. It would prevent gaps. But I agree there are still other reasons to do what this NIP is suggesting. |
Well don't want to sound repetitive but this is exactly NIP-34 PR as I said above xD It only needs some repo maintainer review |
Thanks guys! Currently I still consider negentropy experimental so I don't think we should turn it into a NIP yet. Based on my discussions with the the author of the paper negentropy is based on, there is one more change I'm making to the protocol. I'm also going to simplify it slightly, and remove the I'm working on an article now about why Range-Based Set Reconciliation (RBSR) is so cool. I think it will be a very important building block for internet protocols in the next few years, since it has many compelling advantages over merkle search trees and other similar approaches. I'll let you all know when my article is ready. A couple notes:
|
Interesting point on the If Negentropy clusters by a number of events in order, it feels like if there is an added event right at the beginning of the order, all clusters would be affected and the resulting hashes (or resulting created_ats) would be different. Is that correct? If that is true, then the algorithm on the client side that figures out which clusters to dive into is a bit more complex. Isn't it? |
This should be the vision for this NIP. Efficient sync is good for more than syncing relays; special-purpose clients or DVMs for fulfilling advanced queries across many relays (search, count, find replies, trending) would be well served by an efficient sync mechanism. If it were possible to do the sync in real time in response to a user request, that would open up a lot of really interesting use cases. |
On Mon, Oct 23, 2023 at 05:30:14AM -0700, Doug Hoyte wrote:
Thanks guys! Currently I still consider negentropy experimental so I don't think we should turn it into a NIP yet. Based on my discussions with the the author of the paper negentropy is based on, there is one more change I'm making to the protocol. I'm also going to simplify it slightly, and remove the `idSize` parameter -- that's pretty much the only exposed option now.
Thanks for the heads up! You think it might be ready to spec after this simplification? I will hold off implementing until it's updated.
|
Yes, with the current negentropy version, any stored events that land in a range that you've pre-computed a hash for will need re-computing. This is what I am working on now: Parameterising an incremental hash function with adequate security. This will let you add or subtract event IDs out of a hash without touching any other events. It is literally just treating the hashes as numbers and adding them hashes. There are more secure approaches but they involve complicated elliptic curve dependencies, which I really want to keep out of negentropy spec. Anyway collisions in hash additions takes a fair amount of resources/time, I wrote a program to attack this as best I could: https://github.com/hoytech/birthday-collisions (and there are several additional countermeasures that will make it harder secure still). |
Yes definitely hold off for now. I think it should be good to go after this update, but maybe some people will have good feedback after I publish the article, so we'll see! |
Then, I assume that because it is pre-computed, the base negentropy algo doesn't really have a way to specify the filter you want to sync. Or does it? For instance, Amethyst would like to sync only the event set where the user is the author or p-tagged + all Kind:0s and all Kind:10002s for follows + follows of follows of the user. I wasn't sure there was a smart algo somewhere to allow the pre-computation for different filters, so this PR requires computing hashes on the fly all the time. But maybe there is a way to pre-compute them. |
Yes, it allows syncing an arbitrary filter. It always computes the hashes when needed: As far as I know there isn't really a way to avoid this given arbitrary filters. With incremental hashes you could build this into a tree, and that will improve the efficiency of the sync somewhat, but you'll still need to run the query, read in the matching IDs and re-build the tree on each query, which probably isn't worth it unless filters will match millions of results. |
Considering the under-development status of @hoytech's approach, do we want to advocate for a simpler version now and deprecate it next year or do we want to wait for it? I could really use a sync right now but I understand if people want to wait. |
Depends on the timeline, I'd be ok with two versions, especially if they're very substantially different in complexity/capability, which it seems these are. |
It would be great to see this and other solutions (like #579) being tried now to save battery and bandwidth for everyone on mobile. A GitHub Draft of negentropy-sync would also be awesome so that discussion can be easy to follow later. To the question of implementation complexity, this is what libraries are for and they will come, but first we need specifications. |
I am not sure if I agree with this. Depending on very common libraries is ok, like with SHA-256 for instance. Depending on large, complex, and opinionated libraries is not good. They might come, but it could take years to get a good chunk of languages covered by good library options that are fully interoperable with one another. I would prefer if we made sure Nostr can be easily implemented/assembled in any language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use negentropy, and/or simplify the negentropy spec?
After some testing, I now think the flexibility in window size is important. So, I moved away from a fixed spec with weekly hashes to a spec based on the first It looks more complicated, but it is actually simper than the previous one. |
This is a simple way to implement a sync between event databases of a client and a relay.
The goal is to be so simple that relays and clients can actually implement from scratch.
It is similar to what StrFry does, but strfry's algorithm has too many options to code, which makes an interoperable implementation very difficult/complex.
Curious to hear from @jb55, @hoytech, @mikedilger, @cameri and others who have spent time on this.
Read: https://github.com/vitorpamplona/nips/blob/negentropy-sync/29.md