NIP-29: Simple time-based Sync #826

vitorpamplona · 2023-10-16T19:49:28Z

This is a simple way to implement a sync between event databases of a client and a relay.

The goal is to be so simple that relays and clients can actually implement from scratch.

It is similar to what StrFry does, but strfry's algorithm has too many options to code, which makes an interoperable implementation very difficult/complex.

Curious to hear from @jb55, @hoytech, @mikedilger, @cameri and others who have spent time on this.

Read: https://github.com/vitorpamplona/nips/blob/negentropy-sync/29.md

staab

I'm glad to see this coming into the protocol. I think this could be simplified by just using a HASH verb, since filters are used anyway. That way, a client can window more or less granularly as needed.

hoytech · 2023-10-17T04:39:37Z

Interesting choice to encode the week as the base unit of time in the protocol! The 7 day cycle of days we use today has been unbroken since at least 60 CE during the reign of Augustus, and possibly much longer -- if synchronised with Judaism then perhaps a thousand years earlier than that. This makes the week by far the most stable calendrical unit of time.

However, no matter the unit of time you choose, there will be cases where it is sub-optimal. Consider queries that match infrequent events. If they are posted once a week or less, then weekly hashes degrades into simply transferring the event IDs on every sync. Here it would be ideal to have something like monthly or yearly buckets.

Alternatively, consider syncing the comments on a thread. Most of the time a thread will have a flurry of activity in the first couple days after posting, and then go quiet after. So every time a difference is detected by a sync you will need to re-download the full thread (and potentially paginate it, since most relays limit the number of stored events that a REQ can return). Here ideally you'd use hourly or minutely buckets.

As you suggest, it is possible to have custom intervals selected by the protocol. Requiring implementations to do calendar arithmetic seems far too complicated to me. With negentropy, the matching events are divided into (by default) 16 equal-sized (by number of events, not time) buckets, and the starting timestamp for each bucket is sent.

The second problem: Once you detect a difference in the hash of a set of items, what do you do next? If you simply perform the entire query then the amount of data transferred is linear in the entire set for each sync. In the worst case where you run a separate sync for each individual event: quadratic.

With @staab's suggestion, you could split the range into half-week windows, and get each of their hashes, and recurse into the ones that differ. This would result in bandwidth overhead logarithmic in the set size.

This is effectively how negentropy works, except that many ranges can be batched together and worked on concurrently (while strictly adhering to transport frame size limits), and when a small enough range is found, the item IDs are sent directly, rather then being hashed.

strfry's algorithm has way too many options, which makes an interoperable implementation very difficult

I guess you are referring to the complexity of a negentropy implementation? Granted, this is non-trivial. However, there are 3 existing implementations and a pretty decent reference test suite. What language would be most useful for your app?

vitorpamplona · 2023-10-17T13:12:57Z

no matter the unit of time you choose, there will be cases where it is sub-optimal.

Yes, my hope is that it is ok to be suboptimal for simplicity's sake. We will see.

Once you detect a difference in the hash of a set of items, what do you do next?

Clients would simply download the full week again (we already do that, so it shouldn't be that much of a problem). But I assume past discrepancies of hashes to be rare. It would only happen when an event from the past is re-broadcasted to the relay or when there was an EOSE issue somewhere and the client stopped asking for a given range of events. Most of the time, hashes should match.

I guess you are referring to the complexity of a negentropy implementation? Granted, this is non-trivial.

Yes, I spent some time going through it. I think it is very powerful but too hard/open for clients to be able to declare full compliance between two implementations.

Since we have so many people building relays from scratch, I think it is important to keep this as simple as possible. Anybody should be able to code and reply to these calls correctly, even if they are coding from scratch.

… each event to make sync work well.

arthurfranca · 2023-10-17T15:19:26Z

What about the option of relays storing a .seen_at field to events then clients would filter using since/until/limit based on that field instead of on .created_at?

Maybe then clients could store for each filter, just the last moment it requested events from each relay. If it is enough for your syncing use case, it would be much easier for relays to implement.

vitorpamplona · 2023-10-17T15:58:02Z

What about the option of relays storing a .seen_at field to events then clients would filter using since/until/limit based on that field instead of on .created_at?

It's orthogonal to the solution here. This could help, but it doesn't solve the problem of making sure the past has been fully synced.

staab · 2023-10-20T17:39:03Z

Going to restate my comment, why do we need to enforce weekly? Since we're accepting arbitrary filters here, the hardcoded window doesn't seem to do anything to help relays with performance, other than forcing clients to limit the scope of their request. But that's a pretty weak heuristic, since a year of one pubkey's data is going to be less than 10 minutes of global data. So why not just calculate the hash based on the filter, and relays can complain if the filter isn't restrictive enough?

So:

The client sends a HASH message to the relay with a subscription ID and appropriate filters for the content to be synced.

Request:

["HASH-REQ", <subscription ID string>, <nostr filter>, <nostr filter2>, <nostr filter3>]

The relay calculates the hash and responds with the following:

Response:

["HASH-RES", <subscription ID string>, <SHA256(JSON.stringify([event1.id, event2.id, event3.id, ...])) in hex>]

The client then compares the receiving hashes with those stored locally and, if different, uses the same filter to download all desired events.

Ok, so thinking about this more, this is actually ok. A binary search of hashes over a long time period would require a relay to do more work, and clients can always impose their own narrower windows if desired.

vitorpamplona · 2023-10-20T18:36:41Z

All window sizes that I could imagine came with weird problems. Calendar formatting (timezone, leap years, leap seconds, etc) for instance is already an issue between implementations in multiple languages. Like @hoytech mentioned "week of the year" seems to be the most stable metric among all calendar types.

An option would be to group by a substring of the first n chars of a stringified .created_at. The client sends the number of chars to be grouped by:

1: *-> group by periods of 31 years
2: ** -> group by periods of 3 years
3: *** -> group by periods of 16 weeks and half
4: **** -> group by periods of 11 days and 13 hours
5: ***** -> group by periods of 1 day and 3 hours
6: ****** - > group by periods of 2 hours and 46 minutes
7: ******* -> group by periods of ~16 minutes
8: ******** -> group by periods of 1 minute and 40 seconds
9: ********* -> group by periods of 10 seconds
10: ********** -> group by seconds

In that way, implementers don't need to format the date but groups are less intuitive.

But the goal is to make a Sync that is stupidly simple to implement.

staab · 2023-10-20T18:51:48Z

Right, I was suggesting omitting time-based chunking entirely. But I convinced myself.

vitorpamplona · 2023-10-20T18:56:00Z

Time-based chunking seems right because sync is only useful when reprocessing past events.

We have since/until to deal with downloading events since the last time the user was online. For everything else, sync should be used.

staab · 2023-10-20T22:21:12Z

I just reviewed the negentropy protocol, and while it's a little more complex and not as tidy, there are good libraries available for multiple languages. It's also already implemented in strfry, and probably elsewhere. How much better is negentropy in terms of time/space complexity compared with simple sync? I'm inclined to just use what already exists unless there's a good reason note to. @hoytech would definitely like your opinion on how best to get this into a NIP.

vitorpamplona · 2023-10-20T22:40:51Z

Frankly I don't see any relay dev coding a fully interoperable interface with what strfry has. Much less all the other ad-hoc relays that were coded from scratch that we see out there. Even as a Client, that just needs to code one of the multiple ways to use the protocol, it took me forever to figure out how to code it. Imagine relays that must support all Negentropy options.

That's why I made this PR. The goal is to create something where the end result is very similar in time/costs at a fraction of the complexity of the full protocol.

jb55 · 2023-10-21T00:29:24Z

On Fri, Oct 20, 2023 at 03:21:23PM -0700, hodlbod wrote: I just reviewed the negentropy protocol, and while it's a little more complex and not as tidy, there are good libraries available for multiple languages. It's also already implemented in strfry, and probably elsewhere. How much better is negentropy in terms of time/space complexity compared with simple sync? I'm inclined to just use what already exists unless there's a good reason note to. @hoytech would definitely like your opinion on how best to get this into a NIP.

I had the same thought, I'm also planning on implementing negentropy in nostrdb. Why not just standardize around that? I trust hoy's tech.

vitorpamplona · 2023-10-21T00:42:52Z

Why not just standardize around that?

If we can get people to actually code it, sure.

mikedilger · 2023-10-21T19:03:38Z

Sorting by created_at is not sufficient as you'll get multiple events with the same timestamp. You should subsequently sort by id.

If you only got one hash at a time, you could specify any time period with 'since' and 'until'. EDIT: I presume you want the time periods fixed so relays can have the hashes pre-calculated

What about the option of relays storing a .seen_at field to events then clients would filter using since/until/limit based on that field instead of on .created_at?

It's orthogonal to the solution here. This could help, but it doesn't solve the problem of making sure the past has been fully synced.

I'm still in favor of seen_at filters and maybe I should re-open that.

If I know that I got all the events from a relay up to seen_at time X, then when I ask again later I know that ALL events I might be missing must have flowed in after seen_at time X, even if the created_at times are all over the place. The only slightly tricky bit is that IF the timestamp I put in for seen_at is in the future according to the relay clock, then it won't have seen everything up to that stamp just yet... so the relay should cap the request to the relay's now and indicate that somehow in the reply. This was suggested long long ago.

vitorpamplona · 2023-10-21T19:46:38Z

Sorting by created_at is not sufficient as you'll get multiple events with the same timestamp. You should subsequently sort by id.

Agreed.

I'm still in favor of seen_at filters and maybe I should re-open that.

So,basically, relays must store a received_at date for each event and then the seen_at filter is just like since but for the received_at dates instead of created_at?

mikedilger · 2023-10-22T07:47:18Z

So,basically, relays must store a received_at date for each event and then the seen_at filter is just like since but for the received_at dates instead of created_at?

Yep. It would prevent gaps. But I agree there are still other reasons to do what this NIP is suggesting.

arthurfranca · 2023-10-22T22:36:49Z

Well don't want to sound repetitive but this is exactly NIP-34 PR as I said above xD It only needs some repo maintainer review

hoytech · 2023-10-23T12:30:04Z

Thanks guys! Currently I still consider negentropy experimental so I don't think we should turn it into a NIP yet. Based on my discussions with the the author of the paper negentropy is based on, there is one more change I'm making to the protocol. I'm also going to simplify it slightly, and remove the idSize parameter -- that's pretty much the only exposed option now.

I'm working on an article now about why Range-Based Set Reconciliation (RBSR) is so cool. I think it will be a very important building block for internet protocols in the next few years, since it has many compelling advantages over merkle search trees and other similar approaches. I'll let you all know when my article is ready.

A couple notes:

Bucketing by time: Negentropy does not directly bucket by time. For example, imagine you are using a week interval, and detect there are changes, so you try bucketing it by days. But say all the matching events were created in one particular day. In this case you have made no progress (haven't identified any missing events, or recursed into actual event-containing ranges). Negentropy always makes progress in every message in either direction, because it creates equal-sized buckets according to the local number of events (independent of their timestamps).
seen_at event metadata: Using this relies on keeping relay-specific state, whereas the content-addressed sync types (both time-bucketing and RBSR) do not. If you connect to many relays this becomes less efficient (because you may be downloading the same event N times). Also, relays can unexpectedly re-build their DBs (potentially throwing away or resetting seen_at metadata), or multiple relays can be behind a load-balancer and not have consistent seen_ats, or their clocks can be changed, or events can be deleted and re-inserted, or any number of other things. In a chaotic environment like nostr, I think syncing by content is preferable.

vitorpamplona · 2023-10-23T13:02:59Z

Interesting point on the seen_at.

If Negentropy clusters by a number of events in order, it feels like if there is an added event right at the beginning of the order, all clusters would be affected and the resulting hashes (or resulting created_ats) would be different. Is that correct?

If that is true, then the algorithm on the client side that figures out which clusters to dive into is a bit more complex. Isn't it?

staab · 2023-10-23T15:12:15Z

I think it will be a very important building block for internet protocols in the next few years

This should be the vision for this NIP. Efficient sync is good for more than syncing relays; special-purpose clients or DVMs for fulfilling advanced queries across many relays (search, count, find replies, trending) would be well served by an efficient sync mechanism. If it were possible to do the sync in real time in response to a user request, that would open up a lot of really interesting use cases.

jb55 · 2023-10-24T04:29:55Z

On Mon, Oct 23, 2023 at 05:30:14AM -0700, Doug Hoyte wrote: Thanks guys! Currently I still consider negentropy experimental so I don't think we should turn it into a NIP yet. Based on my discussions with the the author of the paper negentropy is based on, there is one more change I'm making to the protocol. I'm also going to simplify it slightly, and remove the `idSize` parameter -- that's pretty much the only exposed option now.

Thanks for the heads up! You think it might be ready to spec after this simplification? I will hold off implementing until it's updated.

hoytech · 2023-10-24T12:31:29Z

If Negentropy clusters by a number of events in order, it feels like if there is an added event right at the beginning of the order, all clusters would be affected and the resulting hashes (or resulting created_ats) would be different. Is that correct?

If that is true, then the algorithm on the client side that figures out which clusters to dive into is a bit more complex. Isn't it?

Yes, with the current negentropy version, any stored events that land in a range that you've pre-computed a hash for will need re-computing. This is what I am working on now: Parameterising an incremental hash function with adequate security. This will let you add or subtract event IDs out of a hash without touching any other events. It is literally just treating the hashes as numbers and adding them hashes. There are more secure approaches but they involve complicated elliptic curve dependencies, which I really want to keep out of negentropy spec. Anyway collisions in hash additions takes a fair amount of resources/time, I wrote a program to attack this as best I could: https://github.com/hoytech/birthday-collisions (and there are several additional countermeasures that will make it harder secure still).

hoytech · 2023-10-24T12:33:49Z

Thanks for the heads up! You think it might be ready to spec after this simplification? I will hold off implementing until it's updated.

Yes definitely hold off for now. I think it should be good to go after this update, but maybe some people will have good feedback after I publish the article, so we'll see!

vitorpamplona · 2023-10-24T13:29:17Z

Yes, with the current negentropy version, any stored events that land in a range that you've pre-computed a hash for will need re-computing.

Then, I assume that because it is pre-computed, the base negentropy algo doesn't really have a way to specify the filter you want to sync. Or does it?

For instance, Amethyst would like to sync only the event set where the user is the author or p-tagged + all Kind:0s and all Kind:10002s for follows + follows of follows of the user.

I wasn't sure there was a smart algo somewhere to allow the pre-computation for different filters, so this PR requires computing hashes on the fly all the time. But maybe there is a way to pre-compute them.

hoytech · 2023-10-24T13:48:48Z

Yes, it allows syncing an arbitrary filter. It always computes the hashes when needed: As far as I know there isn't really a way to avoid this given arbitrary filters.

With incremental hashes you could build this into a tree, and that will improve the efficiency of the sync somewhat, but you'll still need to run the query, read in the matching IDs and re-build the tree on each query, which probably isn't worth it unless filters will match millions of results.

vitorpamplona · 2023-10-24T14:06:42Z

Considering the under-development status of @hoytech's approach, do we want to advocate for a simpler version now and deprecate it next year or do we want to wait for it?

I could really use a sync right now but I understand if people want to wait.

staab · 2023-10-24T15:44:27Z

Depends on the timeline, I'd be ok with two versions, especially if they're very substantially different in complexity/capability, which it seems these are.

weex · 2023-10-26T18:47:45Z

I could really use a sync right now but I understand if people want to wait.

It would be great to see this and other solutions (like #579) being tried now to save battery and bandwidth for everyone on mobile.

A GitHub Draft of negentropy-sync would also be awesome so that discussion can be easy to follow later.

To the question of implementation complexity, this is what libraries are for and they will come, but first we need specifications.

vitorpamplona · 2023-10-26T19:11:56Z

To the question of implementation complexity, this is what libraries are for and they will come, but first we need specifications.

I am not sure if I agree with this. Depending on very common libraries is ok, like with SHA-256 for instance. Depending on large, complex, and opinionated libraries is not good. They might come, but it could take years to get a good chunk of languages covered by good library options that are fully interoperable with one another.

I would prefer if we made sure Nostr can be easily implemented/assembled in any language.

Semisol

Why not use negentropy, and/or simplify the negentropy spec?

vitorpamplona · 2023-11-08T18:36:47Z

After some testing, I now think the flexibility in window size is important. So, I moved away from a fixed spec with weekly hashes to a spec based on the first n-chars of .created_at. The client can specify how to truncate the timestamp to create groups of multiple sizes, bound by the filters in the subscription.

It looks more complicated, but it is actually simper than the previous one.

29.md

vitorpamplona added 3 commits October 16, 2023 15:43

Simple time-based Sync

1cb0645

Grammar fix

977f228

adjustment in the text

6536047

staab reviewed Oct 16, 2023

View reviewed changes

Adds a clarification that Client must keep an AsSeenOn relay list for…

447f9b3

… each event to make sync work well.

vitorpamplona requested review from scsibug, cameri and alexgleason October 24, 2023 17:49

Semisol self-requested a review October 30, 2023 17:45

Semisol requested changes Nov 1, 2023

View reviewed changes

vitorpamplona added 3 commits November 8, 2023 13:29

Moves away from weekly.

f93015b

Fixes formatting

b11fb18

Fixes markdown

7b764aa

staab reviewed Nov 8, 2023

View reviewed changes

29.md Outdated Show resolved Hide resolved

fixes sorting ambiguity.

73b3274

vitorpamplona mentioned this pull request Dec 27, 2023

Shared Key DM #945

Closed

vitorpamplona mentioned this pull request Feb 10, 2024

NIP-114: ids_only filter #1027

Open

vitorpamplona mentioned this pull request Jun 11, 2024

NIP-01 filter enhancement proposal - negation of IDs - via new filter field with either explicit or bloom filter matching #1290

Open

vitorpamplona mentioned this pull request Nov 8, 2024

NIP-77: Negentropy syncing #1494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIP-29: Simple time-based Sync #826

NIP-29: Simple time-based Sync #826

vitorpamplona commented Oct 16, 2023 •

edited

Loading

staab left a comment

hoytech commented Oct 17, 2023 •

edited

Loading

vitorpamplona commented Oct 17, 2023

arthurfranca commented Oct 17, 2023

vitorpamplona commented Oct 17, 2023 •

edited

Loading

staab commented Oct 20, 2023 •

edited

Loading

vitorpamplona commented Oct 20, 2023 •

edited

Loading

staab commented Oct 20, 2023

vitorpamplona commented Oct 20, 2023

staab commented Oct 20, 2023

vitorpamplona commented Oct 20, 2023

jb55 commented Oct 21, 2023 via email

vitorpamplona commented Oct 21, 2023

mikedilger commented Oct 21, 2023 •

edited

Loading

vitorpamplona commented Oct 21, 2023 •

edited

Loading

mikedilger commented Oct 22, 2023

arthurfranca commented Oct 22, 2023

hoytech commented Oct 23, 2023 •

edited

Loading

vitorpamplona commented Oct 23, 2023 •

edited

Loading

staab commented Oct 23, 2023

jb55 commented Oct 24, 2023 via email

hoytech commented Oct 24, 2023

hoytech commented Oct 24, 2023

vitorpamplona commented Oct 24, 2023

hoytech commented Oct 24, 2023

vitorpamplona commented Oct 24, 2023 •

edited

Loading

staab commented Oct 24, 2023

weex commented Oct 26, 2023 •

edited

Loading

vitorpamplona commented Oct 26, 2023

Semisol left a comment

vitorpamplona commented Nov 8, 2023 •

edited

Loading

NIP-29: Simple time-based Sync #826

Are you sure you want to change the base?

NIP-29: Simple time-based Sync #826

Conversation

vitorpamplona commented Oct 16, 2023 • edited Loading

staab left a comment

Choose a reason for hiding this comment

hoytech commented Oct 17, 2023 • edited Loading

vitorpamplona commented Oct 17, 2023

arthurfranca commented Oct 17, 2023

vitorpamplona commented Oct 17, 2023 • edited Loading

staab commented Oct 20, 2023 • edited Loading

vitorpamplona commented Oct 20, 2023 • edited Loading

staab commented Oct 20, 2023

vitorpamplona commented Oct 20, 2023

staab commented Oct 20, 2023

vitorpamplona commented Oct 20, 2023

jb55 commented Oct 21, 2023 via email

vitorpamplona commented Oct 21, 2023

mikedilger commented Oct 21, 2023 • edited Loading

vitorpamplona commented Oct 21, 2023 • edited Loading

mikedilger commented Oct 22, 2023

arthurfranca commented Oct 22, 2023

hoytech commented Oct 23, 2023 • edited Loading

vitorpamplona commented Oct 23, 2023 • edited Loading

staab commented Oct 23, 2023

jb55 commented Oct 24, 2023 via email

hoytech commented Oct 24, 2023

hoytech commented Oct 24, 2023

vitorpamplona commented Oct 24, 2023

hoytech commented Oct 24, 2023

vitorpamplona commented Oct 24, 2023 • edited Loading

staab commented Oct 24, 2023

weex commented Oct 26, 2023 • edited Loading

vitorpamplona commented Oct 26, 2023

Semisol left a comment

Choose a reason for hiding this comment

vitorpamplona commented Nov 8, 2023 • edited Loading

vitorpamplona commented Oct 16, 2023 •

edited

Loading

hoytech commented Oct 17, 2023 •

edited

Loading

vitorpamplona commented Oct 17, 2023 •

edited

Loading

staab commented Oct 20, 2023 •

edited

Loading

vitorpamplona commented Oct 20, 2023 •

edited

Loading

mikedilger commented Oct 21, 2023 •

edited

Loading

vitorpamplona commented Oct 21, 2023 •

edited

Loading

hoytech commented Oct 23, 2023 •

edited

Loading

vitorpamplona commented Oct 23, 2023 •

edited

Loading

vitorpamplona commented Oct 24, 2023 •

edited

Loading

weex commented Oct 26, 2023 •

edited

Loading

vitorpamplona commented Nov 8, 2023 •

edited

Loading