Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync goes into infinite loop? #427

Open
grrrwaaa opened this issue Aug 17, 2021 · 17 comments
Open

Sync goes into infinite loop? #427

grrrwaaa opened this issue Aug 17, 2021 · 17 comments

Comments

@grrrwaaa
Copy link

grrrwaaa commented Aug 17, 2021

import Automerge from "https://cdn.skypack.dev/[email protected]";

Apologies in advance if I (very likely) did something obviously wrong here.

I was trying to set up the most minimal example of using the syncState mechanism, in order to demonstrate to our research team. I found that my example would sometimes fall into an infinite loop, with peers constantly messaging each other even though their documents were ostensibly already synced. I was able to boil this down to a very minimal example, using only two peers and not even any changes. That leads me to suspect that I'm doing something wrong here, but the docs are a bit opaque to me and I couldn't find any example that was lean enough to compare to.

The codepen below has the demo. It displays the states of the docs and logs messages being sent & received. Networking is "emulated" using setTimeout with random delays, and a simple FIFO inbox for each peer.

https://codepen.io/grrrwaaa/pen/ZEKNeRO?editors=0010

Is it something about how the sync objects returned from receiveSyncMessage or generateSyncMessage are applied? I was unclear whether those should be applied immediately.

I realize that syncing two already-synced documents might be an edge case to ignore, but it can also happen with additional edits. If you uncomment the "simulate an edit" lines, then sometimes the sync messages resolve to stability, but sometimes they still keep talking to each other, which implies there's some order-of-operation effect here that the randomized simulated network delay is picking up.

Again, apologies if I'm missing something here, and thanks for paying attention.

EDIT: Oh -- should I be simulating different actor IDs here?

@pvh
Copy link
Member

pvh commented Aug 17, 2021

You do need different actor IDs though it probably shouldn't enter an infinite loop if you don't have them.

Your example code is a bit complicated but doesn't have anything obviously wrong going on, so the next step would be to figure out why the sync protocol believes it needs to send another message.

If you have a look at the sync code, calling generateSyncMessage will produce a message if there are local changes we know the other peer doesn't have, or if our heads and the other peers' heads don't match (indicating they have changes we need, still.) I've not got a chance to dig into what specifically is going wrong, but I would suspect it's something like:

  • your sync states are being lost / reset between messages, so peers think the other might still have something new for them
  • the messages aren't being applied correctly, so we keep repeating ourselves

On the whole though, it looks from casual inspection like it should work, but since it isn't, I hope this gives you some improved ability to debug the problem. In general I would say that a well designed API should make it very difficult to get into infinite loops in the first place, so we probably still have some work to do here, and once we figure out the problem I certainly welcome your input into how we can either improve the error messages, the API, or failing that, improve the documentation.

@grrrwaaa
Copy link
Author

grrrwaaa commented Aug 17, 2021

Thank you for the very quick reply @pvh !

I simplified the example code & added more comments, hopefully this is now much easier to understand. I'm not sure how much simpler I could make it.

https://codepen.io/grrrwaaa/pen/qBmGMQV?editors=0010

Some new observations:

  • I verified that the two peers' docs have different actorIDs already, so it isn't that.
  • Even with an empty doc the sync loop happens, so it seems to have nothing to do with the content of changes
  • I also checked, and it doesn't matter which peer starts talking first, the same loop happens.

I did discover that if the update() function performs generateSyncMessage before it handles incoming messages, then no sync loop happens. That seems counter-intuitive (e.g. good habit of git pull before git push to resolve locally first), and I thought than in theory the order shouldn't matter. It also makes me nervous about how this will behave with many peers, not just a one-to-one.

(EDIT: relying on send() before receive() didn't stop the infinite loop once I added more changes to the docs, so that's not really a viable workaround).

The sync messages being sent are always [66,0,0,1,0,0,0]. I'm not sure how to interpret the byte arrays yet (though that would be very useful to understand -- is it documented anywhere?)

I also log the current docs, syncStates, and actorIDs to the page, if it helps.

@dan-weaver
Copy link
Contributor

dan-weaver commented Aug 17, 2021

I'm having a bit of trouble understanding exactly what your code does wrong. Like @pvh says at a glance it seems correct. I'm not sure if you've discovered a bug or if the code itself is buggy.

This gist fixes the infinite loop https://gist.github.com/dan-weaver/9d2839e0ed12a03be74724c5c61198a3

I've hoisted the initial "send()" calls out of the setInterval and then embedded the actual send inside receive(). In a real scenario when a peer makes a change you'd just call send() once more and it should trigger a new sync and everything should converge.

I think there's just a weird race condition somewhere in your initial code but I had trouble tracking it down. Peer2 was never receiving any messages so it just kept sending an "initial" message to peer1.

@grrrwaaa
Copy link
Author

Thanks for looking into it!

The trouble with this solution is what happens if the network drops the initial send packet?
I tried this: I call the generateSyncMessage(), but don't forward the message to the other peer (to simulate dropped packet). The result is an infinite loop again.

https://codepen.io/grrrwaaa/pen/NWjVoKW?editors=0010

That is, it seems like things break if you receive & handle a packet from a peer before that peer has received & handled a packet from you. Working around this would require some kind of additional ACK or handshaking logic. Can't the syncStates handle this case without needing extra handshaking?

@dan-weaver
Copy link
Contributor

dan-weaver commented Aug 17, 2021

So you're technically not actually getting an infinite loop. Or at least not one that is a bug in the protocol. If you place some logs in both the receive and send functions you'll notice that only 1 peer is ever actually sending a non null message.

What you have is not an infinite loop you just have 1 peer continuously trying to initiate a sync with one who never responds. Not exactly sure why but peer2 never transmits a single message to peer1. However, one of the peers is just constantly sending a sync message to the other and because it never gets one back it just keeps doing this. That's unavoidable I think. In a real life situation though you would only call update() when u know something locally has changed or when you've received remote changes that you know other peers are interested in.

In the case of a network failure, when a message is actually dropped, things should eventually converge the next time one of the peers sends the next message. The code above doesn't actually simulate that though. Also depending on what kind of network protocol you're using you may be able to detect when a peer has disconnected (web socket for example). If it has done so it should call send to all the peers again once it reconnects.

EDIT:

What's happening is peer 1 is sending the initial sync message to peer2. Peer2 is then noticing that both itself and peer1 have an empty document so it doesn't send anything back. There's no changes to communicate. However, because you call send() every 500 ms this process just repeats.

Again, in a real situation there'd be no reason for peer2 to keep sending an update if in fact nothing has actually changed. That's why moving the initial "send" out of setInterval is a better representation of how this would work irl.

@grrrwaaa
Copy link
Author

Yes, it does seem that if I apply a change to the document before networking starts, then the syncs resolve without looping.

I wonder, is there a way to parse the msg result of generateSyncMessage() to know when it believes it has no changes to communicate? Automerge.decodeChange() seems to be expecting a different kind of buffer.

Again, apologies if this is documented somewhere, I couldn't find it, but I'd really like to know the spec of the change protocol (or protocols?) that Automerge is using. I'm going to need that anyway to create the app-specific side-effects of changes to the doc. (I'm hoping that this is documented and stable -- but I understand if it isn't, and I guess I can fall back to doing a JSONPatch diff if necessary.)

@dan-weaver
Copy link
Contributor

dan-weaver commented Aug 17, 2021

the message returned by generateSyncMessage should be null if the peer considers the doc already synced. I think this is the most authoritative doc: https://github.com/automerge/automerge/blob/main/SYNC.md

as far as side effects. receiveSyncMessage returns a patch object as one of the values in the tuple. Similarly so does applyChanges. It should contain enough information to know what was changed although it might not be all that convenient a format :). I've never tried that.

@grrrwaaa
Copy link
Author

Thank you!

"the message returned by generateSyncMessage should be null if the peer considers the doc already synced." -- That's what I figured, but I guess my test script is showing that this isn't always the case, at least in the situation of a new document. But if this is valid in other cases, I think I can work around it among the general network handshake logic. I'll test a few variants and see if it remains stable.

Yes, I was just looking into the patch and change structures. It seems like that should work. I'm also looking at the Observable route too, which might be easier -- and very easily delineates local and remote origin changes.

Thanks so much for answering my questions, and amazing response time! Apologies that it drifted into another topic.
I have to say, the library itself is really inspiring, and I'm hoping that it will be something we can use in our project(s) for quite a long time to come.

@pvh
Copy link
Member

pvh commented Aug 17, 2021

It only knows that it's synced if it has heard from the other side, so if you don't hear from the other side it will continue to generate a "query".

Also, if you want to rummage around inside the SyncMessage, just use decodeSyncMessage. I think it's exported from Backend?

A small note that strictly none of the sync functions have side effects*. All inputs and outputs are explicit.

The sync protocol should behave correctly regardless of what order the peers initiate contact. Both sides can reach out simultaneously or one side can wait to hear from the other. (There are even tests for this.) Still, if you call "generateSyncMessage" without changing the document or hearing back from the other peer, you're going to get the same syncMessage to send again! This is, I think, correct behaviour, and supports retries in case of lost messages. I could probably be persuaded otherwise, but we wouldn't want to return null since that (as @dan-weaver says) indicates that the local node is sure that it has everything it needs and that its peer hasn't asked for anything else (aka: fully sync'd).

One other note, which is that the unified Frontend/Backend version of the sync protocol has an annoying behaviour which is that when it receives sync messages it's hard to tell whether the document has actually changed. In the case of split frontend/backend operation it's easier -- if you don't get a patch to emit, then there's nothing to render. Unfortunately, the merged version maintains some internal state which makes this harder. I don't really have a good solution for this yet, but practically speaking it leads to a bunch of wasted render() calls in React apps during synchronization. Sorry.

*: There is a side-effect. Although it doesn't always change the backend state, it can result in the old object being marked as "out of date" and returning a new backend object in its place. This is an optimization to allow the Rust backend implementation to do in-place data structure updates for performance reasons.

@grrrwaaa
Copy link
Author

Thanks!

So in the 2nd example (https://codepen.io/grrrwaaa/pen/qBmGMQV?editors=0010), each side is generating a "query" and sending it, and both sides are receiving each others' messages, so it is hearing from the other side, just that there were no changes to report, since all it hears back is a query, which it treats as still not knowing what state the other side is in, so it can't respond with any updates. So in this case, it generates a new query again (and same for the other peer). They will keep querying each other like this until somebody makes some kind of change, and therefore has something concrete to reply, and then the endless queries stop. Is this understading correct?

If so, I'm not really sure that this is correct behaviour. It has heard back from the peer, so it shouldn't need to retry querying. It's just that the peer had nothing to say, because it had an empty doc. I think the response it got back should have informed it what state the peer was in. Like John Cage: "I have nothing to say, and I'm saying it". That is, perhaps what is missing is a sync reply content that says the equivalent of "my state is an empty doc". (I don't think this is a feature for generating retries, as you did actually hear back, so there's no logical requirement for retries on this basis. The peer will get in touch with you if it changes something.) If the sync protocol has to be stricly delta-oriented, then this would mean a delta that identifies the creation of a new document. I realize this could pollute logic with an edge case though, and it probably is not worth it. At least for my use case all new docs will already have some changes so I'm not concerned for my app. I just thought it was worth identifying.

And thanks for the pointer to decodeSyncMessage. I looked into that and also the Observable API. I'm tempted to work with the Observable one, as it handily reports with a boolean whether the change was local, and has the before/after states (handy for a verification step). Would focusing on the Observable avoid the frontend-backend problem you identified?

@dan-weaver
Copy link
Contributor

dan-weaver commented Aug 18, 2021

It doesn't look to me that peer2 is hearing back from peer1 in this example. Peer2 is initiating the sync. Peer1 is "ending" it by producing a null value and then never responding (correctly) to peer2. place a log of console.log(self.name) around line 147 after the check if msg is truthy. You'll see that peer1 never sends anything. So it's not that both peers are generating queries every 500ms it's just that peer1 never did but in essence your code is incorrectly reinitiating the sync again even though it should be considered "complete" by firing again every 500ms.

I guess the question is whether this is actually "incorrect". I think it could go both ways but to your point on network failure. A network failure could also cause a hypothetical ack (non null but some other means of telling the peer that it's complete from the other side) message to be lost too.

The correct thing for peer2 to be doing in my opinion is to not be sending the sync message every 500ms. Instead it should just assume things are fine unless it experiences a local change or hears back from the other peer.

What you're doing though is not necessarily a bad thing. It's just sort of akin to long polling. This "extra" message is relatively harmless and will eventually resolve if peer1 experiences a change and subsequently sends a new message to peer2. However, instead of continuously "polling" the other peer it could just wait for peer1 to experience a change and send it's message anyway.

That said, if you're interested in knowing whether a sync is "complete" in on peer2 maybe for the situation where u want to show / hide a loading indicator I've had success with actually sending the null message and building logic into my app to both sense the outgoing null on one peer and receiving it on the other peer. This kind of serves as an ack message but it still won't influence the results of generateSyncMessage(). It's simply a good way to tell a user that we think the doc is synced. You'd probably want to build in a timeout though since, again the null value could get lost and you'd be left forever spinning.

The protocol is "optimistic" in a way. No news is good news. Eventually things will converge and a reliable ack isn't supposed to be strictly necessary.

Network failure is definitely a thing to consider. Imho it's something that should be handled on each respective peer. If a peer senes it's disconnected it should reconnect and reinitialize a sync with any peers it has been talking to. This is doable via websockets at least. It's not something built into the Automerge sync protocol.

Of course, this is all somewhat up for debate! But the above reflects my experience building an app that can recover from message loss.

@grrrwaaa
Copy link
Author

grrrwaaa commented Aug 19, 2021

Thank you for the extensive response again!

I think I'm following this correctly, but it requires a shift in perspective that is different from what I'm used to for synchronizing over a network. Specifically, the "optimistic" angle of assuming everything is OK until I hear otherwise! But then why does generateSyncMessage() generate a "query"? It should only generate something if (it thinks that) it has a change to share, no? The "query" itself isn't a change.

The periodic sending of messages was only happening because I was incorrectly assuming that generateSyncMessage() would tell me whether I need to send something or not. The generation of a "query" when there is no local change, or, the lack of response to that query, seems to break that assumption. If either of those were not the case (i.e. if generateSyncMessage() returned a null message rather than a query, or if the other peer sent back a "query response all is ok" rather than null, then the messaging back & forth would not happen). Something seems a bit asymmetric here. Maybe the empty "query" is the problem?

Anyway, shifting my perspective to assuming everyone else is in sync unless they tell me otherwise: I guess it's my job as app designer to maintain my own separate knowledge about whether I (think that I) have something to say or not, and not rely on generateSyncMessage() to do that for me. Or is there another way to ask my syncState whether it (thinks that it) has something it needs to send?

As I said, it's not a big deal and easy to workaround in my specific project, I'm only raising this in case it is interesting to the team.

@ept
Copy link
Member

ept commented Aug 19, 2021

Thanks @pvh and @dan-weaver for being so responsive. I hope you don't mind me chiming in.

Let's try to reduce your issue to a single message exchange, without all the polling and queueing and stuff. If I understand correctly, your scenario is like this:

let doc1 = Automerge.init(), state1 = Automerge.initSyncState();
let doc2 = Automerge.init(), state2 = Automerge.initSyncState();
let msg1, msg2;
[state1, msg1] = Automerge.generateSyncMessage(doc1, state1);
// msg1 is [66, 0, 0, 1, 0, 0, 0]
[doc2, state2] = Automerge.receiveSyncMessage(doc2, state2, msg1);
[state2, msg2] = Automerge.generateSyncMessage(doc2, state2);
// msg2 is null

That is, peer1 uses msg1 to tell peer2 which state it's in. peer2 determines that the two nodes are in sync, and that nothing further needs to be said (msg2 is null).

Without msg1, peer2 would have no way of knowing whether the two nodes are in sync. For that reason, the first message of a sync is always non-null.

It's not quite accurate to say that msg1 is a query. The message [66, 0, 0, 1, 0, 0, 0] essentially says: "Hello, I'm an empty document (with zero changes), there have been no new changes since our last sync, and I don't need anything from you. What about you?" Since the recipient is also an empty document, it knows they are in sync. If the recipient had some changes that the sender doesn't, it would know what it needs to send back. In that sense, the message acts as a query. But it also acts as a notification telling the other node what the state of the sender is.

I suggest you call generateSyncMessage in two circumstances: 1. you have just locally updated your document and want to tell the other nodes about the new change; 2. you have just reconnected to a peer after being disconnected for a while, and want to find out whether the other peer has something new for you. When two peers have an active connection and no changes are happening, there is no need to call generateSyncMessage. In particular, there is no need to keep polling.

@grrrwaaa
Copy link
Author

grrrwaaa commented Aug 19, 2021

Perfectly clear, thank you!

I think that what I was doing was the equivalent of this:

import Automerge from "https://cdn.skypack.dev/[email protected]";

let doc1 = Automerge.init(),
  state1 = Automerge.initSyncState();
let doc2 = Automerge.init(),
  state2 = Automerge.initSyncState();
let msg1, msg2;

// both peers initiate sync:
[state1, msg1] = Automerge.generateSyncMessage(doc1, state1);
[state2, msg2] = Automerge.generateSyncMessage(doc2, state2);

// msg1: {"0":66,"1":0,"2":0,"3":1,"4":0,"5":0,"6":0}  
// == "hello I'm an empty & unchanged doc, how about you?"
// msg2: {"0":66,"1":0,"2":0,"3":1,"4":0,"5":0,"6":0}
// == "hello I'm an empty & unchanged doc, how about you?"

// LOOP START

// peer 1 activity:
if (msg2) [doc1, state1] = Automerge.receiveSyncMessage(doc1, state1, msg2);
[state1, msg1] = Automerge.generateSyncMessage(doc1, state1);

// peer 2 activity:
if (msg1) [doc2, state2] = Automerge.receiveSyncMessage(doc2, state2, msg1);
[state2, msg2] = Automerge.generateSyncMessage(doc2, state2);

// msg1: {"0":66,"1":0,"2":0,"3":1,"4":0,"5":0,"6":0}   // **why is this not null?**
// == "hello I'm an empty & unchanged doc, how about you?"
// msg2: null
// == "I think we are in sync"

// LOOP END:    it repeats ad infinitum from this point on; doc1 keeps generating non-null sync messages

https://codepen.io/grrrwaaa/pen/eYWqgJV?editors=0010

That is, sync1 generates the message "Hello, I'm an empty document (with zero changes), there have been no new changes since our last sync, and I don't need anything from you. What about you?", but after it hears the exact same message from sync2, it generates the message again. Shouldn't it now know the state of doc2, and therefore generate null?

I understand that I could use the fact that msg1 == null as an indicator that I don't need to call Automerge.generateSyncMessage(doc1, state1), and that's what I'll do in my code; so this isn't a bug report or anything like that. It's simply surprising to me that Automerge.generateSyncMessage(doc1, state1) didn't generate null in this case; it feels somehow incongruous or asymmetric.

@dan-weaver
Copy link
Contributor

dan-weaver commented Aug 19, 2021

I'm seeing msg1 as null in the below code the only thing I changed was adding the log. msg2 is non null but that's what I'd expect and is inline with Martin's code sample and explanation I think right? Peer1 see's peer2's initial message and doesn't respond (aka sets msg1 to null). Peer2 then is just generating the same sync message it did on the last "turn" because it has no reason to do otherwise.

Remember that this code if (msg1) [doc2, state2] = Automerge.receiveSyncMessage(doc2, state2, msg1); is not executing because msg1 is null so the code calling generateSyncMessage directly under it is just calling it with it's old arguments. When using the protocol you just would not make this call again.

const Automerge = require("automerge");

let doc1 = Automerge.init(),
  state1 = Automerge.initSyncState();
let doc2 = Automerge.init(),
  state2 = Automerge.initSyncState();
let msg1, msg2;

// both peers initiate sync:
[state1, msg1] = Automerge.generateSyncMessage(doc1, state1);
[state2, msg2] = Automerge.generateSyncMessage(doc2, state2);

// msg1: {"0":66,"1":0,"2":0,"3":1,"4":0,"5":0,"6":0}  
// == "hello I'm an empty & unchanged doc, how about you?"
// msg2: {"0":66,"1":0,"2":0,"3":1,"4":0,"5":0,"6":0}
// == "hello I'm an empty & unchanged doc, how about you?"

// LOOP START

// peer 1 activity:
if (msg2) [doc1, state1] = Automerge.receiveSyncMessage(doc1, state1, msg2);
[state1, msg1] = Automerge.generateSyncMessage(doc1, state1);

// peer 2 activity:
if (msg1) [doc2, state2] = Automerge.receiveSyncMessage(doc2, state2, msg1);
[state2, msg2] = Automerge.generateSyncMessage(doc2, state2);

console.log({msg1, msg2});

@pvh
Copy link
Member

pvh commented Aug 19, 2021

Just chiming in to say that if there is a defect in the API design here it's that it might be reasonable for us to conclude that we could probably return null for repeated generateSyncMessage calls, thus:

[doc, state1] = Automerge.receiveSyncMessage(doc, state0, receivedMsg);
[state2, outboundMsg1] = Automerge.generateSyncMessage(doc, state1);
[state3, outboundMsg1] = Automerge.generateSyncMessage(doc, state2); // arguably, this could be null
[state2, outboundMsg1] = Automerge.generateSyncMessage(doc, state1); // this should not be null (no side effects)

I'm not entirely sure we want to do this. I agree that the correct behaviour is to only generateSyncMessage in response to a new connection, receiving a message, or making a change. Still, it seems like it is arguably unnecessary to generate the extra messages. On the other hand, the current behaviour gives nice recovery behaviour in environments with lossy communication channels so... I don't have a strong opinion about the right path forward (aside from improving the docs.)

@ept
Copy link
Member

ept commented Aug 20, 2021

That is, sync1 generates the message "Hello, I'm an empty document (with zero changes), there have been no new changes since our last sync, and I don't need anything from you. What about you?", but after it hears the exact same message from sync2, it generates the message again. Shouldn't it now know the state of doc2, and therefore generate null?

It does generate null! You are overwriting the msg1 variable with null, so peer 2 never receives the message sent by peer 1:

// peer 1 activity:
if (msg2) [doc1, state1] = Automerge.receiveSyncMessage(doc1, state1, msg2); 
[state1, msg1] = Automerge.generateSyncMessage(doc1, state1); // here msg1 is set to null, overwriting the original non-null msg1

// peer 2 activity:
if (msg1) [doc2, state2] = Automerge.receiveSyncMessage(doc2, state2, msg1); // since msg1 is null, this line is not executed
[state2, msg2] = Automerge.generateSyncMessage(doc2, state2); // here peer 2 did not receive any message from peer 1, so it doesn't know they are in sync

It works as expected if you fix this bug:

let msg1, msg2, msg3, msg4;

// both peers initiate sync:
[state1, msg1] = Automerge.generateSyncMessage(doc1, state1);
[state2, msg2] = Automerge.generateSyncMessage(doc2, state2);

// peer 1 activity:
if (msg2) [doc1, state1] = Automerge.receiveSyncMessage(doc1, state1, msg2);
[state1, msg3] = Automerge.generateSyncMessage(doc1, state1); // writes to msg3, not msg1

// peer 2 activity:
if (msg1) [doc2, state2] = Automerge.receiveSyncMessage(doc2, state2, msg1);
[state2, msg4] = Automerge.generateSyncMessage(doc2, state2);
// now msg3 and msg4 are both null, as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants