-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Spontaneous leaving/rejoining rooms ["state resets"] #1953
Comments
Example: @enckse:epiphyte.network left HQ on 2017-02-24 in event id This appears to be an artifact of the state resolution algorithm. |
More details about the example above:
His state after these was, of course, C (left). |
FYI - this just happened again to me on 0.20.0 (recognizing this issue isn't closed, just still an issue in 0.20.0) |
Indeed, 0.20 will have started to make this slightly rarer, thanks #2025, but it's still going to happen when servers turn up and serve new events which refer to old points in the event graph. Using the example above, the event graph looks like this. (Time flows downwards. ASCII-art arrows represent 'prev_event' links, omitting irrelevant events. A/B/C represent enkse's state after the event).
I asked @NegativeMjark for thoughts on how we could fix this; I think what he said amounts to this: We need to change the state resolution algorithm so that two leaves (as we see at The problem this introduces is that different servers in the network could reach different conclusions about the state of the room. This in turn increases the likelihood of events being rejected. However:
An alternative approach would be for a "fixed" server to notice that this sort of state reversion would have happened, and to make sure that, in the next event it sends, it includes |
We just had this on Synapse 0.21 (updated today). Please PM me for logs ( Background: We have a board of directors chat (fully intended to be private/secret), and a director recently left the board, and was naturally removed from the room about a month ago. They suddenly started receiving notifications from the room today, without causing any kind of event or without any action on our part (besides the upgrade?). We subsequently banned the person in an attempt to stop their notifications. |
Ftr, this has been seen in #offtopic. Mijris kicked: https://matrix.to/#/!UcYsUzyxTGDxLBEvLz:matrix.org/$1495057823766vaFre:krtdex.com Mijris talked: https://matrix.to/#/!UcYsUzyxTGDxLBEvLz:matrix.org/%241495116380125811LpGEh:chat.weho.st No join inbetween. This also happened alongside a permisaions revert, where PL limits and user PLs were rolled back (quite common in #offtopic). |
Reports of this happening in HQ for users on the matrix.org HS: https://matrix.to/#/!XqBunHwQIXUiqCaoxq:matrix.org/$14953870101384360LjBep:matrix.org |
https://docs.google.com/document/d/1UqcfS72qh8V7iXJ4oEkMIQ_E63xybRkpLxMEWw-EHBo contains some notes on what we might do about this |
In #offtopic:matrix.org today we had the worst state reset we've yet seen (and This resulted in only one person having an admin account. Luckily, he's still Furthermore, if that user wasn't around anymore, #offtopic would be |
I had a look at what happened in #offtopic recently. #offtopic is particularly problematic, because the powerlevels have been changed a lot over time, by a number of users, some of whom have only been moderators in the past (so would not have had permission to set the powerlevel). It appears that the reset in this case happened on 5th September, when event ID Some relevant power-level events:
Here's a subset of the event graph for #offtopic. (Time flows downwards. ASCII-art arrows represent 'prev_event' links, omitting irrelevant events. [square brackets] represent graph depth. A/B/C represent the power_level state after the event):
Note how the new event references old events which are in state A, and state C, but not any events which are in state B which is a necessary intermediate step to get from A to C. The state resolution algorithm therefore rejects C, and ends up at state A. In technical terms, this is a lack of associativity in the state resolution algorithm. Treating state resolution as a function R, we have:
|
It was November 2016, just for the record. |
Something else maybe worth recording: in general, it is not the case that |
Following further into this discussion:
|
If an HS missed the event after the krtdex.com event (and wasn't able to backfill it later), then it would end up with a gap in the DAG, and the krtdex.com event would be a forward extremity. Then, next time it sent an event, it would try to heal the DAG by sending an event which referenced all of its known forward extremities as prev_events.
It would do if it knew that one was a parent of another, but - as above - in this case
I don't really know what you mean here. |
Fair points for the missing DAG, even tho that seems implementation dependent: one might choose to not let events be created until the missing DAG has been fetched - with unforseen consequences to me at this point, so putting that one aside for now.
I would expect the state conflict resolution algo to only be applied to different branches of the DAG, but in this case it's the same one, and I don't see how that will not cause issues if they are intermediary states. But the real question is why would you apply this to the same DAG branch? Can't a HS ignore that parent events in the resolution? |
Maybe I'm not wording this correctly, but what I mean to say is: it seems to me like this is unsolvable if you allow state conflict resolution algo to be applied to events in the same branch, since I don't see how you can guarantee avoiding a missing intermediary state in the missing part, so fetching the DAG between the events you try to apply the state to is also needed, whatever is done. |
I'm not sure what you mean by "events in the same branch", but I guess you mean events where one is an ancestor of another? I don't think it gets any easier if you restrict it so that state resolution can only be performed between events which are not ancestors of one another.
It should be noted that you will receive any state updates one way or another, even if you don't have the complete DAG. The problem is that you don't (necessarily) have the complete history of how the state got that way, which is (one of) the reasons the state resolution algorithm can't require you to walk the entire DAG. |
So, let's get back to basics. The state resolution algorithm is used when you need to find out the room state for an event which has 2 or more references in From the discussion so far, you have at least two possibilities:
For 1), I don't see an issue at first glance with the current algorithm, since you effectively need to merge together two views of the rooms that diverged and have not interacted with each other. Now, for 2), taking back your example in offtopic, State (A) is an ancestor of (C), which means (A) is already "included" in (C). If you try to merge them (or replay them on top of each other), you will of course end up with a reset. It doesn't even make sense to want to merge A and C as there is no conflict to start with! At this point,
Use 1) is informative, and does not force anything on the other homeservers, and does not rely on a homeserver having the full DAG subset to make an informed decision about state. So you can put anything you want in there. The more the better at this point. On the other hand, 2) is authoritative and since events are immutable once stored (putting redact aside for this), you need to get the state bit right. At best, the event Therefore, I think the whole aglo is fundamentally flawed with:
So, assuming the
Hopefully this makes sense. |
Not necessarily. Consider if the graph was like this:
kamp.site has two events in its prev_events, neither of which is an ancestor of another, yet we still have a problem when resolving the state between the two.
It's almost all about case 2. 1 is just a side-effect.
well, that would be one approach to solving this problem; however Matrix has to date operated under the premise that you can continue to see events and be affected by state changes even if there is a hole in the DAG, and the effects of changing that premise aren't entirely clear to me right now. Furthermore, it's not clear to me that it's the only approach to solving this problem: for instance, if we could ensure that the state res algorithm was associative, then it would also solve the problem. Conflict-free Replicated Data Types might offer one way of ensuring that - but it might also be hard to map the power of our power_levels onto a CRDT. Anyway the point is that there may be more than one way to do this. |
I don't see an issue. you do have potential conflicts, of course, but the algorithm would propose one way to solve it without going back to a previous state. Instead you would ignore an invalid state. My understanding is: Am I missing something? |
you said:
My diagram above is (I think) an example of 1). If we follow the current algorithm, then the power_levels are still unexpectedly reset to A on state resolution. So I am saying that the current algorithm is not satisfactory even in this case.
Well, that sounds like a different algorithm.
We do ignore an invalid state. The question is over how you define "invalid".
It's an algorithm which produces surprising, and often unsatisfactory, results. There may exist additional information which could be made available to the state resolution algorithm, and a different algorithm might produce better results. Calling the current algorithm "invalid" is maybe a bit strong, but it's certainly not optimal.
I don't really understand what you mean here. |
Indeed :( so the current algo is just doomed
If you have a netsplit, and that occuring at the same time (same depth, same
Then one state is no better than the other, in terms of intention from the users, but they can't both be true. One event needs to be ignored. But in the offtopic, it's not that any event is invalid - they are all valid and possible - and the state shouldn't change, it's that state A is evaluated compared to C, when they shouldn't be in the first place. This is what I mean: you have state change, which is due to a bug in an algorithm but should not have happened, and you have state choice, where no algorithm can give you an outcome closer to a regular person expectation. I'm not trying to tell how to solve it at this point, I am only trying to understand what is really going on and why so I can actually implement it and then maybe help fix it (even tho algorithms are not my forte) |
Meanwhile I've put together a test case for mxhsd after I've implemented the current state resolution algo which should include state resets. Hopefully it will provide more clues for people to grasp the issue at hand. |
Got this on MatrixHQ recently. Yay :| |
Is this considered fixed in v3 (and v2)? |
I'm in at least one v3 room that has inconsistent state between homeservers if not true “state resets". |
@non-Jedi this is news to us; are these rooms something we can debug? |
Note that "state resets" per se do not cause inconsistent state between homeservers. All servers agree on what the state is; it just doesn't match what a human expects it to be. So if you're seeing inconsistency between servers please consider it a separate issue. @aaronraimist it's believed to be fixed, or at least significantly improved, but since the problem is due to emergent behaviour when the basic algorithms are applied to complex systems, it's hard to be certain at this stage. |
We believe this to be fixed (or at least massively improved) as of more recent room versions. |
This might be closely related to #1940, but in this case it's definitely not due to rejected events - and it appears the user can interact with the room as if they had never left
The text was updated successfully, but these errors were encountered: