Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on improving chat systems #5

Open
utterances-bot opened this issue Sep 21, 2019 · 3 comments
Open

Thoughts on improving chat systems #5

utterances-bot opened this issue Sep 21, 2019 · 3 comments

Comments

@utterances-bot
Copy link

Thoughts on improving chat systems

I’m quite interested in instant messaging technology! I’ve been an avid user of internet relay chat (IRC) –probably the oldest chat protocol in existence – f...

https://theta.eu.org/2019/09/10/chat-systems.html

Copy link

ara4n commented Sep 21, 2019

"Oh boy" indeed. Speaking as the Matrix project lead, some of the opinions here on Matrix feel unhelpfully harsh / dated / inaccurate - I wish there'd been the option to discuss prior to publication.

Essentially, the problem is that the protocol requires a “state resolution” algorithm to verify permissions in a chatroom. This is the root of all sorts of performance issues, and also has been the source of security issues in the past (q.v.)

This is like saying "the source of all the problems in git is that you have to merge branches together". There is nothing intrinsically wrong with Matrix's state resolution algorithm. Sure, the initial beta version had bugs which we fixed for 1.0 - and sure, a naive implementation is a performance dog. Just like the beta versions of git were probably unperformant and buggy too. But we spent ages making damn sure that we got the thinkos solved and also did the work to make damn sure that performant implementations are possible. As it happens performance right now is okay enough that we haven't prioritised implementing incremental state res as per that design, but it's not an intrinsic "problem" in the protocol.

There are long-standing related issues like #1760 that still aren’t fixed and have lots of users complaining.

We fixed 1760 in matrix-org/synapse#5480 in June. At the time the fix was considered experimental, but having tested it in the wild it actually works fine and we're planning to turn it on by default in Synapse 1.4 (due next week). The only reason it hasn't been turned on earlier was because of an edge case which should land for 1.4.

(although there’s now a somewhat hacky workaround for this that involves sending dummy events into the room)

No, it's not a "somewhat hacky workaround". It's the equivalent of checkpointing a database or incrementally defragmenting a filesystem in order to keep the datastructure in check. We don't "send" dummy events but instead insert them as checkpoint events into the local copy of the room to mitigate fragmentation. I may be biased given I proposed it.

In my personal experience, I ran a Matrix homeserver for the best part of the year until I got fed up with it

From memory, this was prior to 1.0, where indeed we had some pretty bad performance issues. Since then we have been constantly plugging away at improving performance, which has been slowly but steadily improving ever since. It is excruciatingly frustrating to be judged on bad performance and experience from when we were in beta :(

When I used to use it, I’d frequently encounter problems with messages just not being delivered, or push notifications breaking.

This is genuinely surprising. I don't believe I have ever experienced a situation on any of the servers I have visibility on (matrix.org, my personal ones, *.modular.im) where messages have not been delivered. The only scenario I know where this could have happened is matrix-org/synapse#2528 - i.e. after an outage, other servers may not retry sending to a previously dead server until the next time someone speaks in the room. Most other protocols (email, activitypub, etc) suffer the same. Is this what happened? If not, please can you point me at the bug you presumably filed so we can investigate?

In terms of push notifications breaking: Riot/iOS & Android have both suffered issues in the past where failure to refresh push tokens with the push server (due to server unavailability) could cause the app to sulk and turn off push entirely. However, I believe these failure modes were fixed over a year ago. Again, any further information so we can check that this was fixed would be great. (Otherwise, it'd be great not to be judged on bugs which were fixed ages ago).

…has a questionable security story

The official matrix.org server has been hacked in the past, although this is nothing to do with the protocol.

...precisely :/

The original version of the protocol had a bug known as “state resets”, where room state would be reset back to some earlier version.

Yes, the pre-1.0 beta had a bug in the state resolution algorithm which we fixed. And yes, the way to fix bugs in a state resolution algorithm is to migrate your data to use a new version of the algorithm. I'm not sure this counts as a 'questionable security story'.

looking at their list of open and closed security bugs isn’t exactly reassuring.

Weirdly enough we don't use our public github issue tracker for serious security issues. The security tag there almost by definition refers to issues which are sufficiently non-impacting that they can be left in public view. Bashing a project based on a stats analysis of its bugtracker ends up telling you a lot more about the bug filing philosophy of the project than its actual bugginess, imo. We deliberately try to put anything & everything in our bugtrackers, to use them as a knowledge repository of all conceivable defects of the system. It's the diametric opposite of something like (say) postgres, where last time I checked they didn't even have a bugtracker.

Our internal security bug tracker gives a much clearer view, which you can see the results of in synapse's changelog by grepping for security (or looking at the hall of fame at https://matrix.org/security-disclosure-policy).

…has a questionable data model for chat purposes

I guess it depends on whether you consider conversation history a first class citizen or not. A good analogy is IMAP versus POP. IMAP is a serverside knowledge store you can depend on; POP was an awful hack to queue up your messages so you could get them onto your client. I'd rather use IMAP for my email than POP, and for the same reasons I think Matrix has the right data model here though.

That said, if someone wants to do chat over ActivityPub, go wild - we'll just go ahead and bridge it into Matrix :)

Copy link

@ara4n

As it happens performance right now is okay enough that we haven't prioritised implementing incremental state res as per that design, but it's not an intrinsic "problem" in the protocol.

It absolutely is a problem, but that isn't a necessarily negative thing. It's a problem you have spent a lot of time and resources dealing with, and have solved to your satisfaction -- but the fact that you had to spend those resources on that, means that it absolutely is a problem. You said it's as much of a problem as "merging branches in git", but the last time I did real enterprise work I had a lot of problems with merging branches, and most of the time ended up just pulling down the repo fresh and copying my work into it, because that was less work.

and sure, a naive implementation is a performance dog. Just like the beta versions of git were probably unperformant and buggy too.

Right, and that is a problem for people who are implementing the protocol -- i.e. any implementation other than yours. Who will wish to reimplement these things from source?

Well, according to your SDKs page there isn't any C implementation (The C implementation listed actually links to Objective-C, "Matrix.org's reusable UI interfaces for iOS"). So anyone making their own language, or anyone that wants to use a niche language, has to implement that from scratch.

No, it's not a "somewhat hacky workaround". It's the equivalent of checkpointing a database or incrementally defragmenting a filesystem in order to keep the datastructure in check.

Filesystem defragmentation is fundamentally a hack to cope with a badly designed file system. Literally every other filesystem that isn't NTFS does not require defragmentation. Maybe that's a bad example, though?

It is excruciatingly frustrating to be judged on bad performance and experience from when we were in beta :(

Right, but this is an article about replacing chat systems. If someone wants to write their own interface to Matrix, or their own Matrix server, it's pretty clear to see that if it goes anything like the primary interface, it's difficult to implement successfully, and slow even if you somehow manage to make it bug-free.

Weirdly enough we don't use our public github issue tracker for serious security issues. The security tag there almost by definition refers to issues which are sufficiently non-impacting that they can be left in public view.

Right, so what you're saying is that the bug tracker, as of now, does not contain the most serious security problems, and yet somehow based on a look of the non serious bugs, it still manages to look pretty hellish. That doesn't seem like a good thing.

Bashing a project based on a stats analysis of its bugtracker ends up telling you a lot more about the bug filing philosophy of the project than its actual bugginess,

If the project can't be bothered to update the bugtracker, or has bad practice around that, then it's fundamentally hostile to both collaborators and users. I've dealt with bad bugtrackers with a lot of chat applications. It's complete and utter hell and has been the reason outright why I have not adopted certain chat systems.

I guess it depends on whether you consider conversation history a first class citizen or not.

Conversation history has to be a first class citizen, or you are, on behalf of your users, throwing their data away without their choice. I know many people do not like to keep chat history, but that's not an excuse to throw out chat history full stop.

@ara4n
Copy link

ara4n commented Sep 22, 2019

@AlexandriaOL - thanks for responding to my points, although I think you might have misunderstood where I was coming from on some of them (probably my fault for not being clearer):

You said it's as much of a problem as "merging branches in git", but the last time I did real enterprise work I had a lot of problems with merging branches, and most of the time ended up just pulling down the repo fresh and copying my work into it, because that was less work.

In git, every time more than one person works on a repository, git is effectively merging conflicted branches (their clone of the repo with your clone of the repo) together under the hood. i.e. every time you git pull, you are running a decentralised merge resolution algorithm to merge the remote changes into your local repository. It's the fundamental way that a decentralised version control system like git works, just like state resolution is the fundamental way a decentralised DB like Matrix works. I'm sure that in the early days of git (or its predecessors) it was hard to get this algorithm correct and performant, but once they figured it out and documented it (e.g. https://github.com/git/git/blob/master/Documentation/technical/trivial-merge.txt) then other implementations could just stand on their shoulders and implement it in other languages & environments without much trouble.

Right, and that is a problem for people who are implementing the protocol -- i.e. any implementation other than yours. Who will wish to reimplement these things from source?

Well, empirically people are, just as there are independent implementations of Git. The C/ObjC stuff you mentioned refers to client SDK, which are super simple to write and don't do state res; only server implementations do state res when talking to each other. But a working independent server implementation of state res exists at https://github.com/matrix-construct/construct/ (in C++, if you're focusing on languages).

Filesystem defragmentation is fundamentally a hack to cope with a badly designed file system. Literally every other filesystem that isn't NTFS does not require defragmentation. Maybe that's a bad example, though?

My point was that modern filesystems don't need manual defragmentation because they effectively defragment themselves in the background as they go along by shuffling stuff around to avoid fragmentation as they go. This is a direct analogy to the situation with matrix-org/synapse#1760 (which started off effectively forcing users to manually prune their extremities whenever performance got bad) and then the fix which landed in matrix-org/synapse#5319, which stops the problem building up by solving it transparently in the background. So I think it's an appropriate example.

If someone wants to write their own interface to Matrix, or their own Matrix server, it's pretty clear to see that if it goes anything like the primary interface, it's difficult to implement successfully, and slow even if you somehow manage to make it bug-free.

Writing 'interfaces to Matrix' (i.e. clients or client SDKs) is trivial - there are literally hundreds of successful ones now. Writing a Matrix server is indeed harder, but the reason the reference implementation took us ages and isn't perfect isn't because "Matrix is hard to implement well" but because we were figuring it out for the first time as we went, complete with missteps, while also optimising for client developers rather than server developers, given clients are where the users are at. But I think the project should be judged on the end result, not the journey.

Right, so what you're saying is that the bug tracker, as of now, does not contain the most serious security problems, and yet somehow based on a look of the non serious bugs, it still manages to look pretty hellish. That doesn't seem like a good thing.

No, what I'm saying is that the public bug tracker has the least serious security issues on it... which is why there should be no surprise that there are more open than closed, because by definition they're less important(!!) Judging a project by the number of open bugs is like judging a book its number of pages. It doesn't tell you much other than the size of the book.

If the project can't be bothered to update the bugtracker, or has bad practice around that, then it's fundamentally hostile to both collaborators and users.

The bugtracker is pretty good at being kept up-to-date. However, we keep unsolved issues open rather than arbitrary closing them to "make things look better", however minor or deprioritised they are. If you consider that bad practice, then :/

Conversation history has to be a first class citizen, or you are, on behalf of your users, throwing their data away without their choice. I know many people do not like to keep chat history, but that's not an excuse to throw out chat history full stop.

My point was that Matrix is unusual in chat protocols because it treats conversation history as a first class citizen (and decentralises ownership of it over the participating users). This is why it ends up being a decentralised DB, which seems to be root of eta's complaints. I am not proposing throwing out chat history; just the opposite - treating it as the first class citizen it should be, which by extrapolation means you end up with a protocol that looks like Matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants