diff --git a/proposals/3898-sfu.md b/proposals/3898-sfu.md new file mode 100644 index 00000000000..4755f103974 --- /dev/null +++ b/proposals/3898-sfu.md @@ -0,0 +1,443 @@ +# MSC3898: Native Matrix VoIP signalling for cascaded SFUs + +[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401) +specifies how full-mesh group calls work in Matrix. While that MSC works well +for small group calls, it does not work so well for large conferences due to +bandwidth (and other) issues. + +Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams +between peers (which could be clients or SFUs or both). To make use of them +effectively, peers need to be able to tell the SFU which streams they want to +receive at what resolutions. + +To solve the issue of centralization, the SFUs are also allowed to connect to +each other ("cascade") and therefore the peers also need a way to tell an SFU to +which other SFUs to connect. + +## Proposal + +- **TODO: spell out how this works with active speaker detection & associated +signalling** + +### Diagrams + +The diagrams of how this all looks can be found in +[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401). + +### Additions to the `m.call.member` state event + +This MSC proposes adding two _optional_ fields to the `m.call.member` state event: +`m.foci.preferred` and `m.foci.active`. + +Informational: This attempts to avoid the situation where a conference is ongoing +with several users in, for example, New York. These users are all connected to the +focus in New York. Alice joins from London: rather than connecting to the focus +in London, she connects directly to the one in New York since that's where all the +other participants are connected. If more users then join from London, however, they +will all make the same decision and connect to the New York focus rather than the +optimal configuration of the London users connected to the London focus. With active +and preferred foci, the second user that joins from London will know that although +Alice's active focus is New York, her preferred is London, and can therefore choose +the London focus instead. + +For instance: + +```json +{ + "type": "m.call.member", + "state_key": "@matthew:matrix.org", + "content": { + "m.calls": [ + { + "m.call_id": "cvsiu2893", + "m.devices": [{ + "device_id": "U738KDF9WJ", + "m.foci.active": [ + { "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" } + ], + "m.foci.preferred": [ + { "user_id": "@sfu-bon:matrix.org", "device_id": "3FSF589EF" }, + { "user_id": "@sfu-mon:matrix.org", "device_id": "GFSDH93EF" }, + ] + }] + } + ], + "m.expires_ts": 1654616071686 + } +} +``` + +#### `m.foci.active` + +This field is a list of foci the user's device is publishing to. Usually, this +list will have a length of 1, yet a client might publish to multiple foci if +they are on different networks, for instance, or to simultaneously fan-out in +different directions from the client if there is no nearby focus. If the client +is participating full-mesh, it should either omit this field from the state +event or leave the list empty. + +#### `m.foci.preferred` + +This field is a list of foci the client would prefer to switch to from the +current active focus, if any other client also starts using the given focus. If +the client is already using one of its preferred foci, it should either omit +this field from the state event or leave the list empty. + +### Choosing a focus + +#### Discovering foci + +- **TODO: How does a client discover foci? We could use well-known or a custom endpoint** + +Foci are identified by a tuple of `user_id` and `device_id`. + +#### Determining the best focus + +There are many ways to determine the best focus; this MSC recommends the +following: + +- Is the quickest to respond to `m.call.invite` with `m.call.answer`. +- Is the quickest to rapidly reject a spurious HTTPS request to a high-numbered + port on the SFU's IP address, if the SFU exposes its IP somewhere - similar to + the [apenwarr/blip](https://github.com/apenwarr/blip) trick, in order to + measure media-path latency rather than signalling path latency. +- Has the best latency of data-channel traffic flows. +- Has the best latency and bandwidth determined by sending a small splurge of + media down the pipe to probe. + +#### Joining a call + +The following diagram explains how a client chooses a focus when joining a call. + +```mermaid +flowchart TD; +wantsToJoin[Wants to join a call]; +hasPreferred(Has preferred focus?); +callPreferred[Calls preferred foci without media to grab a slot]; +publishPreferred[Publishes `m.foci.preferred`]; +checkMembers(Call has more than 2 members including the client itself?); +callFullMesh[Calls other member full-mesh]; +callMembersFoci[Tries calling foci from `m.call.member` events]; +orderFoci[Orders foci from best to worst]; +findFocusPreferredByOtherMember(Goes through ordered foci to find one which is preferred by at least one other member); +callBestPreferred[Calls the focus]; +callBestActive[Calls the best active focus in room]; +publishActive[Publishes `m.foci.active`]; + +wantsToJoin-->hasPreferred; +hasPreferred--->|Yes|callPreferred; +hasPreferred--->|No|checkMembers; +callPreferred--->publishPreferred; +publishPreferred--->checkMembers; +checkMembers--->|Yes|callMembersFoci; +checkMembers--->|No|callFullMesh; +callMembersFoci--->orderFoci; +orderFoci--->findFocusPreferredByOtherMember; +findFocusPreferredByOtherMember--->|Found|callBestPreferred; +callBestPreferred--->publishActive; +findFocusPreferredByOtherMember--->|Not found|callBestActive; +callBestActive--->publishActive; +``` + +#### Mid-call changes + +Once in a call, the client listens for changes to `m.call.member` state events +and if another member starts using one of the client's preferred foci, the client +switches to that focus. + +**TODO: other cases?** + +### Initial offer/answer dance + +During the initial offer/answer dance, the client establishes a data-channel +between itself and the SFU to use later for rapid signalling. + +### Simulcast + +#### RTP munging + +#### vp8 munging + +### RTCP re-transmission + +### Data-channel messaging + +The client uses the established data channel connection to the SFU to perform +low-latency signalling to rapidly (un)subscribe/(un)publish streams, send +ping messages, metadata, cascade and perform re-negotiation. + +See the section about the [rationale](#the-use-of-the-data-channels-for-signaling) +behind the use of the data channels for signaling. + +- **TODO: Spell out how the DC traffic interacts with application-layer +traffic** + +#### SDP Stream Metadata extension + +The client will be receiving multiple streams from the SFU and it will need to +be able to distinguish them, this therefore builds on +[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and +[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to +provide the client with the necessary metadata. Some of the data-channel events +include an `sdp_stream_metadata` field including a description of the stream +being sent either from the SFU to the client or from the client to the SFU. + +Other than mute information and stream purpose, the metadata includes video +track resolution. The SFU may not be able to determine the resolution of the +track itself but it does need to know for simulcast; therefore, we include this +in the metadata. + +```json +{ + "streamId1": { + "purpose": "m.usermedia", + "audio_muted": false, + "video_muted": true, + "tracks": { + "trackId1": { + "width": 1920, + "height": 1080 + }, + "trackId2": {} + } + } +} +``` + +#### Event types + +This MSC adds a few new `m.call.*` events and extends a few of the existing ones. + +##### `m.call.track_subscription` + +This event is sent to the focus to let it know about the tracks the client would +like to start/stop subscribing to. + +Upon receiving this event, a focus should make the subscribe changes based on +the `start` and `stop` arrays and respond with an `m.call.negotiate` event. + +In the case of video tracks, in the `start` array the client may also request a +specific resolution for a given track; this resolution is a resolution the +client wishes to receive but the SFU may send a lower one due to bandwidth etc. + +If the user for example switches from "spotlight" (one large tile) to "grid" +(multiple small tiles) view, it should also send this event with the updated +resolution in the `start` array to let the focus know of the resolution change. + +Clients may request each track only once: foci should ignore multiple requests +of the same track. + +- **TODO: how do we prove to the focus that we have the right to subscribe to +track?** + +```json +{ + "type": "m.call.track_subscription", + "content": { + "subscribe": [ + { + "stream_id": "streamId1", + "track_id": "trackId1", + "width": 1920, + "height": 1080 + }, + { + "stream_id": "streamId2", + "track_id": "trackId2", + "width": 256, + "height": 144 + } + ], + "unsubscribe": [ + { + "stream_id": "streamId3", + "track_id": "trackId4" + }, + { + "stream_id": "streamId4", + "track_id": "trackId4" + } + ] + } +} +``` + +##### `m.call.negotiate` + +This event works exactly like the `m.call.negotiate` event in 1:1 calls. + +```json +{ + "type": "m.call.negotiate", + "content": { + "description": { + "type": "offer", + "sdp": "..." + }, + "sdp_stream_metadata": {...} // As specified in the Metadata section + } +} +``` + +##### `m.call.sdp_stream_metadata_changed` + +This event works very similarly to the 1:1 call `m.call.sdp_stream_metadata_changed`. + +- **TODO: Spec how foci actually use this to advertise tracks** + +```json +{ + "type": "m.call.sdp_stream_metadata_changed", + "content": { + "sdp_stream_metadata": {...} // As specified in the Metadata section + } +} +``` + +##### `m.call.ping`, `m.call.pong` + +A ping message must be sent by the focus to the client at an interval +no greater than 30 seconds. On receiving a ping message, a client must respond +immediately with a pong message. A client may therefore detect that the +connection has failed after an amount of time of its choosing (greater than +30 seconds) has elapsed since it last saw a ping message. A server may deem a +client unresponsive after not receiving a pong some amount of time after it +has sent a ping, again the amount of time the server waits is up to the +implementation. Either send should hang up once deeming the other side +unresponsive. + +focus -> client: + +```json +{ + "type": "m.call.ping", + "content": {} +} +``` + +client -> focus: + +```json +{ + "type": "m.call.pong", + "content": {} +} +``` + +##### `m.call.connect_to_focus` + +If a user is using their focus in a call, it will need to know how to connect to +other foci present in order to participate in the full-mesh of SFU traffic (if +any). The client is responsible for doing this using the +`m.call.connect_to_focus` event. + +```json +{ + "type": "m.call.connect_to_focus", + "content": { + // TODO: How should this look? + } +} +``` + +### Notes + +#### Hiding behind foci + +We do not recommend that users utilise a focus to hide behind for privacy, but +instead use a TURN server, only providing relay candidates, rather than +consuming focus resources and unnecessarily mandating the presence of a focus. + +## Potential issues + +The SFUs participating in a conference end up in a full mesh. Rather than +inventing our own spanning-tree system for SFUs however, we should fix it for +Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or +similar to decide what better-than-full-mesh topology to use. In practice, full +mesh cascade between SFUs is probably not that bad (especially if SFUs only +request the streams over the trunk their clients care about) - and on aggregate +will be less obnoxious than all the clients hitting a single SFU. + +Too many foci will chew bandwidth due to full-mesh between them. In the worst +case, if every use is on their own HS and picks a different foci, it degenerates +to a full-mesh call (just server-side rather than client-side). Hopefully this +shouldn't happen as you will converge on using a single SFU with the most +clients, but need to check how this works in practice. + +SFrame mandates its own ratchet currently which is almost the same as megolm but +not quite. Switching it out for megolm seems reasonable right now (at least +until MLS comes along) + +## Alternatives + +An option would be to treat 1:1 (and full mesh) entirely differently to SFU +based calling rather than trying to unify them. Also, it's debatable whether +supporting full mesh is useful at all. In the end, it feels like unifying 1:1 +and SFU calling is for the best though, as it then gives you the ability to +trivially upgrade 1:1 calls to group calls and vice versa, and avoids +maintaining two separate hunks of spec. It also forces 1:1 calls to take +multi-stream calls seriously, which is useful for more exotic capture devices +(stereo cameras; 3D cameras; surround sound; audio fields etc). + +### The use of the data channels for signaling + +The current specification assumes that signaling works over Matrix, but +side-chains to the data channel once the peer connection is established +in order to perform low-latency signaling. + +In an ideal scenario the use of the data channels would not be required and +the usage of native Matrix signaling would be sufficient, however due to +the fact that regular Matrix signaling may need to traverse different +servers, e.g. `client <-> home server <-> home server <-> sfu`, our +signaling would not be quite as fast as we need it to be. The effect will +be even greater when coupled with the fact that certain protocols like +HTTP would not be as efficient for a real-time communication as e.g. WebRTC +data channels or WebSockets. + +The problem would be solved if the clients could connect to the SFU +**directly** and communicate via Matrix for all signaling messages. This +would allow us to use a faster transport (WebSockets, QUIC etc) to transmit +signaling messages. However, this is **currently** not possible due to the fact +that it would require the support of the P2P Matrix that is still being under +development at the time of writing this MSC. + +To read more about the problem and get more context, please refer to the +[discussion](https://github.com/matrix-org/matrix-spec-proposals/pull/3898#discussion_r1019098025). + +### Cascading + +One option here is for SFUs to act as an AS and sniff the `m.call.member` +traffic of their associated server, and automatically call any other `m.foci` +which appear. (They don't need to make outbound calls to clients, as clients +always dial in). + +## Security considerations + +Malicious users could try to DoS SFUs by specifying them as their foci. + +SFrame E2EE may go horribly wrong if we can't send the new megolm session fast +enough to all the participants when a participant leave (and meanwhile if we +keep using the old session, we're technically leaking call media to the parted +participant until we manage to rotate). + +Need to ensure there's no scope for media forwarding loops through SFUs. + +In order to authenticate that only legitimate users are allowed to subscribe to +a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and +sniff the `m.call` events on their associated server, and only act on to-device +`m.call.*` events which come from a user who is confirmed to be in the room for +that `m.call`. (In practice, if the conf is E2EE then it's of limited use to +connect to the SFU without having the keys to decrypt the traffic, but this +feature is desirable for non-E2EE confs and to stop bandwidth DoS) + +## Unstable prefixes + +We probably don't care for this for the data-channel? + +While this MSC is not considered stable, implementations should use +`org.matrix.msc3898` as a namespace. + +|Stable (post-FCP) |Unstable | +|------------------|-----------------------------------| +|`m.foci.active` |`org.matrix.msc3898.foci.active` | +|`m.foci.preferred`|`org.matrix.msc3898.foci.preferred`|