matrix-org · SimonBrandner · Oct 21, 2022 · Sep 25, 2022 · Sep 25, 2022 · Oct 2, 2022
diff --git a/proposals/3401-group-voip.md b/proposals/3401-group-voip.md
@@ -1,5 +1,9 @@
 # MSC3401: Native Group VoIP signalling
 
+Note: previously this MSC included SFU signalling which has now been moved to
+[MSC3898](https://github.com/matrix-org/matrix-spec-proposals/pull/3898) to
+avoid making this MSC too large.
+
 ## Problem
 
 VoIP signalling in Matrix is currently conducted via timeline events in a 1:1 room.
@@ -15,7 +19,11 @@ Meanwhile we have no native signalling for group calls at all, forcing you to in
 
 ## Proposal
 
-This proposal provides a signalling framework using to-device messages which can be applied to native Matrix 1:1 calls, full-mesh calls, SFU calls, cascaded SFU calls and in future MCU calls, and hybrid SFU/MCU approaches. It replaces the early flawed sketch at [MSC2359](https://github.com/matrix-org/matrix-doc/pull/2359).
+This proposal provides a signalling framework using to-device messages which can
+be applied to native Matrix 1:1 calls, full-mesh calls and in the future SFU
+calls, cascaded SFU calls MCU calls, and hybrid SFU/MCU approaches. It replaces
+the early flawed sketch at
+[MSC2359](https://github.com/matrix-org/matrix-doc/pull/2359).
 
 This does not immediately replace the current 1:1 call signalling, but may in future provide a migration path to unified signalling for 1:1 and group calls.
 
@@ -44,7 +52,6 @@ SFU (aka Focus):
                | 
                |
                C
-
 Where F is an SFU focus
 ```
 
@@ -74,7 +81,6 @@ The user who wants to initiate a call sends a `m.call` state event into the room
  * `m.type` to say whether the initial type of call is voice only (`m.voice`) or video (`m.video`).  This signals the intent of the user when placing the call to the participants (i.e. "i want to have a voice call with you" or "i want to have a video call with you") and warns the receiver whether they may be expected to view video or not, and provide suitable initial UX for displaying that type of call... even if it later gets upgraded to a video call.
  * `m.terminated` if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).  
  * `m.name` as an optional human-visible label for the call (e.g. "Conference call").
- * `m.foci` as an optional list of recommended SFUs that the call initiator can recommend to users who do not want to use their own SFU (because they don't have one, or because they would be the only person on their SFU for their call, and so choose to connect direct to save bandwidth).
  * The State key is a unique ID for that call. (We can't use the event ID, given `m.type` and `m.terminated` is mutable).  If there are multiple non-terminated conf ID state events in the room, the client should display the most recently edited event.
 
 For instance:
@@ -86,11 +92,7 @@ For instance:
     "content": {
         "m.intent": "m.room",
         "m.type": "m.voice",
-        "m.name": "Voice room",
-        "m.foci": [
-            "@sfu-lon:matrix.org",
-            "@sfu-nyc:matrix.org",
-        ],
+        "m.name": "Voice room"
     }
 }
 ```
@@ -108,7 +110,6 @@ When sending an `m.call.member` event, clients should choose a reasonable value
 The fields within the item in the `m.calls` contents are:
 
  * `m.call_id` - the ID of the conference the user is claiming to participate in.  If this doesn't match an unterminated `m.call` event, it should be ignored.
- * `m.foci` - Optionally, if the user wants to be contacted via an SFU rather than called directly (either 1:1 or full mesh), the user can also specify the SFUs their client(s) are connecting to.
  * `m.devices` - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
    * `device_id` - The device id to use for to-device messages when establishing a call
    * `session_id` - A unique identifier used for resolving duplicate sessions from a given device. When the `session_id` field changes from an incoming `m.call.member` event, any existing calls from this device in this call should be terminated. `session_id` should be generated once per client session on application load.
@@ -125,10 +126,6 @@ For instance:
         "m.calls": [
             {
                 "m.call_id": "cvsiu2893",
-                "m.foci": [
-                    "@sfu-lon:matrix.org",
-                    "@sfu-nyc:matrix.org",
-                ],
                 "m.devices": [
                     {
                         "device_id": "ASDUHDGFYUW", // Used to target to-device messages
@@ -191,8 +188,11 @@ For instance:
 }
 ```
 
-This builds on MSC #3077, which describes streams in `m.call.*` events via a `sdp_stream_metadata` field, but providing the full set of information needed for all devices in the room to know what feeds are available in the group call without having to independently discover them from the SFU.
+This builds on [MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077), which describes streams in `m.call.*` events via a `sdp_stream_metadata` field.
 
+** TODO: Do we need all of this data? Why would we need it? **
+** TODO: This doesn't follow the MSC3077 format very well - can we do something
+about that? **
 ** TODO: Add tracks field **
 ** TODO: Add bitrate/format fields **
 
@@ -210,56 +210,6 @@ Call setup then uses the normal `m.call.*` events, except they are sent over to-
    * `device_id` - The message sender's device id. Used by the opponent member to send response to-device signalling messages even if the `m.call.member` event has not been received yet.
    * `sender_session_id` - Like the `device_id` the `sender_session_id` is used by the opponent member to filter out messages unrelated to the sender's session even if the `m.call.member` event has not been received yet.
  * For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the `m.call` event propagates, to minimise latency.  Therefore we'll need to include an `m.intent` on the `m.call.invite` too.
- * When initiating a group call, we need to decide which devices to actually talk to.
-     * If the client has no SFU configured, we try to use the `m.foci` in the `m.call` event.
-         * If there are multiple `m.foci`, we select the closest one based on latency, e.g. by trying to connect to all of them simultaneously and discarding all but the first call to answer.
-         * If there are no `m.foci` in the `m.call` event, then we look at which foci in `m.call.member` that are already in use by existing participants, and select the most common one.  (If the foci is overloaded it can reject us and we should then try the next most populous one, etc).
-         * If there are no `m.foci` in the `m.call.member`, then we connect full mesh.
-         * If subsequently `m.foci` are introduced into the conference, then we should transfer the call to them (effectively doing a 1:1->group call upgrade).
-     * If the client does have an SFU configured, then we decide whether to use it. 
-         * If other conf participants are already using it, then we use it.
-         * If there are other users from our homeserver in the conference, then we use it (as presumably they should be using it too)
-         * If there are no other `m.foci` (either in the `m.call` or in the participant state) then we use it.
-         * Otherwise, we save bandwidth on our SFU by not cascading and instead behaving as if we had no SFU configured.
- * We do not recommend that users utilise an SFU to hide behind for privacy, but instead use a TURN server, only providing relay candidates, rather than consuming SFU resources and unnecessarily mandating the presence of an SFU.
-
-TODO: spec how clients discover their homeserver's preferred SFU foci
-
-Originally this proposal suggested that foci should be identified by their `(user_id, device_id)` rather than just their user_id, in order to ensure convergence on the same device.  In practice, this is unnecessary complication if we make it the SFU implementor's problem to ensure that either only one device is logged in per SFU user - or instead you cluster the SFU devices together for the same user.  It's important to note that when calling an SFU you should call `*` devices.
-
-### SFU control
-
-SFUs are Selective Forwarding Units - a server which forwarding WebRTC streams between peers (which could be clients or SFUs or both).  To make use of them effectively, peers need to be able to tell the SFU which streams they want to receive, and the SFU must tell the peers which streams it wants to be sent.  We also need a way of telling SFUs which other SFUs to connect ("cascade") to.
-
-The client does this by establishing an optional datachannel connection to the SFU using normal `m.call.invite`, in order to perform low-latency signalling to rapidly select streams.
-
-To select a stream over this channel, the peer sends:
-
-```jsonc
-{
-    "op": "select",
-    "conf_id": "cvsiu2893",
-    "start": [
-        { "stream_id": "qegwy64121wqw", "track_id": "zbhsbdhwe" }
-        { "stream_id": "qegwy64121wqw", "track_id": "zbhsbdhzs" }
-    ],
-    "stop": [
-        { "stream_id": "suigv372y8378", "track_id": "xbhsbdhzs" }
-    ]    
-}
-```
-
-All streams are sent within a single media session (rather than us having multiple sessions or calls), and there is no difference between a peer sending simulcast streams from a webcam versus two SFUs trunking together.
-
-If no DC is established, then 1:1 calls should send all streams without prompting, and SFUs should send no streams by default.
-
-If you are using your SFU in a call, it will need to know how to connect to other SFUs present in order to participate in the fullmesh of SFU traffic (if any).  One option here is for SFUs to act as an AS and sniff the `m.call.member` traffic of their associated server, and automatically call any other `m.foci` which appear.  (They don't need to make outbound calls to clients, as clients always dial in).  Otherwise, we could consider an `"op": "connect"` command sent by clients, but then you have the problem of deciding which client(s) are responsible for reminding the SFU to connect to other SFUs.  Much better to trust the server.
-
-Also, in order to authenticate that only legitimate users are allowed to subscribe to a given conf_id on an SFU, it would make sense for the SFU to act as an AS and sniff the `m.call` events on their associated server, and only act on to-device `m.call.*` events which come from a user who is confirmed to be in the room for that `m.call`.  (In practice, if the conf is E2EE then it's of limited use to connect to the SFU without having the keys to decrypt the traffic, but this feature is desirable for non-E2EE confs and to stop bandwidth DoS)
-
-XXX: define how these DC messages muxes with other traffic, and consider what message encoding to actually use.
-
-TODO: spell out how this works with active speaker detection & associated signalling
 
 ## Example Diagrams
 
@@ -270,7 +220,6 @@ TODO: spell out how this works with active speaker detection & associated signal
 | Solid | [State Event](https://spec.matrix.org/latest/client-server-api/#types-of-room-events) |
 | Dashed | [Event (sent as to-device message)](https://spec.matrix.org/latest/client-server-api/#send-to-device-messaging) |
 
-
 ### Basic Call
 
 ```mermaid
@@ -290,46 +239,22 @@ sequenceDiagram
     Alice-->>Bob: m.call.select_answer
 ```
 
-## Encryption
-
-We get E2EE for 1:1 and full mesh calls automatically in this model.
-
-However, when SFUs are on the media path, the SFU will necessarily terminate the SRTP traffic from the peer, breaking E2EE.  To address this, we apply an additional end-to-end layer of encryption to the media using [WebRTC Encoded Transform](https://github.com/w3c/webrtc-encoded-transform/blob/main/explainer.md) (formerly Insertable Streams) via [SFrame](https://datatracker.ietf.org/doc/draft-omara-sframe/).
-
-In order to provide PFS, The symmetric key used for these stream from a given participating device is a megolm key. Unlike a normal megolm key, this is shared via `m.room_key` over Olm to the devices participating in the conference including an `m.call_id` and `m.room_id` field on the key to correlate it to the conference traffic, rather than using the `session_id` event field to correlate (given the encrypted traffic is SRTP rather than events, and we don't want to have to send fake events from all senders every time the megolm session is replaced).
-
-The megolm key is ratcheted forward for every SFrame, and shared with new participants at the current index via `m.room_key` over Olm as per above.  When participants leave, a new megolm session is created and shared with all participants over Olm.  The new session is only used once all participants have received it.
-
 ## Potential issues
 
 To-device messages are point-to-point between servers, whereas today's `m.call.*` messages can transitively traverse servers via the room DAG, thus working around federation problems.  In practice if you are relying on that behaviour, you're already in a bad place.
 
-The SFUs participating in a conference end up in a full mesh.  Rather than inventing our own spanning-tree system for SFUs however, we should fix it for Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or similar to decide what better-than-full-mesh topology to use.  In practice, full mesh cascade between SFUs is probably not that bad (especially if SFUs only request the streams over the trunk their clients care about) - and on aggregate will be less obnoxious than all the clients hitting a single SFU.
-
-SFrame mandates its own ratchet currently which is almost the same as megolm but not quite.  Switching it out for megolm seems reasonable right now (at least until MLS comes along)
-
 ## Alternatives
 
 There are many many different ways to do this.  The main other alternative considered was not to use state events to track membership, but instead gossip it via either to-device or DC messages between participants.  This fell apart however due to trust: you effectively end up reinventing large parts of Matrix layered on top of to-device or DC.  So you might as well publish and distribute the participation data in the DAG rather than reinvent the wheel.
 
-Another option is to treat 1:1 (and full mesh) entirely differently to SFU based calling rather than trying to unify them.  Also, it's debatable whether supporting full mesh is useful at all.  In the end, it feels like unifying 1:1 and SFU calling is for the best though, as it then gives you the ability to trivially upgrade 1:1 calls to group calls and vice versa, and avoids maintaining two separate hunks of spec.  It also forces 1:1 calls to take multi-stream calls seriously, which is useful for more exotic capture devices (stereo cameras; 3D cameras; surround sound; audio fields etc).
-
 An alternative to to-device messages is to use DMs.  You still risk gappy sync problems though due to lots of traffic, as well as the hassle of creating DMs and requiring canonical DMs to set up the calls.  It does make debugging easier though, rather than having to track encrypted ephemeral to-device msgs.
 
 ## Security considerations
 
 State events are not encrypted currently, and so this leaks that a call is happening, and who is participating in it, and from which devices.
 
-Malicious users could try to DoS SFUs by specifying them as their foci.
-
-SFrame E2EE may go horribly wrong if we can't send the new megolm session fast enough to all the participants when a participant leave (and meanwhile if we keep using the old session, we're technically leaking call media to the parted participant until we manage to rotate).
-
-Need to ensure there's no scope for media forwarding loops through SFUs.
-
 Malicious users in a room could try to sabotage a conference by overwriting the `m.call` state event of the current ongoing call.
 
-Too many foci will chew bandwidth due to fullmesh between them.  In the worst case, if every use is on their own HS and picks a different foci, it degenerates to a fullmesh call (just serverside rather than clientside).  Hopefully this shouldn't happen as you will converge on using a single SFU with the most clients, but need to check how this works in practice.
-
 ## Unstable prefix
 
 | stable event type | unstable event type |