fix(ice): do not disconnect whilst we still check new candidates #489

thomaseizinger · 2024-03-21T23:26:51Z

In case the active candidate pair gets invalidated, str0m currently instantly emits a Disconnected event. For example, if a candidate pair with a peer-reflexive candidate gets nominated and later replaced. This results in a disconnected event despite us not actually being disconnected (a peer-reflexive candidate that gets replaced is after all the exact same socket).

To avoid this false-positive disconnection, we transition back to Checking in case we are still evaluating other candidates.

thomaseizinger · 2024-08-09T03:50:01Z

@algesten What is your appetite for merging this? I am hitting this now again in the context of a peer-reflexive candiate getting replaced with a regular one.

In case the peer-reflexive candidate was already nominated, we will flip to disconnected despite us checking the new candidate.

I remember that your latest stance was that we should "wait" a bit before acting on Disconnected to see if we recover from it. Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

The consequence of this behaviour is a flaky integration test on our end:

Peer 1 starts sending STUN requests before the candidates arrive at peer 2
Peer 2 creates a peer-reflexive candidate
Peer 2 nominates the candidate
Peer 2 receives actual candidates from peer 1
Nominated pair gets replaced, causing ICE timeout in peer 2
Peer 1 runs into ICE timeout because peer 2 discarded the agent as a result of the timeout

algesten · 2024-08-09T06:58:14Z

Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

Does your solution deal with road warriors on mobile data? If so, you could easily be in a situation where you reach Disconnected without being disconnected (and it coming back). We see the exact same thing when implementing browser WebRTC clients. Knowing that something is definitely disconnected is not something ICE can help you with. It's better to go "Reconnecting…" after some timeout to avoid flapping.

(from #486)

Currently, we treat Disconnected as a hard error and clean up our connection state. That is simple to understand which is why I'd like to keep it.

My worry here is that your need comes from a design decision where you want to treat Disconnected as a hard error when that's not what it is even in the ICE spec. I don't think we should land functionality to try and fix something that goes against how ICE works.

My preference is to stick close to WebRTC, but if it's super important to you, I don't think this PR would break anything, so we could potentially land it.

xnorpx · 2024-08-09T07:00:26Z

For example in our codebase, we are letting our connection stick around for 30 seconds when we get an ICE disconnect. In case clients shows up again.

If our clients disconnect through other means we tear down the connections directly but not with ICE disconnect.

thomaseizinger · 2024-08-09T09:29:24Z

Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

Does your solution deal with road warriors on mobile data? If so, you could easily be in a situation where you reach Disconnected without being disconnected (and it coming back). We see the exact same thing when implementing browser WebRTC clients. Knowing that something is definitely disconnected is not something ICE can help you with. It's better to go "Reconnecting…" after some timeout to avoid flapping.

100%. Reliable roaming is something that we really care about. Especially because for relayed connections, deploying new relays is basically like roaming so we need this to work reliably to not interrupt our users every time we deploy. The way it works is that the client applications listen for system-specific events whenever network interfaces change. When that happens, we reset all connections and establish new ones, i.e. we discard all IceAgents and rebind our sockets to new ports (and potentially new interfaces).

Because our application is an IP-tunnel, this means the layered connection is essentially "reconnecting" (i.e. TCP packets will start to get retransmitted by the kernel etc). We use those packets as "intents" that we need to set up a new ICE connection.

Currently, we treat Disconnected as a hard error and clean up our connection state. That is simple to understand which is why I'd like to keep it.

My worry here is that your need comes from a design decision where you want to treat Disconnected as a hard error when that's not what it is even in the ICE spec. I don't think we should land functionality to try and fix something that goes against how ICE works.

I don't think it is against the ICE spec per se. Admittedly, I've not dealt with it as much as you have but in my experience, ICE isn't very good at repairing connection state during roaming or when network conditions actually change, at least not without being told about it. An ICE restart is basically what we are doing except that we just make an entire new agent to begin with. What happens in most cases is that the network link is dead and you can only detect it because X STUN messages have timed out.

For those cases where it is really just intermittent packet loss, the bit that we actually care about is for how long we tolerate that. This is where we get to this:

For example in our codebase, we are letting our connection stick around for 30 seconds when we get an ICE disconnect. In case clients shows up again.

The bit that I don't understand with this approach is: What is the difference between the configured STUN timeout + adding 30 seconds on top vs re-configuring the IceAgent timeouts to give you the duration you want?

str0m's default are 9 STUN messages with a 3s interval and after 50% missing responses we probe more quickly (or something like that). Effectively, it results in a timeout of ~12s I think. For us, that is a bit too long, so we tweak those values in our application. If you can tolerate a longer partition, then you can tweak them upwards and allow up to 15 or 20 missed STUN requests.

Eventually though, there needs to be a timeout. It seems redundant to me to re-implement this on top of str0m when str0m has it already built in.

The main thing that this PR is changing is that:

if we have a nominated candidate pair
and we invalidate that for whatever reason (because it e.g. was a based on a peer-reflexive candidate)
and we still have candidate pairs whose status we don't know yet

Then we transition back to Checking instead of Disconnected. I'd argue that is reasonable because it means str0m will operate more deterministically and not emit Disconnected as a result of slight timing differences in how STUN requests vs candidates arrive.

thomaseizinger · 2024-08-09T13:53:04Z

If this PR is not desired, would you accept a PR that exposes a more fine-granular state on the IceAgent that allows me to check whether or not we are still testing candidates? Then I could implement a check myself to wait for Completed before acting on the Disconnected event.

algesten · 2024-08-10T08:21:34Z

Let's land this PR. I think it's ok.

thomaseizinger · 2024-08-10T09:34:36Z

Let's land this PR. I think it's ok.

Awesome! Would you like me to add a test for it?

Also, I am just changing over our dependency to the fork to test drive this a bit more. We used to have this in our fork and the tests were pretty stable so I am pretty confident that the flakiness come with depending on str0m mainline a couple of days ago.

Better be sure though. Will ping once ready!

xnorpx · 2024-08-10T12:38:58Z

Having an integration test that validates the disconnect would be nice together with example how to configure the timeouts.

My attempts at fixing the flaky integration test with #6200 failed. Unfortunately, we need to go back to a fork of str0m to fix this. I am in discussion with the other maintainers on whether or not we can land this patch. Diff to upstream `str0m`: algesten/str0m@main...firezone:str0m:main Example of a flaky test run: https://github.com/firezone/firezone/actions/runs/10328744448/job/28595705615?pr=6237 Related: algesten/str0m#489.

thomaseizinger · 2024-08-15T04:09:06Z

Having an integration test that validates the disconnect would be nice together with example how to configure the timeouts.

I added a failing test. The setup is a bit peculiar because we need an additional set of candidates. It should demonstrate the problem nicely though.

Regarding the documentation, I think we may need an additional PR to expose the ice timeout config added in #537 on the RtcConfig before that can be used there. Unless you are using the IceAgent directly too, then you can just use the exposed config options.

algesten · 2024-08-15T14:33:57Z

Looks good! Let's merge it!

algesten/str0m#489 got merged, we can thus depend on upstream again.

thomaseizinger mentioned this pull request Mar 21, 2024

feat(snownet): minimize delay when roaming firezone/firezone#4246

Merged

thomaseizinger changed the title ~~feat: transition to Checking if we no nominated but possible candidates~~ feat: transition to Checking before Disconnected Jul 5, 2024

thomaseizinger closed this Jul 5, 2024

thomaseizinger reopened this Aug 9, 2024

thomaseizinger mentioned this pull request Aug 9, 2024

fix(connlib): emit candidates in reverse-priority order firezone/firezone#6200

Merged

thomaseizinger force-pushed the feat/transition-checking branch 2 times, most recently from 64e1940 to c51986c Compare August 10, 2024 03:53

thomaseizinger mentioned this pull request Aug 10, 2024

ci: fix flaky integration test firezone/firezone#6245

Merged

thomaseizinger added 2 commits August 15, 2024 14:04

Add failing test

5f26c37

Transition to Checking if we have new but not yet nominated candidates

1a53502

thomaseizinger force-pushed the feat/transition-checking branch from c51986c to 1a53502 Compare August 15, 2024 04:05

thomaseizinger changed the title ~~feat: transition to Checking before Disconnected~~ fix: do not disconnect whilst checking new candidates Aug 15, 2024

thomaseizinger changed the title ~~fix: do not disconnect whilst checking new candidates~~ fix: do not disconnect whilst we still check new candidates Aug 15, 2024

thomaseizinger marked this pull request as ready for review August 15, 2024 04:13

thomaseizinger changed the title ~~fix: do not disconnect whilst we still check new candidates~~ fix(ice): do not disconnect whilst we still check new candidates Aug 15, 2024

algesten approved these changes Aug 15, 2024

View reviewed changes

thomaseizinger merged commit b800a7b into algesten:main Aug 15, 2024
22 checks passed

thomaseizinger mentioned this pull request Aug 16, 2024

build(deps): use upstream str0m firezone/firezone#6315

Merged

github-merge-queue bot pushed a commit to firezone/firezone that referenced this pull request Aug 16, 2024

build(deps): use upstream str0m (#6315)

6c2fd02

algesten/str0m#489 got merged, we can thus depend on upstream again.

github-merge-queue bot pushed a commit to firezone/firezone that referenced this pull request Aug 16, 2024

build(deps): use upstream str0m (#6315)

bbef3af

algesten/str0m#489 got merged, we can thus depend on upstream again.

thomaseizinger deleted the feat/transition-checking branch August 16, 2024 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ice): do not disconnect whilst we still check new candidates #489

fix(ice): do not disconnect whilst we still check new candidates #489

thomaseizinger commented Mar 21, 2024 •

edited

Loading

thomaseizinger commented Aug 9, 2024 •

edited

Loading

algesten commented Aug 9, 2024

xnorpx commented Aug 9, 2024 •

edited

Loading

thomaseizinger commented Aug 9, 2024 •

edited

Loading

thomaseizinger commented Aug 9, 2024 •

edited

Loading

algesten commented Aug 10, 2024

thomaseizinger commented Aug 10, 2024

xnorpx commented Aug 10, 2024

thomaseizinger commented Aug 15, 2024 •

edited

Loading

algesten commented Aug 15, 2024

fix(ice): do not disconnect whilst we still check new candidates #489

fix(ice): do not disconnect whilst we still check new candidates #489

Conversation

thomaseizinger commented Mar 21, 2024 • edited Loading

thomaseizinger commented Aug 9, 2024 • edited Loading

algesten commented Aug 9, 2024

xnorpx commented Aug 9, 2024 • edited Loading

thomaseizinger commented Aug 9, 2024 • edited Loading

thomaseizinger commented Aug 9, 2024 • edited Loading

algesten commented Aug 10, 2024

thomaseizinger commented Aug 10, 2024

xnorpx commented Aug 10, 2024

thomaseizinger commented Aug 15, 2024 • edited Loading

algesten commented Aug 15, 2024

thomaseizinger commented Mar 21, 2024 •

edited

Loading

thomaseizinger commented Aug 9, 2024 •

edited

Loading

xnorpx commented Aug 9, 2024 •

edited

Loading

thomaseizinger commented Aug 9, 2024 •

edited

Loading

thomaseizinger commented Aug 9, 2024 •

edited

Loading

thomaseizinger commented Aug 15, 2024 •

edited

Loading