Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ice): do not disconnect whilst we still check new candidates #489

Merged
merged 2 commits into from
Aug 15, 2024

Conversation

thomaseizinger
Copy link
Collaborator

@thomaseizinger thomaseizinger commented Mar 21, 2024

In case the active candidate pair gets invalidated, str0m currently instantly emits a Disconnected event. For example, if a candidate pair with a peer-reflexive candidate gets nominated and later replaced. This results in a disconnected event despite us not actually being disconnected (a peer-reflexive candidate that gets replaced is after all the exact same socket).

To avoid this false-positive disconnection, we transition back to Checking in case we are still evaluating other candidates.

@thomaseizinger thomaseizinger changed the title feat: transition to Checking if we no nominated but possible candidates feat: transition to Checking before Disconnected Jul 5, 2024
@thomaseizinger thomaseizinger reopened this Aug 9, 2024
@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Aug 9, 2024

@algesten What is your appetite for merging this? I am hitting this now again in the context of a peer-reflexive candiate getting replaced with a regular one.

In case the peer-reflexive candidate was already nominated, we will flip to disconnected despite us checking the new candidate.

I remember that your latest stance was that we should "wait" a bit before acting on Disconnected to see if we recover from it. Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

The consequence of this behaviour is a flaky integration test on our end:

  • Peer 1 starts sending STUN requests before the candidates arrive at peer 2
  • Peer 2 creates a peer-reflexive candidate
  • Peer 2 nominates the candidate
  • Peer 2 receives actual candidates from peer 1
  • Nominated pair gets replaced, causing ICE timeout in peer 2
  • Peer 1 runs into ICE timeout because peer 2 discarded the agent as a result of the timeout

@algesten
Copy link
Owner

algesten commented Aug 9, 2024

Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

Does your solution deal with road warriors on mobile data? If so, you could easily be in a situation where you reach Disconnected without being disconnected (and it coming back). We see the exact same thing when implementing browser WebRTC clients. Knowing that something is definitely disconnected is not something ICE can help you with. It's better to go "Reconnecting…" after some timeout to avoid flapping.

(from #486)

Currently, we treat Disconnected as a hard error and clean up our connection state. That is simple to understand which is why I'd like to keep it.

My worry here is that your need comes from a design decision where you want to treat Disconnected as a hard error when that's not what it is even in the ICE spec. I don't think we should land functionality to try and fix something that goes against how ICE works.

My preference is to stick close to WebRTC, but if it's super important to you, I don't think this PR would break anything, so we could potentially land it.

@xnorpx
Copy link
Collaborator

xnorpx commented Aug 9, 2024

For example in our codebase, we are letting our connection stick around for 30 seconds when we get an ICE disconnect. In case clients shows up again.

If our clients disconnect through other means we tear down the connections directly but not with ICE disconnect.

@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Aug 9, 2024

Whilst doable, it is a bit annoying to implement and in the case of a real disconnection, harms the user experience because we can't give feedback until the additional timer expires.

Does your solution deal with road warriors on mobile data? If so, you could easily be in a situation where you reach Disconnected without being disconnected (and it coming back). We see the exact same thing when implementing browser WebRTC clients. Knowing that something is definitely disconnected is not something ICE can help you with. It's better to go "Reconnecting…" after some timeout to avoid flapping.

100%. Reliable roaming is something that we really care about. Especially because for relayed connections, deploying new relays is basically like roaming so we need this to work reliably to not interrupt our users every time we deploy. The way it works is that the client applications listen for system-specific events whenever network interfaces change. When that happens, we reset all connections and establish new ones, i.e. we discard all IceAgents and rebind our sockets to new ports (and potentially new interfaces).

Because our application is an IP-tunnel, this means the layered connection is essentially "reconnecting" (i.e. TCP packets will start to get retransmitted by the kernel etc). We use those packets as "intents" that we need to set up a new ICE connection.

Currently, we treat Disconnected as a hard error and clean up our connection state. That is simple to understand which is why I'd like to keep it.

My worry here is that your need comes from a design decision where you want to treat Disconnected as a hard error when that's not what it is even in the ICE spec. I don't think we should land functionality to try and fix something that goes against how ICE works.

I don't think it is against the ICE spec per se. Admittedly, I've not dealt with it as much as you have but in my experience, ICE isn't very good at repairing connection state during roaming or when network conditions actually change, at least not without being told about it. An ICE restart is basically what we are doing except that we just make an entire new agent to begin with. What happens in most cases is that the network link is dead and you can only detect it because X STUN messages have timed out.

For those cases where it is really just intermittent packet loss, the bit that we actually care about is for how long we tolerate that. This is where we get to this:

For example in our codebase, we are letting our connection stick around for 30 seconds when we get an ICE disconnect. In case clients shows up again.

The bit that I don't understand with this approach is: What is the difference between the configured STUN timeout + adding 30 seconds on top vs re-configuring the IceAgent timeouts to give you the duration you want?

str0m's default are 9 STUN messages with a 3s interval and after 50% missing responses we probe more quickly (or something like that). Effectively, it results in a timeout of ~12s I think. For us, that is a bit too long, so we tweak those values in our application. If you can tolerate a longer partition, then you can tweak them upwards and allow up to 15 or 20 missed STUN requests.

Eventually though, there needs to be a timeout. It seems redundant to me to re-implement this on top of str0m when str0m has it already built in.


The main thing that this PR is changing is that:

  • if we have a nominated candidate pair
  • and we invalidate that for whatever reason (because it e.g. was a based on a peer-reflexive candidate)
  • and we still have candidate pairs whose status we don't know yet

Then we transition back to Checking instead of Disconnected. I'd argue that is reasonable because it means str0m will operate more deterministically and not emit Disconnected as a result of slight timing differences in how STUN requests vs candidates arrive.

@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Aug 9, 2024

If this PR is not desired, would you accept a PR that exposes a more fine-granular state on the IceAgent that allows me to check whether or not we are still testing candidates? Then I could implement a check myself to wait for Completed before acting on the Disconnected event.

@algesten
Copy link
Owner

Let's land this PR. I think it's ok.

@thomaseizinger
Copy link
Collaborator Author

Let's land this PR. I think it's ok.

Awesome! Would you like me to add a test for it?

Also, I am just changing over our dependency to the fork to test drive this a bit more. We used to have this in our fork and the tests were pretty stable so I am pretty confident that the flakiness come with depending on str0m mainline a couple of days ago.

Better be sure though. Will ping once ready!

@xnorpx
Copy link
Collaborator

xnorpx commented Aug 10, 2024

Having an integration test that validates the disconnect would be nice together with example how to configure the timeouts.

github-merge-queue bot pushed a commit to firezone/firezone that referenced this pull request Aug 12, 2024
My attempts at fixing the flaky integration test with #6200 failed.
Unfortunately, we need to go back to a fork of str0m to fix this. I am
in discussion with the other maintainers on whether or not we can land
this patch.

Diff to upstream `str0m`:
algesten/str0m@main...firezone:str0m:main
Example of a flaky test run:
https://github.com/firezone/firezone/actions/runs/10328744448/job/28595705615?pr=6237

Related: algesten/str0m#489.
github-merge-queue bot pushed a commit to firezone/firezone that referenced this pull request Aug 12, 2024
My attempts at fixing the flaky integration test with #6200 failed.
Unfortunately, we need to go back to a fork of str0m to fix this. I am
in discussion with the other maintainers on whether or not we can land
this patch.

Diff to upstream `str0m`:
algesten/str0m@main...firezone:str0m:main
Example of a flaky test run:
https://github.com/firezone/firezone/actions/runs/10328744448/job/28595705615?pr=6237

Related: algesten/str0m#489.
@thomaseizinger
Copy link
Collaborator Author

thomaseizinger commented Aug 15, 2024

Having an integration test that validates the disconnect would be nice together with example how to configure the timeouts.

I added a failing test. The setup is a bit peculiar because we need an additional set of candidates. It should demonstrate the problem nicely though.

Regarding the documentation, I think we may need an additional PR to expose the ice timeout config added in #537 on the RtcConfig before that can be used there. Unless you are using the IceAgent directly too, then you can just use the exposed config options.

@thomaseizinger thomaseizinger changed the title feat: transition to Checking before Disconnected fix: do not disconnect whilst checking new candidates Aug 15, 2024
@thomaseizinger thomaseizinger changed the title fix: do not disconnect whilst checking new candidates fix: do not disconnect whilst we still check new candidates Aug 15, 2024
@thomaseizinger thomaseizinger marked this pull request as ready for review August 15, 2024 04:13
@thomaseizinger thomaseizinger changed the title fix: do not disconnect whilst we still check new candidates fix(ice): do not disconnect whilst we still check new candidates Aug 15, 2024
@algesten
Copy link
Owner

Looks good! Let's merge it!

@thomaseizinger thomaseizinger merged commit b800a7b into algesten:main Aug 15, 2024
22 checks passed
github-merge-queue bot pushed a commit to firezone/firezone that referenced this pull request Aug 16, 2024
algesten/str0m#489 got merged, we can thus
depend on upstream again.
github-merge-queue bot pushed a commit to firezone/firezone that referenced this pull request Aug 16, 2024
algesten/str0m#489 got merged, we can thus
depend on upstream again.
@thomaseizinger thomaseizinger deleted the feat/transition-checking branch August 16, 2024 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants