Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make bootstrap more robust (and slower 😢) #1351

Merged
merged 6 commits into from
Oct 9, 2023
Merged

Conversation

bassosimone
Copy link
Contributor

@bassosimone bassosimone commented Oct 9, 2023

This diff ensures that the boostrap is more robust in the scenario described by ooni/probe#2544.

Because we're close to a release, I did not want to write a heavy refactor of the boostrap, and specifically of the iplookup part, but I documented what is needed at ooni/probe#2551.

This diff has been tested in a problematic network under two distinct conditions:

  1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is to give the system resolver intermediate priority;

  2. hot bootstrap with very fucked up sessionresolver.state, where the correct fix is to use a large coarse grained timeout, such that eventually we try the system resolver.

The fix that causes a canceled context to prevent marking a resolver as failed helps in both scenarios.

As you can see, I removed a "reliability fix", which actually was more of an optimization. This removal means that the probe bootstrap is now slower than it could be, hence my choice of documenting in ooni/probe#2551 what I'd do had I had more time to work on this topic.

BTW, I had to create an overall 45 seconds timeout for IP lookups because we have 7 DNS over HTTPS resolvers plus the system resolver. This means that, in the worst case where the system resolver has the least priority, we expect 7*4 = 28 seconds of timeout before eventually using the system resolver. The rest of the timeout accounts for operations happening after the DNS lookup has succeeded.

This diff ensures that the boostrap is more robust in the scenario
described by ooni/probe#2544.

Because we're close to a release, I did not want to write a heavy
refactor of the boostrap, and specifically of the iplookup part, but
I documented what is needed at ooni/probe#2551.

This diff has been tested in a problematic network under two
distinct conditions:

1. cold bootstrap (i.e., w/o prior knowledge), where the correct
fix is to give the system resolver intermediate priority;

2. hot bootstrap with fscked up config, where the correct fix
is to use a large coarse grained timeout.

The fix that causes a canceled context to prevent marking a resolver
as failed helps in both scenarios.

As you can see, I removed a "reliability fix", which actually was
more of an optimization. This removal means that the probe bootstrap
is now slower than it could be, hence my choice of documenting
what I'd do had I had more time to work on this topic.
@bassosimone bassosimone changed the title fix: make bootstrap more robust (and slower) fix: make bootstrap more robust (and slower 😢) Oct 9, 2023
@bassosimone bassosimone merged commit 6b6271a into master Oct 9, 2023
8 checks passed
@bassosimone bassosimone deleted the issue/2544 branch October 9, 2023 15:08
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this pull request Feb 13, 2024
This diff ensures that the boostrap is more robust in the scenario
described by ooni/probe#2544.

Because we're close to a release, I did not want to write a heavy
refactor of the boostrap, and specifically of the iplookup part, but I
documented what is needed at ooni/probe#2551.

This diff has been tested in a problematic network under two distinct
conditions:

1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is
to give the system resolver intermediate priority (i.e., 0.5);

2. hot bootstrap with very f***** up `sessionresolver.state`, where the
correct fix is to use a larger coarse grained timeout, such that
_eventually_ we try the system resolver.

The fix that prevents marking a resolver when the external context has been
canceled is required in both of these scenarios.

As you can see, I removed a "reliability fix", which actually was more
of an optimization. This removal means that the probe bootstrap is now
slower than it could be, hence my choice of documenting in
ooni/probe#2551 what I'd do had I had more
time to work on this topic.

BTW, I had to create an overall 45 seconds timeout for IP lookups
because we have 7 DNS over HTTPS resolvers plus the system resolver.
This means that, in the worst case where the system resolver has the
least priority, we expect 7*4 = 28 seconds of timeout before eventually
using the system resolver. The rest of the 45s timeout accounts for
operations happening after the DNS lookup has completed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant