fix: make bootstrap more robust (and slower 😢) #1351
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This diff ensures that the boostrap is more robust in the scenario described by ooni/probe#2544.
Because we're close to a release, I did not want to write a heavy refactor of the boostrap, and specifically of the iplookup part, but I documented what is needed at ooni/probe#2551.
This diff has been tested in a problematic network under two distinct conditions:
cold bootstrap (i.e., w/o prior knowledge), where the correct fix is to give the system resolver intermediate priority;
hot bootstrap with very fucked up
sessionresolver.state
, where the correct fix is to use a large coarse grained timeout, such that eventually we try the system resolver.The fix that causes a canceled context to prevent marking a resolver as failed helps in both scenarios.
As you can see, I removed a "reliability fix", which actually was more of an optimization. This removal means that the probe bootstrap is now slower than it could be, hence my choice of documenting in ooni/probe#2551 what I'd do had I had more time to work on this topic.
BTW, I had to create an overall 45 seconds timeout for IP lookups because we have 7 DNS over HTTPS resolvers plus the system resolver. This means that, in the worst case where the system resolver has the least priority, we expect 7*4 = 28 seconds of timeout before eventually using the system resolver. The rest of the timeout accounts for operations happening after the DNS lookup has succeeded.