Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Happy Eyeballs" failure reporting #175

Open
enygren opened this issue Jul 20, 2024 · 6 comments
Open

"Happy Eyeballs" failure reporting #175

enygren opened this issue Jul 20, 2024 · 6 comments

Comments

@enygren
Copy link

enygren commented Jul 20, 2024

RFC 8305 defines a "Happy Eyeballs" behavior that allows clients to try IPv6 first but fail over to IPv4 when it's unavailable in a timely manner. There is a proposal to extend this further to handle the increasing number of endpoint candidates (eg, QUIC, mutliple SVCB records, etc) which clients might try (see https://datatracker.ietf.org/doc/html/draft-pauly-v6ops-happy-eyeballs-v3-01)

An ongoing operational concern is that while this Happy Eyeballs behavior greatly improves user experience (eg, a user has multiple protocols and IP versions and IP addresses to try) it can hide partial failures in the network. For example, if a content provider (or ISP) has broken IPv6 then they may not notice as dual-stack users will fallback to IPv4.

It would be highly valuable if NEL could report these sorts of issues where a client had multiple endpoints it could connect to and some were unreachable.

Some considerations:

  • Would this want to have multiple IP addresses and protocols in a single report? (eg, the list of those that were tried and failed and which succeeded, along with timing information?) Or would this want to be a new type of report that just reports that the failure is for alternate endpoint that was tried but that the client succeeded with a different endpoint?
  • Normally alternate connection attempts are abandoned on the success of one race. When NEL is enabled would we want to keep running the other attempt longer to see if it would have succeeded given more time? Reporting the timing for when it was abandoned, when the alternate that was used succeeded, and when/if this failed attempt would have succeded might give valuable insight (eg, to differentiate a case where IPv6 just doesn't work vs being a few milliseconds slower and losing a race)
  • It will be important to balance privacy and providing useful data, and to also make sure that reporting works even in the face of failures that might have caused a Happy Eyeballs use of an alternate. For example, it would be nice to know both the client IPv6 and IPv4 addresses, but this might not be possible in all cases (eg, if either is missing/broken entirely) or in cases where doing so would leak too much privacy-wise.
@simon-friedberger
Copy link

simon-friedberger commented Jul 22, 2024

I think this is an interesting idea. The original problem for NEL was "a client cannot reach me and I want to know about this event" and then it includes information that might help with debugging issues. One concept being "If the client can reach me, I don't need NEL." Even though it is - AFAIR - not correct anymore the spec still states:

To prevent information leakage, NEL reports about a request do not contain any information that is not visible to the server when processing the request.

With IPV4/IPV6 and h2/h3 and HTTPS-upgrades there might be useful information which clients are not making available today like "I tried IPV6 but it didn't work." Although you could argue, that in this case maybe you should have gotten a report for your IPV6 endpoint.

The hard part for this and #176 will be to balance utility and privacy. IMHO It's harder to judge than for the original NEL because "the user wants to connect but cannot" is some motivation to make the users participate in network debugging but if the user does connect with IPV4, why should they provide any kind of privacy sensitive information so somebody else can debug their IPV6 network? And looking at the utility, if you see only IPV4 connections from a certain area, can't you already deduce that IPV6 might not be working?

@LPardue
Copy link

LPardue commented Jul 22, 2024

Speaking for the server side: it feels there's a lot of nuance into the client decision making. Not necessarily outright IP version X was blocked or failed, nor HTTTP version Y was blocked or failed. But that the user agent, rolled a set of dice for enough rounds to find a combo result good enough. The servers can infer the final dice roll leading to a successful connection , but get no insight into the lead up. In actuality, ignoring such failures could prevent identifying systematic issues, which hurt both client and server.

@pmeenan
Copy link

pmeenan commented Jul 22, 2024

If I recall correctly, NEL is meant to help site owners identify infrastructure issues for the parts of the infrastructure that are under their control that happen at a point in time when they have no visibility into the connection attempt.

It feels like ISP IPv6 issues fall outside of the scope of NEL. Same goes for any nuance in the client's decision making that isn't tied to something like the contents of the HTTPS DNS records that is under the control of the site (or CDN they are using).

@clelland
Copy link
Contributor

Let's discuss this at the next WG call; I think that there are things that we can do here, without trying to turn NEL into a general-purpose network troubleshooting tool

@LPardue
Copy link

LPardue commented Jul 30, 2024

From my perspective, site owners delegate these details to their IaaS provider. Then they might review request logs or other telemetry and ask "why was this used over that", "is my website config actually being used", etc. Fallbacks, by their nature mask problems. This can be compunded by multi CDN setups whwre xonfugs might differ.

Not providing network error information when there was a newtwork error seems counterintuitive to me :)

@nicjansma
Copy link

This issue was discussed at W3C TPAC 2024:

  • Presentation
  • Minutes
  • Summary:
    • There was confirmation from other CDNs that getting this data can be valuable to diagnosing network issues
    • As always, we would have to evaluate what could be exposed in a privacy-safe manner (maybe through differential privacy / aggregate reporting)
    • Further discussions were suggested in IETF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants