"Happy Eyeballs" failure reporting #175

enygren · 2024-07-20T22:21:45Z

RFC 8305 defines a "Happy Eyeballs" behavior that allows clients to try IPv6 first but fail over to IPv4 when it's unavailable in a timely manner. There is a proposal to extend this further to handle the increasing number of endpoint candidates (eg, QUIC, mutliple SVCB records, etc) which clients might try (see https://datatracker.ietf.org/doc/html/draft-pauly-v6ops-happy-eyeballs-v3-01)

An ongoing operational concern is that while this Happy Eyeballs behavior greatly improves user experience (eg, a user has multiple protocols and IP versions and IP addresses to try) it can hide partial failures in the network. For example, if a content provider (or ISP) has broken IPv6 then they may not notice as dual-stack users will fallback to IPv4.

It would be highly valuable if NEL could report these sorts of issues where a client had multiple endpoints it could connect to and some were unreachable.

Some considerations:

Would this want to have multiple IP addresses and protocols in a single report? (eg, the list of those that were tried and failed and which succeeded, along with timing information?) Or would this want to be a new type of report that just reports that the failure is for alternate endpoint that was tried but that the client succeeded with a different endpoint?
Normally alternate connection attempts are abandoned on the success of one race. When NEL is enabled would we want to keep running the other attempt longer to see if it would have succeeded given more time? Reporting the timing for when it was abandoned, when the alternate that was used succeeded, and when/if this failed attempt would have succeded might give valuable insight (eg, to differentiate a case where IPv6 just doesn't work vs being a few milliseconds slower and losing a race)
It will be important to balance privacy and providing useful data, and to also make sure that reporting works even in the face of failures that might have caused a Happy Eyeballs use of an alternate. For example, it would be nice to know both the client IPv6 and IPv4 addresses, but this might not be possible in all cases (eg, if either is missing/broken entirely) or in cases where doing so would leak too much privacy-wise.

simon-friedberger · 2024-07-22T08:14:57Z

I think this is an interesting idea. The original problem for NEL was "a client cannot reach me and I want to know about this event" and then it includes information that might help with debugging issues. One concept being "If the client can reach me, I don't need NEL." Even though it is - AFAIR - not correct anymore the spec still states:

To prevent information leakage, NEL reports about a request do not contain any information that is not visible to the server when processing the request.

With IPV4/IPV6 and h2/h3 and HTTPS-upgrades there might be useful information which clients are not making available today like "I tried IPV6 but it didn't work." Although you could argue, that in this case maybe you should have gotten a report for your IPV6 endpoint.

The hard part for this and #176 will be to balance utility and privacy. IMHO It's harder to judge than for the original NEL because "the user wants to connect but cannot" is some motivation to make the users participate in network debugging but if the user does connect with IPV4, why should they provide any kind of privacy sensitive information so somebody else can debug their IPV6 network? And looking at the utility, if you see only IPV4 connections from a certain area, can't you already deduce that IPV6 might not be working?

LPardue · 2024-07-22T10:10:06Z

Speaking for the server side: it feels there's a lot of nuance into the client decision making. Not necessarily outright IP version X was blocked or failed, nor HTTTP version Y was blocked or failed. But that the user agent, rolled a set of dice for enough rounds to find a combo result good enough. The servers can infer the final dice roll leading to a successful connection , but get no insight into the lead up. In actuality, ignoring such failures could prevent identifying systematic issues, which hurt both client and server.

pmeenan · 2024-07-22T14:30:31Z

If I recall correctly, NEL is meant to help site owners identify infrastructure issues for the parts of the infrastructure that are under their control that happen at a point in time when they have no visibility into the connection attempt.

It feels like ISP IPv6 issues fall outside of the scope of NEL. Same goes for any nuance in the client's decision making that isn't tied to something like the contents of the HTTPS DNS records that is under the control of the site (or CDN they are using).

clelland · 2024-07-30T14:45:45Z

Let's discuss this at the next WG call; I think that there are things that we can do here, without trying to turn NEL into a general-purpose network troubleshooting tool

LPardue · 2024-07-30T15:10:51Z

From my perspective, site owners delegate these details to their IaaS provider. Then they might review request logs or other telemetry and ask "why was this used over that", "is my website config actually being used", etc. Fallbacks, by their nature mask problems. This can be compunded by multi CDN setups whwre xonfugs might differ.

Not providing network error information when there was a newtwork error seems counterintuitive to me :)

nicjansma · 2024-10-21T20:31:20Z

This issue was discussed at W3C TPAC 2024:

Presentation
Minutes
Summary:
- There was confirmation from other CDNs that getting this data can be valuable to diagnosing network issues
- As always, we would have to evaluate what could be exposed in a privacy-safe manner (maybe through differential privacy / aggregate reporting)
- Further discussions were suggested in IETF

bashi mentioned this issue Sep 12, 2024

Better HTTP/2 and HTTP/3 errors #134

Open

nicjansma mentioned this issue Oct 21, 2024

Network failures after connection establishment #176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Happy Eyeballs" failure reporting #175

"Happy Eyeballs" failure reporting #175

enygren commented Jul 20, 2024

simon-friedberger commented Jul 22, 2024 •

edited

Loading

LPardue commented Jul 22, 2024

pmeenan commented Jul 22, 2024

clelland commented Jul 30, 2024

LPardue commented Jul 30, 2024

nicjansma commented Oct 21, 2024

"Happy Eyeballs" failure reporting #175

"Happy Eyeballs" failure reporting #175

Comments

enygren commented Jul 20, 2024

simon-friedberger commented Jul 22, 2024 • edited Loading

LPardue commented Jul 22, 2024

pmeenan commented Jul 22, 2024

clelland commented Jul 30, 2024

LPardue commented Jul 30, 2024

nicjansma commented Oct 21, 2024

simon-friedberger commented Jul 22, 2024 •

edited

Loading