Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better HTTP/2 and HTTP/3 errors #134

Open
LPardue opened this issue Apr 4, 2023 · 9 comments
Open

Better HTTP/2 and HTTP/3 errors #134

LPardue opened this issue Apr 4, 2023 · 9 comments

Comments

@LPardue
Copy link

LPardue commented Apr 4, 2023

HTTP/2 and HTTP/3 have many cases where after a successful handshake, a request can fail for a variety of error codes. Request streams can be canceled or abruptly terminated either before a status code is returned by a server, or after a status code is returned but before the response is fully delivered. These are stream errors, not connection errors, which are subtly different. Although, it can be useful to observe both.

Conventional access logs or HTTP request logs really struggle with capturing these types of failures. Meaning website operators do not get good insight into what is happening.

Cloudflare's public stats at https://radar.cloudflare.com/adoption-and-usage show that HTTP/2 and HTTP/3 comprise over 90% of all requests to the Cloudflare edge (for "likely human" traffic). The scale of this NEL gap is pretty tremendous by request volume.

NEL provides an opportunity to improve the situation for various stakeholders. Better visibility may help to spot difficult to repro problems that occur as a low fraction percentage. For example, at a meeting during IETF 116 we discussed ( an implementation bug discovered recently that would be detected as connection error of code FINAL_SIZE_ERROR. This occurred at an aggregate rate of about 0.001% but occurred much more in networks with higher rates of loss. Once found, the root cause of this problem was identified and quickly fixed. But it had been that way for several years.

Related to #119 here and Chrome bug https://bugs.chromium.org/p/chromium/issues/detail?id=1121658

@clelland
Copy link
Contributor

It looks like NEL is on the WebPerfWG agenda this week, should we discuss this issue there?

@LPardue
Copy link
Author

LPardue commented Apr 24, 2023

SGTM let me know if it would help to prepare anything

@yoavweiss
Copy link
Contributor

@LPardue - a short presentation or simply walking folks through the issue and use cases would probably be useful. Thanks! :)

@neilstuartcraig
Copy link
Contributor

For the record, we would +1 this proposal. We support both h2 and h3 for www.bbc.co.uk & www.bbc.com and the current set of h2/h3.protocol.error (and h2.ping_failed) don't provide us anything actionable. We do see all of those h2/h3 event reports so something is going wrong somewhere but we don't have any way to know how to tackle them.

@LPardue
Copy link
Author

LPardue commented May 4, 2023

Discussed on the Web Perf WG call last week, and there seems to be some interest in doing this.

Part of my hope for standardizing the NEL error codes is that we can get a shared agreement about the type of case that they represent. Expected cases (where error is really no error, if that's even worth a report), or more unexpected cases where, for example, the server detected the client making a specific protocol violation and closes the connection with that code.

Today, its probably an activity in spelunking Chromium source code and trying to reverse engineer the conditions. It would be good to have codes that are more expressive of a general problem area, which can motivate client or server operators to do some targetted analysis.

@neilstuartcraig can you share any data about how many instances of these errors you see?

@neilstuartcraig
Copy link
Contributor

@LPardue yep, no problem. Hopefully this'll make sense:

We have a sample rate (failure_fraction) set at 5% and have h2 enabled globally, h3 on our CDN which serves everywhere outside the UK as BAU.
Typically, on a normal day, we'll serve in the region of 300-350M web pages on www.bbc.co.uk & www.bbc.com.

Looking at our NEL data for www.bbc.co.uk and www.bbc.com, we see roughly:

  • 3000-5000 h2.ping_failed per day
  • Single-digit (0-10) h2.protocol.error per day
  • 1000-2000 h3.protocol.error per day

I wonder if the difference in h2 and h3 protocol error reports is down to the level of implementation maturity and/or complexity in those. Definitely highlights to me that we'd like to know more about what's going on so we can discuss with our CDN vendor.

Let me know if you need anything more, we have good data which is easy to access and I'm keen for this so happy to contribute.

@LPardue
Copy link
Author

LPardue commented May 4, 2023

Thanks Neil. I took a very brief look at our data and I can't share numbers but what I do seem to observe is that

  • h2.ping failed seems to corellate with TCP issues - only correlation not causation
  • h2.protocol_error seems to happen in the connection phase, whereas I would expect it to happen in the application phase
  • h3.protocol_error also has the connection phase issue. Its not clear if that means an error establishing a QUIC session, or an error in the HTTP/3 layer.

Based on the brief analysis, h2.ping_failed is potentially actionable (depending who you are, it could highlight a network problem that could be addressed by contacting some NOC). However, if it is just a different format for articulating TCP connectivity issues, then its duplicative and a distraction. Maybe someone on the client side can speak to what it means when this error happens.

Drilling into the actual h2 or h3 errors seems useful in order to break down benign issues versus real things. We should also probably consider breaking out QUIC transport errors desperately, so that HTTP/3 errors are clearly errors in that layer, not anything else.

@enygren
Copy link

enygren commented Jul 20, 2024

This might fit well into the HappyEyeballsV3 discussions starting up in the IETF.
(eg, https://datatracker.ietf.org/doc/html/draft-pauly-v6ops-happy-eyeballs-v3-01)
That may be a broader issue of which this is one part.

@bashi
Copy link

bashi commented Sep 12, 2024

In the past I tried to improve HTTP/2 and HTTP/3 errors in Chromium (https://chromium-review.googlesource.com/c/chromium/src/+/5400899), and now I'm trying to implement HappyEyeballs v3 in Chromium.

I agree that this could be a part of HEv3 reporting discussion (#175, #176).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants