Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is unlikely that the spec's two definitions of valid URL strings are equal #704

Open
domenic opened this issue Sep 2, 2022 · 3 comments
Labels
topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing)

Comments

@domenic
Copy link
Member

domenic commented Sep 2, 2022

(Background note: valid URL strings are completely different from parseable URL strings. See https://url.spec.whatwg.org/#writing and https://url.spec.whatwg.org/#urls , especially the table in the latter, for background.)

The spec currently gives two ways of determining whether an input string is considered "valid":

The latter definition is not precisely specified anywhere; the only thing you can really infer from the spec is that running the parser on some strings sometimes gives as a side effect, one or more validation errors. There's no actual bridging text to a concept like "valid URL string", that I can find.

But I think the intent is for these two to be equivalent.

The problem is, there's absolutely no guarantee this is the case. One instance of a mismatch was found in #437, but I suspect there are many more. As I commented there, I think it would be a fun project to first write an implementation of the parser which tracks validation errors, and then write an implementation of the #url-writing section, and then compare what results those two pieces of code get on a bunch of different (URL string, base URL) inputs. (Sounds like a job for a fuzzer!)

We could solve this in one of two ways:

  • Delete one of the two definitions of valid URL string
  • Make sure they align, and maybe try to maintain that via automated fuzzing somehow.

The latter sounds very silly and redundant to me at first glance. However:

  • The parser-produced validation errors have the nice property that they can be gotten almost for free as a side effect of parsing. (The hardest part is implementing "is a URL code point", which a non-validating parser doesn't need at all.) They also could easily be given names that allow you to track down exactly what went wrong; see Named validation errors #406. Whereas the grammar-ish #url-writing section just gives you a boolean predicate. So that argues for keeping the parser-produced validation errors...

  • On the other hand, Provide a grammar for the URL parser #479 indicates people do seem to like the idea of a short grammar-ish algorithm for determining validity. It's not 100% clear whether they are distinguishing validity from parseability; in particular, the fact that #url-writing has existed forever but people have spent 77 comments complaining about the spec lacking a grammar, indicates maybe #url-writing is not that helpful and what they're really complaining about is how the URL parser is complicated to implement. But I'm not sure.

So, I'm not sure where that really leaves us. But I wanted to log this issue so we had a canonical place to record this unfortunate fact about the current standard.

See also jsdom/whatwg-url#156.

@annevk
Copy link
Member

annevk commented Sep 2, 2022

The HTML Standard does the same thing for the HTML syntax. Mainly because it serves different audiences. I liked that idea and based this on that. I do agree that there is definitely a chance for issues and it would be nice to automatically check them both.

@annevk annevk added the topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing) label Sep 2, 2022
@alercah
Copy link

alercah commented Sep 2, 2022

This is an excellent point and really cuts, in a way, to the core of what #479 was originally about. Thank you!

Given that this is a well-explored area of theory, I think we can do better than a fuzzer and use a formally verified implementation of the parser. Then the spec could be arranged as follows, perhaps:

  1. Make a formal grammar defining the valid URLs.
  2. Add a section with a "web algorithm" which parsea URLs and determines validity.
  3. Write an implementation of the algorithm in a language with formal verification (Idris would be my choice but only because or familiarity) that proves equivalence of the two approaches.

There could potentially be other benefits:

  1. Proving the equivalence of various portions of the URL as parsed by the algorithm with the equivalent grammar productions, validating the use of other algorithms to produce the same result, at least on valid input.
  2. Exploring whether there is a more succinct way to describe a "parseable" URL and the corresponding valid one.
  3. With a formally verified implementation validating against a grammar, it would be possible to propose changes to the structure of the algorithm to improve clarity or fix bugs with confidence that they do not have unexpected consequences, which would make the spec easier to maintain.

I do have thoughts on the "named validation errors" issue but I think those belong in the other thread.

@alwinb
Copy link
Contributor

alwinb commented Sep 2, 2022

in particular, the fact that #url-writing has existed forever but people have spent 77 comments complaining about the spec lacking a grammar.

There is a lot of useful information in those comments. People are not just complaining, they are taking the time to think about this and they are providing constructive feedback.

The HTML Standard does the same thing for the HTML syntax. Mainly because it serves different audiences. I liked that idea and based this on that.

I find this really nice. When a specification has both a formal grammar and an algorithm that matches it. And the courage to make the claim that they are equivalent. Multiple perspectives on the same thing help a lot of you want to really understand it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing)
Development

No branches or pull requests

4 participants