Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow dashes in comments #1356

Merged
merged 3 commits into from
Jun 20, 2016
Merged

Allow dashes in comments #1356

merged 3 commits into from
Jun 20, 2016

Conversation

zcorpan
Copy link
Member

@zcorpan zcorpan commented May 31, 2016

  • Make dashes in comments no longer be parse errors

    The spec disallowed dashes in comments in order to have conformance
    checkers point out the problem for pre-2006 Firefox that implemented
    SGML-style comment parsing, and to match the syntax requirements in
    HTML4 and XML. These are no longer relevant and instead people are
    getting unhelpful error messages when validating for markup like:

    </div><!-- end .foo-bar--baz -->
    • A comment that ends with a dash is now allowed: <!-- --->
    • A comment that contains -- is also allowed: <!-- -- -->
    • A comment that contains --! is also allowed: <!-- --! -->
    • The following are still parse errors: <!-->, <!--->, <!-- --!>

    Fixes Remove parse error for dashes in comments <!-- -- --> #1351.

    Also remove obsolete comments in the spec's source about rewinding
    when hitting EOF (see [Tokenization] Question about the security comment in comment end state #1058).

  • Editorial: use more 'reconsume' in comment states
    This makes it easier to add new states to introduce a parse error
    for <!-- <!-- -->.

  • Make nested comments a parse error
    Nesting comments like <!--<!---->--> is an authoring mistake since
    the comment will end on the first -->. Now that -- in comments is
    allowed, we need new states to make "<!--" in a comment a parse
    error.


cc @sideshowbarker @hsivonen @RReverser @inikulin @gsnedders

@zcorpan
Copy link
Member Author

zcorpan commented May 31, 2016

Stuff to convert to test cases:

errors
<!----!>
<!----!
<!----
<!--->
<!-----
<!-->
<!--
<!--x
<!--<
<!--<!
<!--<!-
<!--<!--
<!--<!--!
<!--<!--!>
<!--<!---
<!--<!--->
<!--<!--x
<!--<!--x-
<!--<!--x--
<!--<!--x-->
<!--<!-x
<!--<!-x-
<!--<!-x--
<!--<!x
<!--<!x-
<!--<!x--
<!--<<!--x-->
<!--<!<!--x-->
<!--<!-<!--x-->
<!----!->
<!----!x>
<!-----x>


no errors
<!---->
<!----!-->
<!----!x-->
<!----->
<!-----x-->
<!--x-->
<!--<!-x-->
<!--<!x-->
<!--<<!x-->
<!--<<!-x-->
<!--<x-->
<!--<>-->
<!--<-->
<!--<--->
<!--<!-->

(And also a variant of each test that has an "x" where the "x" is replaced with U+0000, and those are parse errors.)

(Edit: moved <!--<!--> to no errors per #1456 )

@kosek
Copy link

kosek commented May 31, 2016

How this could be reserialized into XHTML? It just introduces additional discrepancy between HTML and XML syntaxes of HTML5.

@zcorpan
Copy link
Member Author

zcorpan commented May 31, 2016

https://html.spec.whatwg.org/multipage/xhtml.html#xml-fragment-serialisation-algorithm says to throw InvalidStateError. Happy to tweak those rules though if there is impl interest.

https://html.spec.whatwg.org/multipage/syntax.html#coercing-an-html-dom-into-an-infoset has suggestions for how to deal with that situation outside a browser.

@zcorpan
Copy link
Member Author

zcorpan commented May 31, 2016

Also need to update this bit

Errors that can result in infoset coercion
When a user agent based on XML is connected to an HTML parser, it is possible that certain invariants that XML enforces, such as comments never containing two consecutive hyphens, will be violated by an HTML file. Handling this can require that the parser coerce the HTML DOM into an XML-compatible infoset. Most syntax constructs that require such handling are considered invalid.

https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors

(Edit: now fixed.)

@zcorpan
Copy link
Member Author

zcorpan commented May 31, 2016

<!--<!-->

This was previously without parse error, but with the new spec it is a parse error. I think that's OK.

I also realize now that the syntax section needs to disallow the comment data ending with <! or <!-. (Edit: now fixed.)

@RReverser
Copy link
Member

Looks interesting to me, will take a more detailed look tomorrow when less sleepy :)

The spec disallowed dashes in comments in order to have conformance
checkers point out the problem for pre-2006 Firefox that implemented
SGML-style comment parsing, and to match the syntax requirements in
HTML4 and XML. These are no longer relevant and instead people are
getting unhelpful error messages when validating for markup like:

</div><!-- end .foo-bar--baz -->

* A comment that ends with a dash is now allowed: <!-- --->
* A comment that contains -- is also allowed: <!-- -- -->
* A comment that contains --! is also allowed: <!-- --! -->
* The following are still parse errors: <!-->, <!--->, <!-- --!>

Fixes #1351.

Also remove obsolete comments in the spec's source about rewinding
when hitting EOF (see #1058).
This makes it easier to add new states to introduce a parse error
for <!-- <!-- -->.
Nesting comments like <!--<!---->--> is an authoring mistake since
the comment will end on the first -->. Now that -- in comments is
allowed, we need new states to make "<!--" in a comment a parse
error.
@hsivonen
Copy link
Member

hsivonen commented Jun 1, 2016

Making more things that can't be serialized in XHTML conforming in HTML makes me sad. I do realize that serializability to XHTML is of interest to fewer people than not getting errors for consecutive dashes in comments.

Still, I wish this change wasn't made and in general I wish we didn't tweak stuff like this anymore.

@RReverser
Copy link
Member

@kosek @hsivonen Note that reserialization of the parsed HTML even into HTML already doesn't result in the same structure (see #1280). Those particular changes just change parse errors, but similar content was already allowed, and in 1.8 spec notes explicitly:

The DOM, the HTML syntax, and the XHTML syntax cannot all represent the same content. ... Comments that contain the string "-->" can only be represented in the DOM, not in the HTML and XHTML syntaxes.

So when serializing comments back, you probably already need to sanitize dashes inside.

@kosek
Copy link

kosek commented Jun 1, 2016

@hsivonen I agree that changes to parsing rules should be very rare now. This is something what should be realy stable. Change should be supported by really strong use-case, and allowing two dashes is not IMHO strong use-case.

@zcorpan Strong use-case could be allowing nested comments. This is something what is really missing in both HTML and XML. In XML it will never be added. But this could be added to HTML and this would be of benefit to users. But change to parser algorithm would be much more distrupting, of course.

@RReverser
Copy link
Member

Change should be supported by really strong use-case, and allowing two dashes is not IMHO strong use-case.

@kosek Note that this is not a change to the parsing algorithm in any way - this was already allowed and correctly parsed by any HTML5 compliant implementation; this just removes parsing error for validators which is obsolete due to the reasons outlined above; outside of the validators, everything works as earlier.

@zcorpan
Copy link
Member Author

zcorpan commented Jun 1, 2016

@kosek adding support for nested comments would be a change to the parsed tree, which most likely would be breaking Web content. So we can't do that.

@hsivonen OK. It is indeed a tradeoff between needing to do infoset coercion even for conforming HTML syntax when using an XML pipeline, and having conformance checkers not emit messages that are useless for the vast majority of authors (only have anecdotal evidence but from Twitter responses and on stackoverflow, it seems people generally don't understand why -- in comments is not allowed in the first place).

I am under the assumption that when you have a tool or UA with an XML pipeline with an HTML parser in one end, you will in general want to accept broken HTML also, which can already have dashes in comments and so you need to do infoset coercion anyway. But I could be wrong about this of course. Are there examples where this is not the case?

@hsivonen
Copy link
Member

hsivonen commented Jun 1, 2016

@RReverser, as you say, #1280 was already a conformance error. In general, it's sad to have non-round-trippable conforming documents. We should treat such cases as spec bugs if feasible (i.e. make those cases non-conforming if feasible).

@zcorpan, yes, it's likely that HTML consumers that use an XML pipeline internally would want to do infoset coercion regardless of the conformance status of the input.

@inikulin
Copy link
Member

inikulin commented Jun 1, 2016

I like the change. The bad part is that we need to introduce 3 new state just to support parse error (about which AFAIK most implementers don't care). But I guess it's the only way to go, so lgtm.

@zcorpan
Copy link
Member Author

zcorpan commented Jun 1, 2016

An implementation that doesn't care about parse errors can indeed optimize away the new states (and a bunch of other states elsewhere in the parser, too).

@domenic
Copy link
Member

domenic commented Jun 1, 2016

I am not terribly involved in the finer points of document conformance, and so happy to defer to @zcorpan and others as the experts. But as an editor my general thoughts are:

  • We should feel free to change document conformance requirements, even ones involving parser states and such, in the service of making authors' lives better by giving them less spurious errors. That is, the fact that this is parser-related should not make us more hesitant to change it.
  • Thus the deciding factor should be whether we think this error is spurious or important.
  • We should make a principled stand on whether it's important to defining "conforming HTML" such that conforming HTML can be round-tripped into serialized XHTML.
    • As @hsivonen pointed out, in general HTML -> XHTML conversion tools will not actually use the conformance requirements to reject non-conforming documents, so this does not simplify their operation.
    • I guess the hypothetical scenario here is someone consuming other peoples' HTML (not XHTML) documents, who cares about round-trippability to XHTML, but doesn't care about non-conformant HTML documents?
    • In general that population seems likely very small to me so I am not sure we should take a principled stand in favor of their use cases.

So I am in favor of this change, although again I defer to those more involved than I.

@zcorpan
Copy link
Member Author

zcorpan commented Jun 6, 2016

As a general rule, I think anything that can't be serialized to XML should also not be allowed in the HTML syntax, because compat with XML tools is sometimes useful and aligning with XML's rules typically doesn't hurt (e.g. restricting the allowed characters in element names or attribute names). But in this case it appears that this rule hurts more than it helps, so I think we should deviate from it.

I'd like to wait for @sideshowbarker to have an experimental implementation before merging this though.

zcorpan referenced this pull request in validator/htmlparser Jun 10, 2016
This is experiment and may be reverted later.
@sideshowbarker
Copy link
Contributor

https://checker.html5.org/ now has experimental support conforming to the spec text in this PR branch.

Specifically https://checker.html5.org/ behaves as expected for all the test cases in #1356 (comment).

errors
<!----!>

For that case https://checker.html5.org/ emits a new specific error:

Error: --!> found at end of comment (should just be -->).

Suggestions welcome for any improvements to error-message wording.

The error-message wording for the <!---in-comment case is:

Saw <!-- within a comment. Probable cause: Nested comment (not allowed).

Suggestions for any improvements to that wording also welcome.

@sideshowbarker
Copy link
Contributor

https://checker.html5.org/parsetree/ can be used to view a serialization of the resulting parse tree for, e.g., any of the cases in #1356 (comment).

@zcorpan
Copy link
Member Author

zcorpan commented Jun 14, 2016

Nice work @sideshowbarker! I've confirmed the errors/no errors are reported as expected.

I found a bug in the parsed tree:

https://checker.html5.org/parsetree/?parser=html5&content=%3C%21--%3C%21-&submit=Print+Tree

vs.

http://software.hixie.ch/utilities/js/live-dom-viewer/saved/4265

i.e. the comment data is <!- instead of <!.

sideshowbarker added a commit to validator/validator that referenced this pull request Jun 14, 2016
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Feb 6, 2018
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Mar 16, 2018
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Mar 16, 2018
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Mar 16, 2018
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Nov 16, 2018
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Dec 15, 2018
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Dec 17, 2018
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Dec 18, 2018
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Jan 15, 2019
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Apr 28, 2019
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Jul 5, 2019
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Nov 26, 2019
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Jan 15, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Feb 12, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Feb 13, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Mar 4, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Apr 23, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Apr 24, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Apr 25, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Jun 18, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
hsivonen pushed a commit to hsivonen/gecko that referenced this pull request Jul 3, 2020
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Jul 4, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Jul 4, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Aug 3, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Aug 5, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Aug 10, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
sideshowbarker added a commit to validator/htmlparser that referenced this pull request Aug 21, 2020
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Aug 16, 2021
hsivonen pushed a commit to validator/htmlparser that referenced this pull request Aug 17, 2021
Also allow `<!-->` at (IE conditional) comment end

See whatwg/html#1356
See whatwg/html#1456
jamienicol pushed a commit to jamienicol/gecko that referenced this pull request Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants