-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow dashes in comments #1356
Allow dashes in comments #1356
Conversation
Stuff to convert to test cases:
(And also a variant of each test that has an "x" where the "x" is replaced with U+0000, and those are parse errors.) (Edit: moved |
f6bba47
to
3453cb4
Compare
How this could be reserialized into XHTML? It just introduces additional discrepancy between HTML and XML syntaxes of HTML5. |
https://html.spec.whatwg.org/multipage/xhtml.html#xml-fragment-serialisation-algorithm says to throw InvalidStateError. Happy to tweak those rules though if there is impl interest. https://html.spec.whatwg.org/multipage/syntax.html#coercing-an-html-dom-into-an-infoset has suggestions for how to deal with that situation outside a browser. |
Also need to update this bit
https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors (Edit: now fixed.) |
This was previously without parse error, but with the new spec it is a parse error. I think that's OK. I also realize now that the syntax section needs to disallow the comment data ending with |
3453cb4
to
6600a27
Compare
Looks interesting to me, will take a more detailed look tomorrow when less sleepy :) |
The spec disallowed dashes in comments in order to have conformance checkers point out the problem for pre-2006 Firefox that implemented SGML-style comment parsing, and to match the syntax requirements in HTML4 and XML. These are no longer relevant and instead people are getting unhelpful error messages when validating for markup like: </div><!-- end .foo-bar--baz --> * A comment that ends with a dash is now allowed: <!-- ---> * A comment that contains -- is also allowed: <!-- -- --> * A comment that contains --! is also allowed: <!-- --! --> * The following are still parse errors: <!-->, <!--->, <!-- --!> Fixes #1351. Also remove obsolete comments in the spec's source about rewinding when hitting EOF (see #1058).
This makes it easier to add new states to introduce a parse error for <!-- <!-- -->.
Nesting comments like <!--<!---->--> is an authoring mistake since the comment will end on the first -->. Now that -- in comments is allowed, we need new states to make "<!--" in a comment a parse error.
6600a27
to
b4d672b
Compare
Making more things that can't be serialized in XHTML conforming in HTML makes me sad. I do realize that serializability to XHTML is of interest to fewer people than not getting errors for consecutive dashes in comments. Still, I wish this change wasn't made and in general I wish we didn't tweak stuff like this anymore. |
@kosek @hsivonen Note that reserialization of the parsed HTML even into HTML already doesn't result in the same structure (see #1280). Those particular changes just change parse errors, but similar content was already allowed, and in 1.8 spec notes explicitly:
So when serializing comments back, you probably already need to sanitize dashes inside. |
@hsivonen I agree that changes to parsing rules should be very rare now. This is something what should be realy stable. Change should be supported by really strong use-case, and allowing two dashes is not IMHO strong use-case. @zcorpan Strong use-case could be allowing nested comments. This is something what is really missing in both HTML and XML. In XML it will never be added. But this could be added to HTML and this would be of benefit to users. But change to parser algorithm would be much more distrupting, of course. |
@kosek Note that this is not a change to the parsing algorithm in any way - this was already allowed and correctly parsed by any HTML5 compliant implementation; this just removes parsing error for validators which is obsolete due to the reasons outlined above; outside of the validators, everything works as earlier. |
@kosek adding support for nested comments would be a change to the parsed tree, which most likely would be breaking Web content. So we can't do that. @hsivonen OK. It is indeed a tradeoff between needing to do infoset coercion even for conforming HTML syntax when using an XML pipeline, and having conformance checkers not emit messages that are useless for the vast majority of authors (only have anecdotal evidence but from Twitter responses and on stackoverflow, it seems people generally don't understand why I am under the assumption that when you have a tool or UA with an XML pipeline with an HTML parser in one end, you will in general want to accept broken HTML also, which can already have dashes in comments and so you need to do infoset coercion anyway. But I could be wrong about this of course. Are there examples where this is not the case? |
@RReverser, as you say, #1280 was already a conformance error. In general, it's sad to have non-round-trippable conforming documents. We should treat such cases as spec bugs if feasible (i.e. make those cases non-conforming if feasible). @zcorpan, yes, it's likely that HTML consumers that use an XML pipeline internally would want to do infoset coercion regardless of the conformance status of the input. |
I like the change. The bad part is that we need to introduce 3 new state just to support parse error (about which AFAIK most implementers don't care). But I guess it's the only way to go, so lgtm. |
An implementation that doesn't care about parse errors can indeed optimize away the new states (and a bunch of other states elsewhere in the parser, too). |
I am not terribly involved in the finer points of document conformance, and so happy to defer to @zcorpan and others as the experts. But as an editor my general thoughts are:
So I am in favor of this change, although again I defer to those more involved than I. |
As a general rule, I think anything that can't be serialized to XML should also not be allowed in the HTML syntax, because compat with XML tools is sometimes useful and aligning with XML's rules typically doesn't hurt (e.g. restricting the allowed characters in element names or attribute names). But in this case it appears that this rule hurts more than it helps, so I think we should deviate from it. I'd like to wait for @sideshowbarker to have an experimental implementation before merging this though. |
This is experiment and may be reverted later.
https://checker.html5.org/ now has experimental support conforming to the spec text in this PR branch. Specifically https://checker.html5.org/ behaves as expected for all the test cases in #1356 (comment).
For that case https://checker.html5.org/ emits a new specific error:
Suggestions welcome for any improvements to error-message wording. The error-message wording for the
Suggestions for any improvements to that wording also welcome. |
https://checker.html5.org/parsetree/ can be used to view a serialization of the resulting parse tree for, e.g., any of the cases in #1356 (comment). |
Nice work @sideshowbarker! I've confirmed the errors/no errors are reported as expected. I found a bug in the parsed tree: https://checker.html5.org/parsetree/?parser=html5&content=%3C%21--%3C%21-&submit=Print+Tree vs. http://software.hixie.ch/utilities/js/live-dom-viewer/saved/4265 i.e. the comment data is |
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456 Differential Revision: https://phabricator.services.mozilla.com/D82000
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456 Differential Revision: https://phabricator.services.mozilla.com/D82000
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456
Also allow `<!-->` at (IE conditional) comment end See whatwg/html#1356 See whatwg/html#1456 Differential Revision: https://phabricator.services.mozilla.com/D82000
Make dashes in comments no longer be parse errors
The spec disallowed dashes in comments in order to have conformance
checkers point out the problem for pre-2006 Firefox that implemented
SGML-style comment parsing, and to match the syntax requirements in
HTML4 and XML. These are no longer relevant and instead people are
getting unhelpful error messages when validating for markup like:
<!-- --->
--
is also allowed:<!-- -- -->
--!
is also allowed:<!-- --! -->
<!-->
,<!--->
,<!-- --!>
Fixes Remove parse error for dashes in comments <!-- -- --> #1351.
Also remove obsolete comments in the spec's source about rewinding
when hitting EOF (see [Tokenization] Question about the security comment in comment end state #1058).
Editorial: use more 'reconsume' in comment states
This makes it easier to add new states to introduce a parse error
for
<!-- <!-- -->
.Make nested comments a parse error
Nesting comments like
<!--<!---->-->
is an authoring mistake sincethe comment will end on the first
-->
. Now that--
in comments isallowed, we need new states to make "
<!--
" in a comment a parseerror.
cc @sideshowbarker @hsivonen @RReverser @inikulin @gsnedders