Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DTD ENTITY definitions can themselves have entities in them #103

Closed
bucklereed opened this issue Jun 8, 2017 · 4 comments · Fixed by #161
Closed

DTD ENTITY definitions can themselves have entities in them #103

bucklereed opened this issue Jun 8, 2017 · 4 comments · Fixed by #161

Comments

@bucklereed
Copy link

Prelude Text.XML> parseText def "<!DOCTYPE foo [<!ENTITY A \"&#65;\" >]><foo>&A;</foo>"
Right (Document {documentPrologue = Prologue {prologueBefore = [], prologueDoctype = Just (Doctype {doctypeName = "foo", doctypeID = Nothing}), prologueAfter = []}, documentRoot = Element {elementName = Name {nameLocalName = "foo", nameNamespace = Nothing, namePrefix = Nothing}, elementAttributes = fromList [], elementNodes = [NodeContent "&#65;"]}, documentEpilogue = []})

Note the NodeContent; when this is rendered, it becomes &amp;#65;, rather than A.

Also note that entities can reference other entities, which is the root of the infamous 'billion laughs' attack; here be dragons. Character entities are safe, though.

This might not be worth supporting properly, but it should definitely explicitly error out rather than producing garbage.

@bucklereed
Copy link
Author

Even just supporting character entities in entities gets crazy, because they get expanded at the definition site and then as they're substituted in: see appendix D of the XML spec for an example. Considering that apparently no-one else has tripped up on the broken entity handling yet, I think that this is almost certainly not worth supporting properly.

Also entities can be parsed as markup (!), which is also unsupported but not detected when it would be occurring.

@jgm
Copy link
Contributor

jgm commented Feb 24, 2021

I've tripped on this just now...KDE XML syntax definitions have entities in the DOCTYPE which use numerical entities.
It is absolutely worth supporting!

@jgm
Copy link
Contributor

jgm commented Feb 24, 2021

Here's an example from scheme.xml:

<!DOCTYPE language SYSTEM "language.dtd"
[
  <!ENTITY xmlattrs "\s+([^&quot;/>]++|&quot;[^&quot;]*+&quot;)*+">
  <!ENTITY tab      "&#009;">
  <!ENTITY regex    "(?:[^\\(\[/]++|\\.|\[\^?\]?([^\\\[\]]++|\\.|\[(:[^:]+:\])?)++\]|\((?R)\))+">

  <!ENTITY initial_ascii_set "a-zA-Z!$&#37;&amp;*/:&lt;=&gt;?~_^">
  <!ENTITY initial_unicode_set "\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Mn}\p{Nl}\p{No}\p{Pd}\p{Pc}\p{Po}\p{Sc}\p{Sm}\p{Sk}\p{So}\p{Co}">
  <!ENTITY initial_others "\\x[0-9a-fA-F]++;|(?![\x01-\x7f])[&initial_unicode_set;]">
  <!ENTITY initial "(?:[&initial_ascii_set;]|&initial_others;)">
  <!ENTITY subsequent "(?:[&initial_ascii_set;0-9-@.+\p{Nd}\p{Mc}\p{Me}]|&initial_others;)">
  <!ENTITY symbol "(?:&initial;&subsequent;*+)">
]>

We have numerical &#009;, &quot;, and also things that are only defined in this very block, like &initial_unicode_set;.

@jgm
Copy link
Contributor

jgm commented Feb 24, 2021

If you're worried about malicious recursive expansions, you can just put some small finite limit on recursive entity expansion (say 5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants