-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848
Changes from 1 commit
350aadf
89d022d
859b8aa
d45b6e9
a7f9236
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,7 +38,8 @@ | |
import java.util.Map; | ||
import java.util.Properties; | ||
import java.util.Stack; | ||
|
||
import java.util.regex.Matcher; | ||
import java.util.regex.Pattern; | ||
import static javax.xml.transform.OutputKeys.METHOD; | ||
import static org.jsoup.nodes.Document.OutputSettings.Syntax; | ||
|
||
|
@@ -96,6 +97,35 @@ public static Document convert(org.jsoup.nodes.Document in) { | |
return (new W3CDom().fromJsoup(in)); | ||
} | ||
|
||
/** | ||
* Pattern to detect the <code>xmlns="http://www.w3.org/1999/xhtml"</code> default namespace | ||
* declaration when serializing the DOM to HTML. This pattern is "good enough", relying in part | ||
* on the output of the {@link Transformer} used in the implementation, but is not a complete | ||
* solution for all the serializations possible; that is, if one constructed an XML string | ||
* manually, it might be possible to find an obscure variation that this pattern would not | ||
* match. | ||
*/ | ||
static final Pattern HTML_DEFAULT_NAMESPACE_PATTERN = | ||
Pattern.compile("<html[^>]*(\\sxmlns=['\"]http://www.w3.org/1999/xhtml['\"])"); | ||
|
||
/** | ||
* Removes the default <code>xmlns="http://www.w3.org/1999/xhtml"</code> HTML namespace | ||
* declaration if present in the string. | ||
* | ||
* @param html The serialized HTML. | ||
* @return A string without the default <code>xmlns="http://www.w3.org/1999/xhtml"</code> HTML | ||
* namespace declaration. | ||
* @see <a href="https://github.com/jhy/jsoup/issues/1837">Issue #1837: Bug: DOM elements not | ||
* being placed in (X)HTML namespace.</a> | ||
*/ | ||
static String removeDefaultHtmlNamespaceDeclaration(String html) { | ||
Matcher matcher = HTML_DEFAULT_NAMESPACE_PATTERN.matcher(html); | ||
if (matcher.find()) { | ||
html = html.substring(0, matcher.start(1)) + html.substring(matcher.end(1)); | ||
} | ||
return html; | ||
} | ||
|
||
/** | ||
* Serialize a W3C document to a String. Provide Properties to define output settings including if HTML or XML. If | ||
* you don't provide the properties ({@code null}), the output will be auto-detected based on the content of the | ||
|
@@ -140,7 +170,16 @@ else if (doctype.getName().equalsIgnoreCase("html") | |
} | ||
|
||
transformer.transform(domSource, result); | ||
return writer.toString(); | ||
String resultString = writer.toString(); | ||
String outputMethod = properties != null ? properties.get(METHOD) : null; | ||
// Remove any default xmlns="http://www.w3.org/1999/xhtml" namespace declaration, | ||
// but only if it was added during serialization (not if the document already had it). | ||
// Monitor https://stackoverflow.com/q/73919306 for a better approach. | ||
if ((outputMethod == null || outputMethod.equals("html")) //only for HTML output | ||
&& !doc.getDocumentElement().hasAttribute("xmlns")) { | ||
resultString = removeDefaultHtmlNamespaceDeclaration(resultString); | ||
} | ||
return resultString; | ||
|
||
} catch (TransformerException e) { | ||
throw new IllegalStateException(e); | ||
|
@@ -348,6 +387,7 @@ protected static class W3CBuilder implements NodeVisitor { | |
public W3CBuilder(Document doc) { | ||
this.doc = doc; | ||
namespacesStack.push(new HashMap<>()); | ||
namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // TODO document | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel that we should only be setting this default namespace if the input document is HTML. If it's XML, then the HTML -> XML compat note doesn't apply, and we shouldn't be applying it. The impl could be something like
Would need to move this to the appropriate I'm not sure if checking the treebuilder is the best way to see if it's HTML vs XML. The output syntax is how we normally check that, but that doesn't tell us particularly if the input was HTML, and people may be setting the syntax to XML before calling these methods. So am leaning towards checking treebuilder. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That makes sense. I don't know enough about your library's API to know when it thinks it's parsing XML. I had assumed it only parsed things considered to be non-XML HTML. (Otherwise I would think you could just use an off-the-shelf XML parser because the input would not be "soup".)
You know best here. All I care about is when I parse an HTML document, the HTML namespace gets imputed as per the space. If the tree builder is always set to
That sounds like a lot of shuffling. I note that the So what about this inside the final org.jsoup.nodes.Document inDoc = contextElement.ownerDocument();
if (inDoc == null || inDoc.parser().getTreeBuilder() instanceof HtmlTreeBuilder) {
namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // TODO document
} (I see that I had forgotten to document the actual namespace imputation line. I'll add that once I get an OK from you on how to proceed.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jhy could you respond here and let me know which way to go, so that I can finish this PR and get it accepted? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
OK, using the contextElement is a good idea. And yes let's go with checking if the builder was HtmlTreeBuilder. That will work for all cases in the core library (vs anyone's own extensions, which is kind of up them). I would not assume that an Element without an ownerDocument is HTML though, so we should not add the namespace if it is null. There will be no ownerDoc for Elements constructed via I plan on making this namespace inference a defaulted-on option in the W3C DOM. This will allow existing uses who have working code without a namespace (for e.g. simple xpath queries) a migration path, or for new uses that prefer it otherwise. |
||
dest = doc; | ||
contextElement = (org.jsoup.nodes.Element) doc.getUserData(ContextProperty); // Track the context jsoup Element, so we can save the corresponding w3c element | ||
} | ||
|
@@ -366,9 +406,9 @@ public void head(org.jsoup.nodes.Node source, int depth) { | |
tagname to something safe, because that isn't going to be meaningful downstream. This seems(?) to be | ||
how browsers handle the situation, also. https://github.com/jhy/jsoup/issues/1093 */ | ||
try { | ||
Element el = namespace == null && tagName.contains(":") ? | ||
doc.createElementNS("", tagName) : // doesn't have a real namespace defined | ||
doc.createElementNS(namespace, tagName); | ||
// use an empty namespace if none is present but the tag name has a prefix | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't change any functionality here. The modification makes it clear that the only thing happening is that you're "imputing" a namespace of the empty string under certain conditions — you're not calling a separate method. The original code duplicated the call to |
||
String imputedNamespace = namespace == null && tagName.contains(":") ? "" : namespace; | ||
Element el = doc.createElementNS(imputedNamespace, tagName); | ||
copyAttributes(sourceEl, el); | ||
append(el, sourceEl); | ||
if (sourceEl == contextElement) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to leave the introduced namespace in the output. It makes the output XML clearer as to what's happened, and may help other downstream parses. So the regex components etc can be removed.
It's part of the API.
I consider jsoup to be also about pretty-printing. And we do introduce other tokens throughout jsoup as a convenience (e.g. xml declarations, meta charsets, the HTML document structure, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my purposes I don't care one way or another, and it's certainly easier just to leave it in. At the time, due to lack of response to the ticket, I had to make a choice and do what I thought would raise the chance of the PR being requested. (You can see all the time and painful research on Stack Overflow to remove the declaration.) I'll remove the code that removed the declaration in the output.