-
-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node#line heuristic under JRuby is hacky and should be smarter #1223
Comments
Hello @camertron, Thanks for reporting this! I've captured this in a slightly different repro script: #! /usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::XML <<EOXML
<root>
<foo>
bar
</foo>
<foo>
baz
</foo>
</root>
EOXML
puts doc.to_xml
doc.css("foo").each { |node| puts node.line } MRI outputs:
JRuby outputs:
As you can see, when compared to your script, JRuby appears to use a different heuristic when reporting the line number than MRI does. This is likely due to the underlying implementations being different XML libraries (libxml2 for MRI, and xerces for JRuby). Indeed, looking at the code, in MRI we call a libxml2 method to get the internal line number from the parser:
And in JRuby, we look at a counter maintained in the JRuby extension (not the Xerces parser): Part of the issue here is that the w3c has not defined for what serialization a Node's "line number" might correlate:
You can output an XML doc in a couple of different formats (e.g., ignoring whitespace, canonicalized, as-parsed), so which "line number" would we mean? Libxml2 stores the line number for each node as-parsed: that is, "on what line of the original document was this element encountered?". Xerces doesn't have similar functionality, and so you get the kludgey calculation that I linked to above -- it's a "best guess", and in most cases it's wrong, unfortunately. I think what we might be able to do is implement @jvshahid or @yokolet -- does this strategy seem sound to you? I'm also going to rename this issue to reflect the actual work that needs to be done. |
Hey @flavorjones thanks for the awesome and thorough diagnosis! Glad to know the issue is getting some attention :) Let me know if I can be of any help. |
Any update on this, @flavorjones? |
Started a proof-of-concept branch at |
Ok, in the interest of closing this out, I dug in today. The approach I started with is: commit f435981 (origin/flavorjones-1223-jruby-line-numbers)
Author: Mike Dalessio <[email protected]>
Date: 2015-09-30 12:20:04 -0400
WIP for properly tracking line numbers in JRuby.
Related to #1223.
diff --git a/ext/java/nokogiri/internals/NokogiriDomParser.java b/ext/java/nokogiri/internals/NokogiriDomParser.java
index cefae5e..550ab93 100644
--- a/ext/java/nokogiri/internals/NokogiriDomParser.java
+++ b/ext/java/nokogiri/internals/NokogiriDomParser.java
@@ -39,6 +39,12 @@ import nokogiri.XmlDocument;
import org.apache.xerces.parsers.DOMParser;
import org.apache.xerces.parsers.XIncludeParserConfiguration;
import org.apache.xerces.xni.parser.XMLParserConfiguration;
+import org.apache.xerces.xni.XMLLocator;
+import org.apache.xerces.xni.NamespaceContext;
+import org.apache.xerces.xni.Augmentations;
+import org.apache.xerces.xni.QName;
+import org.apache.xerces.xni.XMLAttributes;
+import org.apache.xerces.xni.Augmentations;
import org.cyberneko.dtd.DTDConfiguration;
import org.w3c.dom.Document;
import org.xml.sax.Attributes;
@@ -57,6 +63,7 @@ public class NokogiriDomParser extends DOMParser {
protected DOMParser dtd;
protected boolean xInclude;
protected XMLParserConfiguration config;
+ protected XMLLocator locator;
public NokogiriDomParser(XMLParserConfiguration config) {
super(config);
@@ -113,4 +120,21 @@ public class NokogiriDomParser extends DOMParser {
return source;
}
}
+
+ @Override
+ public void startDocument(XMLLocator locator,
+ String encoding,
+ NamespaceContext namespaceContext,
+ Augmentations augs) {
+ this.locator = locator;
+ super.startDocument(locator, encoding, namespaceContext, augs);
+ }
+
+ @Override
+ public void startElement(QName element,
+ XMLAttributes attributes,
+ Augmentations augs) {
+ System.out.println(element + ": " + locator.getLineNumber());
+ super.startElement(element, attributes, augs);
+ }
} but here I got stuck, because I don't know of any way to link the locator metadata with the final Node in the parsed Document. If someone can help me out, I'm all ears. So I'm falling back on simply tweaking the existing algorithm to handle newlines better. |
Note that Java's w3c.dom.Node interface does not have any way to get (or set) the line number for that node. The original JRuby implementation counted newlines, but did it poorly -- see #1223. This commit improves the newline-counting approach. But the solution itself -- counting newlines! -- is still questionable IMHO and absolutely inefficient. I had played around with an approach that I wrote about at #1223, where the SAX Parser knows what line it's on when `startElement` is invoked (via the XMLLocator). But I couldn't figure out how to preserve that information in the final Document or Node. If you, like me, think this approach is terrible; and if you *also* understand how to set this metadata on the Node or the Document, then please help us out.
…ne-numbers (jruby) Fix Node#line --- **What problem is this PR intended to solve?** #1223 to make `Node#line` on JRuby smarter. For background, the Java DOM interfaces don't provide any way for a Node to know or find the line number in the original document at which it appeared. I'm unhappy with this approach, but was unable to figure out how to make the original approach at #1223 work. If you're familiar with the Java DOM interfaces and want to help, please do. **Have you included adequate test coverage?** Yes. **Does this change affect the behavior of either the C or the Java implementations?** The Java implementation's behavior should be closer to the C implementation's behavior now.
Closed by #2177. |
With MRI 2.1.5, line numbers are correctly reported. With any recent version of JRuby, they aren't:
And then:
As you can see, under JRuby nokogiri incorrectly returns
1
for the line number of the second child underroot
.The text was updated successfully, but these errors were encountered: