Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node#line heuristic under JRuby is hacky and should be smarter #1223

Closed
camertron opened this issue Jan 16, 2015 · 6 comments
Closed

Node#line heuristic under JRuby is hacky and should be smarter #1223

camertron opened this issue Jan 16, 2015 · 6 comments

Comments

@camertron
Copy link

With MRI 2.1.5, line numbers are correctly reported. With any recent version of JRuby, they aren't:

$> rbenv local ruby-2.1.5
$> bundle console
irb> require 'nokogiri'
irb> puts Gem.loaded_specs['nokogiri'].version
1.6.5
irb> doc = Nokogiri::XML('<root><foo>bar</foo>\n\n<foo>baz</foo></root>')
irb> root = doc.xpath('/root').first
irb> root.child.first.line
1
irb> root.child.last.line
3

And then:

$> rbenv local jruby-1.7.15
$> bundle console
irb> require 'nokogiri'
irb> puts Gem.loaded_specs['nokogiri'].version
1.6.5
irb> doc = Nokogiri::XML('<root><foo>bar</foo>\n\n<foo>baz</foo></root>')
irb> root = doc.xpath('/root').first
irb> root.child.first.line
1
irb> root.child.last.line
1

As you can see, under JRuby nokogiri incorrectly returns 1 for the line number of the second child under root.

@flavorjones
Copy link
Member

Hello @camertron,

Thanks for reporting this! I've captured this in a slightly different repro script:

#! /usr/bin/env ruby

require 'nokogiri'

doc = Nokogiri::XML <<EOXML
<root>
  <foo>
    bar
  </foo>
  <foo>
    baz
  </foo>
</root>
EOXML
puts doc.to_xml
doc.css("foo").each { |node| puts node.line }

MRI outputs:

<?xml version="1.0"?>
<root>
  <foo>
    bar
  </foo>
  <foo>
    baz
  </foo>
</root>
2
5

JRuby outputs:

<?xml version="1.0"?>
<root>
  <foo>
    bar
  </foo>
  <foo>
    baz
  </foo>
</root>
2
4

As you can see, when compared to your script, JRuby appears to use a different heuristic when reporting the line number than MRI does. This is likely due to the underlying implementations being different XML libraries (libxml2 for MRI, and xerces for JRuby).

Indeed, looking at the code, in MRI we call a libxml2 method to get the internal line number from the parser:

https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node.c#L1247

And in JRuby, we look at a counter maintained in the JRuby extension (not the Xerces parser):

https://github.com/sparklemotion/nokogiri/blob/master/ext/java/nokogiri/XmlNode.java#L1430-L1452

Part of the issue here is that the w3c has not defined for what serialization a Node's "line number" might correlate:

http://www.w3.org/2003/01/dom2-javadoc/org/w3c/dom/Node.html

You can output an XML doc in a couple of different formats (e.g., ignoring whitespace, canonicalized, as-parsed), so which "line number" would we mean?

Libxml2 stores the line number for each node as-parsed:

http://xmlsoft.org/html/libxml-tree.html#xmlNode

that is, "on what line of the original document was this element encountered?". Xerces doesn't have similar functionality, and so you get the kludgey calculation that I linked to above -- it's a "best guess", and in most cases it's wrong, unfortunately.

I think what we might be able to do is implement startDocument() to store off the XMLLocator, then implement startElement() to query the element's line number from the locator and store it in a hash table for later use by the #line method. But I don't have a ton of experience with Xerces, so I might be totally off base, or there might be a more-clever solution.

@jvshahid or @yokolet -- does this strategy seem sound to you?

I'm also going to rename this issue to reflect the actual work that needs to be done.

@flavorjones flavorjones changed the title Line numbers are incorrect under JRuby Node#line heuristic under JRuby is hacky and should be smarter Jan 20, 2015
@camertron
Copy link
Author

Hey @flavorjones thanks for the awesome and thorough diagnosis! Glad to know the issue is getting some attention :) Let me know if I can be of any help.

camertron added a commit to rosette-proj/rosette-extractor-xml that referenced this issue Feb 19, 2015
@camertron
Copy link
Author

Any update on this, @flavorjones?

flavorjones added a commit that referenced this issue Sep 30, 2015
@flavorjones
Copy link
Member

Started a proof-of-concept branch at flavorjones-1223-jruby-line-numbers

@flavorjones
Copy link
Member

flavorjones commented Jan 18, 2021

Ok, in the interest of closing this out, I dug in today. The approach I started with is:

commit f435981 (origin/flavorjones-1223-jruby-line-numbers)
Author: Mike Dalessio <[email protected]>
Date:   2015-09-30 12:20:04 -0400

    WIP for properly tracking line numbers in JRuby.
    
    Related to #1223.

diff --git a/ext/java/nokogiri/internals/NokogiriDomParser.java b/ext/java/nokogiri/internals/NokogiriDomParser.java
index cefae5e..550ab93 100644
--- a/ext/java/nokogiri/internals/NokogiriDomParser.java
+++ b/ext/java/nokogiri/internals/NokogiriDomParser.java
@@ -39,6 +39,12 @@ import nokogiri.XmlDocument;
 import org.apache.xerces.parsers.DOMParser;
 import org.apache.xerces.parsers.XIncludeParserConfiguration;
 import org.apache.xerces.xni.parser.XMLParserConfiguration;
+import org.apache.xerces.xni.XMLLocator;
+import org.apache.xerces.xni.NamespaceContext;
+import org.apache.xerces.xni.Augmentations;
+import org.apache.xerces.xni.QName;
+import org.apache.xerces.xni.XMLAttributes;
+import org.apache.xerces.xni.Augmentations;
 import org.cyberneko.dtd.DTDConfiguration;
 import org.w3c.dom.Document;
 import org.xml.sax.Attributes;
@@ -57,6 +63,7 @@ public class NokogiriDomParser extends DOMParser {
     protected DOMParser dtd;
     protected boolean xInclude;
     protected XMLParserConfiguration config;
+    protected XMLLocator locator;
 
     public NokogiriDomParser(XMLParserConfiguration config) {
         super(config);
@@ -113,4 +120,21 @@ public class NokogiriDomParser extends DOMParser {
             return source;
         }
     }
+
+    @Override
+    public void startDocument(XMLLocator locator,
+                              String encoding,
+                              NamespaceContext namespaceContext,
+                              Augmentations augs) {
+        this.locator = locator;
+        super.startDocument(locator, encoding, namespaceContext, augs);
+    }
+
+    @Override
+    public void startElement(QName element,
+                             XMLAttributes attributes,
+                             Augmentations augs) {
+        System.out.println(element + ": " + locator.getLineNumber());
+        super.startElement(element, attributes, augs);
+    }
 }

but here I got stuck, because I don't know of any way to link the locator metadata with the final Node in the parsed Document. If someone can help me out, I'm all ears.

So I'm falling back on simply tweaking the existing algorithm to handle newlines better.

flavorjones added a commit that referenced this issue Jan 18, 2021
Note that Java's w3c.dom.Node interface does not have any way to
get (or set) the line number for that node. The original JRuby
implementation counted newlines, but did it poorly -- see #1223.

This commit improves the newline-counting approach. But the solution
itself -- counting newlines! -- is still questionable IMHO and
absolutely inefficient.

I had played around with an approach that I wrote about at #1223,
where the SAX Parser knows what line it's on when `startElement` is
invoked (via the XMLLocator). But I couldn't figure out how to
preserve that information in the final Document or Node.

If you, like me, think this approach is terrible; and if you *also*
understand how to set this metadata on the Node or the Document, then
please help us out.
flavorjones added a commit that referenced this issue Jan 19, 2021
…ne-numbers

(jruby) Fix Node#line

---

**What problem is this PR intended to solve?**

#1223 to make `Node#line` on JRuby smarter.

For background, the Java DOM interfaces don't provide any way for a Node to know or find the line number in the original document at which it appeared.

I'm unhappy with this approach, but was unable to figure out how to make the original approach at #1223 work. If you're familiar with the Java DOM interfaces and want to help, please do.


**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java implementations?**

The Java implementation's behavior should be closer to the C implementation's behavior now.
@flavorjones
Copy link
Member

Closed by #2177.

This was referenced Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants