Node#line heuristic under JRuby is hacky and should be smarter #1223

camertron · 2015-01-16T21:48:14Z

With MRI 2.1.5, line numbers are correctly reported. With any recent version of JRuby, they aren't:

$> rbenv local ruby-2.1.5
$> bundle console
irb> require 'nokogiri'
irb> puts Gem.loaded_specs['nokogiri'].version
1.6.5
irb> doc = Nokogiri::XML('<root><foo>bar</foo>\n\n<foo>baz</foo></root>')
irb> root = doc.xpath('/root').first
irb> root.child.first.line
1
irb> root.child.last.line
3

And then:

$> rbenv local jruby-1.7.15
$> bundle console
irb> require 'nokogiri'
irb> puts Gem.loaded_specs['nokogiri'].version
1.6.5
irb> doc = Nokogiri::XML('<root><foo>bar</foo>\n\n<foo>baz</foo></root>')
irb> root = doc.xpath('/root').first
irb> root.child.first.line
1
irb> root.child.last.line
1

As you can see, under JRuby nokogiri incorrectly returns 1 for the line number of the second child under root.

The text was updated successfully, but these errors were encountered:

flavorjones · 2015-01-20T16:43:28Z

Hello @camertron,

Thanks for reporting this! I've captured this in a slightly different repro script:

#! /usr/bin/env ruby

require 'nokogiri'

doc = Nokogiri::XML <<EOXML
<root>
  <foo>
    bar
  </foo>
  <foo>
    baz
  </foo>
</root>
EOXML
puts doc.to_xml
doc.css("foo").each { |node| puts node.line }

MRI outputs:

<?xml version="1.0"?>
<root>
  <foo>
    bar
  </foo>
  <foo>
    baz
  </foo>
</root>
2
5

JRuby outputs:

<?xml version="1.0"?>
<root>
  <foo>
    bar
  </foo>
  <foo>
    baz
  </foo>
</root>
2
4

As you can see, when compared to your script, JRuby appears to use a different heuristic when reporting the line number than MRI does. This is likely due to the underlying implementations being different XML libraries (libxml2 for MRI, and xerces for JRuby).

Indeed, looking at the code, in MRI we call a libxml2 method to get the internal line number from the parser:

https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node.c#L1247

And in JRuby, we look at a counter maintained in the JRuby extension (not the Xerces parser):

https://github.com/sparklemotion/nokogiri/blob/master/ext/java/nokogiri/XmlNode.java#L1430-L1452

Part of the issue here is that the w3c has not defined for what serialization a Node's "line number" might correlate:

http://www.w3.org/2003/01/dom2-javadoc/org/w3c/dom/Node.html

You can output an XML doc in a couple of different formats (e.g., ignoring whitespace, canonicalized, as-parsed), so which "line number" would we mean?

Libxml2 stores the line number for each node as-parsed:

http://xmlsoft.org/html/libxml-tree.html#xmlNode

that is, "on what line of the original document was this element encountered?". Xerces doesn't have similar functionality, and so you get the kludgey calculation that I linked to above -- it's a "best guess", and in most cases it's wrong, unfortunately.

I think what we might be able to do is implement startDocument() to store off the XMLLocator, then implement startElement() to query the element's line number from the locator and store it in a hash table for later use by the #line method. But I don't have a ton of experience with Xerces, so I might be totally off base, or there might be a more-clever solution.

@jvshahid or @yokolet -- does this strategy seem sound to you?

I'm also going to rename this issue to reflect the actual work that needs to be done.

camertron · 2015-01-20T20:26:15Z

Hey @flavorjones thanks for the awesome and thorough diagnosis! Glad to know the issue is getting some attention :) Let me know if I can be of any help.

…rs right now See: sparklemotion/nokogiri#1223

camertron · 2015-07-01T23:40:57Z

Any update on this, @flavorjones?

Related to #1223.

flavorjones · 2015-09-30T16:21:27Z

Started a proof-of-concept branch at flavorjones-1223-jruby-line-numbers

flavorjones · 2021-01-18T21:45:04Z

Ok, in the interest of closing this out, I dug in today. The approach I started with is:

commit f435981 (origin/flavorjones-1223-jruby-line-numbers)
Author: Mike Dalessio <[email protected]>
Date:   2015-09-30 12:20:04 -0400

    WIP for properly tracking line numbers in JRuby.
    
    Related to #1223.

diff --git a/ext/java/nokogiri/internals/NokogiriDomParser.java b/ext/java/nokogiri/internals/NokogiriDomParser.java
index cefae5e..550ab93 100644
--- a/ext/java/nokogiri/internals/NokogiriDomParser.java
+++ b/ext/java/nokogiri/internals/NokogiriDomParser.java
@@ -39,6 +39,12 @@ import nokogiri.XmlDocument;
 import org.apache.xerces.parsers.DOMParser;
 import org.apache.xerces.parsers.XIncludeParserConfiguration;
 import org.apache.xerces.xni.parser.XMLParserConfiguration;
+import org.apache.xerces.xni.XMLLocator;
+import org.apache.xerces.xni.NamespaceContext;
+import org.apache.xerces.xni.Augmentations;
+import org.apache.xerces.xni.QName;
+import org.apache.xerces.xni.XMLAttributes;
+import org.apache.xerces.xni.Augmentations;
 import org.cyberneko.dtd.DTDConfiguration;
 import org.w3c.dom.Document;
 import org.xml.sax.Attributes;
@@ -57,6 +63,7 @@ public class NokogiriDomParser extends DOMParser {
     protected DOMParser dtd;
     protected boolean xInclude;
     protected XMLParserConfiguration config;
+    protected XMLLocator locator;
 
     public NokogiriDomParser(XMLParserConfiguration config) {
         super(config);
@@ -113,4 +120,21 @@ public class NokogiriDomParser extends DOMParser {
             return source;
         }
     }
+
+    @Override
+    public void startDocument(XMLLocator locator,
+                              String encoding,
+                              NamespaceContext namespaceContext,
+                              Augmentations augs) {
+        this.locator = locator;
+        super.startDocument(locator, encoding, namespaceContext, augs);
+    }
+
+    @Override
+    public void startElement(QName element,
+                             XMLAttributes attributes,
+                             Augmentations augs) {
+        System.out.println(element + ": " + locator.getLineNumber());
+        super.startElement(element, attributes, augs);
+    }
 }

but here I got stuck, because I don't know of any way to link the locator metadata with the final Node in the parsed Document. If someone can help me out, I'm all ears.

So I'm falling back on simply tweaking the existing algorithm to handle newlines better.

Note that Java's w3c.dom.Node interface does not have any way to get (or set) the line number for that node. The original JRuby implementation counted newlines, but did it poorly -- see #1223. This commit improves the newline-counting approach. But the solution itself -- counting newlines! -- is still questionable IMHO and absolutely inefficient. I had played around with an approach that I wrote about at #1223, where the SAX Parser knows what line it's on when `startElement` is invoked (via the XMLLocator). But I couldn't figure out how to preserve that information in the final Document or Node. If you, like me, think this approach is terrible; and if you *also* understand how to set this metadata on the Node or the Document, then please help us out.

…ne-numbers (jruby) Fix Node#line --- **What problem is this PR intended to solve?** #1223 to make `Node#line` on JRuby smarter. For background, the Java DOM interfaces don't provide any way for a Node to know or find the line number in the original document at which it appeared. I'm unhappy with this approach, but was unable to figure out how to make the original approach at #1223 work. If you're familiar with the Java DOM interfaces and want to help, please do. **Have you included adequate test coverage?** Yes. **Does this change affect the behavior of either the C or the Java implementations?** The Java implementation's behavior should be closer to the C implementation's behavior now.

flavorjones · 2021-01-19T13:03:06Z

Closed by #2177.

flavorjones changed the title ~~Line numbers are incorrect under JRuby~~ Node#line heuristic under JRuby is hacky and should be smarter Jan 20, 2015

flavorjones added platform/jruby meta/discussion labels Jan 20, 2015

camertron added a commit to rosette-proj/rosette-extractor-xml that referenced this issue Feb 19, 2015

#line_number_from should return nil since we don't support line numbe…

80b462b

…rs right now See: sparklemotion/nokogiri#1223

flavorjones added a commit that referenced this issue Sep 30, 2015

WIP for properly tracking line numbers in JRuby.

f435981

Related to #1223.

flavorjones added meta/feature-request and removed meta/discussion labels Jan 5, 2019

stevecheckoway mentioned this issue Aug 10, 2019

Add Nokogiri::XML::Node#line= #1918

Merged

flavorjones mentioned this issue Jan 18, 2021

(jruby) Fix Node#line #2177

Merged

flavorjones closed this as completed Jan 19, 2021

sync-by-unito bot mentioned this issue Apr 9, 2021

Bump nokogiri from 1.10.2 to 1.11.3 yujiloh0816/sample_app#10

Closed

dabdine mentioned this issue Dec 5, 2021

Resolve #389: Add line numbers to recog_verify output rapid7/recog#390

Merged

3 tasks

This was referenced Dec 8, 2021

Add tests to cover line numbering edge cases with JRuby #2379

Merged

[bug] Nokogiri/JRuby line numbering off by one when XML prolog is in the source document #2380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node#line heuristic under JRuby is hacky and should be smarter #1223

Node#line heuristic under JRuby is hacky and should be smarter #1223

camertron commented Jan 16, 2015

flavorjones commented Jan 20, 2015

camertron commented Jan 20, 2015

camertron commented Jul 1, 2015

flavorjones commented Sep 30, 2015

flavorjones commented Jan 18, 2021 •

edited

Loading

flavorjones commented Jan 19, 2021

Node#line heuristic under JRuby is hacky and should be smarter #1223

Node#line heuristic under JRuby is hacky and should be smarter #1223

Comments

camertron commented Jan 16, 2015

flavorjones commented Jan 20, 2015

camertron commented Jan 20, 2015

camertron commented Jul 1, 2015

flavorjones commented Sep 30, 2015

flavorjones commented Jan 18, 2021 • edited Loading

flavorjones commented Jan 19, 2021

flavorjones commented Jan 18, 2021 •

edited

Loading