REPP raises an IndexError when an input ends with a double-quote #250

arademaker · 2019-08-15T20:13:30Z

I am trying to use pydelphin for tokenizing a corpus. It seems that sentences ending with " causes an error:

Input:

The oil and gas industry uses wireline logging to obtain a continuous record of a formation's rock properties.
Wireline logging can be defined as being "The acquisition and analysis of geophysical data performed as a function of well bore depth, together with the provision of related services."
Note that "wireline logging" and "mud logging" are not the same, yet are closely linked through the integration of the data sets.
The measurements are made referenced to "TAH" - True Along Hole depth: these and the associated analysis can then be used to infer further properties, such as hydrocarbon saturation and formation pressure, and to make further drilling and production decisions.

Code:

#!/bin/python3

from delphin.repp import REPP
import string
import os, sys

tkz = REPP.from_config("/Users/ar/hpsg/terg/pet/repp.set")
for line in open(sys.argv[1]):
    i = line.rstrip('\n')
    latt = tkz.tokenize(i)
    tks = latt.to_list()

output:

$ python lixo.py lixo.txt
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 5
  self._re = re.compile(pattern)
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 3
  self._re = re.compile(pattern)
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 11
  self._re = re.compile(pattern)
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 15
  self._re = re.compile(pattern)
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 2
  self._re = re.compile(pattern)
Traceback (most recent call last):
  File "lixo.py", line 10, in <module>
    latt = tkz.tokenize(i)
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 466, in tokenize
    return self.group.tokenize(s, pattern=pattern, active=active)
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 93, in tokenize
    res = self.apply(s, active=active)
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 72, in apply
    for step in self.trace(s, active=active):
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 86, in trace
    startmap = _mergemap(startmap, step.startmap)
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 481, in _mergemap
    merged[i] = shift + map1[i + shift]
IndexError: array index out of range

The text was updated successfully, but these errors were encountered:

goodmami · 2019-08-16T02:19:20Z

Thanks, I have confirmed the bug with a smaller example:

$ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< 'Kim said, "hi."'
Traceback (most recent call last):
  [...]
  File "/home/mwg/repos/pydelphin/delphin/repp.py", line 473, in _mergemap
    merged[i] = shift + map1[i + shift]
IndexError: array index out of range

compared with:

$ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< 'Kim said, "hi".'
(0, 0, 1, <0:3>, 1, "Kim", 0, "null") (1, 1, 2, <4:8>, 1, "said", 0, "null") (2, 2, 3, <8:9>, 1, ",", 0, "null") (3, 3, 4, <10:11>, 1, "“", 0, "null") (4, 4, 5, <11:13>, 1, "hi", 0, "null") (5, 5, 6, <13:14>, 1, "”", 0, "null") (6, 6, 7, <14:15>, 1, ".", 0, "null")

(aside: I cut out the FutureWarnings but filed a bug with the ERG about them: delph-in/erg#17)

Based on these examples I think you're correct that the quote characters are causing the error, but furthermore it is only double-quotes:

$ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< "Kim said, 'hi.'"
(0, 0, 1, <0:3>, 1, "Kim", 0, "null") (1, 1, 2, <4:8>, 1, "said", 0, "null") (2, 2, 3, <8:9>, 1, ",", 0, "null") (3, 3, 4, <10:11>, 1, "‘", 0, "null") (4, 4, 5, <11:13>, 1, "hi", 0, "null") (5, 5, 6, <13:14>, 1, ".", 0, "null") (6, 6, 7, <14:15>, 1, "’", 0, "null")

goodmami · 2019-08-16T03:07:34Z

I found a couple possible causes:

When rewriting, the string |Kim said, "hi."| gets some extra characters (initial space and an extra quote), yielding | Kim said, "hi."”| (the | characters are just to show the bounds), which change the length of the string. If the length of the modified string is used with an array with the length of the original string, there could be an index error when computing offsets.
A unicode quote character ” is used in the rewrites. While it is 1 character, it is 3 bytes in UTF-8. I'm pretty sure I'm computing the length of strings by characters and not bytes, but this could also explain the IndexError.

Of course there could be other reasons beyond these two.

arademaker · 2019-08-16T15:49:21Z

You said

A unicode quote character ” is used in the rewrites.

why? Is it something in pydelphin or in REPP rules?

goodmami · 2019-08-17T00:44:39Z

It's in the REPP rules. I'm not sure why they do that, exactly. Maybe they are trying to replace regular, directionally-ambiguous quotes with begin and end quotes.

I think it was from this block in rpp/quotes.rpp:

;;
;; the ‘raw’ WSJ texts every now and again contain sentence-final quotes that
;; are preceded by whitespace (many of these look like quotes that actually
;; should be part of the following sentence, i.e. were moved across a sentence
;; boundary).  for robustness, force directionality of quotes in sentence-
;; initial and final positions.
;;
!^([[({“‘' ]*)("|``)                                            \1“
!^([[({“‘ ]*)'(?![0-9]{2}s?)                                    \1‘
!("|'')([])}”’' ]*)$                                            ”\2
!'([])}”’ ]*)$

But regardless of the output of the rewrite, PyDelphin should be able to maintain the surface alignment properly.

arademaker · 2019-08-17T13:47:25Z

Hum, definitely the rules are replacing ambiguous quote:

$ ./ace -g erg-2018-osx-0.9.30.dat -E
Wireline logging can be defined as being "The acquisition and analysis of geophysical data performed as a function of well bore depth, together with the provision of related services."
Wireline logging can be defined as being “ The acquisition and analysis of geophysical data performed as a function of well bore depth , together with the provision of related services . ”

but the comment above is confusing, it is about 'quotes that are preceded buy whitespace...', not the case here.

goodmami · 2019-08-18T00:26:01Z

Yes. The whitespace is inserted by another rule, I think.

Also it looks like it wasn't really disambiguating the quotes either, both the open-quote and close-quote are the same in the end. I'm not really sure what is the motivation for those rules.

goodmami · 2019-08-21T07:46:06Z

Turns out the problem wasn't double-quotes or unicode issues, it was a replacement pattern that didn't use some capturing groups. Specifically this pattern:

!("|'')([])}”’' ]*)$                                            ”\2

With REPP, accurate surface alignments ("characterization") depends on group references (\2 above) being in-order and without gaps. So it would work if \1 appeared instead, or also in front of \2. But since \2 appears without \1 before it, REPP "gives up" on getting accurate alignments and says that all replaced characters start at the first character of the match and end on the last character. My code had a bug when it was trying to calculate this incrementally for each "segment" (group reference or literal replacement substring). Now, once I get to the point where REPP "gives up", I join all remaining segments into one big replacement, which makes it much easier to calculate the proper alignments.

As a bonus, I sped up processing 2-3x by avoiding some unnecessary computation and also fixed that extra-character bug (which was probably related to the main fix).

arademaker mentioned this issue Aug 15, 2019

touch.py tokenization bug? own-pt/sensetion.el#135

Closed

goodmami mentioned this issue Aug 16, 2019

Update REPP patterns for nested sets delph-in/erg#17

Closed

goodmami added the bug label Aug 16, 2019

goodmami changed the title ~~using pydelphin for tokenization~~ REPP raises an IndexError when an input ends with a double-quote Aug 16, 2019

goodmami closed this as completed in 841ebb6 Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REPP raises an IndexError when an input ends with a double-quote #250

REPP raises an IndexError when an input ends with a double-quote #250

arademaker commented Aug 15, 2019

goodmami commented Aug 16, 2019

goodmami commented Aug 16, 2019

arademaker commented Aug 16, 2019

goodmami commented Aug 17, 2019

arademaker commented Aug 17, 2019

goodmami commented Aug 18, 2019

goodmami commented Aug 21, 2019

REPP raises an IndexError when an input ends with a double-quote #250

REPP raises an IndexError when an input ends with a double-quote #250

Comments

arademaker commented Aug 15, 2019

goodmami commented Aug 16, 2019

goodmami commented Aug 16, 2019

arademaker commented Aug 16, 2019

goodmami commented Aug 17, 2019

arademaker commented Aug 17, 2019

goodmami commented Aug 18, 2019

goodmami commented Aug 21, 2019