Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REPP raises an IndexError when an input ends with a double-quote #250

Closed
arademaker opened this issue Aug 15, 2019 · 7 comments
Closed

REPP raises an IndexError when an input ends with a double-quote #250

arademaker opened this issue Aug 15, 2019 · 7 comments
Labels

Comments

@arademaker
Copy link
Member

I am trying to use pydelphin for tokenizing a corpus. It seems that sentences ending with " causes an error:

Input:

The oil and gas industry uses wireline logging to obtain a continuous record of a formation's rock properties.
Wireline logging can be defined as being "The acquisition and analysis of geophysical data performed as a function of well bore depth, together with the provision of related services."
Note that "wireline logging" and "mud logging" are not the same, yet are closely linked through the integration of the data sets.
The measurements are made referenced to "TAH" - True Along Hole depth: these and the associated analysis can then be used to infer further properties, such as hydrocarbon saturation and formation pressure, and to make further drilling and production decisions.

Code:

#!/bin/python3

from delphin.repp import REPP
import string
import os, sys

tkz = REPP.from_config("/Users/ar/hpsg/terg/pet/repp.set")
for line in open(sys.argv[1]):
    i = line.rstrip('\n')
    latt = tkz.tokenize(i)
    tks = latt.to_list()

output:

$ python lixo.py lixo.txt
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 5
  self._re = re.compile(pattern)
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 3
  self._re = re.compile(pattern)
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 11
  self._re = re.compile(pattern)
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 15
  self._re = re.compile(pattern)
/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py:119: FutureWarning: Possible nested set at position 2
  self._re = re.compile(pattern)
Traceback (most recent call last):
  File "lixo.py", line 10, in <module>
    latt = tkz.tokenize(i)
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 466, in tokenize
    return self.group.tokenize(s, pattern=pattern, active=active)
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 93, in tokenize
    res = self.apply(s, active=active)
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 72, in apply
    for step in self.trace(s, active=active):
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 86, in trace
    startmap = _mergemap(startmap, step.startmap)
  File "/Users/ar/envs/sensetion/lib/python3.7/site-packages/delphin/repp.py", line 481, in _mergemap
    merged[i] = shift + map1[i + shift]
IndexError: array index out of range
@goodmami
Copy link
Member

Thanks, I have confirmed the bug with a smaller example:

$ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< 'Kim said, "hi."'
Traceback (most recent call last):
  [...]
  File "/home/mwg/repos/pydelphin/delphin/repp.py", line 473, in _mergemap
    merged[i] = shift + map1[i + shift]
IndexError: array index out of range

compared with:

$ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< 'Kim said, "hi".'
(0, 0, 1, <0:3>, 1, "Kim", 0, "null") (1, 1, 2, <4:8>, 1, "said", 0, "null") (2, 2, 3, <8:9>, 1, ",", 0, "null") (3, 3, 4, <10:11>, 1, "“", 0, "null") (4, 4, 5, <11:13>, 1, "hi", 0, "null") (5, 5, 6, <13:14>, 1, "”", 0, "null") (6, 6, 7, <14:15>, 1, ".", 0, "null")

(aside: I cut out the FutureWarnings but filed a bug with the ERG about them: delph-in/erg#17)

Based on these examples I think you're correct that the quote characters are causing the error, but furthermore it is only double-quotes:

$ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< "Kim said, 'hi.'"
(0, 0, 1, <0:3>, 1, "Kim", 0, "null") (1, 1, 2, <4:8>, 1, "said", 0, "null") (2, 2, 3, <8:9>, 1, ",", 0, "null") (3, 3, 4, <10:11>, 1, "‘", 0, "null") (4, 4, 5, <11:13>, 1, "hi", 0, "null") (5, 5, 6, <13:14>, 1, ".", 0, "null") (6, 6, 7, <14:15>, 1, "’", 0, "null")

@goodmami goodmami added the bug label Aug 16, 2019
@goodmami goodmami changed the title using pydelphin for tokenization REPP raises an IndexError when an input ends with a double-quote Aug 16, 2019
@goodmami
Copy link
Member

I found a couple possible causes:

  1. When rewriting, the string |Kim said, "hi."| gets some extra characters (initial space and an extra quote), yielding | Kim said, "hi."”| (the | characters are just to show the bounds), which change the length of the string. If the length of the modified string is used with an array with the length of the original string, there could be an index error when computing offsets.

  2. A unicode quote character is used in the rewrites. While it is 1 character, it is 3 bytes in UTF-8. I'm pretty sure I'm computing the length of strings by characters and not bytes, but this could also explain the IndexError.

Of course there could be other reasons beyond these two.

@arademaker
Copy link
Member Author

You said

A unicode quote character ” is used in the rewrites.

why? Is it something in pydelphin or in REPP rules?

@goodmami
Copy link
Member

It's in the REPP rules. I'm not sure why they do that, exactly. Maybe they are trying to replace regular, directionally-ambiguous quotes with begin and end quotes.

I think it was from this block in rpp/quotes.rpp:

;;
;; the ‘raw’ WSJ texts every now and again contain sentence-final quotes that
;; are preceded by whitespace (many of these look like quotes that actually
;; should be part of the following sentence, i.e. were moved across a sentence
;; boundary).  for robustness, force directionality of quotes in sentence-
;; initial and final positions.
;;
!^([[({“‘' ]*)("|``)                                            \1“
!^([[({“‘ ]*)'(?![0-9]{2}s?)                                    \1‘
!("|'')([])}”’' ]*)$                                            ”\2
!'([])}”’ ]*)$   

But regardless of the output of the rewrite, PyDelphin should be able to maintain the surface alignment properly.

@arademaker
Copy link
Member Author

Hum, definitely the rules are replacing ambiguous quote:

$ ./ace -g erg-2018-osx-0.9.30.dat -E
Wireline logging can be defined as being "The acquisition and analysis of geophysical data performed as a function of well bore depth, together with the provision of related services."
Wireline logging can be defined as being “ The acquisition and analysis of geophysical data performed as a function of well bore depth , together with the provision of related services . ”

but the comment above is confusing, it is about 'quotes that are preceded buy whitespace...', not the case here.

@goodmami
Copy link
Member

Yes. The whitespace is inserted by another rule, I think.

Also it looks like it wasn't really disambiguating the quotes either, both the open-quote and close-quote are the same in the end. I'm not really sure what is the motivation for those rules.

@goodmami
Copy link
Member

Turns out the problem wasn't double-quotes or unicode issues, it was a replacement pattern that didn't use some capturing groups. Specifically this pattern:

!("|'')([])}”’' ]*)$                                            ”\2

With REPP, accurate surface alignments ("characterization") depends on group references (\2 above) being in-order and without gaps. So it would work if \1 appeared instead, or also in front of \2. But since \2 appears without \1 before it, REPP "gives up" on getting accurate alignments and says that all replaced characters start at the first character of the match and end on the last character. My code had a bug when it was trying to calculate this incrementally for each "segment" (group reference or literal replacement substring). Now, once I get to the point where REPP "gives up", I join all remaining segments into one big replacement, which makes it much easier to calculate the proper alignments.

As a bonus, I sped up processing 2-3x by avoiding some unnecessary computation and also fixed that extra-character bug (which was probably related to the main fix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants