-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REPP raises an IndexError when an input ends with a double-quote #250
Comments
Thanks, I have confirmed the bug with a smaller example: $ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< 'Kim said, "hi."'
Traceback (most recent call last):
[...]
File "/home/mwg/repos/pydelphin/delphin/repp.py", line 473, in _mergemap
merged[i] = shift + map1[i + shift]
IndexError: array index out of range compared with: $ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< 'Kim said, "hi".'
(0, 0, 1, <0:3>, 1, "Kim", 0, "null") (1, 1, 2, <4:8>, 1, "said", 0, "null") (2, 2, 3, <8:9>, 1, ",", 0, "null") (3, 3, 4, <10:11>, 1, "“", 0, "null") (4, 4, 5, <11:13>, 1, "hi", 0, "null") (5, 5, 6, <13:14>, 1, "”", 0, "null") (6, 6, 7, <14:15>, 1, ".", 0, "null") (aside: I cut out the FutureWarnings but filed a bug with the ERG about them: delph-in/erg#17) Based on these examples I think you're correct that the quote characters are causing the error, but furthermore it is only double-quotes: $ delphin repp -c ~/grammars/erg-trunk/pet/repp.set <<< "Kim said, 'hi.'"
(0, 0, 1, <0:3>, 1, "Kim", 0, "null") (1, 1, 2, <4:8>, 1, "said", 0, "null") (2, 2, 3, <8:9>, 1, ",", 0, "null") (3, 3, 4, <10:11>, 1, "‘", 0, "null") (4, 4, 5, <11:13>, 1, "hi", 0, "null") (5, 5, 6, <13:14>, 1, ".", 0, "null") (6, 6, 7, <14:15>, 1, "’", 0, "null") |
I found a couple possible causes:
Of course there could be other reasons beyond these two. |
You said
why? Is it something in pydelphin or in REPP rules? |
It's in the REPP rules. I'm not sure why they do that, exactly. Maybe they are trying to replace regular, directionally-ambiguous quotes with begin and end quotes. I think it was from this block in
But regardless of the output of the rewrite, PyDelphin should be able to maintain the surface alignment properly. |
Hum, definitely the rules are replacing ambiguous quote:
but the comment above is confusing, it is about 'quotes that are preceded buy whitespace...', not the case here. |
Yes. The whitespace is inserted by another rule, I think. Also it looks like it wasn't really disambiguating the quotes either, both the open-quote and close-quote are the same in the end. I'm not really sure what is the motivation for those rules. |
Turns out the problem wasn't double-quotes or unicode issues, it was a replacement pattern that didn't use some capturing groups. Specifically this pattern:
With REPP, accurate surface alignments ("characterization") depends on group references ( As a bonus, I sped up processing 2-3x by avoiding some unnecessary computation and also fixed that extra-character bug (which was probably related to the main fix). |
I am trying to use pydelphin for tokenizing a corpus. It seems that sentences ending with
"
causes an error:Input:
Code:
output:
The text was updated successfully, but these errors were encountered: