Tokenization of punctuation in Hebrew and other non-latin languages #995

beneyal · 2017-04-19T16:23:25Z

When tokenizing Hebrew, the full stop at the end of a sentence is not tokenized, while if the sentence ends with either a question mark, an exclamation mark, or ellipses, those marks are tokenized.

Example:

from spacy.he import Hebrew

tokenizer = Hebrew().tokenizer

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה.')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה.']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה?')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '?']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה!')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה..')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה...')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...']

Info about spaCy

spaCy version: 1.8.0
Platform: Linux-3.13.0-74-generic-x86_64-with-debian-jessie-sid
Python version: 3.6.1
Installed models:

ines · 2017-04-19T17:06:07Z

Thanks for the report. I think this is caused by the global regex rules for punctuation, some of which currently only cover latin characters. We originally chose the approach of spelling out the individual characters because it made it easier to create uppercase/lowercase sets, and kept things a bit more readable while we were tidying up the language data and inviting more people to contribute.

But now that we're adding more and more languages, this keeps coming up so we should fix this. (If I remember correctly, this was already causing problems for people working with Bengali and developing Russian integration.)

~~I'll open a separate issue about this for spaCy v2.0, but in short:~~ Never mind, just making this the master issue. Steps are:

get rid of explicit character list
use regex library to handle compiling the correct character classes

beneyal · 2017-04-19T17:14:00Z

I'll take a shot at fixing it.

ines · 2017-04-19T17:31:09Z

Thanks a lot! I also added your examples to the tests for Hebrew btw (see commit above) and xfailed the one that ends with a full stop.

I think our overall test coverage for the tokenizer and prefixes/suffixes/infixes is pretty good by now, so this should hopefully help with testing the fix.

beneyal · 2017-04-19T18:40:25Z

It seems I can't run the tests. Both on Windows and a fresh Lubuntu VM, pytest screams there's no module named spacy.gold. I'm using Python 3.6.1 and pip install -r requirements.txted.

While I'm asking, when you said to remove the explicit character list, you meant everything from _ALPHA_LOWER to _HYPHENS, or something else?

Thanks :)

ines · 2017-04-19T20:16:58Z

Ah, have you tried installing the current directory in development mode and then rebuilding spaCy from source?

pip install -e .

If it still complains, you might be running the wrong version of pytest by accident (i.e. the system one or something – this is always super frustrating, because it produces incredibly confusing errors).

About the characters: The main focus should be _ALPHA_LOWER and _ALPHA_UPPER. As for hyphens and other characters, it might be best to keep these a little more explicit. It's not so many, and there might always be a case where we want to exclude certain characters on purpose etc.

dror-kris · 2018-01-29T11:05:48Z

Hey, how did you manage to import the Hebrew? Trying spacy.he but not finding it

lock · 2018-05-08T00:55:51Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added performance 🌙 nightly Discussion and contributions related to nightly builds labels Apr 19, 2017

ines changed the title ~~Hebrew tokenization of punctuation~~ Tokenization of punctuation in Hebrew and other non-latin languages Apr 19, 2017

ines added the help wanted (easy) Contributions welcome! (also suited for spaCy beginners) label Apr 19, 2017

ines added a commit that referenced this issue Apr 19, 2017

Tidy up Hebrew tests and test for punctuation (see #995)

2bd89e7

ines closed this as completed in df64e8d Apr 20, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization of punctuation in Hebrew and other non-latin languages #995

Tokenization of punctuation in Hebrew and other non-latin languages #995

beneyal commented Apr 19, 2017

ines commented Apr 19, 2017 •

edited

Loading

beneyal commented Apr 19, 2017

ines commented Apr 19, 2017

beneyal commented Apr 19, 2017

ines commented Apr 19, 2017

dror-kris commented Jan 29, 2018

lock bot commented May 8, 2018

Tokenization of punctuation in Hebrew and other non-latin languages #995

Tokenization of punctuation in Hebrew and other non-latin languages #995

Comments

beneyal commented Apr 19, 2017

Info about spaCy

ines commented Apr 19, 2017 • edited Loading

beneyal commented Apr 19, 2017

ines commented Apr 19, 2017

beneyal commented Apr 19, 2017

ines commented Apr 19, 2017

dror-kris commented Jan 29, 2018

lock bot commented May 8, 2018

ines commented Apr 19, 2017 •

edited

Loading