Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization of punctuation in Hebrew and other non-latin languages #995

Closed
beneyal opened this issue Apr 19, 2017 · 7 comments
Closed

Tokenization of punctuation in Hebrew and other non-latin languages #995

beneyal opened this issue Apr 19, 2017 · 7 comments
Labels
help wanted (easy) Contributions welcome! (also suited for spaCy beginners) 🌙 nightly Discussion and contributions related to nightly builds

Comments

@beneyal
Copy link
Contributor

beneyal commented Apr 19, 2017

When tokenizing Hebrew, the full stop at the end of a sentence is not tokenized, while if the sentence ends with either a question mark, an exclamation mark, or ellipses, those marks are tokenized.

Example:

from spacy.he import Hebrew

tokenizer = Hebrew().tokenizer

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה.')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה.']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה?')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '?']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה!')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה..')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה...')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...']

Info about spaCy

  • spaCy version: 1.8.0
  • Platform: Linux-3.13.0-74-generic-x86_64-with-debian-jessie-sid
  • Python version: 3.6.1
  • Installed models:
@ines
Copy link
Member

ines commented Apr 19, 2017

Thanks for the report. I think this is caused by the global regex rules for punctuation, some of which currently only cover latin characters. We originally chose the approach of spelling out the individual characters because it made it easier to create uppercase/lowercase sets, and kept things a bit more readable while we were tidying up the language data and inviting more people to contribute.

But now that we're adding more and more languages, this keeps coming up so we should fix this. (If I remember correctly, this was already causing problems for people working with Bengali and developing Russian integration.)

I'll open a separate issue about this for spaCy v2.0, but in short: Never mind, just making this the master issue. Steps are:

  • get rid of explicit character list
  • use regex library to handle compiling the correct character classes

@ines ines added performance 🌙 nightly Discussion and contributions related to nightly builds labels Apr 19, 2017
@ines ines changed the title Hebrew tokenization of punctuation Tokenization of punctuation in Hebrew and other non-latin languages Apr 19, 2017
@ines ines added the help wanted (easy) Contributions welcome! (also suited for spaCy beginners) label Apr 19, 2017
@beneyal
Copy link
Contributor Author

beneyal commented Apr 19, 2017

I'll take a shot at fixing it.

@ines
Copy link
Member

ines commented Apr 19, 2017

Thanks a lot! I also added your examples to the tests for Hebrew btw (see commit above) and xfailed the one that ends with a full stop.

I think our overall test coverage for the tokenizer and prefixes/suffixes/infixes is pretty good by now, so this should hopefully help with testing the fix.

@beneyal
Copy link
Contributor Author

beneyal commented Apr 19, 2017

It seems I can't run the tests. Both on Windows and a fresh Lubuntu VM, pytest screams there's no module named spacy.gold. I'm using Python 3.6.1 and pip install -r requirements.txted.

While I'm asking, when you said to remove the explicit character list, you meant everything from _ALPHA_LOWER to _HYPHENS, or something else?

Thanks :)

@ines
Copy link
Member

ines commented Apr 19, 2017

Ah, have you tried installing the current directory in development mode and then rebuilding spaCy from source?

pip install -e .

If it still complains, you might be running the wrong version of pytest by accident (i.e. the system one or something – this is always super frustrating, because it produces incredibly confusing errors).

About the characters: The main focus should be _ALPHA_LOWER and _ALPHA_UPPER. As for hyphens and other characters, it might be best to keep these a little more explicit. It's not so many, and there might always be a case where we want to exclude certain characters on purpose etc.

@ines ines closed this as completed in df64e8d Apr 20, 2017
@dror-kris
Copy link

Hey, how did you manage to import the Hebrew? Trying spacy.he but not finding it

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted (easy) Contributions welcome! (also suited for spaCy beginners) 🌙 nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

No branches or pull requests

3 participants