-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization of punctuation in Hebrew and other non-latin languages #995
Comments
Thanks for the report. I think this is caused by the global regex rules for punctuation, some of which currently only cover latin characters. We originally chose the approach of spelling out the individual characters because it made it easier to create uppercase/lowercase sets, and kept things a bit more readable while we were tidying up the language data and inviting more people to contribute. But now that we're adding more and more languages, this keeps coming up so we should fix this. (If I remember correctly, this was already causing problems for people working with Bengali and developing Russian integration.)
|
I'll take a shot at fixing it. |
Thanks a lot! I also added your examples to the tests for Hebrew btw (see commit above) and xfailed the one that ends with a full stop. I think our overall test coverage for the tokenizer and prefixes/suffixes/infixes is pretty good by now, so this should hopefully help with testing the fix. |
It seems I can't run the tests. Both on Windows and a fresh Lubuntu VM, pytest screams there's no module named While I'm asking, when you said to remove the explicit character list, you meant everything from Thanks :) |
Ah, have you tried installing the current directory in development mode and then rebuilding spaCy from source? pip install -e . If it still complains, you might be running the wrong version of pytest by accident (i.e. the system one or something – this is always super frustrating, because it produces incredibly confusing errors). About the characters: The main focus should be |
Hey, how did you manage to import the Hebrew? Trying spacy.he but not finding it |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
When tokenizing Hebrew, the full stop at the end of a sentence is not tokenized, while if the sentence ends with either a question mark, an exclamation mark, or ellipses, those marks are tokenized.
Example:
Info about spaCy
The text was updated successfully, but these errors were encountered: