Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization process: Allow zero-width 'infix' token (French tokenization) #768

Closed
raphael0202 opened this issue Jan 23, 2017 · 4 comments
Closed
Labels
enhancement Feature requests and improvements lang / fr French language data and models

Comments

@raphael0202
Copy link
Contributor

raphael0202 commented Jan 23, 2017

Feature request

I've started to work on the tokenization of French texts in SpaCy. In French, the apostrophe ' is a breaking character and is used for elision (e.g: le avion is written l'avion, je ai is written j'ai,...). There are a few exceptions to this rule, like aujourd'hui that should not be splitted.

As elision is very common in French, it is not feasible to list all these elisions as exceptions in TOKENIZER_EXCEPTIONS. It would be much better to consider ' as a word breaking character in fr/punctuations.py, and add r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA) to TOKENIZER_INFIXES.
However, in Tokenizer._attach_tokens, spaCy raises an error the regex matches a zero-width infix token.

In order for this to work, we should consider that if infix_start == infix_end (in tokenizer.pyx, line 292), we shouldn't add an infix token.

@honnibal honnibal added the enhancement Feature requests and improvements label Jan 23, 2017
@honnibal
Copy link
Member

That sounds like a reasonable approach.

One quick suggestion though: Would it be possible to enumerate the prefixes, and put them in the prefix regex?

@raphael0202
Copy link
Contributor Author

Yes of course. I can work on this and submit a PR if you wish.

ines added a commit that referenced this issue Jan 23, 2017
@ines
Copy link
Member

ines commented Jan 23, 2017

Thanks for the PR. Just merged it and added regression tests – all works fine and as expected 👍 I think this will be a pretty useful feature for other languages as well.

@ines ines closed this as completed Jan 23, 2017
@ines ines added the lang / fr French language data and models label Apr 23, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements lang / fr French language data and models
Projects
None yet
Development

No branches or pull requests

3 participants