Tokenization process: Allow zero-width 'infix' token (French tokenization) #768

raphael0202 · 2017-01-23T17:04:35Z

Feature request

I've started to work on the tokenization of French texts in SpaCy. In French, the apostrophe ' is a breaking character and is used for elision (e.g: le avion is written l'avion, je ai is written j'ai,...). There are a few exceptions to this rule, like aujourd'hui that should not be splitted.

As elision is very common in French, it is not feasible to list all these elisions as exceptions in TOKENIZER_EXCEPTIONS. It would be much better to consider ' as a word breaking character in fr/punctuations.py, and add r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA) to TOKENIZER_INFIXES.
However, in Tokenizer._attach_tokens, spaCy raises an error the regex matches a zero-width infix token.

In order for this to work, we should consider that if infix_start == infix_end (in tokenizer.pyx, line 292), we shouldn't add an infix token.

The text was updated successfully, but these errors were encountered:

honnibal · 2017-01-23T17:07:29Z

That sounds like a reasonable approach.

One quick suggestion though: Would it be possible to enumerate the prefixes, and put them in the prefix regex?

raphael0202 · 2017-01-23T17:11:43Z

Yes of course. I can work on this and submit a PR if you wish.

ines · 2017-01-23T20:27:15Z

Thanks for the PR. Just merged it and added regression tests – all works fine and as expected 👍 I think this will be a pretty useful feature for other languages as well.

lock · 2018-05-08T23:38:21Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the enhancement Feature requests and improvements label Jan 23, 2017

raphael0202 mentioned this issue Jan 23, 2017

Allow zero-width 'infix' token #769

Merged

8 tasks

ines added a commit that referenced this issue Jan 23, 2017

Add regression test for #768

0967eb0

ines closed this as completed Jan 23, 2017

ines added the lang / fr French language data and models label Apr 23, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization process: Allow zero-width 'infix' token (French tokenization) #768

Tokenization process: Allow zero-width 'infix' token (French tokenization) #768

raphael0202 commented Jan 23, 2017 •

edited

Loading

honnibal commented Jan 23, 2017

raphael0202 commented Jan 23, 2017

ines commented Jan 23, 2017

lock bot commented May 8, 2018

Tokenization process: Allow zero-width 'infix' token (French tokenization) #768

Tokenization process: Allow zero-width 'infix' token (French tokenization) #768

Comments

raphael0202 commented Jan 23, 2017 • edited Loading

Feature request

honnibal commented Jan 23, 2017

raphael0202 commented Jan 23, 2017

ines commented Jan 23, 2017

lock bot commented May 8, 2018

raphael0202 commented Jan 23, 2017 •

edited

Loading