You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've started to work on the tokenization of French texts in SpaCy. In French, the apostrophe ' is a breaking character and is used for elision (e.g: le avion is written l'avion, je ai is written j'ai,...). There are a few exceptions to this rule, like aujourd'hui that should not be splitted.
As elision is very common in French, it is not feasible to list all these elisions as exceptions in TOKENIZER_EXCEPTIONS. It would be much better to consider ' as a word breaking character in fr/punctuations.py, and add r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA) to TOKENIZER_INFIXES.
However, in Tokenizer._attach_tokens, spaCy raises an error the regex matches a zero-width infix token.
In order for this to work, we should consider that if infix_start == infix_end (in tokenizer.pyx, line 292), we shouldn't add an infix token.
The text was updated successfully, but these errors were encountered:
Thanks for the PR. Just merged it and added regression tests – all works fine and as expected 👍 I think this will be a pretty useful feature for other languages as well.
Feature request
I've started to work on the tokenization of French texts in SpaCy. In French, the apostrophe
'
is a breaking character and is used for elision (e.g:le avion
is writtenl'avion
,je ai
is writtenj'ai
,...). There are a few exceptions to this rule, likeaujourd'hui
that should not be splitted.As elision is very common in French, it is not feasible to list all these elisions as exceptions in
TOKENIZER_EXCEPTIONS
. It would be much better to consider'
as a word breaking character infr/punctuations.py
, and addr'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA)
toTOKENIZER_INFIXES
.However, in
Tokenizer._attach_tokens
, spaCy raises an error the regex matches a zero-width infix token.In order for this to work, we should consider that if
infix_start == infix_end
(intokenizer.pyx
, line 292), we shouldn't add an infix token.The text was updated successfully, but these errors were encountered: