-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preprocessing.strip_punctuation does not handle Unicode #2962
Comments
You are right. All these functions in We're still deciding whether to axe |
As a batteries includes set of first-pass utilities I find them very useful when starting any project. It's really just this issue of punctuation that comes up with any regularity for me. |
OK good. A PR to improve the preprocessing functionality (~better punctuation) is welcome. As long as you're aware the future of |
In terms of improving. Guess question is in what way. Of the three conceptual directions I outlined under Possible solutions above:
TL;DR:
|
Option 1) sounds good to me, as a first step. As always, a pull request with the fix (fixed documenation) is welcome :) |
@piskvorky Perhaps we should make this obvious in the module docstring? |
Maybe. We talk about it in the core tutorials. |
Add ASCII as qualification on `strip_punctuation` doc string. This is "option 1" fix for issue piskvorky#2962
Code comment added linking to issue piskvorky#2962 as a reminder of enhancement possibilities.
) * Clarifying strip_punctuation limited to ASCII Add ASCII as qualification on `strip_punctuation` doc string. This is "option 1" fix for issue #2962 * Added code comment pointing to issue 2962 Code comment added linking to issue #2962 as a reminder of enhancement possibilities. * update CHANGELOG.md Co-authored-by: Michael Penkov <[email protected]>
Problem description
RE_PUNCT
inparsing/preprocessing.py
, which is the substance ofpreprocessing.strip_punctuation
does not consider Unicode punctuation.RE_PUNCT
(=re.compile(r'([%s])+' % re.escape(string.punctuation), re.UNICODE)
) depends on the standard library string module's punctuation string which is limited to ascii punctuation.Steps/code/corpus to reproduce
For the above input I think the correct output would be:
Possible solutions
In the above example my choice of typographic quotes was unimportant but dodges the hard part of a solution which will be a suitable definition of punctuation given the number of possibilities in unicode and ambiguity around some associated uses of those possibilities.
I can think of three large classes of response:
I found this helpful in exploring possible answers to my particular use case:
Thanks for all the hard work on this great library.
The text was updated successfully, but these errors were encountered: