Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocessing.strip_punctuation does not handle Unicode #2962

Open
sciatro opened this issue Sep 28, 2020 · 7 comments
Open

preprocessing.strip_punctuation does not handle Unicode #2962

sciatro opened this issue Sep 28, 2020 · 7 comments
Labels
feature Issue described a new feature

Comments

@sciatro
Copy link
Contributor

sciatro commented Sep 28, 2020

Problem description

RE_PUNCT in parsing/preprocessing.py, which is the substance of preprocessing.strip_punctuation does not consider Unicode punctuation.

RE_PUNCT (=re.compile(r'([%s])+' % re.escape(string.punctuation), re.UNICODE)) depends on the standard library string module's punctuation string which is limited to ascii punctuation.

Steps/code/corpus to reproduce

>>> from gensim.parsing import preprocessing
>>> preprocessing.strip_punctuation('This is a “quoted string” which has the typographic quotes whereas "this one" does not.')
'This is a “quoted string” which has the typographic quotes whereas  this one  does not '

For the above input I think the correct output would be:

'This is a  quoted string  which has the typographic quotes whereas  this one  does not '

Possible solutions

In the above example my choice of typographic quotes was unimportant but dodges the hard part of a solution which will be a suitable definition of punctuation given the number of possibilities in unicode and ambiguity around some associated uses of those possibilities.

I can think of three large classes of response:

  • Anything other than ascii is ambiguous, leave it as is and document
  • Use a hand defined map of Unicode equivalency to ascii punctuation, e.g. Unidecode
  • Exclude based on database category, e.g. this SO answer

I found this helpful in exploring possible answers to my particular use case:

import unicodedata
import sys

to_look_at = [(i, chr(i)) for i in range(sys.maxunicode) if unicodedata.category(chr(i))[0] in ('P', 'S')]

Thanks for all the hard work on this great library.

@piskvorky
Copy link
Owner

piskvorky commented Sep 28, 2020

You are right. All these functions in gensim.preprocessing are fairly naive, they won't stand up to deep industry use. My recommendation for non-academic (=non-toy) projects is always to roll your own preprocessing for your problem domain, because all NLP libraries (gensim included) are kinda generic and rubbish at this. And then for the rest of your pipeline, it becomes garbage-in, garbage-out…

We're still deciding whether to axe gensim.preprocessing completely, so as not to mislead users into unrealistic expectations about its ability, or keep & improve it incrementally. CC @mpenkov .

@piskvorky piskvorky added the feature Issue described a new feature label Sep 28, 2020
@sciatro
Copy link
Contributor Author

sciatro commented Sep 28, 2020

As a batteries includes set of first-pass utilities I find them very useful when starting any project. It's really just this issue of punctuation that comes up with any regularity for me.

@piskvorky
Copy link
Owner

piskvorky commented Sep 28, 2020

OK good.

A PR to improve the preprocessing functionality (~better punctuation) is welcome. As long as you're aware the future of preprocessing is uncertain, feel free to use & improve it!

@sciatro
Copy link
Contributor Author

sciatro commented Sep 28, 2020

In terms of improving. Guess question is in what way.

Of the three conceptual directions I outlined under Possible solutions above:

  1. Document the limitation and let it be may be best given uncertainty. The patch would just be to add the qualification of "ASCII" to the doc for strip_punctuation, i.e. ("""Replace punctuation characters... becomes """Replace ASCII punctuation characters...).

  2. Using equivalency tables to mutate the input and then applying ASCII rules works well as a simple first pass in my experience but does require a new dependency or a commitment to maintaining the tables. Neither a new dependency nor a data maintenance project seems inline with the ambivalence about the future of this functionality. If you're open to a new dependency the patch is just to pass the input string through unidecode.unidecode.

  3. Character removal based on unicode categories is easy enough to do (± special Emoji handling 🤷‍♀️). Doing so is mostly an architectural question about whether you want to put the literal in the source or enumerate the instances of each category at runtime, i.e. do you want to put the to_look_at literal value under version control or put for i in range(sys.maxunicode) if unicodedata.category(chr(i)) under version control? That architectural choice would require perspective on maintainability of a big important library (which I don't have).

TL;DR:

  • If you're going to pull out the functionality: I'd vote option 1
  • If you want to keep the functionality: I'd vote option 3 (and I'm happy to participate in conversation about pros and cons of two approaches / produce patch for preferred approach)
  • If you're an end user looking for a low effort next step, look at point 2 above

@piskvorky
Copy link
Owner

piskvorky commented Sep 28, 2020

Option 1) sounds good to me, as a first step. As always, a pull request with the fix (fixed documenation) is welcome :)

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 29, 2020

My recommendation for non-academic (=non-toy) projects is always to roll your own preprocessing for your problem domain, because all NLP libraries (gensim included) are kinda generic and rubbish at this.

@piskvorky Perhaps we should make this obvious in the module docstring?

@piskvorky
Copy link
Owner

Maybe. We talk about it in the core tutorials.

sciatro added a commit to sciatro/gensim that referenced this issue Sep 29, 2020
Add ASCII as qualification on `strip_punctuation` doc string. 
This is "option 1" fix for issue piskvorky#2962
sciatro added a commit to sciatro/gensim that referenced this issue Sep 30, 2020
Code comment added linking to issue piskvorky#2962 as a reminder of enhancement possibilities.
mpenkov added a commit that referenced this issue Jun 29, 2021
)

* Clarifying strip_punctuation limited to ASCII

Add ASCII as qualification on `strip_punctuation` doc string. 
This is "option 1" fix for issue #2962

* Added code comment pointing to issue 2962

Code comment added linking to issue #2962 as a reminder of enhancement possibilities.

* update CHANGELOG.md

Co-authored-by: Michael Penkov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

3 participants