preprocessing.strip_punctuation does not handle Unicode #2962

sciatro · 2020-09-28T13:33:46Z

Problem description

RE_PUNCT in parsing/preprocessing.py, which is the substance of preprocessing.strip_punctuation does not consider Unicode punctuation.

RE_PUNCT (=re.compile(r'([%s])+' % re.escape(string.punctuation), re.UNICODE)) depends on the standard library string module's punctuation string which is limited to ascii punctuation.

Steps/code/corpus to reproduce

>>> from gensim.parsing import preprocessing
>>> preprocessing.strip_punctuation('This is a “quoted string” which has the typographic quotes whereas "this one" does not.')
'This is a “quoted string” which has the typographic quotes whereas  this one  does not '

For the above input I think the correct output would be:

'This is a  quoted string  which has the typographic quotes whereas  this one  does not '

Possible solutions

In the above example my choice of typographic quotes was unimportant but dodges the hard part of a solution which will be a suitable definition of punctuation given the number of possibilities in unicode and ambiguity around some associated uses of those possibilities.

I can think of three large classes of response:

Anything other than ascii is ambiguous, leave it as is and document
Use a hand defined map of Unicode equivalency to ascii punctuation, e.g. Unidecode
Exclude based on database category, e.g. this SO answer

I found this helpful in exploring possible answers to my particular use case:

import unicodedata
import sys

to_look_at = [(i, chr(i)) for i in range(sys.maxunicode) if unicodedata.category(chr(i))[0] in ('P', 'S')]

Thanks for all the hard work on this great library.

The text was updated successfully, but these errors were encountered:

piskvorky · 2020-09-28T14:14:40Z

You are right. All these functions in gensim.preprocessing are fairly naive, they won't stand up to deep industry use. My recommendation for non-academic (=non-toy) projects is always to roll your own preprocessing for your problem domain, because all NLP libraries (gensim included) are kinda generic and rubbish at this. And then for the rest of your pipeline, it becomes garbage-in, garbage-out…

We're still deciding whether to axe gensim.preprocessing completely, so as not to mislead users into unrealistic expectations about its ability, or keep & improve it incrementally. CC @mpenkov .

sciatro · 2020-09-28T16:01:03Z

As a batteries includes set of first-pass utilities I find them very useful when starting any project. It's really just this issue of punctuation that comes up with any regularity for me.

piskvorky · 2020-09-28T16:05:51Z

OK good.

A PR to improve the preprocessing functionality (~better punctuation) is welcome. As long as you're aware the future of preprocessing is uncertain, feel free to use & improve it!

sciatro · 2020-09-28T18:15:04Z

In terms of improving. Guess question is in what way.

Of the three conceptual directions I outlined under Possible solutions above:

Document the limitation and let it be may be best given uncertainty. The patch would just be to add the qualification of "ASCII" to the doc for strip_punctuation, i.e. ("""Replace punctuation characters... becomes """Replace ASCII punctuation characters...).
Using equivalency tables to mutate the input and then applying ASCII rules works well as a simple first pass in my experience but does require a new dependency or a commitment to maintaining the tables. Neither a new dependency nor a data maintenance project seems inline with the ambivalence about the future of this functionality. If you're open to a new dependency the patch is just to pass the input string through unidecode.unidecode.
Character removal based on unicode categories is easy enough to do (± special Emoji handling 🤷‍♀️). Doing so is mostly an architectural question about whether you want to put the literal in the source or enumerate the instances of each category at runtime, i.e. do you want to put the to_look_at literal value under version control or put for i in range(sys.maxunicode) if unicodedata.category(chr(i)) under version control? That architectural choice would require perspective on maintainability of a big important library (which I don't have).

TL;DR:

If you're going to pull out the functionality: I'd vote option 1
If you want to keep the functionality: I'd vote option 3 (and I'm happy to participate in conversation about pros and cons of two approaches / produce patch for preferred approach)
If you're an end user looking for a low effort next step, look at point 2 above

piskvorky · 2020-09-28T19:13:16Z

Option 1) sounds good to me, as a first step. As always, a pull request with the fix (fixed documenation) is welcome :)

mpenkov · 2020-09-29T07:27:20Z

My recommendation for non-academic (=non-toy) projects is always to roll your own preprocessing for your problem domain, because all NLP libraries (gensim included) are kinda generic and rubbish at this.

@piskvorky Perhaps we should make this obvious in the module docstring?

piskvorky · 2020-09-29T07:30:24Z

Maybe. We talk about it in the core tutorials.

Add ASCII as qualification on `strip_punctuation` doc string. This is "option 1" fix for issue piskvorky#2962

Code comment added linking to issue piskvorky#2962 as a reminder of enhancement possibilities.

) * Clarifying strip_punctuation limited to ASCII Add ASCII as qualification on `strip_punctuation` doc string. This is "option 1" fix for issue #2962 * Added code comment pointing to issue 2962 Code comment added linking to issue #2962 as a reminder of enhancement possibilities. * update CHANGELOG.md Co-authored-by: Michael Penkov <[email protected]>

piskvorky added the feature Issue described a new feature label Sep 28, 2020

sciatro added a commit to sciatro/gensim that referenced this issue Sep 29, 2020

Clarifying strip_punctuation limited to ASCII

ae6c4c9

Add ASCII as qualification on `strip_punctuation` doc string. This is "option 1" fix for issue piskvorky#2962

sciatro mentioned this issue Sep 29, 2020

Document that preprocessing.strip_punctuation is limited to ASCII punctuation characters #2964

Merged

sciatro added a commit to sciatro/gensim that referenced this issue Sep 30, 2020

Added code comment pointing to issue 2962

a24b227

Code comment added linking to issue piskvorky#2962 as a reminder of enhancement possibilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocessing.strip_punctuation does not handle Unicode #2962

preprocessing.strip_punctuation does not handle Unicode #2962

sciatro commented Sep 28, 2020

piskvorky commented Sep 28, 2020 •

edited

Loading

sciatro commented Sep 28, 2020

piskvorky commented Sep 28, 2020 •

edited

Loading

sciatro commented Sep 28, 2020

piskvorky commented Sep 28, 2020 •

edited

Loading

mpenkov commented Sep 29, 2020 •

edited

Loading

piskvorky commented Sep 29, 2020

preprocessing.strip_punctuation does not handle Unicode #2962

preprocessing.strip_punctuation does not handle Unicode #2962

Comments

sciatro commented Sep 28, 2020

Problem description

Steps/code/corpus to reproduce

Possible solutions

piskvorky commented Sep 28, 2020 • edited Loading

sciatro commented Sep 28, 2020

piskvorky commented Sep 28, 2020 • edited Loading

sciatro commented Sep 28, 2020

piskvorky commented Sep 28, 2020 • edited Loading

mpenkov commented Sep 29, 2020 • edited Loading

piskvorky commented Sep 29, 2020

piskvorky commented Sep 28, 2020 •

edited

Loading

piskvorky commented Sep 28, 2020 •

edited

Loading

piskvorky commented Sep 28, 2020 •

edited

Loading

mpenkov commented Sep 29, 2020 •

edited

Loading