Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed UNICODE processing with the strip_non_letters flag in src/preprocessing.jl #265

Merged
merged 3 commits into from
Oct 26, 2023

Conversation

sigmundv
Copy link
Contributor

Changed the regex for strip_non_letters in src/preprocessing.jl to [^…\p{L}\\s], because [^a-zA-Z\\s] matches non-ascii letters and removes diacritic characters, for example.

…\p{L}\s], because [^a-zA-Z\s] matches non-ascii letters and removes diacritic characters, for example
…\p{L}\s], because [^a-zA-Z\s] matches non-ascii letters and removes diacritic characters, for example
@aviks
Copy link
Member

aviks commented Mar 1, 2023

Thanks. Would be nice to get a testcase (something that fails in the current version, but works with this fix) so that we're confident of not breaking this in the future.

Added use cases with Unicode for the Corpus preprocessing with `strip_non_letters` flag.
@rssdev10
Copy link
Collaborator

This change breaks the previous logic of the strip_non_letters flag. The initial implementation of `strip_non_letters' retains only basic Latin characters. Even other European letters are removed. However, the current preprocessing test includes an explicit check for the removal of the Greek symbol "υπ":

@testset "Preprocessing" begin

    sample_text1 = "This is 1 MESSED υπ string!"
    sample_text1_wo_punctuation = "This is 1 MESSED υπ string"
    sample_text1_wo_punctuation_numbers = "This is  MESSED υπ string"
    sample_text1_wo_punctuation_numbers_case = "this is  messed υπ string"
    sample_text1_wo_punctuation_numbers_case_az = "this is  messed  string"
#...
end

Not sure if this is a real case. The ability to handle Unicode is more useful.

@rssdev10 rssdev10 requested a review from aviks October 24, 2023 02:43
@rssdev10 rssdev10 changed the title Fixed the regex for strip_non_letters in src/preprocessing.jl Fixed UNICODE processing with the strip_non_letters flag in src/preprocessing.jl Oct 25, 2023
@rssdev10 rssdev10 merged commit 6d00310 into JuliaText:master Oct 26, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants