Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR: To address performance issues with stopword removal #141

Merged
merged 2 commits into from
Apr 18, 2019

Conversation

asbisen
Copy link
Contributor

@asbisen asbisen commented Apr 10, 2019

PR to address performance regression stated in #140. This brings the time down from 940s to 0.27s for my test dataset (~3.4MB)

primary change is replacement of method remove_patterns which forced modification of strip_whitespace implementation of prepare! method

function remove_patterns(s::AbstractString, rex::Regex) 
  return replace(s, rex => "")
end

I have also modified test cases to make them consistent; where stripping punctuation or stripping a pattern replaces the matched pattern with 0 length string i.e. deletes the matched pattern.

This required special handling for whitespace removal, where one or more than single space is replaced with a blank_space of length 1. And all leading and trailing spaces are stripped.

I don't think there is a right way for certain pre-processing tasks. For example: with strip_punctuation what is the correct way to handle the following strings when removing punctuations.

  • don't mind! => don t mind or dont mind
  • Intel(tm) Core i5-3300k => Intel tm Core i5 3300k or Inteltm Core i53300k

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants