-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Tweet Tokenizer #13
Conversation
Codecov Report
@@ Coverage Diff @@
## master #13 +/- ##
==========================================
- Coverage 80.86% 75.88% -4.99%
==========================================
Files 9 10 +1
Lines 277 651 +374
==========================================
+ Hits 224 494 +270
- Misses 53 157 +104
Continue to review full report at Codecov.
|
for record keeping purposes this is to address #3 |
My feeling is that tokenizers will remain simple enough that they can all go in the readme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
I've just done a first pass for style and basic optimisation.
Looks pretty Ok.
but I'll do another more careful pass over once these changes are made,
and once tests are written.
All globals need to be declared const
.
You seem to use a lot of inner functions which are called only once,
and I am not seeing much gain in clarity from them.
The functionality could be moved to where it is used,
or they coukd be made outer functions
as suits.
Now only the tests and documentation part remain :) |
I have made the suggested changes, you may review this PR. |
Last tiny things then we can merge this. |
The changes have been made and pushed. The CI tests also pass. |
🎉 |
The tweet tokenizer has been added.
The following is how the tokenizer works:
and EMOTICONS_REGEX.
replace_html_entities
is used to replace the html_entities (eg: "Price: £100
" becomes "Price: £100
")......
" becomes "...
" and "waaaaay
" becomes "waaay
" also the twitter handles are optionally removed.preserve_case
by default is set totrue
. If it is set tofalse
,then the tokenizer will downcase everything except for emoticons.
I have 2 questions related to this -
I am currently matching the regex in
replace_html_entities
function twice for the ones that match the outer capturing group. Is there a way that I can use replace and pass all the matched subgroups into theconvert_entity_function
inside it? Refer https://github.com/Ayushk4/WordTokenizers.jl/blob/853331c43a1122690fdd32a78fd040034433c8d0/src/words/tweet_tokenizer.jl#L188 and 132-134 lines on the same file.Should we host documentation for this repository similar to https://juliatext.github.io/TextAnalysis.jl/? I feel that this will grow as more tokenizers will be added, so maybe we can have examples on each.