Add Tweet Tokenizer #13

Ayushk4 · 2019-02-03T05:47:49Z

The tweet tokenizer has been added.

Tokenizer added
Documentation
Tests

The following is how the tokenizer works:

The regular expressions are made for WORD_REGEX (core tokenizer), HANG_REGEX
and EMOTICONS_REGEX.
The function replace_html_entities is used to replace the html_entities (eg: "Price: £100" becomes "Price: £100")
The string is processed optionally for reducing the length of strings (like "......" becomes "..." and "waaaaay" becomes "waaay" also the twitter handles are optionally removed.
The String is tokenized.
preserve_case by default is set to true. If it is set to false,
then the tokenizer will downcase everything except for emoticons.

I have 2 questions related to this -

I am currently matching the regex in replace_html_entities function twice for the ones that match the outer capturing group. Is there a way that I can use replace and pass all the matched subgroups into the convert_entity_function inside it? Refer https://github.com/Ayushk4/WordTokenizers.jl/blob/853331c43a1122690fdd32a78fd040034433c8d0/src/words/tweet_tokenizer.jl#L188 and 132-134 lines on the same file.
Should we host documentation for this repository similar to https://juliatext.github.io/TextAnalysis.jl/? I feel that this will grow as more tokenizers will be added, so maybe we can have examples on each.

codecov-io · 2019-02-03T06:09:28Z

Codecov Report

Merging #13 into master will decrease coverage by 4.98%.
The diff coverage is 72%.

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
- Coverage   80.86%   75.88%   -4.99%     
==========================================
  Files           9       10       +1     
  Lines         277      651     +374     
==========================================
+ Hits          224      494     +270     
- Misses         53      157     +104

Impacted Files	Coverage Δ
src/split_api.jl	`0% <ø> (ø)`	⬆️
src/words/fast.jl	`81.81% <66.66%> (+0.56%)`	⬆️
src/words/tweet_tokenizer.jl	`72.04% <72.04%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d877dc...c7bd296. Read the comment docs.

oxinabox · 2019-02-03T21:21:21Z

for record keeping purposes this is to address #3

oxinabox · 2019-02-03T21:23:11Z

Should we host documentation for this repository similar to https://juliatext.github.io/TextAnalysis.jl/? I feel that this will grow as more tokenizers will be added, so maybe we can have examples on each.

My feeling is that tokenizers will remain simple enough that they can all go in the readme.
At the point in which this changes, then we can look at a solution like Documenter.jl and a documentation page. But I find in general dealing with that is a surprisingly constant source of complexity to a project.

src/words/tweet_tokenizer.jl

oxinabox

Cool!
I've just done a first pass for style and basic optimisation.
Looks pretty Ok.
but I'll do another more careful pass over once these changes are made,
and once tests are written.

All globals need to be declared const.

You seem to use a lot of inner functions which are called only once,
and I am not seeing much gain in clarity from them.
The functionality could be moved to where it is used,
or they coukd be made outer functions
as suits.

Ayushk4 · 2019-06-01T17:39:30Z

Now only the tests and documentation part remain :)

src/words/tweet_tokenizer.jl

README.md

src/words/tweet_tokenizer.jl

README.md

src/words/tweet_tokenizer.jl

Ayushk4 · 2019-06-05T10:01:52Z

I have made the suggested changes, you may review this PR.

src/words/tweet_tokenizer.jl

oxinabox · 2019-06-05T10:19:24Z

Last tiny things then we can merge this.

Ayushk4 · 2019-06-05T10:47:39Z

The changes have been made and pushed. The CI tests also pass.

oxinabox · 2019-06-05T13:12:34Z

🎉

Ayushk4 added 8 commits January 22, 2019 22:14

Add Regex

cd26b75

Add function to replace HTML entities

d7532ec

Add tweet tokenizer

5c87dae

Add docstrings for functions

fd927d1

Add support for tweet tokenizer

4da422e

Update README

2e6b4c2

Fix bug for optional argurments

853331c

Add dependencies to REQUIRE

e3d2fa0

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Feb 3, 2019

View reviewed changes

Finish nltk url function

ce7c74b

Ayushk4 commented Jun 1, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Show resolved Hide resolved

Ayushk4 added 4 commits June 2, 2019 20:58

Add tests

d9a019f

Fix Bugs in tweet tokenizer

bebe5bd

Fix indentation

e2120ad

Update README for TokenBuffer

1dc5445

oxinabox reviewed Jun 3, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

Update Docs for custom token TokenBuffer tokenizers, functions

f18ae44

oxinabox reviewed Jun 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

oxinabox reviewed Jun 3, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

Ayushk4 added 2 commits June 3, 2019 18:52

Minor doc changes

b0d8dd4

Clean up code for tweet Tokenizer

fcfd107

oxinabox reviewed Jun 5, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Show resolved Hide resolved

oxinabox reviewed Jun 5, 2019

View reviewed changes

src/words/tweet_tokenizer.jl Outdated Show resolved Hide resolved

Change vectors into tuples

c7bd296

oxinabox merged commit db3707b into JuliaText:master Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tweet Tokenizer #13

Add Tweet Tokenizer #13

Ayushk4 commented Feb 3, 2019 •

edited

Loading

codecov-io commented Feb 3, 2019 •

edited

Loading

oxinabox commented Feb 3, 2019

oxinabox commented Feb 3, 2019

oxinabox left a comment

Ayushk4 commented Jun 1, 2019

Ayushk4 commented Jun 5, 2019

oxinabox commented Jun 5, 2019

Ayushk4 commented Jun 5, 2019

oxinabox commented Jun 5, 2019

Add Tweet Tokenizer #13

Add Tweet Tokenizer #13

Conversation

Ayushk4 commented Feb 3, 2019 • edited Loading

The following is how the tokenizer works:

I have 2 questions related to this -

codecov-io commented Feb 3, 2019 • edited Loading

Codecov Report

oxinabox commented Feb 3, 2019

oxinabox commented Feb 3, 2019

oxinabox left a comment

Choose a reason for hiding this comment

Ayushk4 commented Jun 1, 2019

Ayushk4 commented Jun 5, 2019

oxinabox commented Jun 5, 2019

Ayushk4 commented Jun 5, 2019

oxinabox commented Jun 5, 2019

Ayushk4 commented Feb 3, 2019 •

edited

Loading

codecov-io commented Feb 3, 2019 •

edited

Loading