Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tweet Tokenizer #13

Merged
merged 45 commits into from
Jun 6, 2019
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
cd26b75
Add Regex
Ayushk4 Jan 22, 2019
d7532ec
Add function to replace HTML entities
Ayushk4 Jan 25, 2019
5c87dae
Add tweet tokenizer
Ayushk4 Jan 31, 2019
fd927d1
Add docstrings for functions
Ayushk4 Jan 31, 2019
4da422e
Add support for tweet tokenizer
Ayushk4 Feb 2, 2019
2e6b4c2
Update README
Ayushk4 Feb 2, 2019
853331c
Fix bug for optional argurments
Ayushk4 Feb 2, 2019
e3d2fa0
Add dependencies to REQUIRE
Ayushk4 Feb 3, 2019
c06ae26
Minor Code fixes
Ayushk4 Feb 4, 2019
1b65d8e
Improve code clarity
Ayushk4 Feb 4, 2019
320ce4d
Add comments and better variable naming
Ayushk4 Feb 8, 2019
94542ef
Add first series of tests
Ayushk4 Feb 8, 2019
1999e27
Add second series of tests
Ayushk4 Feb 8, 2019
ad94e30
Add tests and fix bugs
Ayushk4 Feb 9, 2019
4ec3f0a
Add final set of tests, fix links,typo
Ayushk4 Feb 9, 2019
164974b
Merge branch 'master' of https://github.com/JuliaText/WordTokenizers.jl
Ayushk4 Mar 7, 2019
8007a17
Make Replace entities 30x faster
Ayushk4 Mar 10, 2019
9aac4b2
Merge branch 'master' of https://github.com/JuliaText/WordTokenizers.…
Ayushk4 Mar 10, 2019
50539ba
Use TokenBuffer to speed up pre_processing functions
Ayushk4 Mar 12, 2019
59f8b0c
Fix indentation and bugs
Ayushk4 Mar 13, 2019
aef0efe
Merge branch 'master' of https://github.com/JuliaText/WordTokenizers.…
Ayushk4 Apr 11, 2019
cbb01e8
Add regex-free emoticons via TokenBuffer
Ayushk4 Apr 12, 2019
703ebc4
Add ascii arrows and html tags
Ayushk4 Apr 18, 2019
77b505a
Add functions for twitter hashtags and email addresses
Ayushk4 May 17, 2019
7440301
Fix Bugs
Ayushk4 May 18, 2019
6e20d5d
Add functions for twitterusernames and ellipses
Ayushk4 May 18, 2019
ecae2b9
Fix bugs in emailaddresses
Ayushk4 May 19, 2019
7661b8d
Update fast.jl, Support signs (+,-) in numbers
Ayushk4 May 19, 2019
7d6fa21
Switch to TokenBuffer for Tweet Tokenizer
Ayushk4 May 21, 2019
66adbf8
Add TokenBuffer function for nltk's tweet tokenizer - phone numbers
Ayushk4 May 22, 2019
a6de434
Add nltk_url1
Ayushk4 May 24, 2019
b9f0c44
Finish nltk_url1
Ayushk4 May 24, 2019
040368b
Add urls to tweet Tokenizer
Ayushk4 May 24, 2019
697dee4
Remove option of converting to lowercase
Ayushk4 May 24, 2019
927e4b3
Remove regex patterns
Ayushk4 May 24, 2019
75db813
Fix Bugs in tweet tokenizing functions
Ayushk4 May 31, 2019
ce7c74b
Finish nltk url function
Ayushk4 May 31, 2019
d9a019f
Add tests
Ayushk4 Jun 2, 2019
bebe5bd
Fix Bugs in tweet tokenizer
Ayushk4 Jun 2, 2019
e2120ad
Fix indentation
Ayushk4 Jun 3, 2019
1dc5445
Update README for TokenBuffer
Ayushk4 Jun 3, 2019
f18ae44
Update Docs for custom token TokenBuffer tokenizers, functions
Ayushk4 Jun 3, 2019
b0d8dd4
Minor doc changes
Ayushk4 Jun 3, 2019
fcfd107
Clean up code for tweet Tokenizer
Ayushk4 Jun 3, 2019
c7bd296
Change vectors into tuples
Ayushk4 Jun 5, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ The word tokenizers basically assume sentence splitting has already been done.

(To me it seems like a weird historical thing that NLTK has 2 successive variation on improving the Penn tokenizer, but for now I am matching it and having both. See [[NLTK#2005]](https://github.com/nltk/nltk/issues/2005))

- **Tweet Tokenizer:** (`tweet_tokenizer`) NLTK's casual tokenizer for that is solely designed for tweets. Apart from twitter specific, this tokenizer has good handling for emoticons, and other web aspects like support for HTML Entities. This closely matches NLTK's `nltk.tokenize.TweetTokenizer`


# Sentence Splitters
We currently only have one sentence splitter.
Expand Down
2 changes: 2 additions & 0 deletions REQUIRE
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
julia 0.7
HTML_Entities
StrTables
7 changes: 7 additions & 0 deletions src/WordTokenizers.jl
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@

module WordTokenizers

using HTML_Entities
using StrTables
using Unicode


export poormans_tokenize, punctuation_space_tokenize,
penn_tokenize, improved_penn_tokenize, nltk_word_tokenize,
tweet_tokenize,
tokenize,
rulebased_split_sentences,
split_sentences,
Expand All @@ -16,6 +22,7 @@ include("words/simple.jl")
include("words/nltk_word.jl")
include("words/reversible_tokenize.jl")
include("words/sedbased.jl")
include("words/tweet_tokenizer.jl")
include("sentences/sentence_splitting.jl")
include("words/TokTok.jl")

Expand Down
2 changes: 1 addition & 1 deletion src/split_api.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
export Words, Sentences

const tokenizers = [poormans_tokenize, punctuation_space_tokenize,
penn_tokenize, improved_penn_tokenize, nltk_word_tokenize]
penn_tokenize, improved_penn_tokenize, nltk_word_tokenize, tweet_tokenize]
const sentence_splitters = [rulebased_split_sentences]

const Words = tokenize
Expand Down
9 changes: 6 additions & 3 deletions src/words/fast.jl
Original file line number Diff line number Diff line change
Expand Up @@ -214,9 +214,13 @@ end

Matches numbers such as `10,000.5`, preserving formatting.
"""
function number(ts, sep = (':', ',', '\'', '.'))
isdigit(ts[]) || return false
function number(ts, sep = (':', ',', '\'', '.'); check_sign = false)
i = ts.idx
if check_sign && ts[] ∈ ['+', '-'] && ( i == 1 || isspace(ts[i-1]))
i += 1
end

i <= length(ts.input) && isdigit(ts[i]) || return false
while i <= length(ts.input) && (isdigit(ts[i]) ||
(ts[i] in sep && i < length(ts.input) && isdigit(ts[i+1])))
i += 1
Expand All @@ -225,4 +229,3 @@ function number(ts, sep = (':', ',', '\'', '.'))
ts.idx = i
return true
end

Loading