Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tweet Tokenizer #13

Merged
merged 45 commits into from
Jun 6, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
cd26b75
Add Regex
Ayushk4 Jan 22, 2019
d7532ec
Add function to replace HTML entities
Ayushk4 Jan 25, 2019
5c87dae
Add tweet tokenizer
Ayushk4 Jan 31, 2019
fd927d1
Add docstrings for functions
Ayushk4 Jan 31, 2019
4da422e
Add support for tweet tokenizer
Ayushk4 Feb 2, 2019
2e6b4c2
Update README
Ayushk4 Feb 2, 2019
853331c
Fix bug for optional argurments
Ayushk4 Feb 2, 2019
e3d2fa0
Add dependencies to REQUIRE
Ayushk4 Feb 3, 2019
c06ae26
Minor Code fixes
Ayushk4 Feb 4, 2019
1b65d8e
Improve code clarity
Ayushk4 Feb 4, 2019
320ce4d
Add comments and better variable naming
Ayushk4 Feb 8, 2019
94542ef
Add first series of tests
Ayushk4 Feb 8, 2019
1999e27
Add second series of tests
Ayushk4 Feb 8, 2019
ad94e30
Add tests and fix bugs
Ayushk4 Feb 9, 2019
4ec3f0a
Add final set of tests, fix links,typo
Ayushk4 Feb 9, 2019
164974b
Merge branch 'master' of https://github.com/JuliaText/WordTokenizers.jl
Ayushk4 Mar 7, 2019
8007a17
Make Replace entities 30x faster
Ayushk4 Mar 10, 2019
9aac4b2
Merge branch 'master' of https://github.com/JuliaText/WordTokenizers.…
Ayushk4 Mar 10, 2019
50539ba
Use TokenBuffer to speed up pre_processing functions
Ayushk4 Mar 12, 2019
59f8b0c
Fix indentation and bugs
Ayushk4 Mar 13, 2019
aef0efe
Merge branch 'master' of https://github.com/JuliaText/WordTokenizers.…
Ayushk4 Apr 11, 2019
cbb01e8
Add regex-free emoticons via TokenBuffer
Ayushk4 Apr 12, 2019
703ebc4
Add ascii arrows and html tags
Ayushk4 Apr 18, 2019
77b505a
Add functions for twitter hashtags and email addresses
Ayushk4 May 17, 2019
7440301
Fix Bugs
Ayushk4 May 18, 2019
6e20d5d
Add functions for twitterusernames and ellipses
Ayushk4 May 18, 2019
ecae2b9
Fix bugs in emailaddresses
Ayushk4 May 19, 2019
7661b8d
Update fast.jl, Support signs (+,-) in numbers
Ayushk4 May 19, 2019
7d6fa21
Switch to TokenBuffer for Tweet Tokenizer
Ayushk4 May 21, 2019
66adbf8
Add TokenBuffer function for nltk's tweet tokenizer - phone numbers
Ayushk4 May 22, 2019
a6de434
Add nltk_url1
Ayushk4 May 24, 2019
b9f0c44
Finish nltk_url1
Ayushk4 May 24, 2019
040368b
Add urls to tweet Tokenizer
Ayushk4 May 24, 2019
697dee4
Remove option of converting to lowercase
Ayushk4 May 24, 2019
927e4b3
Remove regex patterns
Ayushk4 May 24, 2019
75db813
Fix Bugs in tweet tokenizing functions
Ayushk4 May 31, 2019
ce7c74b
Finish nltk url function
Ayushk4 May 31, 2019
d9a019f
Add tests
Ayushk4 Jun 2, 2019
bebe5bd
Fix Bugs in tweet tokenizer
Ayushk4 Jun 2, 2019
e2120ad
Fix indentation
Ayushk4 Jun 3, 2019
1dc5445
Update README for TokenBuffer
Ayushk4 Jun 3, 2019
f18ae44
Update Docs for custom token TokenBuffer tokenizers, functions
Ayushk4 Jun 3, 2019
b0d8dd4
Minor doc changes
Ayushk4 Jun 3, 2019
fcfd107
Clean up code for tweet Tokenizer
Ayushk4 Jun 3, 2019
c7bd296
Change vectors into tuples
Ayushk4 Jun 5, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 180 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,14 @@ The word tokenizers basically assume sentence splitting has already been done.

- **Penn Tokenizer:** (`penn_tokenize`) This is Robert MacIntyre's orginal tokenizer used for the Penn Treebank. Splits contractions.
- **Improved Penn Tokenizer:** (`improved_penn_tokenize`) NLTK's improved Penn Treebank Tokenizer. Very similar to the original, some improvements on punctuation and contractions. This matches to NLTK's `nltk.tokenize.TreeBankWordTokenizer.tokenize`
- **NLTK Word tokenizer:** (`nltk_word_tokenize`) NLTK's even more improved version of the Penn Tokenizer. This version has better unicode handling and some other changes. This matches to the most commonly used `nltk.word_tokenize`, minus the sentence tokenizing step.
- **Reversible Tokenizer:** (`rev_tokenize` and `rev_detokenize`) This tokenizer splits on punctuations, space and special symbols. The generated tokens can be de-tokenized by using the `rev_detokenizer` function into the state before tokenization.
- **TokTok Tokenizer:** (`toktok_tokenize`) This tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized. Tok-tok has been tested on and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. **(default tokenizer)**
- **NLTK Word tokenizer:** (`nltk_word_tokenize`) NLTK's even more improved version of the Penn Tokenizer. This version has better unicode handling and some other changes. This matches to the most commonly used `nltk.word_tokenize`, minus the sentence tokenizing step.

(To me it seems like a weird historical thing that NLTK has 2 successive variation on improving the Penn tokenizer, but for now I am matching it and having both. See [[NLTK#2005]](https://github.com/nltk/nltk/issues/2005))

- **Reversible Tokenizer:** (`rev_tokenize` and `rev_detokenize`) This tokenizer splits on punctuations, space and special symbols. The generated tokens can be de-tokenized by using the `rev_detokenizer` function into the state before tokenization.
- **TokTok Tokenizer:** (`toktok_tokenize`) This tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized. Tok-tok has been tested on and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. **(default tokenizer)**
- **Tweet Tokenizer:** (`tweet_tokenizer`) NLTK's casual tokenizer for that is solely designed for tweets. Apart from twitter specific, this tokenizer has good handling for emoticons, and other web aspects like support for HTML Entities. This closely matches NLTK's `nltk.tokenize.TweetTokenizer`


# Sentence Splitters
We currently only have one sentence splitter.
Expand Down Expand Up @@ -112,3 +114,178 @@ So
`split(foo, Words)` is the same as `tokenize(foo)`,
and
`split(foo, Sentences)` is the same as `split_sentences(foo)`.

## Using TokenBuffer API for Custom Tokenizers
We offer a `TokenBuffer` API and supporting utility lexers
for high speed tokenization.

#### Writing your own TokenBuffer tokenizers

`TokenBuffer` turns a string into a readable stream, used for building tokenizers.
Utility lexers such as `spaces` and `number` read characters from the
stream and into an array of tokens.

Lexers return `true` or `false` to indicate whether they matched
in the input stream. They can therefore be combined easily, e.g.

spacesornumber(ts) = spaces(ts) || number(ts)

either skips whitespace or parses a number token, if possible.

The simplest useful tokenizer splits on spaces.

using WordTokenizers: TokenBuffer, isdone, spaces, character

function tokenise(input)
ts = TokenBuffer(input)
while !isdone(ts)
spaces(ts) || character(ts)
end
return ts.tokens
end

tokenise("foo bar baz") # ["foo", "bar", "baz"]

Many prewritten components for building custom tokenizers
can be found in `src/words/fast.jl` and `src/words/tweet_tokenizer.jl`
These components can be mixed and matched to create more complex tokenizers.

Here is a more complex example.

```julia
julia> using WordTokenizers: TokenBuffer, isdone, character, spaces # Present in fast.jl

julia> using WordTokenizers: nltk_url1, nltk_url2, nltk_phonenumbers # Present in tweet_tokenizer.jl

julia> function tokeinze(input)
urls(ts) = nltk_url1(ts) || nltk_url2(ts)

ts = TokenBuffer(input)
while !isdone(ts)
spaces(ts) && continue
urls(ts) ||
nltk_phonenumbers(ts) ||
character(ts)
end
return ts.tokens
end
tokeinze (generic function with 1 method)

julia> tokeinze("A url https://github.com/JuliaText/WordTokenizers.jl/ and phonenumber +0 (987) - 2344321")
6-element Array{String,1}:
"A"
"url"
"https://github.com/JuliaText/WordTokenizers.jl/" # URL detected.
"and"
"phonenumber"
"+0 (987) - 2344321" # Phone number detected.
```

#### Tips for writing custom tokenizers and your own TokenBuffer Lexer

1. The order in which the lexers are written needs to be taken care of in some cases-

For example: `987-654-3210` matches as a phone number
as well as numbers, but number will only match upto `987`
and split about it.

```julia
julia> using WordTokenizers: TokenBuffer, isdone, character, spaces, nltk_phonenumbers, number

julia> order1(ts) = number(ts) || nltk_phonenumbers(ts)
order1 (generic function with 1 method)

julia> order2(ts) = nltk_phonenumbers(ts) || number(ts)
order2 (generic function with 1 method)

julia> function tokenize1(input)
ts = TokenBuffer(input)
while !isdone(ts)
order1(ts) ||
character(ts)
end
return ts.tokens
end
tokenize1 (generic function with 1 method)

julia> function tokenize2(input)
ts = TokenBuffer(input)
while !isdone(ts)
order2(ts) ||
character(ts)
end
return ts.tokens
end
tokenize2 (generic function with 1 method)

julia> tokenize1("987-654-3210") # number(ts) || nltk_phonenumbers(ts)
5-element Array{String,1}:
"987"
"-"
"654"
"-"
"3210"

julia> tokenize2("987-654-3210") # nltk_phonenumbers(ts) || number(ts)
1-element Array{String,1}:
"987-654-3210"
```

2. BoundsError and errors while handling edge cases are most common
and need to be taken of while writing the TokenBuffer lexers.

3. For some TokenBuffer `ts`, use `flush!(ts)`
over push!(ts.tokens, input[i:j]), to make sure that characters
in the Buffer (i.e. ts.Buffer) also gets flushed out as separate tokens.

```julia
julia> using WordTokenizers: TokenBuffer, flush!, spaces, character, isdone

julia> function tokenize(input)
ts = TokenBuffer(input)

while !isdone(ts)
spaces(ts) && continue
my_pattern(ts) ||
character(ts)
end
return ts.tokens
end

julia> function my_pattern(ts) # Matches the pattern for 2 continuous `_`
ts.idx + 1 <= length(ts.input) || return false

if ts[ts.idx] == '_' && ts[ts.idx + 1] == '_'
flush!(ts, "__") # Using flush!
ts.idx += 2
return true
end

return false
end
my_pattern (generic function with 1 method)

julia> tokenize("hi__hello")
3-element Array{String,1}:
"hi"
"__"
"hello"

julia> function my_pattern(ts) # Matches the pattern for 2 continuous `_`
ts.idx + 1 <= length(ts.input) || return false

if ts[ts.idx] == '_' && ts[ts.idx + 1] == '_'
push!(ts.tokens, "__") # Without using flush!
ts.idx += 2
return true
end

return false
end
my_pattern (generic function with 1 method)

julia> tokenize("hi__hello")
2-element Array{String,1}:
"__"
"hihello"
```
2 changes: 2 additions & 0 deletions REQUIRE
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
julia 0.7
HTML_Entities
StrTables
7 changes: 7 additions & 0 deletions src/WordTokenizers.jl
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@

module WordTokenizers

using HTML_Entities
using StrTables
using Unicode


export poormans_tokenize, punctuation_space_tokenize,
penn_tokenize, improved_penn_tokenize, nltk_word_tokenize,
tweet_tokenize,
tokenize,
rulebased_split_sentences,
split_sentences,
Expand All @@ -16,6 +22,7 @@ include("words/simple.jl")
include("words/nltk_word.jl")
include("words/reversible_tokenize.jl")
include("words/sedbased.jl")
include("words/tweet_tokenizer.jl")
include("sentences/sentence_splitting.jl")
include("words/TokTok.jl")

Expand Down
2 changes: 1 addition & 1 deletion src/split_api.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
export Words, Sentences

const tokenizers = [poormans_tokenize, punctuation_space_tokenize,
penn_tokenize, improved_penn_tokenize, nltk_word_tokenize]
penn_tokenize, improved_penn_tokenize, nltk_word_tokenize, tweet_tokenize]
const sentence_splitters = [rulebased_split_sentences]

const Words = tokenize
Expand Down
29 changes: 16 additions & 13 deletions src/words/fast.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,23 +17,23 @@ either skips whitespace or parses a number token, if possible.
The simplest possible tokeniser accepts any `character` with no token breaks:

function tokenise(input)
ts = TokenBuffer(input)
while !isdone(ts)
character(ts)
end
return ts.tokens
ts = TokenBuffer(input)
while !isdone(ts)
character(ts)
end
return ts.tokens
end

tokenise("foo bar baz") # ["foo bar baz"]

The second simplest splits only on spaces:

function tokenise(input)
ts = TokenBuffer(input)
while !isdone(ts)
spaces(ts) || character(ts)
end
return ts.tokens
ts = TokenBuffer(input)
while !isdone(ts)
spaces(ts) || character(ts)
end
return ts.tokens
end

tokenise("foo bar baz") # ["foo", "bar", "baz"]
Expand Down Expand Up @@ -214,9 +214,13 @@ end

Matches numbers such as `10,000.5`, preserving formatting.
"""
function number(ts, sep = (':', ',', '\'', '.'))
isdigit(ts[]) || return false
function number(ts, sep = (':', ',', '\'', '.'); check_sign = false)
i = ts.idx
if check_sign && ts[] ∈ ['+', '-'] && ( i == 1 || isspace(ts[i-1]))
i += 1
end

i <= length(ts.input) && isdigit(ts[i]) || return false
while i <= length(ts.input) && (isdigit(ts[i]) ||
(ts[i] in sep && i < length(ts.input) && isdigit(ts[i+1])))
i += 1
Expand All @@ -225,4 +229,3 @@ function number(ts, sep = (':', ',', '\'', '.'))
ts.idx = i
return true
end

Loading