From 39d2e499606e1eb63ab089b8a11807f4f2639cc8 Mon Sep 17 00:00:00 2001 From: Roman S Samarev Date: Thu, 26 Oct 2023 21:31:46 -0700 Subject: [PATCH] documentation: refresh, adding dynamical calculation, moving from .md to juliadoc --- README.md | 2 +- docs/make.jl | 3 + docs/src/LM.md | 94 ++++--------- docs/src/classify.md | 47 ++----- docs/src/corpus.md | 70 ++-------- docs/src/documents.md | 300 ++++++++++------------------------------ docs/src/example.md | 3 +- docs/src/features.md | 185 +++++++------------------ docs/src/semantic.md | 81 +++++------ src/LM/api.jl | 18 ++- src/LM/preprocessing.jl | 2 +- src/bayes.jl | 13 +- src/corpus.jl | 2 + src/summarizer.jl | 5 + 14 files changed, 250 insertions(+), 575 deletions(-) diff --git a/README.md b/README.md index 8d6f95c2..84cd9e28 100644 --- a/README.md +++ b/README.md @@ -43,5 +43,5 @@ Contributions, in the form of bug-reports, pull requests, additional documentati ## Support -Feel free to ask for help on the [Julia Discourse forum](https://discourse.julialang.org/), or in the `#natural-language` channel on [julia-slack](https://julialang.slack.com). (Which you can [join here](https://slackinvite.julialang.org/)). You can also raise issues in this repository to request new features and/or improvements to the documentation and codebase. +Feel free to ask for help on the [Julia Discourse forum](https://discourse.julialang.org/), or in the `#natural-language` channel on [julia-slack](https://julialang.slack.com). (Which you can [join here](https://julialang.org/slack/)). Or, [select what do you like here](https://julialang.org/community/). You can also raise issues in this repository to request new features and/or improvements to the documentation and codebase. diff --git a/docs/make.jl b/docs/make.jl index 41812cf6..f27612f4 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -19,3 +19,6 @@ makedocs( ], ) +deploydocs(; + repo="github.com/JuliaText/TextAnalysis.jl", +) diff --git a/docs/src/LM.md b/docs/src/LM.md index f0f2e6da..0289516d 100644 --- a/docs/src/LM.md +++ b/docs/src/LM.md @@ -84,8 +84,8 @@ julia> masked_score = maskedscore(model,fit,"is","alien") used to evaluate the probability of word given context (*P(word | context)*) -```julia -score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString) +```@docs +score ``` Arguments: @@ -100,91 +100,51 @@ Arguments: - In Interpolated language model, provide `Kneserney` and `WittenBell` smoothing ### `maskedscore` +```@docs +maskedscore +``` -- It is used to evaluate *score* with masks out of vocabulary words - -- The arguments are the same as for `score` - -### `logscore` - -- Evaluate the log score of this word in this context. +### `logscore` +```@docs +logscore +``` -- The arguments are the same as for `score` and `maskedscore` ### `entropy` -```julia -entropy(m::Langmodel,lm::DefaultDict,text_ngram::word::Vector{T}) where { T <: AbstractString} +```@docs +entropy ``` -- Calculate *cross-entropy* of model for given evaluation text. - -- Input text must be Array of ngram of same lengths - -### `perplexity` - -- Calculates the perplexity of the given text. - -- This is simply 2 ** cross-entropy(`entropy`) for the text, so the arguments are the same as `entropy`. +### `perplexity` +```@docs +perplexity +``` ## Preprocessing For Preprocessing following functions: - -1. `everygram`: Return all possible ngrams generated from sequence of items, as an Array{String,1} - -```julia -julia> seq = ["To","be","or","not"] -julia> a = everygram(seq,min_len=1, max_len=-1) - 10-element Array{Any,1}: - "or" - "not" - "To" - "be" - "or not" - "be or" - "be or not" - "To be or" - "To be or not" +```@docs +everygram +padding_ngram ``` -2. `padding_ngrams`: padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n - - It also pad the original input Array of string - -```julia -julia> example = ["1","2","3","4","5"] -julia> padding_ngrams(example,2,pad_left=true,pad_right=true) - 6-element Array{Any,1}: - " 1" - "1 2" - "2 3" - "3 4" - "4 5" - "5 " -``` ## Vocabulary Struct to store Language models vocabulary checking membership and filters items by comparing their counts to a cutoff value -It also Adds a special "unkown" tokens which unseen words are mapped to - -```julia -julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"] +It also Adds a special "unknown" tokens which unseen words are mapped to -julia> vocabulary = Vocabulary(words, 2) - Vocabulary(Dict(""=>1,"c"=>3,"a"=>3,"d"=>2), 2, "") +```@repl +using TextAnalysis +words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"] +vocabulary = Vocabulary(words, 2) # lookup a sequence or words in the vocabulary -julia> word = ["a", "-", "d", "c", "a"] - -julia> lookup(vocabulary ,word) - 5-element Array{Any,1}: - "a" - "" - "d" - "c" - "a" + +word = ["a", "-", "d", "c", "a"] + +lookup(vocabulary ,word) ``` diff --git a/docs/src/classify.md b/docs/src/classify.md index 8436cbce..51957451 100644 --- a/docs/src/classify.md +++ b/docs/src/classify.md @@ -11,42 +11,25 @@ To load the Naive Bayes Classifier, use the following command - Its usage can be done in the following 3 steps. 1- Create an instance of the Naive Bayes Classifier model - - - model = NaiveBayesClassifier(dict, classes) - - -It takes two arguments- - -* `classes`: An array of possible classes that the concerned data could belong to. -* `dict`:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3) - +```@docs +NaiveBayesClassifier +``` 2- Fitting the model weights on input - - - fit!(model, str, class) - +```@docs +fit! +``` 3- Predicting for the input case - - - predict(model, str) - -## Example - -```julia -julia> m = NaiveBayesClassifier([:legal, :financial]) -NaiveBayesClassifier{Symbol}(String[], Symbol[:legal, :financial], Array{Int64}(0,2)) +```@docs +predict ``` -```julia -julia> fit!(m, "this is financial doc", :financial) -NaiveBayesClassifier{Symbol}(["financial", "this", "is", "doc"], Symbol[:legal, :financial], [1 2; 1 2; 1 2; 1 2]) - -julia> fit!(m, "this is legal doc", :legal) -NaiveBayesClassifier{Symbol}(["financial", "this", "is", "doc", "legal"], Symbol[:legal, :financial], [1 2; 2 2; … ; 2 2; 2 1]) -``` +## Example -```julia -julia> predict(m, "this should be predicted as a legal document") -Dict{Symbol,Float64} with 2 entries: - :legal => 0.666667 - :financial => 0.333333 +```@repl +using TextAnalysis +m = NaiveBayesClassifier([:legal, :financial]) +fit!(m, "this is financial doc", :financial) +fit!(m, "this is legal doc", :legal) +predict(m, "this should be predicted as a legal document") ``` diff --git a/docs/src/corpus.md b/docs/src/corpus.md index 4fe8ab6e..eff92421 100644 --- a/docs/src/corpus.md +++ b/docs/src/corpus.md @@ -3,58 +3,18 @@ Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type: - -```julia -julia> crps = Corpus([StringDocument("Document 1"), - StringDocument("Document 2")]) -A Corpus with 2 documents: - * 2 StringDocument's - * 0 FileDocument's - * 0 TokenDocument's - * 0 NGramDocument's - -Corpus's lexicon contains 0 tokens -Corpus's index contains 0 tokens +```@docs +Corpus ``` ## Standardizing a Corpus -A `Corpus` may contain many different types of documents: - -```julia -julia> crps = Corpus([StringDocument("Document 1"), - TokenDocument("Document 2"), - NGramDocument("Document 3")]) -A Corpus with 3 documents: - * 1 StringDocument's - * 0 FileDocument's - * 1 TokenDocument's - * 1 NGramDocument's - -Corpus's lexicon contains 0 tokens -Corpus's index contains 0 tokens -``` - -It is generally more convenient to standardize all of the documents in a +A `Corpus` may contain many different types of documents. It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the `standardize!` function: -```julia -julia> standardize!(crps, NGramDocument) -``` - -After this step, you can check that the corpus only contains `NGramDocument`'s: - -```julia -julia> crps -A Corpus with 3 documents: - * 0 StringDocument's - * 0 FileDocument's - * 0 TokenDocument's - * 3 NGramDocument's - -Corpus's lexicon contains 0 tokens -Corpus's index contains 0 tokens +```@docs +standardize! ``` ## Processing a Corpus @@ -62,17 +22,13 @@ Corpus's index contains 0 tokens We can apply the same sort of preprocessing steps that are defined for individual documents to an entire corpus at once: -```julia -julia> crps = Corpus([StringDocument("Document ..!!"), - StringDocument("Document ..!!")]) - -julia> prepare!(crps, strip_punctuation) - -julia> text(crps[1]) -"Document " - -julia> text(crps[2]) -"Document " +```@repl +using TextAnalysis +crps = Corpus([StringDocument("Document ..!!"), + StringDocument("Document ..!!")]) +prepare!(crps, strip_punctuation) +text(crps[1]) +text(crps[2]) ``` These operations are run on each document in the corpus individually. @@ -109,7 +65,7 @@ Dict{String,Int64} with 3 entries: But once this work is done, you can easier address lots of interesting questions about a corpus: -``` +```julia julia> lexical_frequency(crps, "Name") 0.5 diff --git a/docs/src/documents.md b/docs/src/documents.md index 8b3bc353..53b00ab8 100644 --- a/docs/src/documents.md +++ b/docs/src/documents.md @@ -13,107 +13,29 @@ allows one to work with documents stored in a variety of formats: Creating any of the four basic types of documents is very easy: -```julia -julia> str = "To be or not to be..." -"To be or not to be..." - -julia> sd = StringDocument(str) -A StringDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: To be or not to be... - -julia> pathname = "/usr/share/dict/words" -"/usr/share/dict/words" - -julia> fd = FileDocument(pathname) -A FileDocument - * Language: Languages.English() - * Title: /usr/share/dict/words - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah - -julia> my_tokens = String["To", "be", "or", "not", "to", "be..."] -6-element Array{String,1}: - "To" - "be" - "or" - "not" - "to" - "be..." - -julia> td = TokenDocument(my_tokens) -A TokenDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: ***SAMPLE TEXT NOT AVAILABLE*** - - -julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2, - "or" => 1, "not" => 1, - "to" => 1, "be..." => 1) -Dict{String,Int64} with 6 entries: - "or" => 1 - "be..." => 1 - "not" => 1 - "to" => 1 - "To" => 1 - "be" => 2 - -julia> ngd = NGramDocument(my_ngrams) -A NGramDocument{AbstractString} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: ***SAMPLE TEXT NOT AVAILABLE*** +```@docs +StringDocument +FileDocument +TokenDocument +NGramDocument ``` An NGramDocument consisting of bigrams or any higher order representation `N` can be easily created by passing the parameter `N` to `NGramDocument` -```julia -julia> NGramDocument("To be or not to be ...", 2) -A NGramDocument{AbstractString} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: ***SAMPLE TEXT NOT AVAILABLE*** +```@repl +using TextAnalysis +NGramDocument("To be or not to be ...", 2) ``` For every type of document except a `FileDocument`, you can also construct a new document by simply passing in a string of text: -```julia -julia> sd = StringDocument("To be or not to be...") -A StringDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: To be or not to be... - -julia> td = TokenDocument("To be or not to be...") -A TokenDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: ***SAMPLE TEXT NOT AVAILABLE*** - -julia> ngd = NGramDocument("To be or not to be...") -A NGramDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: ***SAMPLE TEXT NOT AVAILABLE*** +```@repl +using TextAnalysis +sd = StringDocument("To be or not to be...") +td = TokenDocument("To be or not to be...") +ngd = NGramDocument("To be or not to be...") ``` The system will automatically perform tokenization or n-gramization in order @@ -165,18 +87,12 @@ This constructor is very convenient for working in the REPL, but should be avoid Once you've created a document object, you can work with it in many ways. The most obvious thing is to access its text using the `text()` function: -```julia -julia> sd = StringDocument("To be or not to be...") -A StringDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: To be or not to be... - -julia> text(sd) -"To be or not to be..." +```@repl +using TextAnalysis +sd = StringDocument("To be or not to be..."); +text(sd) ``` + !!! note This function works without warnings on `StringDocument`'s and `FileDocument`'s. For `TokenDocument`'s it is not possible to know if the @@ -189,75 +105,38 @@ julia> text(sd) Instead of working with the text itself, you can work with the tokens or n-grams of a document using the `tokens()` and `ngrams()` functions: -```julia -julia> tokens(sd) -7-element Array{String,1}: - "To" - "be" - "or" - "not" - "to" - "be.." - "." - - julia> ngrams(sd) - Dict{String,Int64} with 7 entries: - "or" => 1 - "not" => 1 - "to" => 1 - "To" => 1 - "be" => 1 - "be.." => 1 - "." => 1 +```@repl +using TextAnalysis +sd = StringDocument("To be or not to be..."); +tokens(sd) +ngrams(sd) ``` By default the `ngrams()` function produces unigrams. If you would like to produce bigrams or trigrams, you can specify that directly using a numeric argument to the `ngrams()` function: -```julia -julia> ngrams(sd, 2) -Dict{AbstractString,Int64} with 13 entries: - "To be" => 1 - "or not" => 1 - "be or" => 1 - "not to" => 1 - "to be.." => 1 - "be.. ." => 1 +```@repl +using TextAnalysis +sd = StringDocument("To be or not to be..."); +ngrams(sd, 2) ``` The `ngrams()` function can also be called with multiple arguments: -```julia -julia> ngrams(sd, 2, 3) -Dict{AbstractString,Int64} with 11 entries: - "or not to" => 1 - "be or" => 1 - "not to" => 1 - "be or not" => 1 - "not to be.." => 1 - "To be" => 1 - "or not" => 1 - "to be.. ." => 1 - "to be.." => 1 - "be.. ." => 1 - "To be or" => 1 +```@repl +using TextAnalysis +sd = StringDocument("To be or not to be..."); +ngrams(sd, 2, 3) ``` If you have a `NGramDocument`, you can determine whether an `NGramDocument` contains unigrams, bigrams or a higher-order representation using the `ngram_complexity()` function: -```julia -julia> ngd = NGramDocument("To be or not to be ...", 2) -A NGramDocument{AbstractString} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: ***SAMPLE TEXT NOT AVAILABLE*** - -julia> ngram_complexity(ngd) -2 +```@repl +using TextAnalysis +ngd = NGramDocument("To be or not to be ...", 2); +ngram_complexity(ngd) ``` This information is not available for other types of `Document` objects @@ -278,43 +157,25 @@ including the following pieces of information: Try these functions out on a `StringDocument` to see how the defaults work in practice: -```julia -julia> StringDocument("This document has too foo words") -A StringDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: This document has too foo words - -julia> language(sd) -Languages.English() - -julia> title(sd) -"Untitled Document" - -julia> author(sd) -"Unknown Author" - -julia> timestamp(sd) -"Unknown Time" +```@repl +using TextAnalysis +sd = StringDocument("This document has too foo words") +language(sd) +title(sd) +author(sd) +timestamp(sd) ``` If you need reset these fields, you can use the mutating versions of the same functions: -```julia -julia> language!(sd, Languages.Spanish()) -Languages.Spanish() - -julia> title!(sd, "El Cid") -"El Cid" - -julia> author!(sd, "Desconocido") -"Desconocido" - -julia> timestamp!(sd, "Desconocido") -"Desconocido" +```@repl +using TextAnalysis, Languages +sd = StringDocument("This document has too foo words") +language!(sd, Languages.Spanish()) +title!(sd, "El Cid") +author!(sd, "Desconocido") +timestamp!(sd, "Desconocido") ``` ## Preprocessing Documents @@ -325,44 +186,33 @@ important, but most text analysis tasks require some amount of preprocessing. At a minimum, your text source may contain corrupt characters. You can remove these using the `remove_corrupt_utf8!()` function: - remove_corrupt_utf8!(sd) +```@docs +remove_corrupt_utf8! +``` Alternatively, you may want to edit the text to remove items that are hard to process automatically. For example, our sample text sentence taken from Hamlet has three periods that we might like to discard. We can remove this kind of punctuation using the `prepare!()` function: -```julia -julia> str = StringDocument("here are some punctuations !!!...") - -julia> prepare!(str, strip_punctuation) - -julia> text(str) -"here are some punctuations " +```@repl +using TextAnalysis +str = StringDocument("here are some punctuations !!!...") +prepare!(str, strip_punctuation) +text(str) ``` * To remove case distinctions, use `remove_case!()` function: * At times you'll want to remove specific words from a document like a person's name. To do that, use the `remove_words!()` function: -```julia -julia> sd = StringDocument("Lear is mad") -A StringDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: Lear is mad - -julia> remove_case!(sd) - -julia> text(sd) -"lear is mad" - -julia> remove_words!(sd, ["lear"]) - -julia> text(sd) -" is mad" +```@repl +using TextAnalysis +sd = StringDocument("Lear is mad") +remove_case!(sd) +text(sd) +remove_words!(sd, ["lear"]) +text(sd) ``` At other times, you'll want to remove whole classes of words. To make this @@ -381,12 +231,12 @@ These special classes can all be removed using specially-named parameters: * `prepare!(sd, strip_articles)` * `prepare!(sd, strip_indefinite_articles)` * `prepare!(sd, strip_definite_articles)` -* `prepare!(sd, strip_preposition)` +* `prepare!(sd, strip_prepositions)` * `prepare!(sd, strip_pronouns)` * `prepare!(sd, strip_stopwords)` * `prepare!(sd, strip_numbers)` * `prepare!(sd, strip_non_letters)` -* `prepare!(sd, strip_spares_terms)` +* `prepare!(sd, strip_sparse_terms)` * `prepare!(sd, strip_frequent_terms)` * `prepare!(sd, strip_html_tags)` @@ -401,17 +251,9 @@ closely related like "dog" and "dogs" and stem them in order to produce a smaller set of words for analysis. We can do this using the `stem!()` function: -```julia -julia> sd = StringDocument("They write, it writes") -A StringDocument{String} - * Language: Languages.English() - * Title: Untitled Document - * Author: Unknown Author - * Timestamp: Unknown Time - * Snippet: They write, it writes - -julia> stem!(sd) - -julia> text(sd) -"They write , it write" +```@repl +using TextAnalysis +sd = StringDocument("They write, it writes") +stem!(sd) +text(sd) ``` diff --git a/docs/src/example.md b/docs/src/example.md index 23b122e4..a1444414 100644 --- a/docs/src/example.md +++ b/docs/src/example.md @@ -4,6 +4,7 @@ To show you how text analysis might work in practice, we're going to work with a text corpus composed of political speeches from American presidents given as part of the State of the Union Address tradition. +```julia using TextAnalysis, MultivariateStats, Clustering crps = DirectoryCorpus("sotu") @@ -27,4 +28,4 @@ as part of the State of the Union Address tradition. T = tf_idf(D) cl = kmeans(T, 5) - +``` diff --git a/docs/src/features.md b/docs/src/features.md index d74f337b..12e90976 100644 --- a/docs/src/features.md +++ b/docs/src/features.md @@ -4,14 +4,12 @@ Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon: -```julia -julia> crps = Corpus([StringDocument("To be or not to be"), - StringDocument("To become or not to become")]) - -julia> update_lexicon!(crps) - -julia> m = DocumentTermMatrix(crps) -A 2 X 6 DocumentTermMatrix +```@repl +using TextAnalysis +crps = Corpus([StringDocument("To be or not to be"), + StringDocument("To become or not to become")]) +update_lexicon!(crps) +m = DocumentTermMatrix(crps) ``` A `DocumentTermMatrix` object is a special type. If you would like to use @@ -121,36 +119,26 @@ julia> hash_dtv(crps[1]) Often we need to find out the proportion of a document is contributed by each term. This can be done by finding the term frequency function - tf(dtm) +```@docs +tf +``` The parameter, `dtm` can be of the types - `DocumentTermMatrix` , `SparseMatrixCSC` or `Matrix` -```julia -julia> crps = Corpus([StringDocument("To be or not to be"), - StringDocument("To become or not to become")]) - -julia> update_lexicon!(crps) - -julia> m = DocumentTermMatrix(crps) - -julia> tf(m) -2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries: - [1, 1] = 0.166667 - [2, 1] = 0.166667 - [1, 2] = 0.333333 - [2, 3] = 0.333333 - [1, 4] = 0.166667 - [2, 4] = 0.166667 - [1, 5] = 0.166667 - [2, 5] = 0.166667 - [1, 6] = 0.166667 - [2, 6] = 0.166667 +```@repl +using TextAnalysis +crps = Corpus([StringDocument("To be or not to be"), + StringDocument("To become or not to become")]) +update_lexicon!(crps) +m = DocumentTermMatrix(crps) +tf(m) ``` ## TF-IDF (Term Frequency - Inverse Document Frequency) - tf_idf(dtm) - +```@docs +tf_idf +``` In many cases, raw word counts are not appropriate for use because: * (A) Some documents are longer than other documents @@ -158,37 +146,13 @@ In many cases, raw word counts are not appropriate for use because: You can work around this by performing TF-IDF on a DocumentTermMatrix: -```julia -julia> crps = Corpus([StringDocument("To be or not to be"), - StringDocument("To become or not to become")]) - -julia> update_lexicon!(crps) - -julia> m = DocumentTermMatrix(crps) -DocumentTermMatrix( - [1, 1] = 1 - [2, 1] = 1 - [1, 2] = 2 - [2, 3] = 2 - [1, 4] = 1 - [2, 4] = 1 - [1, 5] = 1 - [2, 5] = 1 - [1, 6] = 1 - [2, 6] = 1, ["To", "be", "become", "not", "or", "to"], Dict("or"=>5,"not"=>4,"to"=>6,"To"=>1,"be"=>2,"become"=>3)) - -julia> tf_idf(m) -2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries: - [1, 1] = 0.0 - [2, 1] = 0.0 - [1, 2] = 0.231049 - [2, 3] = 0.231049 - [1, 4] = 0.0 - [2, 4] = 0.0 - [1, 5] = 0.0 - [2, 5] = 0.0 - [1, 6] = 0.0 - [2, 6] = 0.0 +```@repl +using TextAnalysis +crps = Corpus([StringDocument("To be or not to be"), + StringDocument("To become or not to become")]) +update_lexicon!(crps) +m = DocumentTermMatrix(crps) +tf_idf(m) ``` As you can see, TF-IDF has the effect of inserting 0's into the columns of @@ -211,23 +175,18 @@ The parameters κ and β default to 2 and 0.75 respectively. Here is an example usage - -```julia -julia> crps = Corpus([StringDocument("a a a sample text text"), StringDocument("another example example text text"), StringDocument(""), StringDocument("another another text text text text")]) - -julia> update_lexicon!(crps) - -julia> m = DocumentTermMatrix(crps) - -julia> bm_25(m) -4×5 SparseArrays.SparseMatrixCSC{Float64,Int64} with 8 stored entries: - [1, 1] = 1.29959 - [2, 2] = 0.882404 - [4, 2] = 1.40179 - [2, 3] = 1.54025 - [1, 4] = 1.89031 - [1, 5] = 0.405067 - [2, 5] = 0.405067 - [4, 5] = 0.676646 +```@repl +using TextAnalysis +crps = Corpus([ + StringDocument("a a a sample text text"), + StringDocument("another example example text text"), + StringDocument(""), + StringDocument("another another text text text text") +]) +update_lexicon!(crps) +m = DocumentTermMatrix(crps) + +bm_25(m) ``` ## Co occurrence matrix (COOM) @@ -250,49 +209,12 @@ the matrix can be extracted using `coom(::CooMatrix)`. The `terms` can also be extracted from this. Here is an example usage - -```julia - -julia> crps = Corpus([StringDocument("this is a string document"), - -julia> C = CooMatrix(crps, window=1, normalize=false) -CooMatrix{Float64}( - [2, 1] = 2.0 - [6, 1] = 2.0 - [1, 2] = 2.0 - [3, 2] = 2.0 - [2, 3] = 2.0 - [6, 3] = 2.0 - [5, 4] = 4.0 - [4, 5] = 4.0 - [6, 5] = 4.0 - [1, 6] = 2.0 - [3, 6] = 2.0 - [5, 6] = 4.0, ["string", "document", "token", "this", "is", "a"], OrderedDict("string"=>1,"document"=>2,"token"=>3,"this"=>4,"is"=>5,"a"=>6)) - -julia> coom(C) -6×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 12 stored entries: - [2, 1] = 2.0 - [6, 1] = 2.0 - [1, 2] = 2.0 - [3, 2] = 2.0 - [2, 3] = 2.0 - [6, 3] = 2.0 - [5, 4] = 4.0 - [4, 5] = 4.0 - [6, 5] = 4.0 - [1, 6] = 2.0 - [3, 6] = 2.0 - [5, 6] = 4.0 - -julia> C.terms -6-element Array{String,1}: - "string" - "document" - "token" - "this" - "is" - "a" - +```@repl +using TextAnalysis +crps = Corpus([StringDocument("this is a string document")]) +C = CooMatrix(crps, window=1, normalize=false) +coom(C) +C.terms ``` It can also be called to calculate the terms for @@ -339,19 +261,6 @@ ERROR: The tokens of an NGramDocument cannot be reconstructed TextAnalysis offers a simple text-rank based summarizer for its various document types. - summarize(d, ns) - -It takes 2 arguments: - -* `d` : A document of type `StringDocument`, `FileDocument` or `TokenDocument` -* `ns` : (Optional) Mention the number of sentences in the Summary, defaults to `5` sentences. - -```julia -julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.") - -julia> summarize(s, ns=2) -2-element Array{SubString{String},1}: - "Assume this Short Document as an example." - "This has too foo sentences." -``` - +```@docs +summarize +``` \ No newline at end of file diff --git a/docs/src/semantic.md b/docs/src/semantic.md index 7fa6a22e..6eb2d7d6 100644 --- a/docs/src/semantic.md +++ b/docs/src/semantic.md @@ -4,29 +4,33 @@ Often we want to think about documents from the perspective of semantic content. One standard approach to doing this, is to perform Latent Semantic Analysis or LSA on the corpus. - - lsa(crps) - lsa(dtm) +```@docs +lsa +``` lsa uses `tf_idf` for statistics. -```julia -julia> crps = Corpus([StringDocument("this is a string document"), TokenDocument("this is a token document")]) -julia> F1.lsa(crps) -LinearAlgebra.SVD{Float64,Float64,Array{Float64,2}}([1.0 0.0; 0.0 1.0], [0.138629, 0.138629], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 1.0]) +```@repl +using TextAnalysis +crps = Corpus([ + StringDocument("this is a string document"), + TokenDocument("this is a token document") +]) +lsa(crps) ``` - lsa can also be performed on a `DocumentTermMatrix`. +```@repl +using TextAnalysis +crps = Corpus([ + StringDocument("this is a string document"), + TokenDocument("this is a token document") +]); +update_lexicon!(crps) -```julia -julia> update_lexicon!(crps) - -julia> m = DocumentTermMatrix(crps) -A 2 X 6 DocumentTermMatrix +m = DocumentTermMatrix(crps) -julia> F2 = lsa(m) -SVD{Float64,Float64,Array{Float64,2}}([1.0 0.0; 0.0 1.0], [0.138629, 0.138629], [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 1.0]) +lsa(m) ``` @@ -36,34 +40,25 @@ Another way to get a handle on the semantic content of a corpus is to use [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation): First we need to produce the DocumentTermMatrix -```julia -julia> crps = Corpus([StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words")]) -julia> update_lexicon!(crps) -julia> m = DocumentTermMatrix(crps) +```@docs +lda ``` - -Latent Dirichlet Allocation has two hyper parameters - -* _α_ : The hyperparameter for topic distribution per document. `α<1` yields a sparse topic mixture for each document. `α>1` yields a more uniform topic mixture for each document. -- _β_ : The hyperparameter for word distribution per topic. `β<1` yields a sparse word mixture for each topic. `β>1` yields a more uniform word mixture for each topic. - -```julia -julia> k = 2 # number of topics -julia> iterations = 1000 # number of gibbs sampling iterations - -julia> α = 0.1 # hyper parameter -julia> β = 0.1 # hyper parameter - -julia> ϕ, θ = lda(m, k, iterations, α, β) -( - [2 , 1] = 0.333333 - [2 , 2] = 0.333333 - [1 , 3] = 0.222222 - [1 , 4] = 0.222222 - [1 , 5] = 0.111111 - [1 , 6] = 0.111111 - [1 , 7] = 0.111111 - [2 , 8] = 0.333333 - [1 , 9] = 0.111111 - [1 , 10] = 0.111111, [0.5 1.0; 0.5 0.0]) +```@repl +using TextAnalysis +crps = Corpus([ + StringDocument("This is the Foo Bar Document"), + StringDocument("This document has too Foo words") +]); +update_lexicon!(crps) +m = DocumentTermMatrix(crps) + +k = 2 # number of topics +iterations = 1000 # number of gibbs sampling iterations +α = 0.1 # hyper parameter +β = 0.1 # hyper parameter + +ϕ, θ = lda(m, k, iterations, α, β); +ϕ +θ ``` See `?lda` for more help. diff --git a/src/LM/api.jl b/src/LM/api.jl index 382e00b9..066a21d7 100644 --- a/src/LM/api.jl +++ b/src/LM/api.jl @@ -1,7 +1,9 @@ -#TO DO -# Doc string """ $(TYPEDSIGNATURES) + +It is used to evaluate score with masks out of vocabulary words + +The arguments are the same as for [`score`](@ref) """ function maskedscore(m::Langmodel, temp_lm::DefaultDict, word, context)::Float64 score(m, temp_lm, lookup(m.vocab, [word])[begin], lookup(m.vocab, [context])[begin]) @@ -9,6 +11,10 @@ end """ $(TYPEDSIGNATURES) + +Evaluate the log score of this word in this context. + +The arguments are the same as for [`score`](@ref) and [`maskedscore`](@ref) """ function logscore(m::Langmodel, temp_lm::DefaultDict, word, context)::Float64 log2(maskedscore(m, temp_lm, word, context)) @@ -16,6 +22,10 @@ end """ $(TYPEDSIGNATURES) + +Calculate *cross-entropy* of model for given evaluation text. + +Input text must be `Vector` of ngram of same lengths """ function entropy(m::Langmodel, lm::DefaultDict, text_ngram::AbstractVector)::Float64 n_sum = sum(text_ngram) do ngram @@ -27,6 +37,10 @@ end """ $(TYPEDSIGNATURES) + +Calculates the perplexity of the given text. + +This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as [`entropy`](@ref) """ function perplexity(m::Langmodel, lm::DefaultDict, text_ngram::AbstractVector)::Float64 return 2^(entropy(m, lm, text_ngram)) diff --git a/src/LM/preprocessing.jl b/src/LM/preprocessing.jl index 12f5a222..d20540cd 100644 --- a/src/LM/preprocessing.jl +++ b/src/LM/preprocessing.jl @@ -44,7 +44,7 @@ padding _ngram is used to pad both left and right of sentence and out putting ng ```julia-repl julia> example = ["1","2","3","4","5"] -julia> padding_ngrams(example,2,pad_left=true,pad_right=true) +julia> padding_ngram(example,2,pad_left=true,pad_right=true) 6-element Array{Any,1}: " 1" "1 2" diff --git a/src/bayes.jl b/src/bayes.jl index 44168ffc..5f2a0928 100644 --- a/src/bayes.jl +++ b/src/bayes.jl @@ -45,20 +45,25 @@ end A Naive Bayes Classifier for classifying documents. +It takes two arguments: +* `classes`: An array of possible classes that the concerned data could belong to. +* `dict`:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3) + # Example ```julia-repl julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict + julia> m = NaiveBayesClassifier([:spam, :non_spam]) -NaiveBayesClassifier{Symbol}(String[], Symbol[:spam, :non_spam], Array{Int64}(0,2)) +NaiveBayesClassifier{Symbol}(String[], [:spam, :non_spam], Matrix{Int64}(undef, 0, 2)) julia> fit!(m, "this is spam", :spam) -NaiveBayesClassifier{Symbol}(["this", "is", "spam"], Symbol[:spam, :non_spam], [2 1; 2 1; 2 1]) +NaiveBayesClassifier{Symbol}(["this", "is", "spam"], [:spam, :non_spam], [2 1; 2 1; 2 1]) julia> fit!(m, "this is not spam", :non_spam) -NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], Symbol[:spam, :non_spam], [2 2; 2 2; 2 2; 1 2]) +NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], [:spam, :non_spam], [2 2; 2 2; 2 2; 1 2]) julia> predict(m, "is this a spam") -Dict{Symbol,Float64} with 2 entries: +Dict{Symbol, Float64} with 2 entries: :spam => 0.59883 :non_spam => 0.40117 ``` diff --git a/src/corpus.jl b/src/corpus.jl index df8b5f79..926bac36 100644 --- a/src/corpus.jl +++ b/src/corpus.jl @@ -280,6 +280,8 @@ Corpus's index contains 0 tokens julia> standardize!(crps, NGramDocument) +# After this step, you can check that the corpus only contains NGramDocument's: + julia> crps A Corpus with 3 documents: * 0 StringDocument's diff --git a/src/summarizer.jl b/src/summarizer.jl index de3d63c4..b9d83f3a 100644 --- a/src/summarizer.jl +++ b/src/summarizer.jl @@ -2,6 +2,11 @@ summarize(doc [, ns]) Summarizes the document and returns `ns` number of sentences. +It takes 2 arguments: + +* `d` : A document of type `StringDocument`, `FileDocument` or `TokenDocument` +* `ns` : (Optional) Mention the number of sentences in the Summary, defaults to `5` sentences. + By default `ns` is set to the value 5. # Example