diff --git a/docs/src/corpus.md b/docs/src/corpus.md index 876027c8..4e3f7da7 100644 --- a/docs/src/corpus.md +++ b/docs/src/corpus.md @@ -4,35 +4,51 @@ Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type: - crps = Corpus([StringDocument("Document 1"), - StringDocument("Document 2")]) +```julia +julia> crps = Corpus([StringDocument("Document 1"), + StringDocument("Document 2")]) +Corpus{StringDocument{String}}(StringDocument{String}[StringDocument{String}("Document 1", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), StringDocument{String}("Document 2", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time"))], 0, Dict{String,Int64}(), Dict{String,Array{Int64,1}}(), TextHashFunction(hash, 100)) +``` ## Standardizing a Corpus A `Corpus` may contain many different types of documents: - crps = Corpus([StringDocument("Document 1"), - TokenDocument("Document 2"), - NGramDocument("Document 3")]) - +```julia +julia> crps = Corpus([StringDocument("Document 1"), + TokenDocument("Document 2"), + NGramDocument("Document 3")]) +Corpus{AbstractDocument}(AbstractDocument[StringDocument{String}("Document 1", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), TokenDocument{String}(["Document", "2"], DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), NGramDocument{String}(Dict("Document"=>1,"3"=>1), 1, DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time"))], 0, Dict{String,Int64}(), Dict{String,Array{Int64,1}}(), TextHashFunction(hash, 100)) +``` It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the `standardize!` function: - standardize!(crps, NGramDocument) +```julia +julia> standardize!(crps, NGramDocument) +``` After this step, you can check that the corpus only contains `NGramDocument`'s: - crps +```julia +julia> crps +Corpus{AbstractDocument}(AbstractDocument[NGramDocument{String}(Dict("1"=>1,"Document"=>1), 1, DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), NGramDocument{String}(Dict("2"=>1,"Document"=>1), 1, DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), NGramDocument{String}(Dict("Document"=>1,"3"=>1), 1, DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time"))], 0, Dict{String,Int64}(), Dict{String,Array{Int64,1}}(), TextHashFunction(hash, 100)) +``` ## Processing a Corpus We can apply the same sort of preprocessing steps that are defined for individual documents to an entire corpus at once: - crps = Corpus([StringDocument("Document 1"), - StringDocument("Document 2")]) - remove_punctuation!(crps) +```julia +julia> crps = Corpus([StringDocument("Document ..!!"), + StringDocument("Document ..!!")]) + +julia> prepare!(crps, strip_punctuation) + +julia> crps +Corpus{StringDocument{String}}(StringDocument{String}[StringDocument{String}("Document ", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), StringDocument{String}("Document ", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time"))], 0, Dict{String,Int64}(), Dict{String,Array{Int64,1}}(), TextHashFunction(hash, 100)) +``` These operations are run on each document in the corpus individually. @@ -47,34 +63,71 @@ In particular, we want to work with two constructs: Because computations involving the lexicon can take a long time, a `Corpus`'s default lexicon is blank: - lexicon(crps) +```julia +julia> crps = Corpus([StringDocument("Name Foo"), + StringDocument("Name Bar")]) +julia> lexicon(crps) +Dict{String,Int64} with 0 entries +``` In order to work with the lexicon, you have to update it and then access it: - update_lexicon!(crps) - lexicon(crps) +```julia +julia> update_lexicon!(crps) + +julia> lexicon(crps) +Dict{String,Int64} with 3 entries: + "Bar" => 1 + "Foo" => 1 + "Name" => 2 +``` But once this work is done, you can easier address lots of interesting questions about a corpus: +``` +julia> lexical_frequency(crps, "Name") +0.5 - lexical_frequency(crps, "Summer") - lexical_frequency(crps, "Document") +julia> lexical_frequency(crps, "Foo") +0.25 +``` Like the lexicon, the inverse index for a corpus is blank by default: - inverse_index(crps) +```julia +julia> inverse_index(crps) +Dict{String,Array{Int64,1}} with 0 entries +``` Again, you need to update it before you can work with it: - update_inverse_index!(crps) - inverse_index(crps) +```julia +julia> update_inverse_index!(crps) + +julia> inverse_index(crps) +Dict{String,Array{Int64,1}} with 3 entries: + "Bar" => [2] + "Foo" => [1] + "Name" => [1, 2] +``` But once you've updated the inverse index, you can easily search the entire corpus: - crps["Document"] - crps["1"] - crps["Summer"] +```julia +julia> crps["Name"] + +2-element Array{Int64,1}: + 1 + 2 + +julia> crps["Foo"] +1-element Array{Int64,1}: + 1 + +julia> crps["Summer"] +0-element Array{Int64,1} +``` ## Converting a DataFrame from a Corpus @@ -83,3 +136,56 @@ corpus. The easiest way to do this is to convert a `Corpus` object into a `DataFrame`: convert(DataFrame, crps) + +## Corpus Metadata + +You can also retrieve the metadata for every document in a `Corpus` at once: + +* `languages()`: What language is the document in? Defaults to `Languages.English()`, a Language instance defined by the Languages package. +* `titles()`: What is the title of the document? Defaults to `"Untitled Document"`. +* `authors()`: Who wrote the document? Defaults to `"Unknown Author"`. +* `timestamps()`: When was the document written? Defaults to `"Unknown Time"`. + +```julia +julia> crps = Corpus([StringDocument("Name Foo"), + StringDocument("Name Bar")]) + +julia> languages(crps) +2-element Array{Languages.English,1}: + Languages.English() + Languages.English() + +julia> titles(crps) +2-element Array{String,1}: + "Untitled Document" + "Untitled Document" + +julia> authors(crps) +2-element Array{String,1}: + "Unknown Author" + "Unknown Author" + +julia> timestamps(crps) +2-element Array{String,1}: + "Unknown Time" + "Unknown Time" +``` + +It is possible to change the metadata fields for each document in a `Corpus`. +These functions use the same metadata value for every document: + +```julia +julia> languages!(crps, Languages.German()) +julia> titles!(crps, "") +julia> authors!(crps, "Me") +julia> timestamps!(crps, "Now") +``` +Additionally, you can specify the metadata fields for each document in +a `Corpus` individually: + +```julia +julia> languages!(crps, [Languages.German(), Languages.English +julia> titles!(crps, ["", "Untitled"]) +julia> authors!(crps, ["Ich", "You"]) +julia> timestamps!(crps, ["Unbekannt", "2018"]) +``` diff --git a/docs/src/documents.md b/docs/src/documents.md index 097c87a4..7ba964f7 100644 --- a/docs/src/documents.md +++ b/docs/src/documents.md @@ -8,7 +8,8 @@ allows one to work with documents stored in a variety of formats: * _TokenDocument_ : A document represented as a sequence of UTF8 tokens * _NGramDocument_ : A document represented as a bag of n-grams, which are UTF8 n-grams that map to counts -These format represent a hierarchy: you can always move down the hierachy, but can generally not move up the hierachy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`. +!!! note + These formats represent a hierarchy: you can always move down the hierachy, but can generally not move up the hierachy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`. Creating any of the four basic types of documents is very easy: @@ -17,7 +18,7 @@ julia> str = "To be or not to be..." "To be or not to be..." julia> sd = StringDocument(str) -StringDocument{String}("To be or not to be...", TextAnalysis.DocumentMetadata(Languages.English(), "Unnamed Document", "Unknown Author", "Unknown Time")) +StringDocument{String}("To be or not to be...", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) julia> pathname = "/usr/share/dict/words" "/usr/share/dict/words" @@ -35,7 +36,7 @@ julia> my_tokens = String["To", "be", "or", "not", "to", "be..."] "be..." julia> td = TokenDocument(my_tokens) -TokenDocument{String}(["To", "be", "or", "not", "to", "be..."], TextAnalysis.DocumentMetadata(Languages.English(), "Unnamed Document", "Unknown Author", "Unknown Time")) +TokenDocument{String}(["To", "be", "or", "not", "to", "be..."], TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2, "or" => 1, "not" => 1, @@ -49,15 +50,30 @@ Dict{String,Int64} with 6 entries: "be" => 2 julia> ngd = NGramDocument(my_ngrams) -NGramDocument{AbstractString}(Dict{AbstractString,Int64}("or"=>1,"be..."=>1,"not"=>1,"to"=>1,"To"=>1,"be"=>2), 1, TextAnalysis.DocumentMetadata(Languages.English(), "Unnamed Document", "Unknown Author", "Unknown Time")) +NGramDocument{AbstractString}(Dict{AbstractString,Int64}("or"=>1,"be..."=>1,"not"=>1,"to"=>1,"To"=>1,"be"=>2), 1, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) +``` + +An NGramDocument consisting of bigrams or any higher order representation `N` +can be easily created by passing the parameter `N` to `NGramDocument` + +```julia +julia> ngd = NGramDocument("To be or not to be ...", 2) +NGramDocument{AbstractString}(Dict{AbstractString,Int64}("to be"=>1,"not"=>1,"be or"=>1,"or"=>1,"not to"=>1,"To"=>1,".."=>1,"."=>1,"be .."=>1,"be"=>2…), 2, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) ``` For every type of document except a `FileDocument`, you can also construct a new document by simply passing in a string of text: - sd = StringDocument("To be or not to be...") - td = TokenDocument("To be or not to be...") - ngd = NGramDocument("To be or not to be...") +```julia +julia> sd = StringDocument("To be or not to be...") +StringDocument{String}("To be or not to be...", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) + +julia> td = TokenDocument("To be or not to be...") +TokenDocument{String}(["To", "be", "or", "not", "to", "be..", "."], TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) + +julia> ngd = NGramDocument("To be or not to be...") +NGramDocument{String}(Dict("or"=>1,"not"=>1,"to"=>1,"To"=>1,"be"=>1,"be.."=>1,"."=>1), 1, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) +``` The system will automatically perform tokenization or n-gramization in order to produce the required data. Unfortunately, `FileDocument`'s cannot be @@ -68,10 +84,19 @@ That said, there is one way around this restriction: you can use the generic `Document()` constructor function, which will guess at the type of the inputs and construct the appropriate type of document object: - Document("To be or not to be...") - Document("/usr/share/dict/words") - Document(String["To", "be", "or", "not", "to", "be..."]) - Document(Dict{String, Int}("a" => 1, "b" => 3)) +```julia +julia> Document("To be or not to be...") +StringDocument{String}("To be or not to be...", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) + +julia> Document("/usr/share/dict/words") +FileDocument("/usr/share/dict/words", TextAnalysis.DocumentMetadata(Languages.English(), "/usr/share/dict/words", "Unknown Author", "Unknown Time")) + +julia> Document(String["To", "be", "or", "not", "to", "be..."]) +TokenDocument{String}(["To", "be", "or", "not", "to", "be..."], TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) + +julia> Document(Dict{String, Int}("a" => 1, "b" => 3)) +NGramDocument{AbstractString}(Dict{AbstractString,Int64}("b"=>3,"a"=>1), 1, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) +``` This constructor is very convenient for working in the REPL, but should be avoided in permanent code because, unlike the other constructors, the return type of the `Document` function cannot be known at compile-time. @@ -80,32 +105,79 @@ This constructor is very convenient for working in the REPL, but should be avoid Once you've created a document object, you can work with it in many ways. The most obvious thing is to access its text using the `text()` function: - text(sd) +```julia +julia> sd = StringDocument("To be or not to be...") +StringDocument{String}("To be or not to be...", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) -This function works without warnings on `StringDocument`'s and -`FileDocument`'s. For `TokenDocument`'s it is not possible to know if the -text can be reconstructed perfectly, so calling -`text(TokenDocument("This is text"))` will produce a warning message before -returning an approximate reconstruction of the text as it existed before -tokenization. It is entirely impossible to reconstruct the text of an -`NGramDocument`, so `text(NGramDocument("This is text"))` raises an error. +julia> text(sd) +"To be or not to be..." +``` +!!! note + This function works without warnings on `StringDocument`'s and + `FileDocument`'s. For `TokenDocument`'s it is not possible to know if the + text can be reconstructed perfectly, so calling + `text(TokenDocument("This is text"))` will produce a warning message before + returning an approximate reconstruction of the text as it existed before + tokenization. It is entirely impossible to reconstruct the text of an + `NGramDocument`, so `text(NGramDocument("This is text"))` raises an error. Instead of working with the text itself, you can work with the tokens or n-grams of a document using the `tokens()` and `ngrams()` functions: - tokens(sd) - ngrams(sd) +```julia +julia> tokens(sd) +7-element Array{String,1}: + "To" + "be" + "or" + "not" + "to" + "be.." + "." + + julia> ngrams(sd) + Dict{String,Int64} with 7 entries: + "or" => 1 + "not" => 1 + "to" => 1 + "To" => 1 + "be" => 1 + "be.." => 1 + "." => 1 +``` By default the `ngrams()` function produces unigrams. If you would like to produce bigrams or trigrams, you can specify that directly using a numeric argument to the `ngrams()` function: - ngrams(sd, 2) +```julia +julia> ngrams(sd, 2) +Dict{AbstractString,Int64} with 13 entries: + "not" => 1 + "be.." => 1 + "be or" => 1 + "or" => 1 + "not to" => 1 + "To" => 1 + "." => 1 + "be" => 1 + "To be" => 1 + "or not" => 1 + "to be.." => 1 + "be.. ." => 1 + "to" => 1 +``` If you have a `NGramDocument`, you can determine whether an `NGramDocument` contains unigrams, bigrams or a higher-order representation using the `ngram_complexity()` function: - ngram_complexity(ngd) +```julia +julia> ngd = NGramDocument("To be or not to be ...", 2) +NGramDocument{AbstractString}(Dict{AbstractString,Int64}("to be"=>1,"not"=>1,"be or"=>1,"or"=>1,"not to"=>1,"To"=>1,".."=>1,"."=>1,"be .."=>1,"be"=>2…), 2, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) + +julia> ngram_complexity(ngd) +2 +``` This information is not available for other types of `Document` objects because it is possible to produce any level of complexity when constructing @@ -118,48 +190,46 @@ document, every document object also stores basic metadata about itself, including the following pieces of information: * `language()`: What language is the document in? Defaults to `Languages.English()`, a Language instance defined by the Languages package. -* `title()`: What is the name of the document? Defaults to `"Untitled Document"`. +* `title()`: What is the title of the document? Defaults to `"Untitled Document"`. * `author()`: Who wrote the document? Defaults to `"Unknown Author"`. * `timestamp()`: When was the document written? Defaults to `"Unknown Time"`. Try these functions out on a `StringDocument` to see how the defaults work in practice: - language(sd) - title(sd) - author(sd) - timestamp(sd) +```julia +julia> sd = StringDocument("This document has too foo words") +StringDocument{String}("This document has too foo words", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) -If you need reset these fields, you can use the mutating versions of the same -functions: +julia> language(sd) +Languages.English() - language!(sd, Languages.Spanish()) - title!(sd, "El Cid") - author!(sd, "Desconocido") - timestamp!(sd, "Desconocido") +julia> title(sd) +"Untitled Document" -You can also retrieve the metadata for every document in a `Corpus` at once: +julia> author(sd) +"Unknown Author" - languages(crps) - titles(crps) - authors(crps) - timestamps(crps) +julia> timestamp(sd) +"Unknown Time" +``` -It is possible to change the metadata fields for each document in a `Corpus`. -These functions use the same metadata value for every document: +If you need reset these fields, you can use the mutating versions of the same +functions: - languages!(crps, Languages.German()) - titles!(crps, "") - authors!(crps, "Me") - timestamps!(crps, "Now") +```julia +julia> language!(sd, Languages.Spanish()) +Languages.Spanish() -Additionally, you can specify the metadata fields for each document in -a `Corpus` individually: +julia> title!(sd, "El Cid") +"El Cid" - languages!(crps, [Languages.German(), Languages.English()]) - titles!(crps, ["", "Untitled"]) - authors!(crps, ["Ich", "You"]) - timestamps!(crps, ["Unbekannt", "2018"]) +julia> author!(sd, "Desconocido") +"Desconocido" + +julia> timestamp!(sd, "Desconocido") +"Desconocido" +``` ## Preprocessing Documents @@ -176,21 +246,33 @@ to process automatically. For example, our sample text sentence taken from Hamle has three periods that we might like to discard. We can remove this kind of punctuation using the `prepare!()` function: - prepare!(sd, strip_punctuation) +```julia +julia> str = StringDocument("here are some punctuations !!!...") -Like punctuation, numbers and case distinctions are often easier removed than -dealt with. To remove numbers or case distinctions, use the -`remove_numbers!()` and `remove_case!()` functions: +julia> prepare!(str, strip_punctuation) - remove_numbers!(sd) - remove_case!(sd) +julia> str +StringDocument{String}("here are some punctuations ", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) +``` -At times you'll want to remove specific words from a document like a person's +* To remove case distinctions, use `remove_case!()` function: +* At times you'll want to remove specific words from a document like a person's name. To do that, use the `remove_words!()` function: - sd = StringDocument("Lear is mad") - remove_words!(sd, ["Lear"]) +```julia +julia> sd = StringDocument("Lear is mad") +StringDocument{String}("Lear is mad", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) + +julia> remove_case!(sd) + +julia> sd +StringDocument{String}("lear is mad", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) + +julia> remove_words!(sd, ["lear"]) +julia> sd +StringDocument{String}(" is mad", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) +``` At other times, you'll want to remove whole classes of words. To make this easier, we can use several classes of basic words defined by the Languages.jl package: @@ -226,4 +308,11 @@ closely related like "dog" and "dogs" and stem them in order to produce a smaller set of words for analysis. We can do this using the `stem!()` function: - stem!(sd) +```julia +julia> sd = StringDocument("Foo writes and foo bar write") + +julia> stem!(sd) + +julia> sd +StringDocument{String}("Foo write and foo bar write", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time")) +``` diff --git a/docs/src/features.md b/docs/src/features.md index 37c07cbc..a6813a84 100644 --- a/docs/src/features.md +++ b/docs/src/features.md @@ -4,18 +4,53 @@ Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon: - update_lexicon!(crps) - m = DocumentTermMatrix(crps) +```julia +julia> crps = Corpus([StringDocument("To be or not to be"), + StringDocument("To become or not to become")]) + +julia> update_lexicon!(crps) + +julia> m = DocumentTermMatrix(crps) +DocumentTermMatrix( + [1, 1] = 1 + [2, 1] = 1 + [1, 2] = 2 + [2, 3] = 2 + [1, 4] = 1 + [2, 4] = 1 + [1, 5] = 1 + [2, 5] = 1 + [1, 6] = 1 + [2, 6] = 1, ["To", "be", "become", "not", "or", "to"], Dict("or"=>5,"not"=>4,"to"=>6,"To"=>1,"be"=>2,"become"=>3)) +``` A `DocumentTermMatrix` object is a special type. If you would like to use a simple sparse matrix, call `dtm()` on this object: - dtm(m) +```julia +julia> dtm(m) +2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries: + [1, 1] = 1 + [2, 1] = 1 + [1, 2] = 2 + [2, 3] = 2 + [1, 4] = 1 + [2, 4] = 1 + [1, 5] = 1 + [2, 5] = 1 + [1, 6] = 1 + [2, 6] = 1 +``` If you would like to use a dense matrix instead, you can pass this as an argument to the `dtm` function: - dtm(m, :dense) +```julia +julia> dtm(m, :dense) +2×6 Array{Int64,2}: + 1 2 0 1 1 1 + 1 0 2 1 1 1 +``` ## Creating Individual Rows of a Document Term Matrix @@ -24,7 +59,11 @@ make do with just a single row. You can get this using the `dtv` function. Because individual's document do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument: - dtv(crps[1], lexicon(crps)) +```julia +julia> dtv(crps[1], lexicon(crps)) +1×6 Array{Int64,2}: + 1 2 0 1 1 1 +``` ## The Hash Trick @@ -33,35 +72,61 @@ The need to create a lexicon before we can construct a document term matrix is o function that outputs integers from 1 to N. To construct such a hash function, you can use the `TextHashFunction(N)` constructor: - h = TextHashFunction(10) +```julia +julia> h = TextHashFunction(10) +TextHashFunction(hash, 10) +``` You can see how this function maps strings to numbers by calling the `index_hash` function: - index_hash("a", h) - index_hash("b", h) +```julia +julia> index_hash("a", h) +8 + +julia> index_hash("b", h) +7 +``` Using a text hash function, we can represent a document as a vector with N entries by calling the `hash_dtv` function: - hash_dtv(crps[1], h) +```julia +julia> hash_dtv(crps[1], h) +1×10 Array{Int64,2}: + 0 2 0 0 1 3 0 0 0 0 +``` This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance: - hash_dtm(crps, h) +```julia +julia> hash_dtm(crps, h) +2×10 Array{Int64,2}: + 0 2 0 0 1 3 0 0 0 0 + 0 2 0 0 1 1 0 0 2 0 +``` Every corpus has a hash function built-in, so this function can be called using just one argument: - hash_dtm(crps) +```julia +julia> hash_dtm(crps) +2×100 Array{Int64,2}: + 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +``` Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you: - hash_dtv(crps[1]) +```julia +julia> hash_dtv(crps[1]) +1×100 Array{Int64,2}: + 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 +``` -## TF-IDF +## TF-IDF (Term Frequency - Inverse Document Frequency) In many cases, raw word counts are not appropriate for use because: diff --git a/docs/src/semantic.md b/docs/src/semantic.md index cfd47c43..f9dc957a 100644 --- a/docs/src/semantic.md +++ b/docs/src/semantic.md @@ -11,11 +11,35 @@ Analysis or LSA on the corpus. You can do this using the `lsa` function: Another way to get a handle on the semantic content of a corpus is to use [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation): - m = DocumentTermMatrix(crps) - k = 2 # number of topics - iteration = 1000 # number of gibbs sampling iterations - alpha = 0.1 # hyper parameter - beta = 0.1 # hyber parameter - ϕ, θ = lda(m, k, iteration, alpha, beta) # ϕ is k x word matrix. - # value is probablity of occurrence of a word in a topic. +First we need to produce the DocumentTermMatrix +```julia +julia> crps = Corpus([StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words")]) +julia> update_lexicon!(crps) +julia> m = DocumentTermMatrix(crps) +``` + +Latent Dirchlet Allocation has two hyper parameters - +* _α_ : The hyperparameter for topic distribution per document. `α<1` yields a sparse topic mixture for each document. `α>1` yields a more uniform topic mixture for each document. +- _β_ : The hyperparameter for word distribution per topic. `β<1` yields a sparse word mixture for each topic. `β>1` yields a more uniform word mixture for each topic. + +```julia +julia> k = 2 # number of topics +julia> iterations = 1000 # number of gibbs sampling iterations + +julia> α = 0.1 # hyper parameter +julia> β = 0.1 # hyper parameter + +julia> ϕ, θ = lda(m, k, iterations, α, β) +( + [2 , 1] = 0.333333 + [2 , 2] = 0.333333 + [1 , 3] = 0.222222 + [1 , 4] = 0.222222 + [1 , 5] = 0.111111 + [1 , 6] = 0.111111 + [1 , 7] = 0.111111 + [2 , 8] = 0.333333 + [1 , 9] = 0.111111 + [1 , 10] = 0.111111, [0.5 1.0; 0.5 0.0]) +``` See `?lda` for more help.