Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Documentation Revamp #134

Merged
merged 9 commits into from
Mar 24, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 128 additions & 22 deletions docs/src/corpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,35 +4,51 @@ Working with isolated documents gets boring quickly. We typically want to
work with a collection of documents. We represent collections of documents
using the Corpus type:

crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
```julia
julia> crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
Corpus{StringDocument{String}}(StringDocument{String}[StringDocument{String}("Document 1", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), StringDocument{String}("Document 2", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time"))], 0, Dict{String,Int64}(), Dict{String,Array{Int64,1}}(), TextHashFunction(hash, 100))
```

## Standardizing a Corpus

A `Corpus` may contain many different types of documents:

crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])

```julia
julia> crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])
Corpus{AbstractDocument}(AbstractDocument[StringDocument{String}("Document 1", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), TokenDocument{String}(["Document", "2"], DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), NGramDocument{String}(Dict("Document"=>1,"3"=>1), 1, DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time"))], 0, Dict{String,Int64}(), Dict{String,Array{Int64,1}}(), TextHashFunction(hash, 100))
```
It is generally more convenient to standardize all of the documents in a
corpus using a single type. This can be done using the `standardize!`
function:

standardize!(crps, NGramDocument)
```julia
julia> standardize!(crps, NGramDocument)
```

After this step, you can check that the corpus only contains `NGramDocument`'s:

crps
```julia
julia> crps
Corpus{AbstractDocument}(AbstractDocument[NGramDocument{String}(Dict("1"=>1,"Document"=>1), 1, DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), NGramDocument{String}(Dict("2"=>1,"Document"=>1), 1, DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), NGramDocument{String}(Dict("Document"=>1,"3"=>1), 1, DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time"))], 0, Dict{String,Int64}(), Dict{String,Array{Int64,1}}(), TextHashFunction(hash, 100))
```

## Processing a Corpus

We can apply the same sort of preprocessing steps that are defined for
individual documents to an entire corpus at once:

crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
remove_punctuation!(crps)
```julia
julia> crps = Corpus([StringDocument("Document ..!!"),
StringDocument("Document ..!!")])

julia> prepare!(crps, strip_punctuation)

julia> crps
Corpus{StringDocument{String}}(StringDocument{String}[StringDocument{String}("Document ", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time")), StringDocument{String}("Document ", DocumentMetadata(English(), "Untitled Document", "Unknown Author", "Unknown Time"))], 0, Dict{String,Int64}(), Dict{String,Array{Int64,1}}(), TextHashFunction(hash, 100))
```

These operations are run on each document in the corpus individually.

Expand All @@ -47,34 +63,71 @@ In particular, we want to work with two constructs:
Because computations involving the lexicon can take a long time, a
`Corpus`'s default lexicon is blank:

lexicon(crps)
```julia
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])
julia> lexicon(crps)
Dict{String,Int64} with 0 entries
```

In order to work with the lexicon, you have to update it and then access it:

update_lexicon!(crps)
lexicon(crps)
```julia
julia> update_lexicon!(crps)

julia> lexicon(crps)
Dict{String,Int64} with 3 entries:
"Bar" => 1
"Foo" => 1
"Name" => 2
```

But once this work is done, you can easier address lots of interesting
questions about a corpus:
```
julia> lexical_frequency(crps, "Name")
0.5

lexical_frequency(crps, "Summer")
lexical_frequency(crps, "Document")
julia> lexical_frequency(crps, "Foo")
0.25
```

Like the lexicon, the inverse index for a corpus is blank by default:

inverse_index(crps)
```julia
julia> inverse_index(crps)
Dict{String,Array{Int64,1}} with 0 entries
```

Again, you need to update it before you can work with it:

update_inverse_index!(crps)
inverse_index(crps)
```julia
julia> update_inverse_index!(crps)

julia> inverse_index(crps)
Dict{String,Array{Int64,1}} with 3 entries:
"Bar" => [2]
"Foo" => [1]
"Name" => [1, 2]
```

But once you've updated the inverse index, you can easily search the entire
corpus:

crps["Document"]
crps["1"]
crps["Summer"]
```julia
julia> crps["Name"]

2-element Array{Int64,1}:
1
2

julia> crps["Foo"]
1-element Array{Int64,1}:
1

julia> crps["Summer"]
0-element Array{Int64,1}
```

## Converting a DataFrame from a Corpus

Expand All @@ -83,3 +136,56 @@ corpus. The easiest way to do this is to convert a `Corpus` object into
a `DataFrame`:

convert(DataFrame, crps)

## Corpus Metadata

You can also retrieve the metadata for every document in a `Corpus` at once:

* `languages()`: What language is the document in? Defaults to `Languages.English()`, a Language instance defined by the Languages package.
* `titles()`: What is the title of the document? Defaults to `"Untitled Document"`.
* `authors()`: Who wrote the document? Defaults to `"Unknown Author"`.
* `timestamps()`: When was the document written? Defaults to `"Unknown Time"`.

```julia
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])

julia> languages(crps)
2-element Array{Languages.English,1}:
Languages.English()
Languages.English()

julia> titles(crps)
2-element Array{String,1}:
"Untitled Document"
"Untitled Document"

julia> authors(crps)
2-element Array{String,1}:
"Unknown Author"
"Unknown Author"

julia> timestamps(crps)
2-element Array{String,1}:
"Unknown Time"
"Unknown Time"
```

It is possible to change the metadata fields for each document in a `Corpus`.
These functions use the same metadata value for every document:

```julia
julia> languages!(crps, Languages.German())
julia> titles!(crps, "")
julia> authors!(crps, "Me")
julia> timestamps!(crps, "Now")
```
Additionally, you can specify the metadata fields for each document in
a `Corpus` individually:

```julia
julia> languages!(crps, [Languages.German(), Languages.English
julia> titles!(crps, ["", "Untitled"])
julia> authors!(crps, ["Ich", "You"])
julia> timestamps!(crps, ["Unbekannt", "2018"])
```
Loading