Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation update #277

Merged
merged 1 commit into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,5 @@ Contributions, in the form of bug-reports, pull requests, additional documentati

## Support

Feel free to ask for help on the [Julia Discourse forum](https://discourse.julialang.org/), or in the `#natural-language` channel on [julia-slack](https://julialang.slack.com). (Which you can [join here](https://slackinvite.julialang.org/)). You can also raise issues in this repository to request new features and/or improvements to the documentation and codebase.
Feel free to ask for help on the [Julia Discourse forum](https://discourse.julialang.org/), or in the `#natural-language` channel on [julia-slack](https://julialang.slack.com). (Which you can [join here](https://julialang.org/slack/)). Or, [select what do you like here](https://julialang.org/community/). You can also raise issues in this repository to request new features and/or improvements to the documentation and codebase.

3 changes: 3 additions & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,6 @@ makedocs(
],
)

deploydocs(;
repo="github.com/JuliaText/TextAnalysis.jl",
)
94 changes: 27 additions & 67 deletions docs/src/LM.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,8 @@ julia> masked_score = maskedscore(model,fit,"is","alien")

used to evaluate the probability of word given context (*P(word | context)*)

```julia
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
```@docs
score
```

Arguments:
Expand All @@ -100,91 +100,51 @@ Arguments:
- In Interpolated language model, provide `Kneserney` and `WittenBell` smoothing

### `maskedscore`
```@docs
maskedscore
```

- It is used to evaluate *score* with masks out of vocabulary words

- The arguments are the same as for `score`

### `logscore`

- Evaluate the log score of this word in this context.
### `logscore`
```@docs
logscore
```

- The arguments are the same as for `score` and `maskedscore`

### `entropy`

```julia
entropy(m::Langmodel,lm::DefaultDict,text_ngram::word::Vector{T}) where { T <: AbstractString}
```@docs
entropy
```

- Calculate *cross-entropy* of model for given evaluation text.

- Input text must be Array of ngram of same lengths

### `perplexity`

- Calculates the perplexity of the given text.

- This is simply 2 ** cross-entropy(`entropy`) for the text, so the arguments are the same as `entropy`.
### `perplexity`
```@docs
perplexity
```

## Preprocessing

For Preprocessing following functions:

1. `everygram`: Return all possible ngrams generated from sequence of items, as an Array{String,1}

```julia
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
10-element Array{Any,1}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"
```@docs
everygram
padding_ngram
```

2. `padding_ngrams`: padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

```julia
julia> example = ["1","2","3","4","5"]
julia> padding_ngrams(example,2,pad_left=true,pad_right=true)
6-element Array{Any,1}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"
```
## Vocabulary

Struct to store Language models vocabulary

checking membership and filters items by comparing their counts to a cutoff value

It also Adds a special "unkown" tokens which unseen words are mapped to

```julia
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
It also Adds a special "unknown" tokens which unseen words are mapped to

julia> vocabulary = Vocabulary(words, 2)
Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>")
```@repl
using TextAnalysis
words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
vocabulary = Vocabulary(words, 2)

# lookup a sequence or words in the vocabulary
julia> word = ["a", "-", "d", "c", "a"]

julia> lookup(vocabulary ,word)
5-element Array{Any,1}:
"a"
"<unk>"
"d"
"c"
"a"

word = ["a", "-", "d", "c", "a"]

lookup(vocabulary ,word)
```
47 changes: 15 additions & 32 deletions docs/src/classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,42 +11,25 @@ To load the Naive Bayes Classifier, use the following command -
Its usage can be done in the following 3 steps.

1- Create an instance of the Naive Bayes Classifier model -

model = NaiveBayesClassifier(dict, classes)


It takes two arguments-

* `classes`: An array of possible classes that the concerned data could belong to.
* `dict`:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3)

```@docs
NaiveBayesClassifier
```

2- Fitting the model weights on input -

fit!(model, str, class)

```@docs
fit!
```
3- Predicting for the input case -

predict(model, str)

## Example

```julia
julia> m = NaiveBayesClassifier([:legal, :financial])
NaiveBayesClassifier{Symbol}(String[], Symbol[:legal, :financial], Array{Int64}(0,2))
```@docs
predict
```

```julia
julia> fit!(m, "this is financial doc", :financial)
NaiveBayesClassifier{Symbol}(["financial", "this", "is", "doc"], Symbol[:legal, :financial], [1 2; 1 2; 1 2; 1 2])

julia> fit!(m, "this is legal doc", :legal)
NaiveBayesClassifier{Symbol}(["financial", "this", "is", "doc", "legal"], Symbol[:legal, :financial], [1 2; 2 2; … ; 2 2; 2 1])
```
## Example

```julia
julia> predict(m, "this should be predicted as a legal document")
Dict{Symbol,Float64} with 2 entries:
:legal => 0.666667
:financial => 0.333333
```@repl
using TextAnalysis
m = NaiveBayesClassifier([:legal, :financial])
fit!(m, "this is financial doc", :financial)
fit!(m, "this is legal doc", :legal)
predict(m, "this should be predicted as a legal document")
```
70 changes: 13 additions & 57 deletions docs/src/corpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,76 +3,32 @@
Working with isolated documents gets boring quickly. We typically want to
work with a collection of documents. We represent collections of documents
using the Corpus type:

```julia
julia> crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
```@docs
Corpus
```

## Standardizing a Corpus

A `Corpus` may contain many different types of documents:

```julia
julia> crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])
A Corpus with 3 documents:
* 1 StringDocument's
* 0 FileDocument's
* 1 TokenDocument's
* 1 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
```

It is generally more convenient to standardize all of the documents in a
A `Corpus` may contain many different types of documents. It is generally more convenient to standardize all of the documents in a
corpus using a single type. This can be done using the `standardize!`
function:

```julia
julia> standardize!(crps, NGramDocument)
```

After this step, you can check that the corpus only contains `NGramDocument`'s:

```julia
julia> crps
A Corpus with 3 documents:
* 0 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 3 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
```@docs
standardize!
```

## Processing a Corpus

We can apply the same sort of preprocessing steps that are defined for
individual documents to an entire corpus at once:

```julia
julia> crps = Corpus([StringDocument("Document ..!!"),
StringDocument("Document ..!!")])

julia> prepare!(crps, strip_punctuation)

julia> text(crps[1])
"Document "

julia> text(crps[2])
"Document "
```@repl
using TextAnalysis
crps = Corpus([StringDocument("Document ..!!"),
StringDocument("Document ..!!")])
prepare!(crps, strip_punctuation)
text(crps[1])
text(crps[2])
```

These operations are run on each document in the corpus individually.
Expand Down Expand Up @@ -109,7 +65,7 @@ Dict{String,Int64} with 3 entries:

But once this work is done, you can easier address lots of interesting
questions about a corpus:
```
```julia
julia> lexical_frequency(crps, "Name")
0.5

Expand Down
Loading