Skip to content

Commit

Permalink
documentation: refresh, adding dynamical calculation, moving from .md…
Browse files Browse the repository at this point in the history
… to juliadoc
  • Loading branch information
rssdev10 committed Oct 27, 2023
1 parent 77f1abb commit 0ae19c9
Show file tree
Hide file tree
Showing 13 changed files with 249 additions and 574 deletions.
3 changes: 3 additions & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,6 @@ makedocs(
],
)

deploydocs(;
repo="github.com/JuliaText/TextAnalysis.jl",
)
94 changes: 27 additions & 67 deletions docs/src/LM.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,8 @@ julia> masked_score = maskedscore(model,fit,"is","alien")

used to evaluate the probability of word given context (*P(word | context)*)

```julia
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
```@docs
score
```

Arguments:
Expand All @@ -100,91 +100,51 @@ Arguments:
- In Interpolated language model, provide `Kneserney` and `WittenBell` smoothing

### `maskedscore`
```@docs
maskedscore
```

- It is used to evaluate *score* with masks out of vocabulary words

- The arguments are the same as for `score`

### `logscore`

- Evaluate the log score of this word in this context.
### `logscore`
```@docs
logscore
```

- The arguments are the same as for `score` and `maskedscore`

### `entropy`

```julia
entropy(m::Langmodel,lm::DefaultDict,text_ngram::word::Vector{T}) where { T <: AbstractString}
```@docs
entropy
```

- Calculate *cross-entropy* of model for given evaluation text.

- Input text must be Array of ngram of same lengths

### `perplexity`

- Calculates the perplexity of the given text.

- This is simply 2 ** cross-entropy(`entropy`) for the text, so the arguments are the same as `entropy`.
### `perplexity`
```@docs
perplexity
```

## Preprocessing

For Preprocessing following functions:

1. `everygram`: Return all possible ngrams generated from sequence of items, as an Array{String,1}

```julia
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
10-element Array{Any,1}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"
```@docs
everygram
padding_ngram
```

2. `padding_ngrams`: padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

```julia
julia> example = ["1","2","3","4","5"]
julia> padding_ngrams(example,2,pad_left=true,pad_right=true)
6-element Array{Any,1}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"
```
## Vocabulary

Struct to store Language models vocabulary

checking membership and filters items by comparing their counts to a cutoff value

It also Adds a special "unkown" tokens which unseen words are mapped to

```julia
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
It also Adds a special "unknown" tokens which unseen words are mapped to

julia> vocabulary = Vocabulary(words, 2)
Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>")
```@repl
using TextAnalysis
words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
vocabulary = Vocabulary(words, 2)
# lookup a sequence or words in the vocabulary
julia> word = ["a", "-", "d", "c", "a"]

julia> lookup(vocabulary ,word)
5-element Array{Any,1}:
"a"
"<unk>"
"d"
"c"
"a"
word = ["a", "-", "d", "c", "a"]
lookup(vocabulary ,word)
```
47 changes: 15 additions & 32 deletions docs/src/classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,42 +11,25 @@ To load the Naive Bayes Classifier, use the following command -
Its usage can be done in the following 3 steps.

1- Create an instance of the Naive Bayes Classifier model -

model = NaiveBayesClassifier(dict, classes)


It takes two arguments-

* `classes`: An array of possible classes that the concerned data could belong to.
* `dict`:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3)

```@docs
NaiveBayesClassifier
```

2- Fitting the model weights on input -

fit!(model, str, class)

```@docs
fit!
```
3- Predicting for the input case -

predict(model, str)

## Example

```julia
julia> m = NaiveBayesClassifier([:legal, :financial])
NaiveBayesClassifier{Symbol}(String[], Symbol[:legal, :financial], Array{Int64}(0,2))
```@docs
predict
```

```julia
julia> fit!(m, "this is financial doc", :financial)
NaiveBayesClassifier{Symbol}(["financial", "this", "is", "doc"], Symbol[:legal, :financial], [1 2; 1 2; 1 2; 1 2])

julia> fit!(m, "this is legal doc", :legal)
NaiveBayesClassifier{Symbol}(["financial", "this", "is", "doc", "legal"], Symbol[:legal, :financial], [1 2; 2 2; ; 2 2; 2 1])
```
## Example

```julia
julia> predict(m, "this should be predicted as a legal document")
Dict{Symbol,Float64} with 2 entries:
:legal => 0.666667
:financial => 0.333333
```@repl
using TextAnalysis
m = NaiveBayesClassifier([:legal, :financial])
fit!(m, "this is financial doc", :financial)
fit!(m, "this is legal doc", :legal)
predict(m, "this should be predicted as a legal document")
```
70 changes: 13 additions & 57 deletions docs/src/corpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,76 +3,32 @@
Working with isolated documents gets boring quickly. We typically want to
work with a collection of documents. We represent collections of documents
using the Corpus type:

```julia
julia> crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
```@docs
Corpus
```

## Standardizing a Corpus

A `Corpus` may contain many different types of documents:

```julia
julia> crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])
A Corpus with 3 documents:
* 1 StringDocument's
* 0 FileDocument's
* 1 TokenDocument's
* 1 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
```

It is generally more convenient to standardize all of the documents in a
A `Corpus` may contain many different types of documents. It is generally more convenient to standardize all of the documents in a
corpus using a single type. This can be done using the `standardize!`
function:

```julia
julia> standardize!(crps, NGramDocument)
```

After this step, you can check that the corpus only contains `NGramDocument`'s:

```julia
julia> crps
A Corpus with 3 documents:
* 0 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 3 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
```@docs
standardize!
```

## Processing a Corpus

We can apply the same sort of preprocessing steps that are defined for
individual documents to an entire corpus at once:

```julia
julia> crps = Corpus([StringDocument("Document ..!!"),
StringDocument("Document ..!!")])

julia> prepare!(crps, strip_punctuation)

julia> text(crps[1])
"Document "

julia> text(crps[2])
"Document "
```@repl
using TextAnalysis
crps = Corpus([StringDocument("Document ..!!"),
StringDocument("Document ..!!")])
prepare!(crps, strip_punctuation)
text(crps[1])
text(crps[2])
```

These operations are run on each document in the corpus individually.
Expand Down Expand Up @@ -109,7 +65,7 @@ Dict{String,Int64} with 3 entries:

But once this work is done, you can easier address lots of interesting
questions about a corpus:
```
```julia
julia> lexical_frequency(crps, "Name")
0.5

Expand Down
Loading

0 comments on commit 0ae19c9

Please sign in to comment.