Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Model Interface #210

Merged
merged 51 commits into from
Sep 2, 2020
Merged

Conversation

tejasvaidhyadev
Copy link
Member

@tejasvaidhyadev tejasvaidhyadev commented Apr 27, 2020

language model inspired by NLTK.lm package

Roadmap

  • vocabulary - Implemented contain vocabulary structure

  • Counter - contains language model counter

  • Structure and methods for counting ngrams and will count any ngram sequence you give it

  • models - contains Interpolated Language model

  • KneserNeyInterpolated ✔️

  • Laplace ✔️

  • Lidstone ✔️

  • MLE ✔️

  • WittenBellInterpolated ✔️

  • preprocessing - contains all function needed for preprocessing

  • smoothing - Smoothing algorithms for language modeling.

  • utility

  • documentation and test

it will also include

  • Cross entropy of model for given eval text
  • logscore - eval the log score of word in the context
  • Perplexity - perplexity of the given text

@tejasvaidhyadev
Copy link
Member Author

Now we have working MLE or Ngrambase model

Copy link
Member

@Ayushk4 Ayushk4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes in code-style through-out the code. It will be great if you could go through the following suggestions, and fix all instances of those in this PR.
Examples - docstring, indentation, spaces before and after +,-,/,*,= etc. (except for space about '=' in args of function calling), overly specified types.

The code style looks good in general, especially the usage of type hierarchies for Langmodel. Once you are done with the entire PR and the following suggestions. I will do a more thorough review of this.

src/LM/counter.jl Outdated Show resolved Hide resolved
src/LM/counter.jl Show resolved Hide resolved
src/LM/langmodel.jl Outdated Show resolved Hide resolved
src/LM/preprocessing.jl Outdated Show resolved Hide resolved
src/LM/counter.jl Outdated Show resolved Hide resolved
@Ayushk4
Copy link
Member

Ayushk4 commented May 27, 2020

Once you are finished with the PR, try to incorporate above suggestions. Let me know once you are done.

@tejasvaidhyadev
Copy link
Member Author

Thanks, @Ayushk4
I will incorporate the changes asap.

@Ayushk4
Copy link
Member

Ayushk4 commented Jun 5, 2020

Hi, Tejas. Thanks for incorporating the changes.
Please let me know once this PR is ready for review.

src/LM/preprocessing.jl Outdated Show resolved Hide resolved
@Ayushk4
Copy link
Member

Ayushk4 commented Jun 7, 2020

Travis seems to fail on 1.3: ERROR: Ambiguous dependency on StatsBase. Can you look at this once?

@tejasvaidhyadev
Copy link
Member Author

tejasvaidhyadev commented Jun 7, 2020

Travis seems to fail on 1.3: ERROR: Ambiguous dependency on StatsBase. Can you look at this once?

updated

src/LM/counter.jl Outdated Show resolved Hide resolved
Copy link
Member

@Ayushk4 Ayushk4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall.

@Ayushk4
Copy link
Member

Ayushk4 commented Jun 8, 2020

@aviks Can you have a look and suggestion changes? Maybe then we can have this merged?

@tejasvaidhyadev
Copy link
Member Author

tejasvaidhyadev commented Jun 18, 2020

@aviks
can we merged it?

@aviks
Copy link
Member

aviks commented Jun 18, 2020

Taking a look now

@tejasvaidhyadev tejasvaidhyadev changed the title Language Model Interface [WIP] Language Model Interface Aug 20, 2020
@aviks aviks merged commit cd44039 into JuliaText:master Sep 2, 2020
Implementation of Base Ngram Model.

"""
function MLE(word, unk_cutoff=1, unk_label="<unk>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

word needs a type tag here. Since other parameters of this function have default value, it can revert to a one argument form, that is then ambigous with the struct definiton. Is word meant to be a vector? The docstring suggests so, and the function definition should be amended

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes word is Vector here
I will make a small PR with all word tags

In addition to initialization arguments from BaseNgramModel also requires
a number by which to increase the counts, gamma.
"""
function Lidstone(word, gamma = 1.0, unk_cutoff=1, unk_label="<unk>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same...

gamma::Float64
end

function Laplace(word, unk_cutoff=1, unk_label="<unk>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants