Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Baseline model notebook and embeddings trainer notebook #47

Merged
merged 51 commits into from
May 11, 2019

Conversation

cocochrane
Copy link
Contributor

No description provided.

@review-notebook-app
Copy link

Check out this pull request on ReviewNB: https://app.reviewnb.com/microsoft/nlp/pull/47

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

@cocochrane
Copy link
Contributor Author

@AbhiramE @caseyhong @irshaffe @irshaffe @catherine667 Opened this PR to the staging branch. You all have already reviewed this work, but tagging you in case you want to make any more edits!

@saidbleik
Copy link
Collaborator

We might want to keep the utils in preprocess.py general-purpose
For example:

  • pass the columns names that need to be lower-cased in to_lowercase()
  • to_nltk_tokens() and rm_nltk_stopwords() take in an arbitrary number of columns but only 2 are operated on.

@cocochrane
Copy link
Contributor Author

We might want to keep the utils in preprocess.py general-purpose
For example:

pass the columns names that need to be lower-cased in to_lowercase()
to_nltk_tokens() and rm_nltk_stopwords() take in an arbitrary number of columns but only 2 are operated on.

@caseyhong @AbhiramE @janhavi13 Just keeping you updated on the comments that correspond to changes to the utils

@saidbleik
Copy link
Collaborator

@caseyhong @AbhiramE @janhavi13 Just keeping you updated on the comments that correspond to changes to the utils

We can also have 2 versions like to_lowercase_all(df) and to_lowecase(df, col_names)

@jisooghd
Copy link
Contributor

jisooghd commented May 8, 2019

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so

to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?

@saidbleik
Copy link
Collaborator

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so

to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?

No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

@janhavi13
Copy link
Collaborator

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so
to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?

No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

@saidbleik - so like loop through the token_cols[] list passed and do the function for each column. Is that correct ? That way it's not restricted to [0] and [1] ?

@saidbleik
Copy link
Collaborator

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so
to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?
No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

@saidbleik - so like loop through the token_cols[] list passed and do the function for each column. Is that correct ? That way it's not restricted to [0] and [1] ?

yes, the argument is a list (not restricted to 2)

examples/embeddings/embedding_trainer.ipynb Outdated Show resolved Hide resolved
examples/embeddings/embedding_trainer.ipynb Outdated Show resolved Hide resolved
examples/embeddings/embedding_trainer.ipynb Outdated Show resolved Hide resolved
examples/embeddings/embedding_trainer.ipynb Outdated Show resolved Hide resolved
examples/embeddings/embedding_trainer.ipynb Outdated Show resolved Hide resolved
examples/embeddings/embedding_trainer.ipynb Outdated Show resolved Hide resolved
examples/embeddings/embedding_trainer.ipynb Outdated Show resolved Hide resolved
@miguelgfierro
Copy link
Member

miguelgfierro commented May 9, 2019

you guys are doing an amazing job. Sorry I broke your folder structure. If you have any problem, please let me know and I will help :-)

I did a pass to the notebooks and will do another pass later

AbhiramE and others added 15 commits May 9, 2019 14:58
1. Refactored word2vec loader to perform existing file checks before downloading or extracting.

2. Added units tests to load, download and extract functions.
1. Refactored word2vec loader to perform existing file checks before downloading or extracting.

2. Added units tests to load, download and extract functions.
1. Added methods to download, extract and load glove vectors.
2. Added units tests to test the public methods.

Other changes
 1. Made download and extract methods private.
 2. Refactored Word2vec unit tests to exclude private methods.
1. Added methods to download, extract and load glove vectors.
2. Added units test to test the public method.

Other changes
 1. Refactored files to add return types to docstrings.
 2. Minor changes to path variables.
@janhavi13
Copy link
Collaborator

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so
to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?
No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

@saidbleik - so like loop through the token_cols[] list passed and do the function for each column. Is that correct ? That way it's not restricted to [0] and [1] ?

yes, the argument is a list (not restricted to 2)

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so
to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?
No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

@saidbleik - so like loop through the token_cols[] list passed and do the function for each column. Is that correct ? That way it's not restricted to [0] and [1] ?

yes, the argument is a list (not restricted to 2)

@saidbleik - Take a look at the fixed nltk utils and I also added to_lowercase_all and to_lowercase variation in preprocess.py. Let know if it's good enough.

Copy link
Collaborator

@saidbleik saidbleik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@saidbleik saidbleik merged commit 07ca05d into staging May 11, 2019
@jisooghd jisooghd deleted the maidap-sentence-similarity branch May 22, 2019 03:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants