Baseline model notebook and embeddings trainer notebook #47

cocochrane · 2019-05-07T16:04:15Z

No description provided.

review-notebook-app · 2019-05-07T16:04:18Z

Check out this pull request on ReviewNB: https://app.reviewnb.com/microsoft/nlp/pull/47

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

cocochrane · 2019-05-07T18:14:29Z

@AbhiramE @caseyhong @irshaffe @irshaffe @catherine667 Opened this PR to the staging branch. You all have already reviewed this work, but tagging you in case you want to make any more edits!

utils_nlp/pretrained_embeddings/fasttext.py

saidbleik · 2019-05-08T03:28:14Z

We might want to keep the utils in preprocess.py general-purpose
For example:

pass the columns names that need to be lower-cased in to_lowercase()
to_nltk_tokens() and rm_nltk_stopwords() take in an arbitrary number of columns but only 2 are operated on.

cocochrane · 2019-05-08T17:33:26Z

We might want to keep the utils in preprocess.py general-purpose
For example:

pass the columns names that need to be lower-cased in to_lowercase()
to_nltk_tokens() and rm_nltk_stopwords() take in an arbitrary number of columns but only 2 are operated on.

@caseyhong @AbhiramE @janhavi13 Just keeping you updated on the comments that correspond to changes to the utils

saidbleik · 2019-05-08T17:55:05Z

@caseyhong @AbhiramE @janhavi13 Just keeping you updated on the comments that correspond to changes to the utils

We can also have 2 versions like to_lowercase_all(df) and to_lowecase(df, col_names)

jisooghd · 2019-05-08T18:25:44Z

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so

to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?

saidbleik · 2019-05-08T18:45:58Z

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so

to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?

No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

janhavi13 · 2019-05-08T19:10:38Z

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so
to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?

No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

@saidbleik - so like loop through the token_cols[] list passed and do the function for each column. Is that correct ? That way it's not restricted to [0] and [1] ?

saidbleik · 2019-05-08T19:12:28Z

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so
to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?
No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

@saidbleik - so like loop through the token_cols[] list passed and do the function for each column. Is that correct ? That way it's not restricted to [0] and [1] ?

yes, the argument is a list (not restricted to 2)

examples/embeddings/embedding_trainer.ipynb

examples/sentence_similarity/02-model-deep-dive/baseline_deep_dive.ipynb

miguelgfierro · 2019-05-09T15:58:17Z

you guys are doing an amazing job. Sorry I broke your folder structure. If you have any problem, please let me know and I will help :-)

I did a pass to the notebooks and will do another pass later

1. Refactored word2vec loader to perform existing file checks before downloading or extracting. 2. Added units tests to load, download and extract functions.

1. Added methods to download, extract and load glove vectors. 2. Added units tests to test the public methods. Other changes 1. Made download and extract methods private. 2. Refactored Word2vec unit tests to exclude private methods.

1. Added methods to download, extract and load glove vectors. 2. Added units test to test the public method. Other changes 1. Refactored files to add return types to docstrings. 2. Minor changes to path variables.

…mbedding loader

…ercase as per said's comments

…_words to more than 2 sentences

… column names

Preprocess utils

janhavi13 · 2019-05-10T21:00:07Z

@saidbleik makes sense for the lowercase. for your second point, do you mean we should explicitly enforce the 2 column limit since that is what is actually happening under the hood? so
to_spacy_tokens(df, sent_col_1, sent_col_2, tok_col_1, tok_col_2)
instead of
to_spacy_tokens(df, sentence_cols=["sentence1", "sentence2"], token_cols=["sentence1_tokens", "sentence2_tokens"])
?
No, I meant the current implementation only applies to token_cols[0] and token_cols[1]. It should allow an arbitrary number of columns.

@saidbleik - so like loop through the token_cols[] list passed and do the function for each column. Is that correct ? That way it's not restricted to [0] and [1] ?

yes, the argument is a list (not restricted to 2)

@saidbleik - Take a look at the fixed nltk utils and I also added to_lowercase_all and to_lowercase variation in preprocess.py. Let know if it's good enough.

saidbleik

Thanks.

cocochrane requested review from miguelgfierro, saidbleik and eedeleon May 7, 2019 16:04

saidbleik reviewed May 7, 2019

View reviewed changes

utils_nlp/pretrained_embeddings/fasttext.py Outdated Show resolved Hide resolved

miguelgfierro reviewed May 9, 2019

View reviewed changes

AbhiramE and others added 15 commits May 9, 2019 14:58

Initial commit for word embeddings

48adc4f

Added support to download and extract word2vec pretrained vectors

47ada0d

Reformated files

9895dd4

Word2vec loader - Code changes and unit tests.

8408d7c

1. Refactored word2vec loader to perform existing file checks before downloading or extracting. 2. Added units tests to load, download and extract functions.

Word2vec loader - Code changes and unit tests.

8203b01

1. Refactored word2vec loader to perform existing file checks before downloading or extracting. 2. Added units tests to load, download and extract functions.

Glove loader - Code changes and unit tests.

8025b44

1. Added methods to download, extract and load glove vectors. 2. Added units tests to test the public methods. Other changes 1. Made download and extract methods private. 2. Refactored Word2vec unit tests to exclude private methods.

Minor changes

4e48002

FastText loader - Code changes and unit tests.

2502d91

1. Added methods to download, extract and load glove vectors. 2. Added units test to test the public method. Other changes 1. Refactored files to add return types to docstrings. 2. Minor changes to path variables.

Minor change.

6ba2723

Added smoke tests to verify extracted sizes of pretrained vectors

b955e53

token_cols bugfix

faf924b

Download and clean stsbenchmark data

b4b405a

Baseline model deep dive notebook

8ae3d9d

Adding processing util and edits

8a0cb41

Add fastText and reorganize notebook

9e249aa

Courtney Cochrane and others added 18 commits May 9, 2019 14:58

Edits to embedding and baselines notebooks and integration of glove e…

9fb0138

…mbedding loader

Changing file location based on new folder structure

10650b6

Move embeddings notebook into embeddings folder

60877c0

Changes to paths and revert back mistaken change to gitignore

45dfcd8

Revert back README after take out my comments

b75b612

Revert back README after take out my comments

9de6161

New file structure

2aaa9dd

Adding file to new place in folder structure

6a07b6f

Adding infastText loader

54fc9c1

Edits to import statements and grammar edits

be929e2

Integrate fastText embedding loader and small edits

d1b9922

README edit

aca5618

revert Readme to latest branch changes

bc2f511

Moved urls to module constants for pretrained embedding utils.

49595b8

persist clean_snli

4517c06

add links to original model papers

12125a3

Add detail to data loading part, add links to original papers

17dee0d

Add in timer to time each of the embedding trainers

b1b5ec1

cocochrane force-pushed the maidap-sentence-similarity branch from 958922a to b1b5ec1 Compare May 9, 2019 19:16

cocochrane added 2 commits May 10, 2019 10:00

Change based folder structure adjusment

0812acf

Changing structure

2058c77

saidbleik mentioned this pull request May 10, 2019

[Scenario] Text Classification Examples #9

Closed

janhavi13 added 4 commits May 10, 2019 16:27

feat(code review) fix to_nltk_tokens, add to_lowercase_all and to_low…

6e35238

…ercase as per said's comments

feat(code review comments) generalize nltk utils tokenize, remove_sto…

197d771

…_words to more than 2 sentences

feat(code fix) rm_nltk_stop_words now expects sentences and stop_word…

bb5764a

… column names

Merge pull request #57 from microsoft/janhavi-fix-preprocessing-file

9338f40

Preprocess utils

saidbleik approved these changes May 11, 2019

View reviewed changes

saidbleik merged commit 07ca05d into staging May 11, 2019

jisooghd deleted the maidap-sentence-similarity branch May 22, 2019 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baseline model notebook and embeddings trainer notebook #47

Baseline model notebook and embeddings trainer notebook #47

cocochrane commented May 7, 2019

review-notebook-app bot commented May 7, 2019

cocochrane commented May 7, 2019

saidbleik commented May 8, 2019

cocochrane commented May 8, 2019

saidbleik commented May 8, 2019

jisooghd commented May 8, 2019

saidbleik commented May 8, 2019

janhavi13 commented May 8, 2019

saidbleik commented May 8, 2019

miguelgfierro commented May 9, 2019 •

edited

Loading

janhavi13 commented May 10, 2019

saidbleik left a comment

Baseline model notebook and embeddings trainer notebook #47

Baseline model notebook and embeddings trainer notebook #47

Conversation

cocochrane commented May 7, 2019

review-notebook-app bot commented May 7, 2019

cocochrane commented May 7, 2019

saidbleik commented May 8, 2019

cocochrane commented May 8, 2019

saidbleik commented May 8, 2019

jisooghd commented May 8, 2019

saidbleik commented May 8, 2019

janhavi13 commented May 8, 2019

saidbleik commented May 8, 2019

miguelgfierro commented May 9, 2019 • edited Loading

janhavi13 commented May 10, 2019

saidbleik left a comment

Choose a reason for hiding this comment

miguelgfierro commented May 9, 2019 •

edited

Loading