-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Keras wrapper for Word2Vec model in Gensim #1248
[MRG] Keras wrapper for Word2Vec model in Gensim #1248
Conversation
@tmylk I have tried to use the wrapper for a smaller version of the 20NewsGroup task. The code used is as follows. (It is based on the code used here) from __future__ import print_function
import os
import sys
import numpy as np
from gensim.sklearn_integration.keras_wrapper_gensim_word2vec import KerasWrapperWord2VecModel
from gensim.models import word2vec
from keras.engine import Input
from keras.layers import merge
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
BASE_DIR = ''
TEXT_DATA_DIR = BASE_DIR + './path/to/text/data/dir'
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
BATCH_SIZE = 128
# prepare text samples and their labels
texts = [] # list of text samples
labels_index = {} # dictionary mapping label name to numeric id
labels = [] # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
path = os.path.join(TEXT_DATA_DIR, name)
if os.path.isdir(path):
label_id = len(labels_index)
labels_index[name] = label_id
for fname in sorted(os.listdir(path)):
if fname.isdigit():
fpath = os.path.join(path, fname)
if sys.version_info < (3,):
f = open(fpath)
else:
f = open(fpath, encoding='latin-1')
t = f.read()
i = t.find('\n\n') # skip header
if 0 < i:
t = t[i:]
texts.append(t)
f.close()
labels.append(label_id)
# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels))
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]
# train the embedding matrix
data1 = word2vec.LineSentence('./path/to/input/data')
Keras_w2v = KerasWrapperWord2VecModel(data1, min_count=1)
embedding_layer = Keras_w2v.get_embedding_layer()
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x) # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=BATCH_SIZE) |
Please add unit tests and an ipynb |
@tmylk Sure. |
@tmylk I have added unit tests for word similarity task (using cosine distance) as well as a smaller version of the 20NewsGroups classification task. I have also created an IPython notebook for the wrapper explaining both these examples. |
Addresses of Atheist Organizations | ||
USA | ||
FREEDOM FROM RELIGION FOUNDATION | ||
Darwin fish bumper stickers and assorted other atheist paraphernalia are available from the Freedom From Religion Foundation in the US. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put data into test/test_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk The data used in the unit tests is present in test/test_data
folder already. The data used in the IPython notebooks is present in docs/notebooks/datasets
folder. We are using the same data at both places so to avoid unnecessary duplication, should I use the data in test/test_data
(i.e. set the path accordingly in the ipynb) in the ipynb notebooks as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, test/test_data
is a better location for data used in both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So, I'll change the path set in the ipynb for this functionality.
gensim/keras_integration/__init__.py
Outdated
# | ||
# Copyright (C) 2011 Radim Rehurek <[email protected]> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
"""Keras wrappers for gensim. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Wrapper to allow gensim word2vec as input into Keras." is more clear. And in other docstrings and ipynbs too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk The file __init__.py
is common for the entire folder gensim/keras_integration
(which in turn would have the files for integration of the various models with Keras). So shouldn't the docstring here be more generic (like it already is)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be more general, just pointing out the difference in meaning between "keras wrapper for gensim" vs "gensim wrapper for keras". Which one is it?
I think it is a "Wrapper to allow gensim models as input into Keras."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Understood it now. Thanks for pointing this out. So, I'll change "Keras wrappers for gensim" to "Wrappers to allow gensim models as input into Keras".
Thanks for the feature. The change seems to be so small - just 5 lines in Instead of creating a new class, could you please just add this one method to |
@tmylk Sure. I'll add the function in |
@chinmayapancholi13 this is more appropriate to have in KeyedVectors class. |
@tmylk I have moved the function |
…into keras_wrapper_word2vec
@tmylk Thanks a lot for your feedback. I have incorporated your suggestions as follows :
|
gensim/keras_integration/__init__.py
Outdated
@@ -0,0 +1,8 @@ | |||
#!/usr/bin/env python | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file and folder are no longer needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! Removing this folder right away.
setup.py
Outdated
@@ -233,6 +233,9 @@ def finalize_options(self): | |||
'scikit-learn', | |||
'pyemd', | |||
'annoy', | |||
'theano', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would only one of them be enough? tf is preferred
Let's add a note about shorttext to ipynb. And then LGTM |
Need to fix problems with travis (so strange, because keras is installed, but in log I see |
@menshikh-iv This is because Keras also needs Tensorflow to be installed (when Tensorflow is the default backend). And we are not installing Tensorflow in Thus, we(similar to Keras) expect the users to install Tensorflow by themselves. |
@chinmayapancholi13 Thank for your clarification, but what will we do with Travis? |
@menshikh-iv One solution can be to include TF in the installation and add a note that users would have to re-install the TF of their choice again (i.e. in case it gets overwritten). For the tests to get passed in travis, I believe we must install TF somehow so this can be a solution in my opinion. |
@chinmayapancholi13 Sounds good for me, let's do it. |
@chinmayapancholi13 Great 🥇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the interesting feature and notebook!
There are some code style issues -- can you fix that? @menshikh-iv
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Then, we call the wrapper and pass appropriate parameters." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What wrapper?
Here and below, the text refers to some "wrapper", but I see no wrapper.
"source": [ | ||
"word_a = 'graph'\n", | ||
"word_b = 'trees'\n", | ||
"output = keras_model.predict([np.asarray([model.wv.vocab[word_a].index]), np.asarray([model.wv.vocab[word_b].index])]) # output is the cosine distance between the two words (as a similarity measure)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment would be better moved after the code, on a separate line, to improve readability.
"source": [ | ||
"# global variables\n", | ||
"\n", | ||
"nb_filters=1200 # number of filters\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: space around the assignment operator (x = 1
).
Here and in my other places in the notebook.
" category_col, descp_col = df.columns.values.tolist()\n", | ||
" shorttextdict = defaultdict(lambda : [])\n", | ||
" for category, descp in zip(df[category_col], df[descp_col]):\n", | ||
" if type(descp)==str:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work across Python 2 / Python 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is working fine for both Python 2 and 3.
" shorttextdict = defaultdict(lambda : [])\n", | ||
" for category, descp in zip(df[category_col], df[descp_col]):\n", | ||
" if type(descp)==str:\n", | ||
" shorttextdict[category] += [descp]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append
simpler, faster and more readable?
Also, if you need to convert to plain dict
anyway below, a plain setdefault(key, []).append(x)
may be easier than defaultdict.
" \"\"\"\n", | ||
" df = pd.read_csv(filepath)\n", | ||
" category_col, descp_col = df.columns.values.tolist()\n", | ||
" shorttextdict = defaultdict(lambda : [])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultdict(list)
" Return an example data set, with three subjects and corresponding keywords.\n", | ||
" This is in the format of the training input.\n", | ||
" \"\"\"\n", | ||
" data_path = './datasets/keras_classifier_training_data.csv'\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.path.join
better (will work on Windows too).
Here and elsewhere.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The result above clearly suggests (~ 98% probability!) that the input `artificial intellegence` should belong to the category `mathematics`, which conforms very well with the expected output in this case.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intellegence
=> intelligence
@piskvorky Thanks a lot for your comprehensive feedback! :) I'd be happy to make these changes in a new PR. I'll also try to keep in mind these code-style issues in the future. |
This PR adds a Keras wrapper for Word2Vec Model in Gensim.