-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert to absolute paths in wordrank #1503
Conversation
gensim/models/wrappers/wordrank.py
Outdated
@@ -118,14 +117,14 @@ def train(cls, wr_path, corpus_file, out_name, size=100, window=15, symmetric=1, | |||
utils.check_output(w, args=command, stdin=r) | |||
|
|||
logger.info("Deleting frequencies from vocab file") | |||
with smart_open(vocab_file, 'wb') as w: | |||
with smart_open(join(meta_dir, vocab_file), 'wb') as w: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move join
to definition of vocab_file
(line 91) and same changes for all smart_open
arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code style suggestions.
gensim/models/wrappers/wordrank.py
Outdated
# prepare training data (cooccurrence matrix and vocab) | ||
model_dir = os.path.join(wr_path, out_name) | ||
meta_dir = os.path.join(model_dir, 'meta') | ||
model_dir = join(wr_path, out_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using full namespace os.path.join
is preferable.
There are many join
s in Python and its various libraries, and the context makes the code immediately easier to read and understand for other readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gensim/models/wrappers/wordrank.py
Outdated
cmd_del_vocab_freq = ['cut', '-d', " ", '-f', '1', temp_vocab_file] | ||
|
||
commands = [cmd_vocab_count, cmd_cooccurence_count, cmd_shuffle_cooccurences] | ||
input_fnames = [corpus_file.split('/')[-1], corpus_file.split('/')[-1], cooccurrence_file] | ||
input_fnames = [join(meta_dir, corpus_file.split('/')[-1]), join(meta_dir, corpus_file.split('/')[-1]), cooccurrence_file] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string.split('/')
is not portable -- see os.path.split
, os.path.basename
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -126,7 +125,7 @@ def train(cls, wr_path, corpus_file, out_name, size=100, window=15, symmetric=1, | |||
with smart_open(cooccurrence_shuf_file, 'rb') as f: | |||
numlines = sum(1 for line in f) | |||
with smart_open(meta_file, 'wb') as f: | |||
meta_info = "{0} {1}\n{2} {3}\n{4} {5}".format(numwords, numwords, numlines, cooccurrence_shuf_file, numwords, vocab_file) | |||
meta_info = "{0} {1}\n{2} {3}\n{4} {5}".format(numwords, numwords, numlines, cooccurrence_shuf_file.split('/')[-1], numwords, vocab_file.split('/')[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dtto on split
.
Elsewhere in the file (and in gensim) the standard C-style %s %d %f
string formatting is used; best to keep it consistent here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky formatting with {}.format
more preferable for Python now. I think we should use format
method instead of C-style formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kept {}.format
for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code style comments.
gensim/models/wrappers/wordrank.py
Outdated
cmd_del_vocab_freq = ['cut', '-d', " ", '-f', '1', temp_vocab_file] | ||
|
||
commands = [cmd_vocab_count, cmd_cooccurence_count, cmd_shuffle_cooccurences] | ||
input_fnames = [join(meta_dir, corpus_file.split('/')[-1]), join(meta_dir, corpus_file.split('/')[-1]), cooccurrence_file] | ||
input_fnames = [os.path.join(meta_dir, os.path.split(corpus_file)[-1]), os.path.join(meta_dir, os.path.split(corpus_file)[-1]), cooccurrence_file] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is a little hard to navigate -- any way to restructure the logic to make it more readable? Maybe factor out some of the arguments into separate lines?
gensim/models/wrappers/wordrank.py
Outdated
os.makedirs(meta_dir) | ||
logger.info("Dumped data will be stored in '%s'", model_dir) | ||
copyfile(corpus_file, join(meta_dir, corpus_file.split('/')[-1])) | ||
copyfile(corpus_file, os.path.join(meta_dir, corpus_file.split('/')[-1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't os.path.split()[-1]
simply os.path.basename()
?
Converted relative paths to absolute for every wordrank command.