Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to absolute paths in wordrank #1503

Merged
merged 5 commits into from
Jul 25, 2017

Conversation

parulsethi
Copy link
Contributor

Converted relative paths to absolute for every wordrank command.

@@ -118,14 +117,14 @@ def train(cls, wr_path, corpus_file, out_name, size=100, window=15, symmetric=1,
utils.check_output(w, args=command, stdin=r)

logger.info("Deleting frequencies from vocab file")
with smart_open(vocab_file, 'wb') as w:
with smart_open(join(meta_dir, vocab_file), 'wb') as w:
Copy link
Contributor

@menshikh-iv menshikh-iv Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move join to definition of vocab_file (line 91) and same changes for all smart_open arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code style suggestions.

# prepare training data (cooccurrence matrix and vocab)
model_dir = os.path.join(wr_path, out_name)
meta_dir = os.path.join(model_dir, 'meta')
model_dir = join(wr_path, out_name)
Copy link
Owner

@piskvorky piskvorky Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using full namespace os.path.join is preferable.

There are many joins in Python and its various libraries, and the context makes the code immediately easier to read and understand for other readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cmd_del_vocab_freq = ['cut', '-d', " ", '-f', '1', temp_vocab_file]

commands = [cmd_vocab_count, cmd_cooccurence_count, cmd_shuffle_cooccurences]
input_fnames = [corpus_file.split('/')[-1], corpus_file.split('/')[-1], cooccurrence_file]
input_fnames = [join(meta_dir, corpus_file.split('/')[-1]), join(meta_dir, corpus_file.split('/')[-1]), cooccurrence_file]
Copy link
Owner

@piskvorky piskvorky Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string.split('/') is not portable -- see os.path.split, os.path.basename etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -126,7 +125,7 @@ def train(cls, wr_path, corpus_file, out_name, size=100, window=15, symmetric=1,
with smart_open(cooccurrence_shuf_file, 'rb') as f:
numlines = sum(1 for line in f)
with smart_open(meta_file, 'wb') as f:
meta_info = "{0} {1}\n{2} {3}\n{4} {5}".format(numwords, numwords, numlines, cooccurrence_shuf_file, numwords, vocab_file)
meta_info = "{0} {1}\n{2} {3}\n{4} {5}".format(numwords, numwords, numlines, cooccurrence_shuf_file.split('/')[-1], numwords, vocab_file.split('/')[-1])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dtto on split.

Elsewhere in the file (and in gensim) the standard C-style %s %d %f string formatting is used; best to keep it consistent here as well.

Copy link
Contributor

@menshikh-iv menshikh-iv Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky formatting with {}.format more preferable for Python now. I think we should use format method instead of C-style formatting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kept {}.format for now

@menshikh-iv menshikh-iv merged commit 7a9e98e into piskvorky:develop Jul 25, 2017
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code style comments.

cmd_del_vocab_freq = ['cut', '-d', " ", '-f', '1', temp_vocab_file]

commands = [cmd_vocab_count, cmd_cooccurence_count, cmd_shuffle_cooccurences]
input_fnames = [join(meta_dir, corpus_file.split('/')[-1]), join(meta_dir, corpus_file.split('/')[-1]), cooccurrence_file]
input_fnames = [os.path.join(meta_dir, os.path.split(corpus_file)[-1]), os.path.join(meta_dir, os.path.split(corpus_file)[-1]), cooccurrence_file]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is a little hard to navigate -- any way to restructure the logic to make it more readable? Maybe factor out some of the arguments into separate lines?

os.makedirs(meta_dir)
logger.info("Dumped data will be stored in '%s'", model_dir)
copyfile(corpus_file, join(meta_dir, corpus_file.split('/')[-1]))
copyfile(corpus_file, os.path.join(meta_dir, corpus_file.split('/')[-1]))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't os.path.split()[-1] simply os.path.basename()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants