-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace open() with smart_open() in notebooks. Fix #1789 #1812
Conversation
Thanks for PR @sharanry, two questions:
|
i have replaced all open calls in |
@sharanry yes |
@menshikh-iv Could you review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sharanry, are you already test it (i.e. re-run all notebooks locally with your changes)?
"class MyCorpus(object):\n", | ||
" def __iter__(self):\n", | ||
" for line in open('datasets/mycorpus.txt'):\n", | ||
" for line in smart_open('datasets/mycorpus.txt'):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to explicitly add mode, i.e. 'rb'
, 'r'
, etc.
@menshikh-iv I made the changes which you had asked. I will be trying with docker image provided next. |
@sharanry Also please merge |
…into smart_open * 'develop' of https://github.com/RaRe-Technologies/gensim: Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823) Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837) Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821) Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833) Fix docstrings for `gensim.matutils` (piskvorky#1804) Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803) Fix docstrings for `gensim.models.normmodel` (piskvorky#1805) Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714) Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822) Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806) Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802) Fix docstrings for `gensim.utils` (piskvorky#1797) Fix tox.ini/setup.cfg configuration (piskvorky#1815) Add wordnet mammal train file for Poincare notebook (piskvorky#1781)
Hey @sharanry, how about the run of notebooks? |
hey @menshikh-iv, sorry for the late reply, I havent been able to build the docker image, i tried |
@sharanry replace |
@sharanry how is going? are you finished with running? |
@menshikh-iv Apologies for the delay. I am incurring multiple errors with the docker image. I am unable to set it up. Is there any alternative way I could run the notebooks? |
@sharanry what's kind of errors, can you show it here? alternative way - locally on your machine. |
Big thanks @sharanry, need to fix notebooks first (unfortunately, this doesn't work correctly). |
@menshikh-iv I have yet to check all the notebooks. I can help fixing them if possible. |
@sharanry yes, it will be really nice, feel free to fix problems. |
@menshikh-iv Can we merge this? |
@sharanry I want, but we need to run all of this notebooks first and check that all fine :( UPD: let me check it myself |
@sharanry it was much harder than I thought
I'll try to run, fix, merge this stuff when I have enough time for it (probably, not soon), sorry for waiting. Also, please move commits from #1893 to current PR (anyway, I will not merge this until we fix it), also, this allows us to avoid painful merge-conflict with notebooks. |
By discussion in #1964 (we'll rework most of the notebooks and provide small clear examples instead of most "long-running" stuff) + really hard to check (I spend a lot of time for it, but did not even check half, because of the need to test on 3 different python versions). For this reason, I close current PR, sorry @sharanry, thanks for the work! |
I see no reason to wait for some undefined future tests or refactorings just to change I propose merging this (useful) PR. |
@piskvorky I close it because this needs a really long time to check that we didn't break something (and all works as expected) with this change. P/S this thing is not so useful, without this modification everything works. |
I agree it may not be a critical feature, but it is a welcome one -- please merge if the changes themselves are OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these changes in binary/text semantics?
@@ -67,7 +68,7 @@ | |||
" files = os.listdir(data_dir + yr_dir)\n", | |||
" for filen in files:\n", | |||
" # Note: ignoring characters that cause encoding errors.\n", | |||
" with open(data_dir + yr_dir + '/' + filen, errors='ignore') as fid:\n", | |||
" with smart_open(data_dir + yr_dir + '/' + filen, 'rb') as fid:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug (not in this PR, notebook already bad): use os.path.join
for joining filesystem paths.
Also, why the change in errors='ignore'
?
@@ -114,7 +115,7 @@ | |||
" # as well as pages about a single year.\n", | |||
" # As a result, this preprocessing differs from the paper.\n", | |||
" \n", | |||
" with open(os.path.join(data_dir, fname)) as f:\n", | |||
" with smart_open(os.path.join(data_dir, fname), 'rb') as f:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a bug. If you open the file in binary mode, operations like split
and lower
have a different meaning, compared to text.
I see the same problem in many places in this PR.
Replace open() with smart_open() in notebooks to make it compatible with Python 3.x
Issue: #1789