-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add random_seed
to LdaMallet
#2153
Add random_seed
to LdaMallet
#2153
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for PR @Zohaggie, also please add
- loading old Mallet model with new code
- training with
radom_seed
@@ -122,6 +124,7 @@ def __init__(self, mallet_path, corpus=None, num_topics=100, alpha=50, id2word=N | |||
self.workers = workers | |||
self.optimize_interval = optimize_interval | |||
self.iterations = iterations | |||
self.random_seed = random_seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define custom load
functions for old mallet models (without this option), see an example https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py#L348-L355
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check I have coded and placed this correctly... Should I include logging as in the example?
Training with random_seed - what is required there?
gensim/models/wrappers/ldamallet.py
Outdated
@@ -100,6 +100,8 @@ def __init__(self, mallet_path, corpus=None, num_topics=100, alpha=50, id2word=N | |||
Number of training iterations. | |||
topic_threshold : float, optional | |||
Threshold of the probability above which we consider a topic. | |||
random_seed: int, optional | |||
Random seed to ensure consistent results, default is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to write default value in docstring description + .
at the end of sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spelled out the default so that users know that they need not enter a random_seed parameter at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default parameters automatically showed in documentation, for this reason, no need to duplicate it in docstring.
gensim/models/wrappers/ldamallet.py
Outdated
@@ -268,11 +271,16 @@ def train(self, corpus): | |||
cmd = self.mallet_path + ' train-topics --input %s --num-topics %s --alpha %s --optimize-interval %s '\ | |||
'--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '\ | |||
'--num-iterations %s --inferencer-filename %s --doc-topics-threshold %s' | |||
|
|||
if self.random_seed != None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simply if self.random_seed
enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason for None is so that random seed is not invoked unless a value is explicitly passed, which can be zero. If I test just for if self.random_seed and zero is passed, will it enter this if - a zero is a valid random_seed value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, in this case, should be if self.random_seed is not None
(see best practicies)
gensim/models/wrappers/ldamallet.py
Outdated
cmd = cmd % ( | ||
self.fcorpusmallet(), self.num_topics, self.alpha, self.optimize_interval, | ||
self.workers, self.fstate(), self.fdoctopics(), self.ftopickeys(), self.iterations, | ||
self.finferencer(), self.topic_threshold | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just inserted a blank line after the cmd definition just for readability - it is not required
I use Anaconda and have not been confident to install virtualenv and tox, so cannot check myself, but the Travis CI tox check for Python 2.7 is failing - are you able to tell me what I need to do to fix it? |
@Zohaggie next time see travis log, here is current state (PEP8 issues)
Also, don't forget about writing tests please |
I am unable to determine how to pass the travis test, as it seems to be highlighting old issues with whitespaces in blank lines which I think I have fixed in the latest commit. Also, this is a first-time for me to contribute to a project, I am unsure what's required to write tests. Please assist with some more guidance! |
so, PEP8 can be easily fixed, more important here is tests, let me try to describe, what's needed Add 2 tests (methods) to this class https://github.com/RaRe-Technologies/gensim/blob/17fa0dcea8bb7824f0e709fd3ff60007bcdd85f6/gensim/test/test_ldamallet_wrapper.py#L30
|
OK, thanks. I might be a day or two completing this, juggling it with other commitments. This program by default loads common_texts as a sample text - should I use that? It might be necessary to use something a bit bigger, there are only 9 documents, 12 unique words in common_texts. |
@Zohaggie just use corpus that already mentioned in test class https://github.com/RaRe-Technologies/gensim/blob/17fa0dcea8bb7824f0e709fd3ff60007bcdd85f6/gensim/test/test_ldamallet_wrapper.py#L36 |
OK, thanks - I will use it. I have already experimented with what the current test does, it saves the mallet model and data into the user\temp directory, but in this test you want me to save into a sub-directory under gensim\test\test_data? I say sub-directory as there are a number of files comprising a mallet model and I don't want to clutter up the test_data directory. Can you confirm my interpretation of things please? |
@Zohaggie you should create a folder in |
Thanks... |
ping @Zohaggie, are you planning to finish PR? |
Hi Ivan
Yes I am planning to, but I have some urgent commitments to completing a
first draft of a paper over the last few weeks. I hope to get back to the
PR next week. If you have the time and its straightforward for you to
complete the testing piece and get it to me then that would help, obviously
:)
KindRegards
Chris
…On Wed, Aug 29, 2018 at 1:13 PM, Ivan Menshikh ***@***.***> wrote:
ping @Zohaggie <https://github.com/Zohaggie>, are you planning to finish
PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2153 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKwcI6f1LNjTEkYOP5h_IfofWzy3vl00ks5uVgbNgaJpZM4V26Xh>
.
|
Hi Ivan
Finally able to get back to this, I hope to have something to you soon
Regards
Chris
On Thu, Aug 30, 2018 at 8:29 AM, Chris Palmer <[email protected]>
wrote:
… Hi Ivan
Yes I am planning to, but I have some urgent commitments to completing a
first draft of a paper over the last few weeks. I hope to get back to the
PR next week. If you have the time and its straightforward for you to
complete the testing piece and get it to me then that would help, obviously
:)
KindRegards
Chris
On Wed, Aug 29, 2018 at 1:13 PM, Ivan Menshikh ***@***.***>
wrote:
> ping @Zohaggie <https://github.com/Zohaggie>, are you planning to finish
> PR?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2153 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AKwcI6f1LNjTEkYOP5h_IfofWzy3vl00ks5uVgbNgaJpZM4V26Xh>
> .
>
|
Hi Ivan
I am finishing off the adding of the these 2 functions, but I am not sure
about the training and testing of the models for equality.
I have tried just creating the 2 models for comparison with just the same
parameters including of course the random_seed value, and if I compare them
using the pattern in the testPersistence function for comparing word_topics
then they are the same - self.assertTrue(np.allclose(model.word_topics,
model2.word_topics)).
However, if I try the pattern for comparing the output of model[doc] (to
model2[doc]) which I see in the testTransform function, then they are not
the same, unless I adopt the approach in testTransform and train them in
parallel, so they both converge to deliver the expected dense vector
of [0.49, 0.51]. But to do this seems incorrect, I am forcing them to be
the same by using restarts.
Can you please advise me about what I should be doing here?
Kind Regards
Chris
…On Sat, Aug 11, 2018 at 11:23 PM, Ivan Menshikh ***@***.***> wrote:
so, PEP8 can be easily fixed, more important here is tests, let me try to
describe, what's needed
Add 2 tests (methods) to this class https://github.com/RaRe-
Technologies/gensim/blob/17fa0dcea8bb7824f0e709fd3ff600
07bcdd85f6/gensim/test/test_ldamallet_wrapper.py#L30
- test_load_model
1. (before writing test) train lda mallet with last gensim version
(simply install it from PyPI) on very small dataset
2. (before writing test) save model and add this data to
gensim/test/test_data folder
3. write an test where you simply load this model & check that it
works (for example - try to apply it and update)
- test_random_seed
1. Define seed
2. Train fst model with seed
3. Train snd model with seed
4. Check that models are same
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2153 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKwcI2wmifHz1n78HrncU7fPqxOG5p4mks5uPtrtgaJpZM4V26Xh>
.
|
sorry for waiting @Zohaggie, can you please submit a code first (sounds like you make it correct, but something goes wrong) and we together have a look to concrete pieces. |
Hi Ivan. I have added code but I need guidance from you re the test for model equality. I didn't add but was wondering if something like this is required: passed = False
passed2 = False
for i in range(5): # restart at most 5 times
# create the transformation model
model = ldamallet.LdaMallet(mallet_path, corpus, id2word=dictionary, num_topics=2, iterations=200, prefix=prefix, random_seed = 10)
model2 = ldamallet.LdaMallet(mallet_path, corpus, id2word=dictionary, num_topics=2, iterations=200, prefix=prefix, random_seed = 10)
# transform one document
doc = list(corpus)[0]
transformed = model[doc]
transformed2 = model2[doc]
vec = matutils.sparse2full(transformed, 2) # convert to dense vector, for easier equality tests
vec2 = matutils.sparse2full(transformed2, 2)
expected = [0.49, 0.51]
# must contain the same values, up to re-ordering
passed = np.allclose(sorted(vec), sorted(expected), atol=1e-1)
passed2 = np.allclose(sorted(vec2), sorted(expected), atol=1e-1)
if passed & passed2:
model.save(model_save_name)
break
if not passed:
logging.warning(
"LDA model failed to converge on attempt %i (got %s, expected %s)",
i, sorted(vec), sorted(expected)
)
if not passed2:
logging.warning(
"LDA model2 failed to converge on attempt %i (got %s, expected %s)",
i, sorted(vec), sorted(expected)
) |
@Zohaggie thanks for code, here is my notes about it
|
Hi Ivan When you say "Good idea" is it regarding to my idea for the code posted in my comments here? The reason for me asking you about this was because I was not sure if something like this was needed in addition to what I have already in place in my function Are you happy with the function I will get back to you perhaps for clarification, but in the meantime, no the code has not failed at all with my testing. Can you tell me why the checks are failing? Regarding use of sorted, or any other designs in my test code, I did not know what code to use so I copied existing code. Are you able to point me to better patterns to copy? |
Thanks for adding this feature. Any ideas what version it will go into and when? |
Hi Ivan Still waiting to hear back from you with clarification about what I need add and change... Are you happy with Thanks |
Ivan and Chris, our team is waiting on the |
Hi John
This is my first time contributing via a pull request with so many added
conditions placed on me to also include other features including testing,
and its not clear to me exactly what's required for the testing. I am
waiting for clarification from Ivan. I found his last directions to be
unclear.
I will in the meantime do a little more towards the testing function for
random_seed, despite not having heard from Ivan. Regarding the core
functionality - from my experience it works, which is after all why I put
it in, so I guess you could take it as it is and satisfy yourself about it.
:)
Kind Regards
Chris
…On Sun, Oct 14, 2018 at 2:44 AM John Cerwin ***@***.***> wrote:
Ivan and Chris, our team is waiting on therandom_seed functionality. When
do you think it will be released?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2153 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKwcIw2KgbzMSHOI_Bg6R6NcM5gI8kYNks5ukgp1gaJpZM4V26Xh>
.
|
@Zohaggie really sorry for waiting (busy month), I'll try to have a look deeply on current week. @Jcerwin next release not scheduled yet, by my feeling, this will be in the current year. If you need to use this feature ASAP - you can just install gensim from source (github) after we merge current PR |
Thank you both Ivan and Chris! |
random_seed
to LdaMallet
big thanks @Zohaggie, congratz with the first contribution 🥇 |
Ivan, I see that you have a default of 0 rather than None, and have stated the 0 will use the system clock. Can you give me a URL to where the Mallet documentation uses the system clock with a 0 value in random-seed? Also, I have tried to install 3.6.0 using conda, and its not offering it - just 3.5.0-py36h830ac7b_1000 - when will it be updated? |
same for topic inference, etc.
When release happend :) ETA - end of Jan. BTW - better to use |
Thanks Ivan - this was with conda-forge that I tried...
…On Fri, Jan 11, 2019 at 1:35 PM Ivan Menshikh ***@***.***> wrote:
Ivan, I see that you have a default of 0 rather than None, and have stated
the 0 will use the system clock. Can you give me a URL to where the Mallet
documentation uses the system clock with a 0 value in random-seed?
https://github.com/mimno/Mallet/blob/af1fcb1f3e6561afac28f4331e4e0d735b3d11b4/src/cc/mallet/topics/tui/TopicTrainer.java#L138-L139
same for topic inference, etc.
Also, I have tried to install 3.6.0 using conda, and its not offering it -
just 3.5.0-py36h830ac7b_1000 - when will it be updated?
When release happend :) ETA - end of Jan. BTW - better to use conda-forge
(we support this) or classic PyPI (though pip)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2153 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKwcI0_S63BO9thXhs6AukhKdcfTXdnpks5vB_hTgaJpZM4V26Xh>
.
|
Thanks for the reference re zero parameter! Wish I had known that earlier :)
On Fri, Jan 11, 2019 at 7:46 PM Chris Palmer <[email protected]>
wrote:
… Thanks Ivan - this was with conda-forge that I tried...
On Fri, Jan 11, 2019 at 1:35 PM Ivan Menshikh ***@***.***>
wrote:
> Ivan, I see that you have a default of 0 rather than None, and have
> stated the 0 will use the system clock. Can you give me a URL to where the
> Mallet documentation uses the system clock with a 0 value in random-seed?
>
>
> https://github.com/mimno/Mallet/blob/af1fcb1f3e6561afac28f4331e4e0d735b3d11b4/src/cc/mallet/topics/tui/TopicTrainer.java#L138-L139
>
> same for topic inference, etc.
>
> Also, I have tried to install 3.6.0 using conda, and its not offering it
> - just 3.5.0-py36h830ac7b_1000 - when will it be updated?
>
> When release happend :) ETA - end of Jan. BTW - better to use conda-forge
> (we support this) or classic PyPI (though pip)
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2153 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AKwcI0_S63BO9thXhs6AukhKdcfTXdnpks5vB_hTgaJpZM4V26Xh>
> .
>
|
Including a random_seed parameter enables consistent results from Mallet.