Add `random_seed` to `LdaMallet` #2153

ChrisPalmerNZ · 2018-08-09T21:48:15Z

Including a random_seed parameter enables consistent results from Mallet.

menshikh-iv

thanks for PR @Zohaggie, also please add

loading old Mallet model with new code
training with radom_seed

menshikh-iv · 2018-08-10T09:25:56Z

gensim/models/wrappers/ldamallet.py

@@ -122,6 +124,7 @@ def __init__(self, mallet_path, corpus=None, num_topics=100, alpha=50, id2word=N
        self.workers = workers
        self.optimize_interval = optimize_interval
        self.iterations = iterations
+        self.random_seed = random_seed


define custom load functions for old mallet models (without this option), see an example https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py#L348-L355

Please check I have coded and placed this correctly... Should I include logging as in the example?

Training with random_seed - what is required there?

menshikh-iv · 2018-08-10T09:26:21Z

gensim/models/wrappers/ldamallet.py

@@ -100,6 +100,8 @@ def __init__(self, mallet_path, corpus=None, num_topics=100, alpha=50, id2word=N
            Number of training iterations.
        topic_threshold : float, optional
            Threshold of the probability above which we consider a topic.
+        random_seed: int, optional
+            Random seed to ensure consistent results, default is None   


No need to write default value in docstring description + . at the end of sentence

I spelled out the default so that users know that they need not enter a random_seed parameter at all.

Default parameters automatically showed in documentation, for this reason, no need to duplicate it in docstring.

menshikh-iv · 2018-08-10T09:27:01Z

gensim/models/wrappers/ldamallet.py

@@ -268,11 +271,16 @@ def train(self, corpus):
        cmd = self.mallet_path + ' train-topics --input %s --num-topics %s  --alpha %s --optimize-interval %s '\
            '--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '\
            '--num-iterations %s --inferencer-filename %s --doc-topics-threshold %s'
+
+        if self.random_seed != None:


simply if self.random_seed enough

Reason for None is so that random seed is not invoked unless a value is explicitly passed, which can be zero. If I test just for if self.random_seed and zero is passed, will it enter this if - a zero is a valid random_seed value.

so, in this case, should be if self.random_seed is not None (see best practicies)

menshikh-iv · 2018-08-10T09:27:14Z

gensim/models/wrappers/ldamallet.py

        cmd = cmd % (
            self.fcorpusmallet(), self.num_topics, self.alpha, self.optimize_interval,
            self.workers, self.fstate(), self.fdoctopics(), self.ftopickeys(), self.iterations,
            self.finferencer(), self.topic_threshold
        )
+


Actually I just inserted a blank line after the cmd definition just for readability - it is not required

ChrisPalmerNZ · 2018-08-10T19:36:46Z

I use Anaconda and have not been confident to install virtualenv and tox, so cannot check myself, but the Travis CI tox check for Python 2.7 is failing - are you able to tell me what I need to do to fix it?

menshikh-iv · 2018-08-11T04:53:03Z

@Zohaggie next time see travis log, here is current state (PEP8 issues)

  /home/travis/build/RaRe-Technologies/gensim$ /home/travis/build/RaRe-Technologies/gensim/.tox/flake8/bin/flake8 gensim/ 
gensim/models/wrappers/ldamallet.py:104:54: W291 trailing whitespace
        """
        Parameters
        ----------
        mallet_path : str
            Path to the mallet binary, e.g. `/home/username/mallet-2.0.7/bin/mallet`.
        corpus : iterable of iterable of (int, int), optional
            Collection of texts in BoW format.
        num_topics : int, optional
            Number of topics.
        alpha : int, optional
            Alpha parameter of LDA.
        id2word : :class:`~gensim.corpora.dictionary.Dictionary`, optional
            Mapping between tokens ids and words from corpus, if not specified - will be inferred from `corpus`.
        workers : int, optional
            Number of threads that will be used for training.
        prefix : str, optional
            Prefix for produced temporary files.
        optimize_interval : int, optional
            Optimize hyperparameters every `optimize_interval` iterations
            (sometimes leads to Java exception 0 to switch off hyperparameter optimization).
        iterations : int, optional
            Number of training iterations.
        topic_threshold : float, optional
            Threshold of the probability above which we consider a topic.
        random_seed: int, optional
            Random seed to ensure consistent results.   
        """
                                                     ^
gensim/models/wrappers/ldamallet.py:277:1: W293 blank line contains whitespace
        
^
gensim/models/wrappers/ldamallet.py:579:1: W293 blank line contains whitespace
    
^
gensim/models/wrappers/ldamallet.py:582:1: E302 expected 2 blank lines, found 1
def malletmodel2ldamodel(mallet_model, gamma_threshold=0.001, iterations=50):
^

Also, don't forget about writing tests please

ChrisPalmerNZ · 2018-08-11T11:18:19Z

I am unable to determine how to pass the travis test, as it seems to be highlighting old issues with whitespaces in blank lines which I think I have fixed in the latest commit.

Also, this is a first-time for me to contribute to a project, I am unsure what's required to write tests.

Please assist with some more guidance!

menshikh-iv · 2018-08-11T13:23:37Z

so, PEP8 can be easily fixed, more important here is tests, let me try to describe, what's needed

Add 2 tests (methods) to this class https://github.com/RaRe-Technologies/gensim/blob/17fa0dcea8bb7824f0e709fd3ff60007bcdd85f6/gensim/test/test_ldamallet_wrapper.py#L30

test_load_model
1. (before writing test) train lda mallet with last gensim version (simply install it from PyPI) on very small dataset
2. (before writing test) save model and add this data to gensim/test/test_data folder
3. write an test where you simply load this model & check that it works (for example - try to apply it and update)
test_random_seed
1. Define seed
2. Train fst model with seed
3. Train snd model with seed
4. Check that models are same

ChrisPalmerNZ · 2018-08-11T18:47:34Z

OK, thanks. I might be a day or two completing this, juggling it with other commitments.

This program by default loads common_texts as a sample text - should I use that? It might be necessary to use something a bit bigger, there are only 9 documents, 12 unique words in common_texts.

menshikh-iv · 2018-08-12T04:31:33Z

@Zohaggie just use corpus that already mentioned in test class https://github.com/RaRe-Technologies/gensim/blob/17fa0dcea8bb7824f0e709fd3ff60007bcdd85f6/gensim/test/test_ldamallet_wrapper.py#L36

ChrisPalmerNZ · 2018-08-12T04:44:00Z

OK, thanks - I will use it. I have already experimented with what the current test does, it saves the mallet model and data into the user\temp directory, but in this test you want me to save into a sub-directory under gensim\test\test_data? I say sub-directory as there are a number of files comprising a mallet model and I don't want to clutter up the test_data directory. Can you confirm my interpretation of things please?

menshikh-iv · 2018-08-12T04:46:38Z

@Zohaggie you should create a folder in test_data for this model and store model here (dataset already in test_data, no additional actions needed.

ChrisPalmerNZ · 2018-08-12T04:47:50Z

Thanks...

menshikh-iv · 2018-08-29T03:13:08Z

ping @Zohaggie, are you planning to finish PR?

ChrisPalmerNZ · 2018-08-29T22:29:06Z

Hi Ivan Yes I am planning to, but I have some urgent commitments to completing a first draft of a paper over the last few weeks. I hope to get back to the PR next week. If you have the time and its straightforward for you to complete the testing piece and get it to me then that would help, obviously :) KindRegards Chris

…

On Wed, Aug 29, 2018 at 1:13 PM, Ivan Menshikh ***@***.***> wrote: ping @Zohaggie <https://github.com/Zohaggie>, are you planning to finish PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2153 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKwcI6f1LNjTEkYOP5h_IfofWzy3vl00ks5uVgbNgaJpZM4V26Xh> .

ChrisPalmerNZ · 2018-09-12T06:50:17Z

Hi Ivan Finally able to get back to this, I hope to have something to you soon Regards Chris On Thu, Aug 30, 2018 at 8:29 AM, Chris Palmer <[email protected]> wrote:

…

Hi Ivan Yes I am planning to, but I have some urgent commitments to completing a first draft of a paper over the last few weeks. I hope to get back to the PR next week. If you have the time and its straightforward for you to complete the testing piece and get it to me then that would help, obviously :) KindRegards Chris On Wed, Aug 29, 2018 at 1:13 PM, Ivan Menshikh ***@***.***> wrote: > ping @Zohaggie <https://github.com/Zohaggie>, are you planning to finish > PR? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2153 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AKwcI6f1LNjTEkYOP5h_IfofWzy3vl00ks5uVgbNgaJpZM4V26Xh> > . >

ChrisPalmerNZ · 2018-09-14T18:31:34Z

Hi Ivan I am finishing off the adding of the these 2 functions, but I am not sure about the training and testing of the models for equality. I have tried just creating the 2 models for comparison with just the same parameters including of course the random_seed value, and if I compare them using the pattern in the testPersistence function for comparing word_topics then they are the same - self.assertTrue(np.allclose(model.word_topics, model2.word_topics)). However, if I try the pattern for comparing the output of model[doc] (to model2[doc]) which I see in the testTransform function, then they are not the same, unless I adopt the approach in testTransform and train them in parallel, so they both converge to deliver the expected dense vector of [0.49, 0.51]. But to do this seems incorrect, I am forcing them to be the same by using restarts. Can you please advise me about what I should be doing here? Kind Regards Chris

…

On Sat, Aug 11, 2018 at 11:23 PM, Ivan Menshikh ***@***.***> wrote: so, PEP8 can be easily fixed, more important here is tests, let me try to describe, what's needed Add 2 tests (methods) to this class https://github.com/RaRe- Technologies/gensim/blob/17fa0dcea8bb7824f0e709fd3ff600 07bcdd85f6/gensim/test/test_ldamallet_wrapper.py#L30 - test_load_model 1. (before writing test) train lda mallet with last gensim version (simply install it from PyPI) on very small dataset 2. (before writing test) save model and add this data to gensim/test/test_data folder 3. write an test where you simply load this model & check that it works (for example - try to apply it and update) - test_random_seed 1. Define seed 2. Train fst model with seed 3. Train snd model with seed 4. Check that models are same — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2153 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKwcI2wmifHz1n78HrncU7fPqxOG5p4mks5uPtrtgaJpZM4V26Xh> .

menshikh-iv · 2018-09-24T05:47:06Z

sorry for waiting @Zohaggie, can you please submit a code first (sounds like you make it correct, but something goes wrong) and we together have a look to concrete pieces.

ChrisPalmerNZ · 2018-09-26T11:25:42Z

Hi Ivan. I have added code but I need guidance from you re the test for model equality. I didn't add but was wondering if something like this is required:

passed = False
passed2 = False
for i in range(5):  # restart at most 5 times
    # create the transformation model
    model  = ldamallet.LdaMallet(mallet_path, corpus, id2word=dictionary, num_topics=2, iterations=200, prefix=prefix, random_seed = 10)
    model2 = ldamallet.LdaMallet(mallet_path, corpus, id2word=dictionary, num_topics=2, iterations=200, prefix=prefix, random_seed = 10)
    # transform one document
    doc = list(corpus)[0]
    transformed  = model[doc]
    transformed2 = model2[doc]
    vec  = matutils.sparse2full(transformed, 2)  # convert to dense vector, for easier equality tests
    vec2 = matutils.sparse2full(transformed2, 2)
    expected = [0.49, 0.51]
    # must contain the same values, up to re-ordering
    passed  = np.allclose(sorted(vec), sorted(expected), atol=1e-1)
    passed2 = np.allclose(sorted(vec2), sorted(expected), atol=1e-1)
    if passed & passed2:
        model.save(model_save_name)
        break
    if not passed:    
        logging.warning(
            "LDA model failed to converge on attempt %i (got %s, expected %s)",
            i, sorted(vec), sorted(expected)
        )
    if not passed2:    
        logging.warning(
            "LDA model2 failed to converge on attempt %i (got %s, expected %s)",
            i, sorted(vec), sorted(expected)
        )

menshikh-iv · 2018-09-28T07:32:07Z

@Zohaggie thanks for code, here is my notes about it

Good idea: just compare 2 matricies:
- Words X Topics (extract from model)
- Topics X Docs - same as you do now, but with all corus, instead of 1 document (more precie approach)
Why order can change (and you need sorted) if you pin a seed? Is this issue of Mallet?
Is your current code failed? In what moment? Maybe you need to explicitly pin some other parameters (like # of epochs or something similar)?

ChrisPalmerNZ · 2018-09-28T12:00:49Z

Hi Ivan

When you say "Good idea" is it regarding to my idea for the code posted in my comments here? The reason for me asking you about this was because I was not sure if something like this was needed in addition to what I have already in place in my function test_random_seed. Can you give me some evaluation of that code and let me know if it needs more (along the lines of the suggested code)?

Are you happy with the function test_load_model?

I will get back to you perhaps for clarification, but in the meantime, no the code has not failed at all with my testing.

Can you tell me why the checks are failing?

Regarding use of sorted, or any other designs in my test code, I did not know what code to use so I copied existing code. Are you able to point me to better patterns to copy?

Jcerwin · 2018-09-30T23:47:03Z

Thanks for adding this feature. Any ideas what version it will go into and when?

ChrisPalmerNZ · 2018-10-01T01:08:24Z

Hi Ivan

Still waiting to hear back from you with clarification about what I need add and change... Are you happy with test_load_model, do I need to add extra tests to test_random_seed or replace existing tests?

Thanks
Chris

Jcerwin · 2018-10-13T15:44:40Z

Ivan and Chris, our team is waiting on therandom_seed functionality. When do you think it will be released?

ChrisPalmerNZ · 2018-10-14T04:39:43Z

Hi John This is my first time contributing via a pull request with so many added conditions placed on me to also include other features including testing, and its not clear to me exactly what's required for the testing. I am waiting for clarification from Ivan. I found his last directions to be unclear. I will in the meantime do a little more towards the testing function for random_seed, despite not having heard from Ivan. Regarding the core functionality - from my experience it works, which is after all why I put it in, so I guess you could take it as it is and satisfy yourself about it. :) Kind Regards Chris

…

On Sun, Oct 14, 2018 at 2:44 AM John Cerwin ***@***.***> wrote: Ivan and Chris, our team is waiting on therandom_seed functionality. When do you think it will be released? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2153 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKwcIw2KgbzMSHOI_Bg6R6NcM5gI8kYNks5ukgp1gaJpZM4V26Xh> .

menshikh-iv · 2018-10-15T04:04:04Z

@Zohaggie really sorry for waiting (busy month), I'll try to have a look deeply on current week.

@Jcerwin next release not scheduled yet, by my feeling, this will be in the current year. If you need to use this feature ASAP - you can just install gensim from source (github) after we merge current PR

Jcerwin · 2018-10-15T15:10:35Z

Thank you both Ivan and Chris!

…nce too

menshikh-iv · 2019-01-10T11:38:32Z

big thanks @Zohaggie, congratz with the first contribution 🥇

ChrisPalmerNZ · 2019-01-10T18:18:09Z

Ivan, I see that you have a default of 0 rather than None, and have stated the 0 will use the system clock. Can you give me a URL to where the Mallet documentation uses the system clock with a 0 value in random-seed?

Also, I have tried to install 3.6.0 using conda, and its not offering it - just 3.5.0-py36h830ac7b_1000 - when will it be updated?

menshikh-iv · 2019-01-11T02:34:43Z

Ivan, I see that you have a default of 0 rather than None, and have stated the 0 will use the system clock. Can you give me a URL to where the Mallet documentation uses the system clock with a 0 value in random-seed?

https://github.com/mimno/Mallet/blob/af1fcb1f3e6561afac28f4331e4e0d735b3d11b4/src/cc/mallet/topics/tui/TopicTrainer.java#L138-L139

same for topic inference, etc.

Also, I have tried to install 3.6.0 using conda, and its not offering it - just 3.5.0-py36h830ac7b_1000 - when will it be updated?

When release happend :) ETA - end of Jan. BTW - better to use conda-forge (we support this) or classic PyPI (though pip)

ChrisPalmerNZ · 2019-01-11T08:46:51Z

Thanks Ivan - this was with conda-forge that I tried...

…

On Fri, Jan 11, 2019 at 1:35 PM Ivan Menshikh ***@***.***> wrote: Ivan, I see that you have a default of 0 rather than None, and have stated the 0 will use the system clock. Can you give me a URL to where the Mallet documentation uses the system clock with a 0 value in random-seed? https://github.com/mimno/Mallet/blob/af1fcb1f3e6561afac28f4331e4e0d735b3d11b4/src/cc/mallet/topics/tui/TopicTrainer.java#L138-L139 same for topic inference, etc. Also, I have tried to install 3.6.0 using conda, and its not offering it - just 3.5.0-py36h830ac7b_1000 - when will it be updated? When release happend :) ETA - end of Jan. BTW - better to use conda-forge (we support this) or classic PyPI (though pip) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2153 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKwcI0_S63BO9thXhs6AukhKdcfTXdnpks5vB_hTgaJpZM4V26Xh> .

ChrisPalmerNZ · 2019-01-11T08:48:13Z

Thanks for the reference re zero parameter! Wish I had known that earlier :) On Fri, Jan 11, 2019 at 7:46 PM Chris Palmer <[email protected]> wrote:

…

Thanks Ivan - this was with conda-forge that I tried... On Fri, Jan 11, 2019 at 1:35 PM Ivan Menshikh ***@***.***> wrote: > Ivan, I see that you have a default of 0 rather than None, and have > stated the 0 will use the system clock. Can you give me a URL to where the > Mallet documentation uses the system clock with a 0 value in random-seed? > > > https://github.com/mimno/Mallet/blob/af1fcb1f3e6561afac28f4331e4e0d735b3d11b4/src/cc/mallet/topics/tui/TopicTrainer.java#L138-L139 > > same for topic inference, etc. > > Also, I have tried to install 3.6.0 using conda, and its not offering it > - just 3.5.0-py36h830ac7b_1000 - when will it be updated? > > When release happend :) ETA - end of Jan. BTW - better to use conda-forge > (we support this) or classic PyPI (though pip) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2153 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AKwcI0_S63BO9thXhs6AukhKdcfTXdnpks5vB_hTgaJpZM4V26Xh> > . >

added random_seed parameter to ldamallet wrapper

75c25ee

menshikh-iv suggested changes Aug 10, 2018

View reviewed changes

ChrisPalmerNZ added 2 commits August 10, 2018 22:17

added load for backwards compatibility

1bfaf48

fix random_seed evaluation for PEP8 best practice

2ca085c

ChrisPalmerNZ added 2 commits August 11, 2018 19:48

Various PEP8 compliance fixes

645863c

PEP8: Removed white space from blank lines

d8e0f54

Added tests for random_seed parameter in ldamallet wrapper

53182a0

ChrisPalmerNZ added 2 commits October 14, 2018 16:15

test_random_seed - test all docs in corpus

6b2d52b

Added additional blank line after class declaration

a5db1df

PEP8 formatting changes only

3ef309f

menshikh-iv added 8 commits January 9, 2019 15:04

Merge remote-tracking branch 'upstream/develop' into mallet-random-seed

a6af553

fix pep8

53136a1

use 0 as default seed (according to Mallet doc) + pin seed for infere…

0b3d7aa

…nce too

make seed test strict

7fa5069

remove useless test

b001cd3

remove not used data

f3bd927

fix typo

a1e73b6

fill random_seed=0 in load

86ede21

menshikh-iv changed the title ~~Added random_seed parameter to ldamallet wrapper~~ Add random_seed to LdaMallet Jan 10, 2019

menshikh-iv merged commit 01f4ac8 into piskvorky:develop Jan 10, 2019

Add random_seed to LdaMallet #2153

Add random_seed to LdaMallet #2153

Conversation

ChrisPalmerNZ commented Aug 9, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisPalmerNZ commented Aug 10, 2018

menshikh-iv commented Aug 11, 2018

ChrisPalmerNZ commented Aug 11, 2018

menshikh-iv commented Aug 11, 2018

ChrisPalmerNZ commented Aug 11, 2018 • edited Loading

menshikh-iv commented Aug 12, 2018

ChrisPalmerNZ commented Aug 12, 2018

menshikh-iv commented Aug 12, 2018

ChrisPalmerNZ commented Aug 12, 2018

menshikh-iv commented Aug 29, 2018

ChrisPalmerNZ commented Aug 29, 2018 via email

ChrisPalmerNZ commented Sep 12, 2018 via email

ChrisPalmerNZ commented Sep 14, 2018 via email

menshikh-iv commented Sep 24, 2018

ChrisPalmerNZ commented Sep 26, 2018 • edited by menshikh-iv Loading

menshikh-iv commented Sep 28, 2018

ChrisPalmerNZ commented Sep 28, 2018 • edited Loading

Jcerwin commented Sep 30, 2018

ChrisPalmerNZ commented Oct 1, 2018

Jcerwin commented Oct 13, 2018

ChrisPalmerNZ commented Oct 14, 2018 via email

menshikh-iv commented Oct 15, 2018 • edited Loading

Jcerwin commented Oct 15, 2018

menshikh-iv commented Jan 10, 2019

ChrisPalmerNZ commented Jan 10, 2019 • edited Loading

menshikh-iv commented Jan 11, 2019

ChrisPalmerNZ commented Jan 11, 2019 via email

ChrisPalmerNZ commented Jan 11, 2019 via email

Add `random_seed` to `LdaMallet` #2153

Add `random_seed` to `LdaMallet` #2153

ChrisPalmerNZ commented Aug 11, 2018 •

edited

Loading

ChrisPalmerNZ commented Sep 26, 2018 •

edited by menshikh-iv

Loading

ChrisPalmerNZ commented Sep 28, 2018 •

edited

Loading

menshikh-iv commented Oct 15, 2018 •

edited

Loading

ChrisPalmerNZ commented Jan 10, 2019 •

edited

Loading