Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add random_seed to LdaMallet #2153

Merged
merged 17 commits into from
Jan 10, 2019

Conversation

ChrisPalmerNZ
Copy link
Contributor

Including a random_seed parameter enables consistent results from Mallet.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for PR @Zohaggie, also please add

  • loading old Mallet model with new code
  • training with radom_seed

@@ -122,6 +124,7 @@ def __init__(self, mallet_path, corpus=None, num_topics=100, alpha=50, id2word=N
self.workers = workers
self.optimize_interval = optimize_interval
self.iterations = iterations
self.random_seed = random_seed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define custom load functions for old mallet models (without this option), see an example https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py#L348-L355

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check I have coded and placed this correctly... Should I include logging as in the example?

Training with random_seed - what is required there?

@@ -100,6 +100,8 @@ def __init__(self, mallet_path, corpus=None, num_topics=100, alpha=50, id2word=N
Number of training iterations.
topic_threshold : float, optional
Threshold of the probability above which we consider a topic.
random_seed: int, optional
Random seed to ensure consistent results, default is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to write default value in docstring description + . at the end of sentence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spelled out the default so that users know that they need not enter a random_seed parameter at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default parameters automatically showed in documentation, for this reason, no need to duplicate it in docstring.

@@ -268,11 +271,16 @@ def train(self, corpus):
cmd = self.mallet_path + ' train-topics --input %s --num-topics %s --alpha %s --optimize-interval %s '\
'--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '\
'--num-iterations %s --inferencer-filename %s --doc-topics-threshold %s'

if self.random_seed != None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply if self.random_seed enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for None is so that random seed is not invoked unless a value is explicitly passed, which can be zero. If I test just for if self.random_seed and zero is passed, will it enter this if - a zero is a valid random_seed value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, in this case, should be if self.random_seed is not None (see best practicies)

cmd = cmd % (
self.fcorpusmallet(), self.num_topics, self.alpha, self.optimize_interval,
self.workers, self.fstate(), self.fdoctopics(), self.ftopickeys(), self.iterations,
self.finferencer(), self.topic_threshold
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just inserted a blank line after the cmd definition just for readability - it is not required

@ChrisPalmerNZ
Copy link
Contributor Author

I use Anaconda and have not been confident to install virtualenv and tox, so cannot check myself, but the Travis CI tox check for Python 2.7 is failing - are you able to tell me what I need to do to fix it?

@menshikh-iv
Copy link
Contributor

@Zohaggie next time see travis log, here is current state (PEP8 issues)

  /home/travis/build/RaRe-Technologies/gensim$ /home/travis/build/RaRe-Technologies/gensim/.tox/flake8/bin/flake8 gensim/ 
gensim/models/wrappers/ldamallet.py:104:54: W291 trailing whitespace
        """
        Parameters
        ----------
        mallet_path : str
            Path to the mallet binary, e.g. `/home/username/mallet-2.0.7/bin/mallet`.
        corpus : iterable of iterable of (int, int), optional
            Collection of texts in BoW format.
        num_topics : int, optional
            Number of topics.
        alpha : int, optional
            Alpha parameter of LDA.
        id2word : :class:`~gensim.corpora.dictionary.Dictionary`, optional
            Mapping between tokens ids and words from corpus, if not specified - will be inferred from `corpus`.
        workers : int, optional
            Number of threads that will be used for training.
        prefix : str, optional
            Prefix for produced temporary files.
        optimize_interval : int, optional
            Optimize hyperparameters every `optimize_interval` iterations
            (sometimes leads to Java exception 0 to switch off hyperparameter optimization).
        iterations : int, optional
            Number of training iterations.
        topic_threshold : float, optional
            Threshold of the probability above which we consider a topic.
        random_seed: int, optional
            Random seed to ensure consistent results.   
        """
                                                     ^
gensim/models/wrappers/ldamallet.py:277:1: W293 blank line contains whitespace
        
^
gensim/models/wrappers/ldamallet.py:579:1: W293 blank line contains whitespace
    
^
gensim/models/wrappers/ldamallet.py:582:1: E302 expected 2 blank lines, found 1
def malletmodel2ldamodel(mallet_model, gamma_threshold=0.001, iterations=50):
^

Also, don't forget about writing tests please

@ChrisPalmerNZ
Copy link
Contributor Author

I am unable to determine how to pass the travis test, as it seems to be highlighting old issues with whitespaces in blank lines which I think I have fixed in the latest commit.

Also, this is a first-time for me to contribute to a project, I am unsure what's required to write tests.

Please assist with some more guidance!

@menshikh-iv
Copy link
Contributor

so, PEP8 can be easily fixed, more important here is tests, let me try to describe, what's needed

Add 2 tests (methods) to this class https://github.com/RaRe-Technologies/gensim/blob/17fa0dcea8bb7824f0e709fd3ff60007bcdd85f6/gensim/test/test_ldamallet_wrapper.py#L30

  • test_load_model
    1. (before writing test) train lda mallet with last gensim version (simply install it from PyPI) on very small dataset
    2. (before writing test) save model and add this data to gensim/test/test_data folder
    3. write an test where you simply load this model & check that it works (for example - try to apply it and update)
  • test_random_seed
    1. Define seed
    2. Train fst model with seed
    3. Train snd model with seed
    4. Check that models are same

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Aug 11, 2018

OK, thanks. I might be a day or two completing this, juggling it with other commitments.

This program by default loads common_texts as a sample text - should I use that? It might be necessary to use something a bit bigger, there are only 9 documents, 12 unique words in common_texts.

@menshikh-iv
Copy link
Contributor

@ChrisPalmerNZ
Copy link
Contributor Author

OK, thanks - I will use it. I have already experimented with what the current test does, it saves the mallet model and data into the user\temp directory, but in this test you want me to save into a sub-directory under gensim\test\test_data? I say sub-directory as there are a number of files comprising a mallet model and I don't want to clutter up the test_data directory. Can you confirm my interpretation of things please?

@menshikh-iv
Copy link
Contributor

@Zohaggie you should create a folder in test_data for this model and store model here (dataset already in test_data, no additional actions needed.

@ChrisPalmerNZ
Copy link
Contributor Author

Thanks...

@menshikh-iv
Copy link
Contributor

ping @Zohaggie, are you planning to finish PR?

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Aug 29, 2018 via email

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Sep 12, 2018 via email

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Sep 14, 2018 via email

@menshikh-iv
Copy link
Contributor

sorry for waiting @Zohaggie, can you please submit a code first (sounds like you make it correct, but something goes wrong) and we together have a look to concrete pieces.

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Sep 26, 2018

Hi Ivan. I have added code but I need guidance from you re the test for model equality. I didn't add but was wondering if something like this is required:

passed = False
passed2 = False
for i in range(5):  # restart at most 5 times
    # create the transformation model
    model  = ldamallet.LdaMallet(mallet_path, corpus, id2word=dictionary, num_topics=2, iterations=200, prefix=prefix, random_seed = 10)
    model2 = ldamallet.LdaMallet(mallet_path, corpus, id2word=dictionary, num_topics=2, iterations=200, prefix=prefix, random_seed = 10)
    # transform one document
    doc = list(corpus)[0]
    transformed  = model[doc]
    transformed2 = model2[doc]
    vec  = matutils.sparse2full(transformed, 2)  # convert to dense vector, for easier equality tests
    vec2 = matutils.sparse2full(transformed2, 2)
    expected = [0.49, 0.51]
    # must contain the same values, up to re-ordering
    passed  = np.allclose(sorted(vec), sorted(expected), atol=1e-1)
    passed2 = np.allclose(sorted(vec2), sorted(expected), atol=1e-1)
    if passed & passed2:
        model.save(model_save_name)
        break
    if not passed:    
        logging.warning(
            "LDA model failed to converge on attempt %i (got %s, expected %s)",
            i, sorted(vec), sorted(expected)
        )
    if not passed2:    
        logging.warning(
            "LDA model2 failed to converge on attempt %i (got %s, expected %s)",
            i, sorted(vec), sorted(expected)
        )         

@menshikh-iv
Copy link
Contributor

@Zohaggie thanks for code, here is my notes about it

  • Good idea: just compare 2 matricies:
    • Words X Topics (extract from model)
    • Topics X Docs - same as you do now, but with all corus, instead of 1 document (more precie approach)
  • Why order can change (and you need sorted) if you pin a seed? Is this issue of Mallet?
  • Is your current code failed? In what moment? Maybe you need to explicitly pin some other parameters (like # of epochs or something similar)?

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Sep 28, 2018

Hi Ivan

When you say "Good idea" is it regarding to my idea for the code posted in my comments here? The reason for me asking you about this was because I was not sure if something like this was needed in addition to what I have already in place in my function test_random_seed. Can you give me some evaluation of that code and let me know if it needs more (along the lines of the suggested code)?

Are you happy with the function test_load_model?

I will get back to you perhaps for clarification, but in the meantime, no the code has not failed at all with my testing.

Can you tell me why the checks are failing?

Regarding use of sorted, or any other designs in my test code, I did not know what code to use so I copied existing code. Are you able to point me to better patterns to copy?

@Jcerwin
Copy link

Jcerwin commented Sep 30, 2018

Thanks for adding this feature. Any ideas what version it will go into and when?

@ChrisPalmerNZ
Copy link
Contributor Author

Hi Ivan

Still waiting to hear back from you with clarification about what I need add and change... Are you happy with test_load_model, do I need to add extra tests to test_random_seed or replace existing tests?

Thanks
Chris

@Jcerwin
Copy link

Jcerwin commented Oct 13, 2018

Ivan and Chris, our team is waiting on therandom_seed functionality. When do you think it will be released?

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Oct 14, 2018 via email

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 15, 2018

@Zohaggie really sorry for waiting (busy month), I'll try to have a look deeply on current week.

@Jcerwin next release not scheduled yet, by my feeling, this will be in the current year. If you need to use this feature ASAP - you can just install gensim from source (github) after we merge current PR

@Jcerwin
Copy link

Jcerwin commented Oct 15, 2018

Thank you both Ivan and Chris!

@menshikh-iv menshikh-iv changed the title Added random_seed parameter to ldamallet wrapper Add random_seed to LdaMallet Jan 10, 2019
@menshikh-iv
Copy link
Contributor

big thanks @Zohaggie, congratz with the first contribution 🥇

@menshikh-iv menshikh-iv merged commit 01f4ac8 into piskvorky:develop Jan 10, 2019
@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Jan 10, 2019

Ivan, I see that you have a default of 0 rather than None, and have stated the 0 will use the system clock. Can you give me a URL to where the Mallet documentation uses the system clock with a 0 value in random-seed?

Also, I have tried to install 3.6.0 using conda, and its not offering it - just 3.5.0-py36h830ac7b_1000 - when will it be updated?

@menshikh-iv
Copy link
Contributor

Ivan, I see that you have a default of 0 rather than None, and have stated the 0 will use the system clock. Can you give me a URL to where the Mallet documentation uses the system clock with a 0 value in random-seed?

https://github.com/mimno/Mallet/blob/af1fcb1f3e6561afac28f4331e4e0d735b3d11b4/src/cc/mallet/topics/tui/TopicTrainer.java#L138-L139

same for topic inference, etc.

Also, I have tried to install 3.6.0 using conda, and its not offering it - just 3.5.0-py36h830ac7b_1000 - when will it be updated?

When release happend :) ETA - end of Jan. BTW - better to use conda-forge (we support this) or classic PyPI (though pip)

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Jan 11, 2019 via email

@ChrisPalmerNZ
Copy link
Contributor Author

ChrisPalmerNZ commented Jan 11, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants