Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly process empty documents in AuthorTopicModel #2133

Merged
merged 6 commits into from
Aug 2, 2018

Conversation

probinso
Copy link
Contributor

@probinso probinso commented Jul 18, 2018

This is a fix #1589

initialized empty numpy arrays defualt to dtype=np.float making them ineligible for use as index arrays (which must be of dtype=np.integer or dtype=np.bool)

@piskvorky piskvorky changed the title Fix 1589 [WIP] Correctly process empty documents in AuthorTopicModel Jul 21, 2018
@probinso
Copy link
Contributor Author

@piskvorky is there anything else I need to do for this pull request?

cts = np.array([cnt for _, cnt in doc])
ids = [id for id, _ in doc]
ids = np.array(ids, dtype=np.integer)
cts = np.array([cnt for _, cnt in doc], dtype=np.integer)
Copy link
Owner

@piskvorky piskvorky Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this np.integer type. How does it differ from normal np.int? What's the difference, why use one or the other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference in our case

import numpy as np

arr1, arr2 = [1, 2, 3], []

assert np.array(arr1, dtype=np.int).dtype == \
       np.array(arr1, dtype=np.integer).dtype == \
       np.array(arr2, dtype=np.int).dtype == \
       np.array(arr2, dtype=np.integer).dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of it "casted" to int64 on my x64 linux

@piskvorky
Copy link
Owner

piskvorky commented Jul 26, 2018

It looks good, thanks @probinso . Just a little clarification around np.integer / np.int for my sake please.

Then we wait for @menshikh-iv to get back from holiday, review & merge :)

@probinso
Copy link
Contributor Author

probinso commented Jul 30, 2018

@piskvorky

That is a good question. I'll read through the numpy code. I used what I expected to be the most general correct type. However I can tell that they are different because (np.int is np.integer) == False.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @probinso, please fix current review and I'll merge your PR

@@ -460,11 +460,12 @@ def inference(self, chunk, author2doc, doc2author, rhot, collect_sstats=False, c
# make sure the term IDs are ints, otherwise np will get upset
ids = [int(idx) for idx, _ in doc]
else:
ids = [idx for idx, _ in doc]
cts = np.array([cnt for _, cnt in doc])
ids = [id for id, _ in doc]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert back idx (id is built-in function name)

@@ -110,6 +109,19 @@ def testBasic(self):
jill_topics = matutils.sparse2full(jill_topics, model.num_topics)
self.assertTrue(all(jill_topics > 0))

def testEmptyDocument(self):
_local_texts = common_texts + [['only_occurs_once_in_corpus_and_alone_in_doc']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why vars starts from underscore? please remove underscores from start

_corpus = [_dictionary.doc2bow(text) for text in _local_texts]
_a2d = author2doc.copy()
_a2d['joaquin'] = [len(_local_texts) - 1]
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need try/except section in test, if test raise unexpected exception - this means that test broken

try:
_ = self.class_(_corpus, author2doc=_a2d, id2word=_dictionary, num_topics=2)
except IndexError:
raise IndexError("error occurs in 1.0.0 release tag")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

a2d['joaquin'] = [len(local_texts) - 1]

_ = self.class_(corpus, author2doc=a2d, id2word=dictionary, num_topics=2)
assert(_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to retrieve vector for any document or corpus (instead of assertion) as "sanity check" action, because _ will be always initialized.

@menshikh-iv menshikh-iv changed the title [WIP] Correctly process empty documents in AuthorTopicModel Correctly process empty documents in AuthorTopicModel Aug 2, 2018
@menshikh-iv
Copy link
Contributor

menshikh-iv commented Aug 2, 2018

Thanks @probinso, congratz with the first contribution 🥇 !

@menshikh-iv menshikh-iv merged commit 61728a0 into piskvorky:develop Aug 2, 2018
@probinso probinso deleted the fix_1589 branch August 2, 2018 04:32
@piskvorky
Copy link
Owner

piskvorky commented Aug 2, 2018

I'm still -1 on using np.integer -- what is that and why should we use it, instead of the standard int / np.int?

Unless this change is well-understood, it sounds like a recipe for type-casting and serialization trouble.

@menshikh-iv
Copy link
Contributor

@piskvorky fixed in #2145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problem using bound function in Author Topic model!!
3 participants