Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(LDA): Search complexity in SparseVector (word-topic and doc-topic vector) is log(K). Please consider to use HashVector in Breeze (or OpenAddressHashArray in Breeze) that is with O(1) #28

Open
hucheng opened this issue Aug 26, 2015 · 2 comments

Comments

@hucheng
Copy link
Contributor

hucheng commented Aug 26, 2015

During sampleToken, when computing probability of Nkd*(Nkw+beta), given a topic k in Nkd, find the corresponding Nkw of that topic k is O(Kw). If we choose to use HashVector, the complexity can be reduced to O(1), even with a little bit space overhead.
Since HashVector in Breeze is not serializable, so please consider to directly use OpenAddressHashArray in Breeze.

@hucheng
Copy link
Contributor Author

hucheng commented Aug 26, 2015

Note that the aggregating (sum) of two HashVectors cost more than sum of two sparseVector. So the key is to avoid the new operation but with in-place add. Please be referred to aggregateByKey.

@bhoppi
Copy link
Contributor

bhoppi commented Sep 16, 2015

I tried several open address Hashmap implementations used as HashVector backend, but surprisingly they are all slower than SparseVector (Java's HashMap >> Spark's OpenHashMap > Breeze's OpenAddressHashArray > fastutil's int-int map). Maybe because the conflicts happen at the average of log(K) times.
But now this issue is perfectly solved using a very easy trick: before we sample all the edges of one term, we transform the term's count vector into DenseVector first. Then finding n_kw will just need O(1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants