Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing MinHash retrieval from keys for MinHashLSHForest #233

Closed
123epsilon opened this issue Feb 7, 2024 · 2 comments · Fixed by #234
Closed

Implementing MinHash retrieval from keys for MinHashLSHForest #233

123epsilon opened this issue Feb 7, 2024 · 2 comments · Fixed by #234

Comments

@123epsilon
Copy link
Contributor

Hi, I've been using the MinHashLSHForest class to do some deduplication, and part of that pipeline is to retrieve the top-k items and then estimate the Jaccard similarities with each of those items, obviously this requires reconstructing the MinHash object related to the given key returned by MinHashLSHForest.query().

This seems like a decently common use-case since we often screen the results of LSH Forest using the Jaccard Similarity, my question is do you feel that this is common enough to implement such a function as a part of the MinHashLSHForest class?

I've implemented a simple way to recompute the original hashvalues array from the keys dictionary in MinHashLSHForest as follows, I'd be happy to submit a PR but just wanted to know how this aligned with the vision for this package

"""
Takes the list of bytes-like generated by the LSH Forest 
that corresponds to some given key and recovers the hashvalues
which can be converted back into a MinHash to compute Jaccard Similarity

Given a number of prefix trees, L, when we insert a (key, MinHash) pair 
the LSH Forest creates L byteslike items each corresponding to a range of 
hashvalues from the original MinHash object for a given key. Each range is of
size num_perm / L. Therefore here we convert these items from byteslikes back into
arrays of unsigned integers and then concatenate them so that they are in a representation
that we can build a MinHash object with. Namely, we return an array of unsigned integers
of length num_perm that represent hashvalues from each of num_perm hash functions
chosen during the MinHash creation.
"""
def byteslist_to_hashvalues(byteslist):
    hashvalues = np.array([], dtype=np.uint64)
    for item in byteslist:
        # unswap the bytes, as their representation is flipped during storage
        hv_segment = np.frombuffer(item, dtype=np.uint64).byteswap()
        hashvalues = np.append(hashvalues, hv_segment)
    
    return hashvalues

where we might call this by using

lsh = MinHashLSHForest(...)
hv = byteslist_to_hashvalues(lsh.keys[mykey])
mh = MinHash(hashvalues=hv)

# now use mh.jaccard() ...
...

A unit test would involve inserting a MinHash into MinHashLSHForest and then reconstructing it and checking for jaccard_sim == 1.0.

@ekzhu
Copy link
Owner

ekzhu commented Mar 1, 2024

Sounds like a good plan. I think it would be good to have something like this.

@123epsilon
Copy link
Contributor Author

Sounds like a good plan. I think it would be good to have something like this.

@ekzhu Great! I just opened a PR: #234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants