Implementing MinHash retrieval from keys for MinHashLSHForest #233

123epsilon · 2024-02-07T00:02:50Z

Hi, I've been using the MinHashLSHForest class to do some deduplication, and part of that pipeline is to retrieve the top-k items and then estimate the Jaccard similarities with each of those items, obviously this requires reconstructing the MinHash object related to the given key returned by MinHashLSHForest.query().

This seems like a decently common use-case since we often screen the results of LSH Forest using the Jaccard Similarity, my question is do you feel that this is common enough to implement such a function as a part of the MinHashLSHForest class?

I've implemented a simple way to recompute the original hashvalues array from the keys dictionary in MinHashLSHForest as follows, I'd be happy to submit a PR but just wanted to know how this aligned with the vision for this package

"""
Takes the list of bytes-like generated by the LSH Forest 
that corresponds to some given key and recovers the hashvalues
which can be converted back into a MinHash to compute Jaccard Similarity

Given a number of prefix trees, L, when we insert a (key, MinHash) pair 
the LSH Forest creates L byteslike items each corresponding to a range of 
hashvalues from the original MinHash object for a given key. Each range is of
size num_perm / L. Therefore here we convert these items from byteslikes back into
arrays of unsigned integers and then concatenate them so that they are in a representation
that we can build a MinHash object with. Namely, we return an array of unsigned integers
of length num_perm that represent hashvalues from each of num_perm hash functions
chosen during the MinHash creation.
"""
def byteslist_to_hashvalues(byteslist):
    hashvalues = np.array([], dtype=np.uint64)
    for item in byteslist:
        # unswap the bytes, as their representation is flipped during storage
        hv_segment = np.frombuffer(item, dtype=np.uint64).byteswap()
        hashvalues = np.append(hashvalues, hv_segment)
    
    return hashvalues

where we might call this by using

lsh = MinHashLSHForest(...)
hv = byteslist_to_hashvalues(lsh.keys[mykey])
mh = MinHash(hashvalues=hv)

# now use mh.jaccard() ...
...

A unit test would involve inserting a MinHash into MinHashLSHForest and then reconstructing it and checking for jaccard_sim == 1.0.

The text was updated successfully, but these errors were encountered:

ekzhu · 2024-03-01T06:42:03Z

Sounds like a good plan. I think it would be good to have something like this.

123epsilon · 2024-03-02T20:55:05Z

Sounds like a good plan. I think it would be good to have something like this.

@ekzhu Great! I just opened a PR: #234

123epsilon mentioned this issue Mar 2, 2024

Retrieve MinHash from LSHForest #234

Merged

ekzhu closed this as completed in #234 Mar 11, 2024

DavidNemeskey mentioned this issue Aug 22, 2024

MinHash retrieval by key #242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing MinHash retrieval from keys for MinHashLSHForest #233

Implementing MinHash retrieval from keys for MinHashLSHForest #233

123epsilon commented Feb 7, 2024

ekzhu commented Mar 1, 2024

123epsilon commented Mar 2, 2024

Implementing MinHash retrieval from keys for MinHashLSHForest #233

Implementing MinHash retrieval from keys for MinHashLSHForest #233

Comments

123epsilon commented Feb 7, 2024

ekzhu commented Mar 1, 2024

123epsilon commented Mar 2, 2024