You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I've been using the MinHashLSHForest class to do some deduplication, and part of that pipeline is to retrieve the top-k items and then estimate the Jaccard similarities with each of those items, obviously this requires reconstructing the MinHash object related to the given key returned by MinHashLSHForest.query().
This seems like a decently common use-case since we often screen the results of LSH Forest using the Jaccard Similarity, my question is do you feel that this is common enough to implement such a function as a part of the MinHashLSHForest class?
I've implemented a simple way to recompute the original hashvalues array from the keys dictionary in MinHashLSHForest as follows, I'd be happy to submit a PR but just wanted to know how this aligned with the vision for this package
"""Takes the list of bytes-like generated by the LSH Forest that corresponds to some given key and recovers the hashvalueswhich can be converted back into a MinHash to compute Jaccard SimilarityGiven a number of prefix trees, L, when we insert a (key, MinHash) pair the LSH Forest creates L byteslike items each corresponding to a range of hashvalues from the original MinHash object for a given key. Each range is ofsize num_perm / L. Therefore here we convert these items from byteslikes back intoarrays of unsigned integers and then concatenate them so that they are in a representationthat we can build a MinHash object with. Namely, we return an array of unsigned integersof length num_perm that represent hashvalues from each of num_perm hash functionschosen during the MinHash creation."""defbyteslist_to_hashvalues(byteslist):
hashvalues=np.array([], dtype=np.uint64)
foriteminbyteslist:
# unswap the bytes, as their representation is flipped during storagehv_segment=np.frombuffer(item, dtype=np.uint64).byteswap()
hashvalues=np.append(hashvalues, hv_segment)
returnhashvalues
where we might call this by using
lsh=MinHashLSHForest(...)
hv=byteslist_to_hashvalues(lsh.keys[mykey])
mh=MinHash(hashvalues=hv)
# now use mh.jaccard() ...
...
A unit test would involve inserting a MinHash into MinHashLSHForest and then reconstructing it and checking for jaccard_sim == 1.0.
The text was updated successfully, but these errors were encountered:
Hi, I've been using the
MinHashLSHForest
class to do some deduplication, and part of that pipeline is to retrieve the top-k items and then estimate the Jaccard similarities with each of those items, obviously this requires reconstructing theMinHash
object related to the given key returned byMinHashLSHForest.query()
.This seems like a decently common use-case since we often screen the results of LSH Forest using the Jaccard Similarity, my question is do you feel that this is common enough to implement such a function as a part of the MinHashLSHForest class?
I've implemented a simple way to recompute the original
hashvalues
array from thekeys
dictionary inMinHashLSHForest
as follows, I'd be happy to submit a PR but just wanted to know how this aligned with the vision for this packagewhere we might call this by using
A unit test would involve inserting a
MinHash
intoMinHashLSHForest
and then reconstructing it and checking forjaccard_sim == 1.0
.The text was updated successfully, but these errors were encountered: