You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The deduplication rate is empirically much better in this pyspark implementation which I am guessing has to do with the higher rate of collisions from the truncation of the hash.
The text was updated successfully, but these errors were encountered:
I would like to get a minhash with alternative hash algorithms such as the first four bytes of SHA1 as implemented in https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication_spark.py
The deduplication rate is empirically much better in this pyspark implementation which I am guessing has to do with the higher rate of collisions from the truncation of the hash.
The text was updated successfully, but these errors were encountered: