Exposing number of bytes to keep & hashing algorithm in the expression minhash() #2958

MisterKloudy · 2024-09-27T15:31:51Z

I would like to get a minhash with alternative hash algorithms such as the first four bytes of SHA1 as implemented in https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication_spark.py

The deduplication rate is empirically much better in this pyspark implementation which I am guessing has to do with the higher rate of collisions from the truncation of the hash.

jaychia · 2024-11-07T00:06:13Z

Has this been completed @andrewgazelka? I see #3052 has been merged!

jaychia assigned andrewgazelka Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing number of bytes to keep & hashing algorithm in the expression minhash() #2958

Exposing number of bytes to keep & hashing algorithm in the expression minhash() #2958

MisterKloudy commented Sep 27, 2024

jaychia commented Nov 7, 2024 •

edited

Loading

Exposing number of bytes to keep & hashing algorithm in the expression minhash() #2958

Exposing number of bytes to keep & hashing algorithm in the expression minhash() #2958

Comments

MisterKloudy commented Sep 27, 2024

jaychia commented Nov 7, 2024 • edited Loading

jaychia commented Nov 7, 2024 •

edited

Loading