Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from smhasher to mmh3 #6

Merged
merged 1 commit into from
May 30, 2017
Merged

Conversation

dan-blanchard
Copy link
Contributor

A number of our tests needed to have updated results after this change, so I'm not sure if this is correct or not. The main issue with this change is that for some reason smhasher treats all results from MurmurHash3 as unsigned big-endian integers, whereas mm3's 64 bit hash function returns a pair of little-endian signed integers. smhasher supported specifying unsigned seeds of any size, but the conversion to C long values would only take the first 32-bits worth of the number because it uses the I format specifier, which has no overflow checking. There's a pull request to switch mmh3 to this same method that has been open for a long time, but I don't think it will be merged (hajimes/mmh3#6), so I manually try to do the same overflow handling by dividing by 2^32. I believe the results still differ because of the endianness difference in how the two libraries treat their results, but I couldn't quite figure out if any of our code needs to be updated for that (or to handle mmh3 returning signed ints).

@kbourgoin
Copy link

What are we still using probably for and what motivated this change?

@conradlee
Copy link

conradlee commented May 12, 2017 via email

@kbourgoin
Copy link

@dan-blanchard this is probably fine, but it isn't my area of expertise. This is definitely going to change the results of hashes, so will that have consequences elsewhere? If this operates in structures that only exist in memory, it'll be fine. But if this is going to load an old HLL from somewhere and start adding hashes, then we're going to have problems.

@dan-blanchard
Copy link
Contributor Author

Surprisingly, hajimes/mmh3#6 got merged, so I'm going to tweak this to just require mmh3>=2.4 and not need the workarounds I was using before.

@dan-blanchard
Copy link
Contributor Author

I've updated this to just workaround the endianness differences between how mmh3 and smhasher interface with the C code and I'm getting exactly the same results as before, so this is way better now, and I'm going to merge.

@kbourgoin
Copy link

That's great!

@dan-blanchard dan-blanchard merged commit ee4972f into master May 30, 2017
@dan-blanchard dan-blanchard deleted the feature/switch_to_mmh3 branch May 30, 2017 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants