-
Notifications
You must be signed in to change notification settings - Fork 222
Fix: native hashed-hash for MutableDictionaryArray #1564
Fix: native hashed-hash for MutableDictionaryArray #1564
Conversation
src/array/dictionary/value_map.rs
Outdated
fn write(&mut self, value: &[u8]) { | ||
#[cfg(target_endian = "little")] | ||
{ | ||
for (index, item) in value.iter().enumerate().take(8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also sort of an overkill tbh... because we know that value.len() == 8
always (or rather, "almost surely"...)
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1564 +/- ##
==========================================
- Coverage 83.02% 83.02% -0.01%
==========================================
Files 391 391
Lines 42786 42809 +23
==========================================
+ Hits 35523 35542 +19
- Misses 7263 7267 +4
☔ View full report in Codecov by Sentry. |
(Forgot to clarify: the added test currently fails on main.) |
@aldanor I believe we can make it a little bit simple by having an type IdHashmap = hashbrown::HashMap<K, V, IdBuildHasher>;
type IdBuildHasher = BuildHasherDefault<IdHasher>; where pub struct IdHasher {
hash: u64,
}
impl Hasher for IdHasher {
fn finish(&self) -> u64 {
self.hash
}
#[inline]
fn write_u64(&mut self, i: u64) {
self.hash = i;
}
} And then we have a // hash + value
#[derive(Eq, Copy, Clone)]
struct Key<T: Copy> {
hash: u64,
value: T,
}
impl<T: Copy> Hash for Key<T> {
fn hash<H: Hasher>(&self, state: &mut H) {
state.write_u64(self.hash)
}
}
impl<T: Copy + PartialEq> PartialEq for Key<T> {
fn eq(&self, other: &Self) -> bool {
self.value == other.value
}
} And then we are responsible for initializing the |
@ritchie46 Oh, very good point indeed, I totally forgot about the fact that
|
(updated with passthrough hasher that uses |
I think we can do with using the index. That's what we do in polars for strings. #[derive(Copy, Clone)]
struct Key {
pub(super) hash: u64,
pub(super) idx: u32,
} And then the comparison is something like this: table.raw_entry_mut().from_hash(h, |key| {
// first compare the hash before we incur the cache miss
key.hash == h && {
let idx = key.idx as usize;
unsafe { keys.get_unchecked_release(idx).as_deref() == key_val }
}
}) |
@ritchie46 That's almost exactly what we already have here :) (aside from additional hash equality check which I can add too if it improves things; and with key being of // it's actually quite hilarious, I think, that I had to go through all circles of hell in the last prs here, introduce more bugs while fixing previous bugs, then optimize it while fixing them and finally arrive to pretty much precisely the polars solution 😆 |
I see now. Aside from the naming very similar indeed. :) I have left one comment about the assert. Other than that I think this is good to go. 👍 |
Here's yet another bugfix / refactor of mutable dictionary arrays (due to incorrect solution in #1561)... tested it on some arrays of 100+M size, this time it seems like it actually works correctly.
It was previously correct on small maps, but as soon as it would resize itself, hash consistency would vanish.
There's a somewhat flimsy part (or flimsy invariant rather) in assuming that hash of a u64 hash using
HashHasherNative
will be identical to the hash itself if it's built in an endian-dependent way. I've tested it on M1 (LE) and it's correct, I'm guessing it will be correct on BE as well.Not using the
hash_hasher::HashHasher
directly because it's too slow on little-endian platforms as it flips the endian needlessly (for our use case, that is). We can bypass that step entirely (except for internal hashmap resizing), which avoids huge slowdown.// @ritchie46
Fixes #1562