Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

2nd (safe) rewrite of MutableDictionaryArray #1561

Merged
merged 5 commits into from
Sep 4, 2023

Conversation

aldanor
Copy link
Contributor

@aldanor aldanor commented Sep 3, 2023

As suggested by @ritchie46, using hashbrown's raw-entry API (flipped <K, usize> to <usize, K> though).

Need to be very careful to not use any default map API though since it will break map consistency (like .insert(), .collect(), etc).

Benchmarks are roughly the same as in #1555, not slower, not faster; miri should be happy though.

@codecov
Copy link

codecov bot commented Sep 3, 2023

Codecov Report

Patch coverage: 71.71% and project coverage change: -0.06% ⚠️

Comparison is base (b2017d7) 83.08% compared to head (1d8afd9) 83.02%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1561      +/-   ##
==========================================
- Coverage   83.08%   83.02%   -0.06%     
==========================================
  Files         389      391       +2     
  Lines       42640    42786     +146     
==========================================
+ Hits        35427    35525      +98     
- Misses       7213     7261      +48     
Files Changed Coverage Δ
src/array/dictionary/mod.rs 95.32% <ø> (ø)
src/array/mod.rs 79.72% <ø> (ø)
src/array/indexable.rs 40.84% <40.84%> (ø)
src/array/dictionary/mutable.rs 73.28% <83.33%> (-5.86%) ⬇️
src/array/dictionary/value_map.rs 91.56% <91.56%> (ø)
src/compute/cast/primitive_to.rs 93.04% <100.00%> (ø)

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aldanor
Copy link
Contributor Author

aldanor commented Sep 3, 2023

(Don't merge yet, I'll add a few very quick tests)

@ritchie46
Copy link
Collaborator

ritchie46 commented Sep 3, 2023

Nice! Thanks for the quick turnaround. We might also explore making the index type generic, so that we can support all unsigned indexes and get better cache coherence.

But that's definitely not a blocker for this PR.

@aldanor
Copy link
Contributor Author

aldanor commented Sep 3, 2023

We might also explore making the index type generic, so that we can support all unsigned indexes and get better cache coherence.

As in <K, K>? (it will fail otherwise anyway upon key conversion). Or, hold on, even <K, ()> (i.e. technically, a HashSet), because keys and values are equal anyways...

You would need to convert is back and forth to usize though to get the actual array index, not quite sure if cache coherence offsets potential extra branching (might quickly check though).

@aldanor
Copy link
Contributor Author

aldanor commented Sep 3, 2023

@ritchie46 ok, using <K, ()> seems to work, there's two minor quirks though:

  • in RawEntryMutBuilder::from_hash(), we cannot handle errors other than unwrap results since it's a closure in which we don't control return type, but I think there we can use unsafe { key.as_usize() } since we've already converted that key from usize previously, so it implies it can be unsafely converted back?
  • DictionaryKey has to derive from Hash just so that some HashMap bounds are satisfied (which is a bit dumb because we'll never actually hash it, but it doesn't matter that much since those are just ints). Tbh we might as well just derive NativeType: Hash since those are just primitives... (or just the DictionaryKey).

Benches seem to improve a bit further by ~10%: (now 1.7x from original for utf8, 3.5x for u64)

dict_utf8               time:   [141.11 µs 141.44 µs 141.82 µs]
                        change: [-11.502% -11.251% -10.984%] (p = 0.00 < 0.05)
                        Performance has improved.

dict_u64                time:   [35.081 µs 35.169 µs 35.259 µs]
                        change: [-8.4965% -8.1730% -7.8369%] (p = 0.00 < 0.05)
                        Performance has improved.

Can clean it up a bit and push it too here if it's of interest.

@aldanor
Copy link
Contributor Author

aldanor commented Sep 3, 2023

(Pushed the <K, ()> commit as well to see if that makes sense)

I think it only makes sense because we fully manage MutableDictionArray internally, i.e. you can't directly load those internal indices from anywhere else.

@aldanor
Copy link
Contributor Author

aldanor commented Sep 3, 2023

Since #1559 has been merged, I'll have to rebase this on main again, ok then... 🤦

@aldanor
Copy link
Contributor Author

aldanor commented Sep 3, 2023

Ok, I've rebased on revert of the revert, should be ok now.

Tests were added.

Unit-value hashmap added = waiting on feedback from @ritchie46 (if I missed something and there's some gaps we can always revert that part if needed).

Otherwise, nothing else left to do in this PR I think unless there's any particular comments, so should be good to go. (clippy error in CI is some unrelated spurious network error)

@ritchie46
Copy link
Collaborator

Since #1559 has been merged, I'll have to rebase this on main again, ok then... 🤦

@sundy-li is trigger happy. 🙈 😜

@ritchie46
Copy link
Collaborator

Thanks for doing the rewrite @aldanor. Good to go.

@ritchie46 ritchie46 merged commit 87ab844 into jorgecarleitao:main Sep 4, 2023
25 checks passed
@aldanor aldanor deleted the feature/safer-mutable-dict branch September 4, 2023 18:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants