-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ones
using trailing_zeros
#18
Conversation
I ported the benchmarks to criterion (see #19), hoping to get less noisy results. I still don't trust them, but here they are:
The actual noise is at least 2%. The last benchmark is newly introduced and uses (uniformly) random bits. For the "all ones" benchmark, the slice-based implementation is best:
(Note that the change in performance is bogus, the code was not changed.) |
} | ||
self.bitset = self.remaining_blocks[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of immediately setting bitset to remaining_blocks[0] and updating remaining blocks, why don't you search for the first none zero block and only then update remaining_blocks? (just to save some assignments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the PR has been opened for almost 3 years, so I guess nothing needs be done until the maintainer shows interest, but I figured I'd add the comment in case something comes of it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is faster, because it might mess up vectorization. Could be interesting to benchmark.
Looks good to me. Sorry for the long wait, I'm not sure if there was a reason for not including this |
I rebased it on master. |
return None; | ||
} | ||
while self.bitset == 0 { | ||
if self.remaining_blocks.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can accomplish all of is_empty, [0] and [1..]
in one method call - self.remaining_blocks.split_first()
for a presumed minor win in terms of bounds checking overhead 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or for that matter, reading split_first's implmentation with slice patterns - but I think that's a much more recent feature if let [first, tail @ ..] = self { Some((first, tail)) } else { None }
but it's good to read and use elsewhere if not here.
Looks good to me. What you think @jrraymond? If you say go I can maybe look at releasing unless you want to do it |
lgtm as well. I don't mind releasing this morning. I dumped the asm because I was curious to see if trailing zeroes does produce better asm and it does: <old::Ones as core::iter::traits::iterator::Iterator>::next:
mov ecx, dword ptr [rdi + 32]
mov rsi, qword ptr [rdi]
xor eax, eax
.LBB8_1:
mov rdx, rsi
mov esi, ecx
shr ecx
test sil, 1
jne .LBB8_5
lea rsi, [rdx + 1]
test ecx, ecx
jne .LBB8_1
mov rsi, qword ptr [rdi + 24]
test rsi, rsi
je .LBB8_6
mov rdx, qword ptr [rdi + 16]
add rsi, -1
mov ecx, dword ptr [rdx]
add rdx, 4
mov qword ptr [rdi + 16], rdx
mov qword ptr [rdi + 24], rsi
mov rsi, qword ptr [rdi + 8]
add rsi, 1
mov qword ptr [rdi + 8], rsi
shl rsi, 5
jmp .LBB8_1
.LBB8_5:
mov dword ptr [rdi + 32], ecx
lea rax, [rdx + 1]
mov qword ptr [rdi], rax
mov eax, 1
.LBB8_6:
ret
<new::Ones as core::iter::traits::iterator::Iterator>::next:
mov r8d, dword ptr [rdi + 24]
test r8d, r8d
je .LBB7_3
mov rdx, qword ptr [rdi]
jmp .LBB7_2
.LBB7_3:
mov rsi, qword ptr [rdi + 16]
add rsi, -1
xor eax, eax
.LBB7_4:
cmp rsi, -1
je .LBB7_5
mov rdx, qword ptr [rdi]
mov rcx, qword ptr [rdi + 8]
mov r8d, dword ptr [rcx]
mov dword ptr [rdi + 24], r8d
add rcx, 4
mov qword ptr [rdi + 8], rcx
mov qword ptr [rdi + 16], rsi
add rdx, 1
mov qword ptr [rdi], rdx
add rsi, -1
test r8d, r8d
je .LBB7_4
.LBB7_2:
bsf eax, r8d ; bit scan forward \o/
lea esi, [r8 - 1]
and esi, r8d
mov dword ptr [rdi + 24], esi
shl rdx, 5
or rdx, rax
mov eax, 1
ret
.LBB7_5:
ret |
The
iter_ones_all_zeros
benchmark improved a lot, the results foriter_ones_all_ones
are too noisy on my machine to tell. Will update when I have better numbers.Fixes #17.