-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Dragonbox #2713
Simplify Dragonbox #2713
Conversation
It usually generates slower code than manual handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
include/fmt/format-inl.h
Outdated
#if FMT_HAS_BUILTIN(__builtin_addcll) | ||
unsigned long long carry; | ||
low_ = __builtin_addcll(low_, n, 0, &carry); | ||
high_ = __builtin_addcll(high_, 0, carry, &carry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can
high_ = __builtin_addcll(high_, 0, carry, &carry);
be replaced with
high_ += carry;
or is there any benefit to using a builtin?
And similar comment in other conditionally compiled branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some experiment: https://godbolt.org/z/Yzf1T9ed8
Seems like they generate the same code.
include/fmt/format-inl.h
Outdated
constexpr uint64_t high() const noexcept { return high_; } | ||
constexpr uint64_t low() const noexcept { return low_; } | ||
|
||
uint128_wrapper& operator+=(uint64_t n) & noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&
is probably not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
include/fmt/format-inl.h
Outdated
uint128_wrapper& operator+=(uint64_t n) FMT_NOEXCEPT { | ||
# if defined(_MSC_VER) && defined(_M_X64) | ||
unsigned char carry = _addcarry_u64(0, low_, n, &low_); | ||
constexpr uint128_wrapper(uint64_t high, uint64_t low) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere please keep FMT_NOEXCEPT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, did the change.
Hmm. Probably I have to do some experiment, but I guess the cache recovery can be made even simpler, because the error it introduces might not introduce actual error. Please hold this for a while if you are satisfied otherwise! To be more precise, what I'm talking about is this: As you already know, the gigantic table that Dragonbox relies on is consisting of approximations to 5^k's (or equivalently 10^k's) with the leading 1 aligned at the MSB. The way the table is compressed down to a much smaller one is that we just store 5^k's for certain k's, and in the runtime it tries to recover a possibly missing entry it needs by multiplying appropriate 5^a to an existing entry and then realign it so that the leading 1 is still at the MSB. Since the table entries are already approximations, this will inevitably introduce some amount of error, so the resulting value is slightly off compared to what's in the full table, so what I did is to store those errors into a separate table so that we can add those errors back to correctly recover the missing entry we are interested in. However, it is likely that this small amount of error actually doesn't matter as the computed value is probably still a sufficiently good approximation to the true value. So let me do some analysis to see if I can prove this claim. |
Sorry for the delay, I found a severe mistake in my proof of correctness of the new integer checks, and in order to fix that I had to learn some more refined mathematical tools and come up with some more sophisticated arguments. It was quite substantial, actually. But now I believe the proof is correct. And yeah, we indeed do not need to add the recovery error back, which is great :) |
Hold on, I need to verify several more cases. |
Okay, turned out to be not a problem. |
static constexpr struct { | ||
uint32_t magic_number; | ||
int bits_for_comparison; | ||
uint32_t threshold; | ||
int shift_amount; | ||
} infos[] = {{0xcccd, 16, 0x3333, 18}, {0xa429, 8, 0x0a, 20}}; | ||
int margin_bits; | ||
int divisibility_check_bits; | ||
} infos[] = {{0x199a, 8, 8}, {0xa3d71, 10, 16}}; | ||
constexpr auto info = infos[N - 1]; | ||
n *= info.magic_number; | ||
const uint32_t comparison_mask = (1u << info.bits_for_comparison) - 1; | ||
bool result = (n & comparison_mask) <= info.threshold; | ||
n >>= info.shift_amount; | ||
n >>= info.margin_bits; | ||
const uint32_t comparison_mask = (1u << info.divisibility_check_bits) - 1; | ||
bool result = (n & comparison_mask) == 0; | ||
n >>= info.divisibility_check_bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an explanation of this logic somewhere, particularly what magic numbers are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, I'm still working on rewriting the paper. But it's probably simple enough so that I can explain it here. Roughly speaking, those magic numbers are still binary digits of 1/5 and 1/25.
What we are trying to do here is to find integers m
and k
and l
such that
floor(n/d) = floor(nm / 2^(k+l))
whered=10
ord=100
,floor(nm/2^k) mod 2^l = 0
if and only ifn
is divisible byd
.
(Here, m
is magic_number
, k
is margin_bits
, and l
is divisibility_check_bits
.)
So if these are possible, then we first multiply n
with the magic number m
and shift to right by k
. The lower l
bits of the resulting number is zero if and only if n
is divisible by d
, and then we shift to right again by l
to wipe out those bits, then the result must be precisely floor(n/d)
.
The first item is nothing but what the classical paper by Granlund-Montgomery tells us, so let's take it as granted; we set m
to be the ceiling of 2^(k+l)/d
for large enough k+l
. (To be a bit more specific, k+l >= 16
for float
and k+l >= 26
for double
are more than enough.)
The rationale behind the second item is that, since m
is almost equal to 2^(k+l)/d
, so if n
is divisible by d
, then nm
should be almost a multiple of 2^(k+l)
. And 2^k
is how far nm
can deviate from truly being a multiple of 2^(k+l)
.
To make this idea precise, note that 1 is equivalent to
floor(n/d) <= nm/2^(k+l) < floor(n/d) + 1.
Now, let n = dq + r
with 0 <= r < d
, then this becomes
q <= nm/2^(k+l) < q + 1,
so multiplying 2^l
yields
q * 2^l <= nm/2^k < (q+1) * 2^l.
Hence, floor(nm/2^k) mod 2^l
is just floor(nm/2^k) - q * 2^l
.
Now, we want two things:
floor(nm/2^k) - q * 2^l = 0
ifr = 0
, andfloor(nm/2^k) - q * 2^l != 0
ifr != 0
.
The first one can be rewritten as
q * 2^l <= nm/2^k < q * 2^l + 1,
and since we already have the first inequality, it is equivalent to
nm < q * 2^(k+l) + 2^k.
Using n = dq
, we can rewrite it as
q(dm - 2^(k+l)) < 2^k,
and since m = ceil(2^(k+l)/d)
, we have 0 <= dm - 2^(k+l) < d
, so it is sufficient to have
qd = n < 2^k.
Since we know the range of n
(0~100
for float
and 0~1000
for double
), we can easily figure out that k >= 7
for float
and k >= 10
for double are sufficient to ensure that r = 0
implies floor(nm/2^k) mod 2^l = 0
.
On the other hand, the second one can be rewritten as
nm/2^k >= q * 2^l + 1,
or equivalently,
(dq + r) * m >= q * 2^(k+l) + 2^k,
or equivalently,
rm >= q(2^(k+l) - dm) + 2^k.
Since dm - 2^(k+l) > 0
, we can see that having
m >= 2^k
is sufficient to ensure that r != 0
implies floor(nm/2^k) mod 2^l != 0
. Or to be a bit more lenient, it is sufficient to have
2^(k+l)/d >= 2^k,
or equivalently
2^l >= d.
Hence, l >= 4
for float
and l >= 7
for double
are enough. (Of course, we set l = 8
for float
and l = 16
for double
to make mod 2^l
a trivial operation.)
I also did an exhaustive test of this here to make sure things are done correctly.
Ah, by the way, I got this idea from Schubfach. I don't know where the author of Schubfach got this, but I guess this kind of idea is probably already common and can be found here and there. Things like this and this are possibly related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the detailed explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My pleasure! I'll get back to you when I finish writing things up. The paper will contain several sharper results generalizing/unifying this and several other tricks I have been relying on.
Also, if time permits, I will work on an alternative idea about fixed-precision printing, based on a generalization of Dr. Lemire's paper I linked. I hope it ends up with something I can contribute again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this paper, a better method is introduced which allows us to save yet one more instruction:
const uint32_t magic_number = (1u << 16) / divisor + 1;
const uint32_t comparison_mask = (1u << 16) - 1;
n *= magic_number;
bool result = (n & comparison_mask) < magic_number;
n >>= 16;
(This method, the one I explained, and the classical divisibility test by Granlund-Montgomery using modular inverse, are in fact all just special cases of a more general result that will appear in the paper I'm writing. Basically, what this method is doing is not very different from what I explained. Indeed, floor(nm/2^k) mod 2^l = 0
is equivalent to nm mod 2^(k+l) < 2^k
, and the thing is that there is no terribly important reason to have a power of 2 on the right-hand side now. And it turns out that we can replace it by m
.)
I don't think this code will perform measurably better, but it's shorter and presumably better, so why not.
Merged, thanks! |
Main differences:
check_divisibility_and_divide_by_pow5()
and related code segments are simplified. I expect them to be now compiled into fewer instructions.In my machine, it seems that the performance is not very much affected, though I expected a little bit of improvement. (I ran Milo's benchmark several times, but the results were largely inconsistent.) Probably because bottlenecks are on other places. But the code size will be a bit smaller, because now we do not have the divisibility check tables.