-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strconv: implement the Ryu algorithm to speed up float64->decimal conversion #15672
Comments
320 times slower WAT?1!1 |
This is a corner-case of a corner-case - extremely unlikely to matter in real-world apps. We have large-exponent values (unlikely), and for the BenchmarkAppendFloatLarge2-48 we have a slow-path: In genericFtoa, we end up calling bigFtoa which uses slow bignum arithmetic. In bigFtoa, the call to roundShortest fails to return quickly, and it proceeds doing the slow (but precise) binary-to-decimal mantissa conversion (ftoa.go:257ff). Probably not worth spending much time on fixing this. |
Looks like a nice DoS attack vector on any Go software that ever formats floats. Also, |
cc: @agl for input re: uses of float formatting for DoS attacks. I suspect this might not be unique to Go, and if so, what do other systems do? |
It's an interesting algorithmic complexity attack. (There have been similar issues in the past.) It's a bit of a sharp edge, but some values will probably always take a little longer to format. We could put effort into optimising this, but it probably makes the code complex in an already complex area. Perhaps it's worth a comment, but I'm not sure about more than that. |
How many of these corner case numbers exist? If their number is small, maybe a static lookup table could work.... |
@nightlyone Hard to say without very careful (and difficult!) mathematical analysis. |
Well, the fact that Dmitry found one (in a space of 2**64 elements) by fuzzing (I believe?) suggests that the number of slow values is probably not that small. |
Basically all floating point printing algorithms use a fast path for most numbers and fall back to a slow path. Go's fast path is the Grisu3 algorithm, which fails for 0.5% of inputs. The fallback is to some slow decimal code that I wrote a long time ago. In practice this is fine. Grisu3 was basically state of the art when it was added here, I believe by @remyoudompheng . Very recently a new paper came out that gets the fast path down to "all but 45 float64 values", which you can then handle with a table. See Andrysco, Jhala, and Lerner, If someone wants to implement that, great, but I don't think it's a particularly high priority. |
@rsc it seems that by the authors' own benchmarks, Errol is 2.4x slower than Grisu3. Does that seem like a reasonable tradeoff for strconv? (I ask because I'm thinking of working on this but I anticipate pushback if the change makes the 95.5% case more than twice as slow. Another alternative would be to use Errol as the Grisu3 fallback but that seems like a lot of code for printing floats.) |
FWIW I have a one-line change that makes the slow path about 35% faster on Dmitry's number:
It doesn't fix the problem but it's better than nothing I guess: |
@cespare Having good performance for the majority of cases can be critical for some users. In my applications I would not withstand a 2.4x slowdown (however using Go's code generation the ratio could turn out to be different). If I understand correctly Errol can remove the need for big number arithmetic entirely, so it could be a fallback to Grisu3 without so much code size inflation, and it would be quite useful because if 1 number out of 200 is 300 times slower to process, the impact is far from negligible. |
@remyoudompheng Errol does not eliminate the need for big number arithmetic entirely. For floating point numbers in the range [2^54, 2^131) it falls back to exact integer arithmetic. The algorithm is actually basically the same as previous conversion algorithms except outside that range it uses double-double (2x float64) to calculate. The double-double has enough precision except for a few enumerated cases that are looked up in a table. The compute_digits algorithm in the paper for the shortest output produces the largest correctly rounded decimal in the rounding range, so that would need to be modified to produce the closest correctly rounded decimal. Also, note that the paper does not contain any error analysis for float32. I think these issues might indicate that Errol is not ready for inclusion in the standard library. Figure 13(c) in the paper shows the fall back performance they are benchmarking Errol against and it shows ~ linear dependence on the floating point exponent. Benchmarking the Go fall back, which I think is using the same Dragon4 algorithm, shows ~ quadratic dependence on the exponent. So the current slow path should be able to speed up quite a bit. |
Ryū: fast float-to-string conversion, presentation, github Just came out and is simpler and faster than grisu3 (~3x in the paper's benchmarks). It requires only integer operations, although for float64 it uses 128-bit integers that should be made more efficient by #24813 It also requires a few small lookup tables:
Since it is both simpler and faster, I would recommend investigating the ryu algorithm as the replacement for grisu3 instead of errol. |
I have implemented Ryu in Go at https://github.com/cespare/ryu. On my machine it is significantly faster than strconv for all the inputs I've tried.
(That last one is @dvyukov's pathological case.) While some inputs are faster than others, there aren't any drastically slower fallback paths.
It's true that Ryu requires some lookup tables. However, in his C implementation @ulfjack (the Ryu author) has a size-optimized version that uses much smaller lookup tables in exchange for a little more CPU cost. I've implemented that as well and it's still faster than the current strconv implementation in all the cases I've looked at:
Based on these promising results and the comments by @bmkessler above, it seems like switching strconv to use Ryu is the best option available to us. I've taken the liberty of retitling the issue accordingly. I intend to work on this during the Go 1.13 cycle. |
I'm also in favour of switching. I also have some version on my side in the works, but I'm taking a different path by making it look like the old Grisu code. I'll try to show it somewhere. |
A lot of internal functions are exposed in export_test.go in order to test and bench various levels of function calls. Random tests have been run on a few billion values. Implements golang#15672. Change-Id: I028faa4f97c38f51709469f7314bfd7ec12f06dd
I have pushed my own things in the following branch https://github.com/remyoudompheng/go/tree/ryu A few notes:
|
I have also prepared a bunch of denormals very hard to parse when printed in their shortest form (they are also hard for AppendFloat). The last one is "kind of" pathological but since it has few digits, the existing code handles it fine. BenchmarkAtof/1.68514038588815e-309-4 50000 18384 ns/op BenchmarkAtof/9.11691642378e-312-4 50000 19854 ns/op BenchmarkAtof/1.62420278e-315-4 10000000 71.4 ns/op BenchmarkAppendFloatHard/341076211242912p-1074-4 50000 32519 ns/op BenchmarkAppendFloatHard/1845284427387p-1074-4 50000 32516 ns/op BenchmarkAppendFloatHard/328742302p-1074-4 20000000 103 ns/op The corresponding numbers are: 9.11691642378e-312 = 0x1ada385d67b.7fffffff5d9...p-1074 1.62420278e-315 = 0x1398359e.7fffe022p-1074 |
Hello, Benchmarks of the computational part (convert uint64 mantissa and power of 10 exponent to float64). benchmark old ns/op new ns/op delta BenchmarkAtof/33909e0-4 5.64 14.9 +164.18% BenchmarkAtof/3397784e-4-4 6.58 20.3 +208.51% BenchmarkAtof/509e73-4 36.9 18.0 -51.22% BenchmarkAtof/6226662346353213e-324-4 39.8 18.5 -53.52% BenchmarkAtof/6808957268280643e116-4 5684 18.0 -99.68% BenchmarkAtof/4334126125515466e-225-4 9829 18.0 -99.82% BenchmarkAtof/168514038588815e-323-4 18157 18.6 -99.90% BenchmarkAtof/911691642378e-323-4 19622 18.6 -99.91% BenchmarkAtof/162420278e-323-4 40.0 18.5 -53.75% BenchmarkAtof/22250738585072011e-324-4 40.0 18.6 -53.50% Benchmark of ParseFloat: in this benchmark the Grisu parsing is replaced by the Ryu routine (keeping the "float64 multiply" fast path). benchmark old ns/op new ns/op delta BenchmarkAtof/33909-4 24.1 24.0 -0.41% BenchmarkAtof/339.7784-4 29.1 29.1 +0.00% BenchmarkAtof/-5.09e75-4 63.4 46.2 -27.13% BenchmarkAtof/123456789123456789123456789-4 98.8 999 +911.13% BenchmarkAtof/622666234635.3213e-320-4 82.8 64.1 -22.58% BenchmarkAtof/33909#01-4 23.5 23.6 +0.43% BenchmarkAtof/339.778-4 26.5 26.4 -0.38% BenchmarkAtof/12.3456e32-4 66.7 66.2 -0.75% BenchmarkAtof/100000000000000016777215-4 844 786 -6.87% BenchmarkAtof/100000000000000016777216-4 693 643 -7.22% BenchmarkAtof/6808957268280643e116-4 5836 65.5 -98.88% BenchmarkAtof/4.334126125515466e-210-4 10045 72.3 -99.28% BenchmarkAtof/1.68514038588815e-309-4 18462 64.8 -99.65% BenchmarkAtof/9.11691642378e-312-4 19818 58.8 -99.70% BenchmarkAtof/1.62420278e-315-4 71.1 54.4 -23.49% BenchmarkAtof/2.2250738585072011e-308-4 84.2 67.1 -20.31% |
Nice. Great observation, @remyoudompheng. I haven't published my parsing code yet; really need to finish that. I believe it can be made fast regardless of input string length, i.e., scale linearly. |
A lot of internal functions are exposed in export_test.go in order to test and bench various levels of function calls. Random tests have been run on a few billion values. Implements golang#15672. Change-Id: I028faa4f97c38f51709469f7314bfd7ec12f06dd
As of commit remyoudompheng@ed351765ae the Atof implementation is now extended to work whenever the decimal digits form a number that fits in a uint64. It can also handle non ambiguous very long inputs that simply converting the rounded-up mantissa if it yields the same result. // Halfway is 500016682268521616.00000000000001e229 {"500016682268521616e229", "5.000166822685216e+246", nil}, // 18 digits necessary // Halfway is 1873795671212201760.9999999999999998e108 {"1873795671212201761e108", "1.873795671212202e+126", nil}, // 19 digits (61 bits) necessary // Halfway is 10027399025072458413.99999999999998e140 {"10027399025072458414e140", "1.002739902507246e+159", nil}, // 20 digits (64 bits) necessary @ulfjack I am interested in knowing whether the "TestRyuNoCarry" is exactly the same proof as in your paper or not. Final benchmarks: BenchmarkAtof/33909-4 23.2 23.0 -0.86% BenchmarkAtof/339.7784-4 29.0 28.7 -1.03% BenchmarkAtof/-5.09e75-4 63.8 45.2 -29.15% BenchmarkAtof/18446744073709551608-4 98.0 68.8 -29.80% BenchmarkAtof/123456789123456789123456789-4 90.7 82.1 -9.48% BenchmarkAtof/622666234635.3213e-320-4 80.0 59.1 -26.12% BenchmarkAtof/33909#01-4 22.9 22.2 -3.06% BenchmarkAtof/339.778-4 26.7 26.8 +0.37% BenchmarkAtof/12.3456e32-4 66.2 61.4 -7.25% BenchmarkAtof/2.3399415873862403e69-4 2836 59.0 -97.92% BenchmarkAtof/500016682268521616e229-4 19610 67.7 -99.65% BenchmarkAtof/1873795671212201761e108-4 6383 76.2 -98.81% BenchmarkAtof/10027399025072458414e140-4 7728 79.2 -98.98% BenchmarkAtof/100000000000000016777215-4 889 833 -6.30% BenchmarkAtof/100000000000000016777216-4 768 721 -6.12% BenchmarkAtof/6808957268280643e116-4 5828 57.7 -99.01% BenchmarkAtof/4.334126125515466e-210-4 9924 59.6 -99.40% BenchmarkAtof/1.68514038588815e-309-4 18428 58.8 -99.68% BenchmarkAtof/9.11691642378e-312-4 19888 55.0 -99.72% BenchmarkAtof/1.62420278e-315-4 73.7 54.3 -26.32% BenchmarkAtof/2.2250738585072011e-308-4 82.7 61.1 -26.12% |
@remyoudompheng I'm afraid I don't understand your question. The proof in the paper only covers binary to decimal conversion, not the other way round. I have since written down an extended proof that shows that the same concepts apply to all source and target bases with minor changes for certain base pairs. I believe that my implementation closely follows the proof. |
@ulfjack I meant that this unit test (https://github.com/remyoudompheng/go/blob/ed351765ae8a8307e4df08a24511ec058b7f7ccc/src/strconv/extfloat2_test.go#L478) aims at embedding a proof of what is paragraph 3.2.3 of your paper, to make the code self-contained. I believe it is essentially equivalent. |
@remyoudompheng and I discussed this and he's going to work on creating the CLs for his implementation in strconv. I may help review the changes. I'm changing the assignment accordingly. |
Status update:
Shortest formatting is not yet cleaned up. |
Shortest formatting is now included and I added a fix to avoid all long divisions on 32-bit platforms (especially arm where performance was awful). ARM now gets performance gains as well. |
The Ryū algorithm as described in a paper by Ulf Adams, "Ryū: Fast Float-to-String Conversion" (doi:10.1145/3192366.3192369) is better than Grisu3 because it handles all edge cases properly. In Grisu3, about 0.5% of float64 numbers fall back to the slow algorithm with can be 10-200 times slower. The core property used by the Ryū algorithm is that using sufficiently large precision for powers of 10 can eliminate all rounding edge cases. Such edge cases can be characterized by an equation of shape: m * P <= n * 2^k <= m * (P+1) where P is the fixed precision truncated version of the power of 10. Solving this equation can be done using Farey sequences to enumerate rationals n/m in the interval [P/2^k, (P+1)/2^k]. The original algorithm describes formatting to shortest decimal representation. This patch implements a variant of this algorithm for atof functions, using the properties: - 64-bit powers of 10 are enough to handle 31-bit decimal mantissas to parse float32 values - 128-bit powers of 10 are enough to handle 64-bit decimal mantissas to parse float64 values Since Grisu3 already uses 64-bit powers of ten, the difference in atof32 is hard to notice, but rather resides in much clearer logic. Powers of 10 are tabulated and will be reused for the ftoa implementation. AMD64 benchmarks: benchmark old ns/op new ns/op delta BenchmarkAtof64Decimal 38.6 38.5 -0.26% BenchmarkAtof64Float 49.9 49.6 -0.60% BenchmarkAtof64FloatExp 78.5 69.9 -10.96% BenchmarkAtof64FloatExact 125 141 +12.80% BenchmarkAtof64Big 148 161 +8.78% BenchmarkAtof64Hard 9946 120 -98.79% BenchmarkAtof64RandomBits 70.7 69.1 -2.26% BenchmarkAtof64RandomFloats 70.4 70.5 +0.14% BenchmarkAtof32Decimal 40.1 37.5 -6.48% BenchmarkAtof32Float 48.4 45.3 -6.40% BenchmarkAtof32FloatExp 87.1 74.1 -14.93% BenchmarkAtof32FloatHard 951 104 -89.06% BenchmarkAtof32Random 113 97.1 -14.07% ARM benchmarks: benchmark old ns/op new ns/op delta BenchmarkAtof64Decimal 670 659 -1.64% BenchmarkAtof64Float 2082 2050 -1.54% BenchmarkAtof64FloatExp 1137 1044 -8.18% BenchmarkAtof64FloatExact 1007 1623 +61.17% BenchmarkAtof64Big 1179 1361 +15.44% BenchmarkAtof64Hard 61099 1097 -98.20% BenchmarkAtof64RandomBits 646 634 -1.86% BenchmarkAtof64RandomFloats 639 627 -1.88% BenchmarkAtof32Decimal 823 824 +0.12% BenchmarkAtof32Float 2398 2364 -1.42% BenchmarkAtof32FloatExp 1294 1195 -7.65% BenchmarkAtof32FloatHard 6168 965 -84.35% BenchmarkAtof32Random 1175 1100 -6.38% Updates golang#15672 Change-Id: I297f2ffb038d7c4598e1365b61c13b30e9bdd7fc
This patch implements a simplified version of Ulf Adams, "Ryū: Fast Float-to-String Conversion" (doi:10.1145/3192366.3192369) for formatting floating-point numbers with a fixed number of decimal digits. It uses the same principles but does not need to handle the complex task of finding a shortest representation. This allows to handle a few more cases than Grisu3, notably formatting with up to 18 significant digits. AMD64 benchmarks benchmark old ns/op new ns/op delta BenchmarkAppendFloat/32Fixed8Hard 74.2 47.8 -35.58% BenchmarkAppendFloat/32Fixed9Hard 77.1 57.6 -25.29% BenchmarkAppendFloat/64Fixed1 62.1 48.9 -21.26% BenchmarkAppendFloat/64Fixed2 69.6 49.3 -29.17% BenchmarkAppendFloat/64Fixed3 63.4 50.6 -20.19% BenchmarkAppendFloat/64Fixed4 71.5 49.1 -31.33% BenchmarkAppendFloat/64Fixed12 95.7 71.5 -25.29% BenchmarkAppendFloat/64Fixed16 1608 63.1 -96.08% BenchmarkAppendFloat/64Fixed12Hard 1276 60.3 -95.27% BenchmarkAppendFloat/64Fixed17Hard 4128 68.6 -98.34% BenchmarkAppendFloat/64Fixed18Hard 4155 4146 -0.22% ARM benchmarks benchmark old ns/op new ns/op delta BenchmarkAppendFloat/32Fixed8Hard 1045 575 -44.98% BenchmarkAppendFloat/32Fixed9Hard 1178 996 -15.45% BenchmarkAppendFloat/64Fixed1 781 786 +0.64% BenchmarkAppendFloat/64Fixed2 806 694 -13.90% BenchmarkAppendFloat/64Fixed3 765 723 -5.49% BenchmarkAppendFloat/64Fixed4 815 648 -20.49% BenchmarkAppendFloat/64Fixed12 1292 1039 -19.58% BenchmarkAppendFloat/64Fixed16 20045 1103 -94.50% BenchmarkAppendFloat/64Fixed12Hard 16041 979 -93.90% BenchmarkAppendFloat/64Fixed17Hard 50489 1200 -97.62% BenchmarkAppendFloat/64Fixed18Hard 53500 53630 +0.24% Updates golang#15672 Change-Id: I160963e141dd48287ad8cf57bcc3c686277788e8
This patch implements the algorithm from Ulf Adams, "Ryū: Fast Float-to-String Conversion" (doi:10.1145/3192366.3192369) for formatting floating-point numbers with a fixed number of decimal digits. It is not a direct translation of the reference C implementation but still follows the original paper. In particular, it uses full 128-bit powers of 10, which allows for more precision in the other modes (fixed ftoa, atof). AMD64 benchmarks benchmark old ns/op new ns/op delta BenchmarkAppendFloat/Decimal-4 49.9 57.0 +14.23% BenchmarkAppendFloat/Float-4 121 89.0 -26.45% BenchmarkAppendFloat/Exp-4 89.4 96.4 +7.83% BenchmarkAppendFloat/NegExp-4 88.7 93.0 +4.85% BenchmarkAppendFloat/LongExp-4 142 108 -23.94% BenchmarkAppendFloat/Big-4 144 112 -22.22% BenchmarkAppendFloat/BinaryExp-4 43.0 43.1 +0.23% BenchmarkAppendFloat/32Integer-4 51.4 56.4 +9.73% BenchmarkAppendFloat/32ExactFraction-4 95.3 79.4 -16.68% BenchmarkAppendFloat/32Point-4 121 77.2 -36.20% BenchmarkAppendFloat/32Exp-4 87.3 103 +17.98% BenchmarkAppendFloat/32NegExp-4 87.1 85.2 -2.18% BenchmarkAppendFloat/32Shortest-4 106 76.2 -28.11% BenchmarkAppendFloat/Slowpath64-4 1016 95.3 -90.62% BenchmarkAppendFloat/SlowpathDenormal64-4 32013 86.2 -99.73% ARM benchmarks benchmark old ns/op new ns/op delta BenchmarkAppendFloat/Decimal-4 829 678 -18.21% BenchmarkAppendFloat/Float-4 1367 1259 -7.90% BenchmarkAppendFloat/Exp-4 1100 1338 +21.64% BenchmarkAppendFloat/NegExp-4 1097 1336 +21.79% BenchmarkAppendFloat/LongExp-4 1852 1367 -26.19% BenchmarkAppendFloat/Big-4 1885 1621 -14.01% BenchmarkAppendFloat/BinaryExp-4 1000 966 -3.40% BenchmarkAppendFloat/32Integer-4 892 737 -17.38% BenchmarkAppendFloat/32ExactFraction-4 1201 1134 -5.58% BenchmarkAppendFloat/32Point-4 1439 1085 -24.60% BenchmarkAppendFloat/32Exp-4 1130 1372 +21.42% BenchmarkAppendFloat/32NegExp-4 1128 1126 -0.18% BenchmarkAppendFloat/32Shortest-4 1368 1069 -21.86% BenchmarkAppendFloat/Slowpath64-4 28468 1333 -95.32% BenchmarkAppendFloat/SlowpathDenormal64-4 378975 1291 -99.66% Fixes golang#15672 Change-Id: Ib90dfa245f62490a6666671896013cf3f9a1fb22
Change https://golang.org/cl/170078 mentions this issue: |
Change https://golang.org/cl/170079 mentions this issue: |
Change https://golang.org/cl/170080 mentions this issue: |
Reading @rsc's comment on CL 170078, it looks like these changes were intended to be reviewed early-in-cycle for 1.15. @remyoudompheng is this still the case? |
This patch implements a simplified version of Ulf Adams, "Ryū: Fast Float-to-String Conversion" (doi:10.1145/3192366.3192369) for formatting floating-point numbers with a fixed number of decimal digits. It uses the same principles but does not need to handle the complex task of finding a shortest representation. This allows to handle a few more cases than Grisu3, notably formatting with up to 18 significant digits. name old time/op new time/op delta AppendFloat/32Fixed8Hard-4 72.0ns ± 2% 56.0ns ± 2% -22.28% (p=0.000 n=10+10) AppendFloat/32Fixed9Hard-4 74.8ns ± 0% 64.2ns ± 2% -14.16% (p=0.000 n=8+10) AppendFloat/64Fixed1-4 60.4ns ± 1% 54.2ns ± 1% -10.31% (p=0.000 n=10+9) AppendFloat/64Fixed2-4 66.3ns ± 1% 53.3ns ± 1% -19.54% (p=0.000 n=10+9) AppendFloat/64Fixed3-4 61.0ns ± 1% 55.0ns ± 2% -9.80% (p=0.000 n=9+10) AppendFloat/64Fixed4-4 66.9ns ± 0% 52.0ns ± 2% -22.20% (p=0.000 n=8+10) AppendFloat/64Fixed12-4 95.5ns ± 1% 76.2ns ± 3% -20.19% (p=0.000 n=10+9) AppendFloat/64Fixed16-4 1.62µs ± 0% 0.07µs ± 2% -95.69% (p=0.000 n=10+10) AppendFloat/64Fixed12Hard-4 1.27µs ± 1% 0.07µs ± 1% -94.83% (p=0.000 n=9+9) AppendFloat/64Fixed17Hard-4 3.68µs ± 1% 0.08µs ± 2% -97.86% (p=0.000 n=10+9) AppendFloat/64Fixed18Hard-4 3.67µs ± 0% 3.72µs ± 1% +1.44% (p=0.000 n=9+10) Updates #15672 Change-Id: I160963e141dd48287ad8cf57bcc3c686277788e8 Reviewed-on: https://go-review.googlesource.com/c/go/+/170079 Reviewed-by: Emmanuel Odeke <[email protected]> Trust: Emmanuel Odeke <[email protected]> Trust: Nigel Tao <[email protected]> Trust: Robert Griesemer <[email protected]> Run-TryBot: Emmanuel Odeke <[email protected]> TryBot-Result: Go Bot <[email protected]>
go version devel +15f2d0e Fri May 13 01:19:05 2016 +0000 linux/amd64
The text was updated successfully, but these errors were encountered: