Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new mp_todec_fast... #330

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

MasterDuke17
Copy link
Collaborator

that uses Barrett reduction to speed up stringifying large integers.

Some benchmarking, done with a lightly modified demo/timing.c:

CLK_PER_SEC == 3600313344
mp_torad of 2**1 =>  22224156/sec,       162 cycles, 0.0000 seconds
mp_todec of 2**1 =>   1492667/sec,      2412 cycles, 0.0000 seconds
mp_torad of 13**1 =>  15385954/sec,       234 cycles, 0.0000 seconds
mp_todec of 13**1 =>   1503890/sec,      2394 cycles, 0.0000 seconds

mp_torad of 2**2 =>  22224156/sec,       162 cycles, 0.0000 seconds
mp_todec of 2**2 =>   1503890/sec,      2394 cycles, 0.0000 seconds
mp_torad of 13**2 =>  11765729/sec,       306 cycles, 0.0000 seconds
mp_todec of 13**2 =>   1459981/sec,      2466 cycles, 0.0000 seconds

mp_torad of 2**4 =>  14286957/sec,       252 cycles, 0.0000 seconds
mp_todec of 2**4 =>   1459981/sec,      2466 cycles, 0.0000 seconds
mp_torad of 13**4 =>   7143478/sec,       504 cycles, 0.0000 seconds
mp_todec of 13**4 =>   1273996/sec,      2826 cycles, 0.0000 seconds

mp_torad of 2**8 =>  11112078/sec,       324 cycles, 0.0000 seconds
mp_todec of 2**8 =>   1438974/sec,      2502 cycles, 0.0000 seconds
mp_torad of 13**8 =>   4348204/sec,       828 cycles, 0.0000 seconds
mp_todec of 13**8 =>    639033/sec,      5634 cycles, 0.0000 seconds

mp_torad of 2**16 =>   7143478/sec,       504 cycles, 0.0000 seconds
mp_todec of 2**16 =>   1242344/sec,      2898 cycles, 0.0000 seconds
mp_torad of 13**16 =>   2325783/sec,      1548 cycles, 0.0000 seconds
mp_todec of 13**16 =>    381712/sec,      9432 cycles, 0.0000 seconds

mp_torad of 2**32 =>   3704026/sec,       972 cycles, 0.0000 seconds
mp_todec of 2**32 =>    617337/sec,      5832 cycles, 0.0000 seconds
mp_torad of 13**32 =>   1000087/sec,      3600 cycles, 0.0000 seconds
mp_todec of 13**32 =>    216469/sec,     16632 cycles, 0.0000 seconds

mp_torad of 2**64 =>   2040993/sec,      1764 cycles, 0.0000 seconds
mp_todec of 2**64 =>    347856/sec,     10350 cycles, 0.0000 seconds
mp_torad of 13**64 =>    376680/sec,      9558 cycles, 0.0000 seconds
mp_todec of 13**64 =>    115952/sec,     31050 cycles, 0.0000 seconds

mp_torad of 2**128 =>    869640/sec,      4140 cycles, 0.0000 seconds
mp_todec of 2**128 =>    197255/sec,     18252 cycles, 0.0000 seconds
mp_torad of 13**128 =>    122484/sec,     29394 cycles, 0.0000 seconds
mp_todec of 13**128 =>     59778/sec,     60228 cycles, 0.0000 seconds

mp_torad of 2**256 =>    336729/sec,     10692 cycles, 0.0000 seconds
mp_todec of 2**256 =>    103851/sec,     34668 cycles, 0.0000 seconds
mp_torad of 13**256 =>     35017/sec,    102816 cycles, 0.0000 seconds
mp_todec of 13**256 =>     29042/sec,    123966 cycles, 0.0000 seconds

mp_torad of 2**512 =>    107305/sec,     33552 cycles, 0.0000 seconds
mp_todec of 2**512 =>     53494/sec,     67302 cycles, 0.0000 seconds
mp_torad of 13**512 =>      9245/sec,    389430 cycles, 0.0001 seconds
mp_todec of 13**512 =>     14733/sec,    244368 cycles, 0.0001 seconds

mp_torad of 2**1024 =>     30425/sec,    118332 cycles, 0.0000 seconds
mp_todec of 2**1024 =>     52900/sec,     68058 cycles, 0.0000 seconds
mp_torad of 13**1024 =>      4827/sec,    745865 cycles, 0.0002 seconds
mp_todec of 13**1024 =>     13844/sec,    260046 cycles, 0.0001 seconds

mp_torad of 2**2048 =>     15603/sec,    230742 cycles, 0.0001 seconds
mp_todec of 2**2048 =>     26176/sec,    137538 cycles, 0.0000 seconds
mp_torad of 13**2048 =>      1269/sec,   2835144 cycles, 0.0008 seconds
mp_todec of 13**2048 =>      6411/sec,    561582 cycles, 0.0002 seconds

mp_torad of 2**4096 =>      4148/sec,    867797 cycles, 0.0002 seconds
mp_todec of 2**4096 =>     12749/sec,    282384 cycles, 0.0001 seconds
mp_torad of 13**4096 =>       323/sec,  11136150 cycles, 0.0031 seconds
mp_todec of 13**4096 =>      2852/sec,   1262358 cycles, 0.0004 seconds

mp_torad of 2**8192 =>      1081/sec,   3328344 cycles, 0.0009 seconds
mp_todec of 2**8192 =>      5991/sec,    600948 cycles, 0.0002 seconds
mp_torad of 13**8192 =>        81/sec,  44357363 cycles, 0.0123 seconds
mp_todec of 13**8192 =>      1198/sec,   3004920 cycles, 0.0008 seconds

mp_torad of 2**16384 =>       274/sec,  13116996 cycles, 0.0036 seconds
mp_todec of 2**16384 =>      2677/sec,   1344762 cycles, 0.0004 seconds
mp_torad of 13**16384 =>        20/sec, 176360310 cycles, 0.0490 seconds
mp_todec of 13**16384 =>       474/sec,   7579854 cycles, 0.0021 seconds

mp_torad of 2**32768 =>        69/sec,  51832746 cycles, 0.0144 seconds
mp_todec of 2**32768 =>      1146/sec,   3139524 cycles, 0.0009 seconds
mp_torad of 13**32768 =>         5/sec, 702303642 cycles, 0.1951 seconds
mp_todec of 13**32768 =>       176/sec,  20395908 cycles, 0.0057 seconds

mp_torad of 2**65536 =>        17/sec, 205633044 cycles, 0.0571 seconds
mp_todec of 2**65536 =>       456/sec,   7885224 cycles, 0.0022 seconds
mp_torad of 13**65536 =>         1/sec, 2803178970 cycles, 0.7786 seconds
mp_todec of 13**65536 =>        57/sec,  62507790 cycles, 0.0174 seconds

mp_torad of 2**131072 =>         4/sec, 820253700 cycles, 0.2278 seconds
mp_todec of 2**131072 =>       165/sec,  21787452 cycles, 0.0061 seconds
mp_torad of 13**131072 =>         0/sec, 11200695156 cycles, 3.1110 seconds
mp_todec of 13**131072 =>        17/sec, 200245194 cycles, 0.0556 seconds

mp_torad of 2**262144 =>         1/sec, 3273670044 cycles, 0.9093 seconds
mp_todec of 2**262144 =>        53/sec,  67087368 cycles, 0.0186 seconds
mp_torad of 13**262144 =>         0/sec, 44720842806 cycles, 12.4214 seconds
mp_todec of 13**262144 =>         5/sec, 661838760 cycles, 0.1838 seconds

mp_torad of 2**524288 =>         0/sec, 13101782489 cycles, 3.6391 seconds
mp_todec of 2**524288 =>        16/sec, 217065492 cycles, 0.0603 seconds
mp_torad of 13**524288 =>         0/sec, 177629371344 cycles, 49.3372 seconds
mp_todec of 13**524288 =>         1/sec, 2334986874 cycles, 0.6486 seconds

@MasterDuke17
Copy link
Collaborator Author

See #328

@MasterDuke17 MasterDuke17 force-pushed the barrett_todecimal branch 2 times, most recently from e3ba30e to f0f5bb7 Compare August 23, 2019 23:06
@minad
Copy link
Member

minad commented Aug 24, 2019

👍

Regarding the API I think it would be best if we replace the mp_todecimal macros by specialised functions. These could either branch to a fast variant if available or branch to the generic toradix. The mp_todecimal_fast function from this PR should then be internal and named s_mp_todecimal_fast

@MasterDuke17
Copy link
Collaborator Author

MasterDuke17 commented Aug 24, 2019 via email

@MasterDuke17
Copy link
Collaborator Author

These could either branch to a fast variant if available or branch to the generic toradix.

What do you mean, "if available"? Should I change this PR to also change the mp_todecimal macro into a function that chooses mp_toradix under a certain size and chooses my new function above?

One difficulty is that mp_toradix assume it's given a buffer of the necessary size, but my function grows the buffer because mp_radix_size is so slow. However, @czurnieden said they have a new mp_radix_size that's O(1) in the works, do we need to wait until that's merged?

@minad
Copy link
Member

minad commented Aug 24, 2019

If available - via the macro configuration, see #262

@minad
Copy link
Member

minad commented Aug 24, 2019

I would prefer if we use a radix size function and don't grow the buffer.

@czurnieden
Copy link
Contributor

Why is it called todecimal?

Is there a from_decimal available?

But after a quick first glimpse: looks good.
It needs some formatting and adaption to the style here and a test and documentation but otherwise…

However, @czurnieden said they have a new mp_radix_size that's O(1) in the works, do we need to wait until that's merged?

It is in a larger PR but independent of the rest of that PR. Needs to get updated to the new state of LibTomMath (more or less just the small_int_2_big_int parts, nothing significant) but no other hurdles as far as I can see.
As whined about in #328 I'm stuck up to the upper edge of my lower lip in work untill the end of August, so if I get a kick in the athumbs-up from the regulars I'll squeeze it in this WE otherwise use mp_ilogb in the meantime.

@MasterDuke17
Copy link
Collaborator Author

@minad

I would prefer if we use a radix size function and don't grow the buffer.

I can switch to assuming the passed in buffer is the correct size, but it'll take me a little bit of time to re-work.

@czurnieden

It needs some formatting and adaption to the style here

I tried to follow the style and it passes make astyle, what is wrong?

a test and documentation

Yep I can work on that, but I wanted confirmation I was on the right path first, so thanks.

@MasterDuke17
Copy link
Collaborator Author

A style/correctness question. Should I move the 3 arrays of mp_ints that are passed around outside the functions so they don't have to be passed in every call? So

mp_int nL[20], shiftL[20], mL[20];

mp_err mp_todecimal_fast_rec(mp_int *number, int precalc_array_index, int left, char **result) {
...
}

instead of the current

mp_err mp_todecimal_fast_rec(mp_int *number, mp_int *nL, mp_int *shiftL, mp_int *mL, int precalc_array_index, int left, char **result) {
...
}

@MasterDuke17
Copy link
Collaborator Author

MasterDuke17 commented Aug 26, 2019

Timings with the new version that expects to be passed a pre-allocated buffer (i.e., similar API to mp_toradix). It got a bit faster, now the crossover point with mp_toradix is closer to 2^1k than 2^2k.

CLK_PER_SEC == 3600292536
mp_torad of 2**1 =>  22224028/sec,       162 cycles, 0.0000 seconds
mp_todec of 2**1 =>   1818329/sec,      1980 cycles, 0.0000 seconds
mp_torad of 13**1 =>  14286875/sec,       252 cycles, 0.0000 seconds
mp_todec of 13**1 =>   1785859/sec,      2016 cycles, 0.0000 seconds

mp_torad of 2**2 =>  16668021/sec,       216 cycles, 0.0000 seconds
mp_todec of 2**2 =>   1852002/sec,      1944 cycles, 0.0000 seconds
mp_torad of 13**2 =>  11112014/sec,       324 cycles, 0.0000 seconds
mp_todec of 13**2 =>   1801948/sec,      1998 cycles, 0.0000 seconds

mp_torad of 2**4 =>  14286875/sec,       252 cycles, 0.0000 seconds
mp_todec of 2**4 =>   1801948/sec,      1998 cycles, 0.0000 seconds
mp_torad of 13**4 =>   7143437/sec,       504 cycles, 0.0000 seconds
mp_todec of 13**4 =>   1613034/sec,      2232 cycles, 0.0000 seconds

mp_torad of 2**8 =>  11112014/sec,       324 cycles, 0.0000 seconds
mp_todec of 2**8 =>   1801948/sec,      1998 cycles, 0.0000 seconds
mp_torad of 13**8 =>   4444805/sec,       810 cycles, 0.0000 seconds
mp_todec of 13**8 =>    766345/sec,      4698 cycles, 0.0000 seconds

mp_torad of 2**16 =>   6452137/sec,       558 cycles, 0.0000 seconds
mp_todec of 2**16 =>   1600130/sec,      2250 cycles, 0.0000 seconds
mp_torad of 13**16 =>   2299037/sec,      1566 cycles, 0.0000 seconds
mp_todec of 13**16 =>    451503/sec,      7974 cycles, 0.0000 seconds

mp_torad of 2**32 =>   4000325/sec,       900 cycles, 0.0000 seconds
mp_todec of 2**32 =>    749124/sec,      4806 cycles, 0.0000 seconds
mp_torad of 13**32 =>   1015310/sec,      3546 cycles, 0.0000 seconds
mp_todec of 13**32 =>    256102/sec,     14058 cycles, 0.0000 seconds

mp_torad of 2**64 =>   2062023/sec,      1746 cycles, 0.0000 seconds
mp_todec of 2**64 =>    406537/sec,      8856 cycles, 0.0000 seconds
mp_torad of 13**64 =>    363665/sec,      9900 cycles, 0.0000 seconds
mp_todec of 13**64 =>    141154/sec,     25506 cycles, 0.0000 seconds

mp_torad of 2**128 =>    921733/sec,      3906 cycles, 0.0000 seconds
mp_todec of 2**128 =>    230965/sec,     15588 cycles, 0.0000 seconds
mp_torad of 13**128 =>    120419/sec,     29898 cycles, 0.0000 seconds
mp_todec of 13**128 =>     74744/sec,     48168 cycles, 0.0000 seconds

mp_torad of 2**256 =>    332805/sec,     10818 cycles, 0.0000 seconds
mp_todec of 2**256 =>    126512/sec,     28458 cycles, 0.0000 seconds
mp_torad of 13**256 =>     34955/sec,    102996 cycles, 0.0000 seconds
mp_todec of 13**256 =>     38875/sec,     92610 cycles, 0.0000 seconds

mp_torad of 2**512 =>    105940/sec,     33984 cycles, 0.0000 seconds
mp_todec of 2**512 =>     67459/sec,     53370 cycles, 0.0000 seconds
mp_torad of 13**512 =>      9089/sec,    396090 cycles, 0.0001 seconds
mp_todec of 13**512 =>     19411/sec,    185472 cycles, 0.0001 seconds

mp_torad of 2**1024 =>     30141/sec,    119448 cycles, 0.0000 seconds
mp_todec of 2**1024 =>     46150/sec,     78012 cycles, 0.0000 seconds
mp_torad of 13**1024 =>      4917/sec,    732096 cycles, 0.0002 seconds
mp_todec of 13**1024 =>     19351/sec,    186048 cycles, 0.0001 seconds

mp_torad of 2**2048 =>     15745/sec,    228654 cycles, 0.0001 seconds
mp_todec of 2**2048 =>     36241/sec,     99342 cycles, 0.0000 seconds
mp_torad of 13**2048 =>      1274/sec,   2824362 cycles, 0.0008 seconds
mp_todec of 13**2048 =>      8878/sec,    405504 cycles, 0.0001 seconds

mp_torad of 2**4096 =>      4170/sec,    863334 cycles, 0.0002 seconds
mp_todec of 2**4096 =>     17672/sec,    203724 cycles, 0.0001 seconds
mp_torad of 13**4096 =>       324/sec,  11106486 cycles, 0.0031 seconds
mp_todec of 13**4096 =>      3837/sec,    938070 cycles, 0.0003 seconds

mp_torad of 2**8192 =>      1085/sec,   3315510 cycles, 0.0009 seconds
mp_todec of 2**8192 =>      8256/sec,    436068 cycles, 0.0001 seconds
mp_torad of 13**8192 =>        82/sec,  43743563 cycles, 0.0122 seconds
mp_todec of 13**8192 =>      1577/sec,   2282616 cycles, 0.0006 seconds

mp_torad of 2**16384 =>       276/sec,  13006458 cycles, 0.0036 seconds
mp_todec of 2**16384 =>      3589/sec,   1003068 cycles, 0.0003 seconds
mp_torad of 13**16384 =>        20/sec, 174191328 cycles, 0.0484 seconds
mp_todec of 13**16384 =>       643/sec,   5597280 cycles, 0.0016 seconds

mp_torad of 2**32768 =>        70/sec,  51329502 cycles, 0.0143 seconds
mp_todec of 2**32768 =>      1555/sec,   2314188 cycles, 0.0006 seconds
mp_torad of 13**32768 =>         5/sec, 695193174 cycles, 0.1931 seconds
mp_todec of 13**32768 =>       254/sec,  14172930 cycles, 0.0039 seconds

mp_torad of 2**65536 =>        17/sec, 203966838 cycles, 0.0567 seconds
mp_todec of 2**65536 =>       637/sec,   5649137 cycles, 0.0016 seconds
mp_torad of 13**65536 =>         1/sec, 2778189192 cycles, 0.7717 seconds
mp_todec of 13**65536 =>        97/sec,  36809424 cycles, 0.0102 seconds

mp_torad of 2**131072 =>         4/sec, 813089286 cycles, 0.2258 seconds
mp_todec of 2**131072 =>       250/sec,  14385618 cycles, 0.0040 seconds
mp_torad of 13**131072 =>         0/sec, 11102668398 cycles, 3.0838 seconds
mp_todec of 13**131072 =>        35/sec, 100849896 cycles, 0.0280 seconds

mp_torad of 2**262144 =>         1/sec, 3254062661 cycles, 0.9038 seconds
mp_todec of 2**262144 =>        96/sec,  37234620 cycles, 0.0103 seconds
mp_torad of 13**262144 =>         0/sec, 44394883092 cycles, 12.3309 seconds
mp_todec of 13**262144 =>        13/sec, 272151000 cycles, 0.0756 seconds

mp_torad of 2**524288 =>         0/sec, 12977733078 cycles, 3.6046 seconds
mp_todec of 2**524288 =>        35/sec, 101212992 cycles, 0.0281 seconds
mp_torad of 13**524288 =>         0/sec, 177622493220 cycles, 49.3356 seconds
mp_todec of 13**524288 =>         4/sec, 737604720 cycles, 0.2049 seconds

@MasterDuke17
Copy link
Collaborator Author

@minad #262 hasn't been merged. Are you saying wait until it has? Or rebase this branch onto feature-detection2 and use its MP_HAS?

@sjaeckel
Copy link
Member

Thanks for this PR!

Sorry for blocking you guys, I will have more time from next week on

Yes, please rebase, squash together into logical commits and use MP_HAS

@sjaeckel
Copy link
Member

... squash already happened ... 👍

@MasterDuke17 MasterDuke17 force-pushed the barrett_todecimal branch 6 times, most recently from fc55954 to b2ca59c Compare August 31, 2019 15:53
@MasterDuke17 MasterDuke17 mentioned this pull request Aug 31, 2019
@MasterDuke17 MasterDuke17 force-pushed the barrett_todecimal branch 5 times, most recently from 69d57ff to f84b69c Compare August 31, 2019 16:31
@minad
Copy link
Member

minad commented Sep 1, 2019

Close in favor of #331

@minad minad closed this Sep 1, 2019
@sjaeckel
Copy link
Member

sjaeckel commented Sep 2, 2019

Re-opened as #331 has its base to a non-existant branch now

@sjaeckel
Copy link
Member

sjaeckel commented Sep 9, 2019

My proposal would be to work on a generalized solution supporting all possible radices. Then we can just replace the existing mp_to_radix function

IIUC that's not an option as this implementation is only faster for big numbers (@MasterDuke17 mentioned something like 500 digits on an older machine and 1500 digits on a newer one, I don't know whether it's decimal- or ltm-digits)

@MasterDuke17
Copy link
Collaborator Author

My proposal would be to work on a generalized solution supporting all possible radices. Then we can just replace the existing mp_to_radix function

IIUC that's not an option as this implementation is only faster for big numbers (@MasterDuke17 mentioned something like 500 digits on an older machine and 1500 digits on a newer one, I don't know whether it's decimal- or ltm-digits)

I wasn't very clear, those are numbers I was raising two to the power of (e.g., 2^512 as you can see in some of my benchmarks in earlier commits). On my faster machine it was somewhere around mp_int.used > 10 (3ee4c5e#diff-32d269f5bde171a5a4d1b7709b3b0d4eR14).

@czurnieden has some ways to speed up this implementation (and re-writing to not be recursive may help, also trading memory for speed and keeping around the arrays of calculated values), but the constant factor may always put it slower than the current mp_to_radix implementation for small values. I don't have enough intuition to know if the constant factor can be made low enough that the loss in speed is compensated for by the decrease in complexity to switch completely to this.

@minad
Copy link
Member

minad commented Sep 9, 2019

@sjaeckel You are right. We should do something similar like karatsuba. However I don't think it makes sense to merge a specialized version for base 10 if we can get a generalized version which works as well.

@sjaeckel
Copy link
Member

@czurnieden has some ways to speed up this implementation

Cool then let's do that!

re-writing to not be recursive may help

That's also a good idea

also trading memory for speed and keeping around the arrays of calculated values

You mean something like a "registry by radix" with the pre-calculated values of nl[], shiftL[], mL[]? sounds very nice, especially since it's most likely only radices 10 and 16, probably sometimes 2 and for mtest 64!

We should do something similar like karatsuba.

👍

However I don't think it makes sense to merge a specialized version for base 10 if we can get a generalized version which works as well.

Fine by me!

I thought we could merge this version soon, so we have at least a faster mp_to_radix(..., 10) for big inputs in the next release... but if you think we should wait it's also okay.

@MasterDuke17
Copy link
Collaborator Author

@czurnieden has some ways to speed up this implementation

Cool then let's do that!

I believe @czurnieden's suggestions are straightforward, but will require me to understand more of the actual math than I do now, so it may take a while.

re-writing to not be recursive may help

That's also a good idea

I'll work on this, but same re may take a while.

also trading memory for speed and keeping around the arrays of calculated values

You mean something like a "registry by radix" with the pre-calculated values of nl[], shiftL[], mL[]? sounds very nice, especially since it's most likely only radices 10 and 16, probably sometimes 2 and for mtest 64!

Is there anything similar in the codebase I could take a look at for inspiration?

However I don't think it makes sense to merge a specialized version for base 10 if we can get a generalized version which works as well.

Fine by me!

I thought we could merge this version soon, so we have at least a faster mp_to_radix(..., 10) for big inputs in the next release... but if you think we should wait it's also okay.

If a fully generalized version would cause a delay until the release after next I'd advocate for including the base-10 version. But that's only so it be more likely to get into MoarVM faster (since my completely unmeasured opinion is that 99% of the uses of mp_to_radix are for base-10).

@minad
Copy link
Member

minad commented Sep 11, 2019

@MasterDuke17 I don't think we will add this to the next release. It needs more time to mature. In particular since there are already many changes and lots of deprecations which need to be released. But I am interested in getting this in soon, I would use this faster to_radix myself. However my current impression of this library is that it is not moving particularly fast, but rather taking it's time to figure out how things should be.

With regards to moarvm, it is not yet up to date with the latest release. And it would also probably help if moar gets ported to the tommath develop branch. This would help both projects I guess (advancing moar and feedback for tommath).

@czurnieden
Copy link
Contributor

@MasterDuke17
As it seems that you are stuck here, I digged a bit deeper into my tapes and found the old prototype/proof-of-concept code of the D&C algorithm I use. It needs a fast division, but you have one you can plug-in. It also works without fast division but the effect is, of course, much smaller.

It got adapted it to the current LTM so it can run (tests, including timing are in demo/test.c. Without fast division and with 180k dec. dig. it is 0.something seconds in contrast to 19 seconds for bigint2string and 0.something seconds in contrast to 3 seconds for string2bigint here), but it is still the prototype, so it needs a good cleanup, a bit of shuffling around to make the cache threadsafe and your Barrett division plugged in. It is also tunable to some extent and has a lot of room for optimizations.

I've put it in my fork; the two functions are mp_get_str and mp_set_str respectively. Both are wrappers around mp_read_radix and mp_to_radix and hence are easily made optional (7.3k for bn_mp_get_str.o and 4k for bn_mp_set_str.o on my machine, they are not that big ) but harder to build into mp_read_radix and mp_to_radix themselfes.

It is a quite straightforward textbook implementation of the D&C algorithm described by many and should be easy to follow.

@MasterDuke17
Copy link
Collaborator Author

@czurnieden Ooo, those do look helpful indeed, thanks! I'll have a go at plugging in my Barrett division.

@czurnieden
Copy link
Contributor

@MasterDuke17

Couldn't sleep, so did the main cleanup in bn_mp_set_str.c (string2bigint) and made it stand-alone (with option, see comments there). It is twice as large as bn_mp_read_radix.o but O(log(n) * M(n)) (In the current LTM M(n) = n^1.46 with Toom-Cook 3-way) instead of O(n^2).

But a fast string2bigint isn't very urgent, a fast bigint2string on the other side is. Any progress made with it? Can I help?

@MasterDuke17
Copy link
Collaborator Author

MasterDuke17 commented Sep 17, 2019 via email

@czurnieden
Copy link
Contributor

@MasterDuke17

Did most of the chores in bn_mp_get_str.c. It should be ready now to fill the cache with the approximate inverses and use multiplication and Barrett's correction instead of mp_div.
A hint to get the k in (2^k)/n: in a/(base^n) a is more than 2 times larger in magnitude than base^n. If you take that large multiplier (that is k = floor(log_2(a))) the error is unity at most, only one correction is necessary.

For further reading try Brent et al.: "Modern Computer Arithmetic" section 2.4.1. and Kalles, pardon, Karl Hasselström's master thesis "Fast Division of Large Integers" (shows correction if the error is bigger than unity allowing for smaller k), and the wiki-page, of course, for the error computation.

Another hint: Barrett division is not as complicated as it seems, so if you think you need to do something complicated you are probably on the wrong path (optimizing k is a bit hard, but you don't need it now, only later and only if it is way too slow without).

Example for a/n (all divisions here are truncating) resulting in q the quotient and r the remainder (assuming I didn't do anything wrong here ;-) )

Input a, n with log(a) >= 2*log(n) (roughly, the exact ranges and proportions are described in the papers linked above)

k = floor(log_2(a)) # which is too large, but good enough for now
m = (2^k)/n         # the approximate reciprocal
q = (a * m) >> k    # compute approximate quotient
r = a - q * n       # compute approximate remainder
if (r >= n)         # check if off-by-one
   q = q + 1        # correct quotient (one too small with the value of k chosen) 
   r = r - n        # correct remainder (one n too big with the value of k chosen)
endif
return q,r

@minad minad added this to the v2.0.0 milestone Oct 3, 2019
@MasterDuke17
Copy link
Collaborator Author

@czurnieden I'm making progress. I now have a Frankenstein combination of my bn_s_mp_to_decimal_fast.c and your bn_mp_get_str.c that sort of works. However, I currently have two problems. The first is that my new combined code is adding some extra leading 0s. The second is that valgrind does not like your code (nor the combined version). It's complaining about Invalid write of size 1 here https://github.com/czurnieden/libtommath/blob/faster_number_conversion/bn_mp_get_str.c#L55 and Invalid read of size 1 here https://github.com/czurnieden/libtommath/blob/faster_number_conversion/bn_mp_get_str.c#L116 and Conditional jump or move depends on uninitialised value(s) here https://github.com/czurnieden/libtommath/blob/faster_number_conversion/bn_mp_get_str.c#L115. If you're interested, this is what I have so far: https://gist.github.com/MasterDuke17/cae7ba73306ad315372a570b041c9c52

@czurnieden
Copy link
Contributor

The first is that my new combined code is adding some extra leading 0s

Yeah, that part is a bit "how you doin'" and may need some extra massage.

The second is that valgrind does not like your code

It is more or less a "working sketch" to give you one way to tackle your problem as an example. It makes me wonder that valgrind didn't crash&burn completely ;-)

I uploaded a slightly more "ready to go" version (none of the functions have their if( (err = mp_foo()) != MP_OKAY) goto ERR; to make it more readable). It even survived valgrind ;-)

@MasterDuke17
Copy link
Collaborator Author

Hm, at this point I'm not sure what's missing from your new version (other than the error handling and such)? Is there actually something that should be ported in from my code?

@czurnieden
Copy link
Contributor

Is there actually something that should be ported in from my code?

You still have the base-10 optimized part the computing of the reciprocal) which should find it's way into it because the whole thing would work much better with fast division in general but as the most used bases are probably 16, 10, and 64, and maybe base 2, too. The power-of-two bases are quickly computed, only base 10 is more work.

There is even a discussion here to reduce the functionality in a later LTM version to just those power-of-two bases and base 10.

So I would say that you should concentrate on what you started with and implement a (preprocessor/ MP_HAS()) branch for base 10 (the power-of-two bases will get their optimization in mp_to/from_radix) such that people who love their odd bases can use the general version and the rest can get the handoptimized ones.
As noted above: the latter might be the default in version 2.0.

Soooo: no cheap excuses here, go on! :-)

@czurnieden
Copy link
Contributor

Please rebase from time to time.
It gets too hectic otherwise as I can tell from experience ;-)

@minad minad removed their request for review October 8, 2019 20:35
@czurnieden
Copy link
Contributor

@MasterDuke17 took the liberty to take the algorithm apart and wrote down what I found.
(I don't think I used more Latex packages than exist in a recent TeTex bundle but if it doesn't build drop me a note)
Translating from code to Latex might have introduced some errors, please check.

It seems as if all of the odd radices get dropped in version 2.0.0 (who actully needs more than the base 2, 10, 16 and for saving space 64?) so I think you can drop the rest and concentrate on refining and optimizing what you started with: base 10. I think Lars Hellström made some suggestions in his post archived at Sourceforge like dropping recursion and adding some additional caching.

Thank you for your patience!

@czurnieden
Copy link
Contributor

@sjaeckel

if I could choose I'd really love to see an explanation on how the algorithm works

It took me a bit, but I finally got it.
I think ;-)
Put it a LaTeX file here.

@czurnieden
Copy link
Contributor

czurnieden commented Oct 31, 2019

@MasterDuke17

I extended my description and hope it is complete now (sans the iterative version(s)). If not feel free to ask me, I know it is still a bit of a mess.

Please rebase.

I wanted to give Github's web editor a try and do it for you but it seems as if I'm probably not the guy for web editors ;-)

You can take the develop branch but take care that your stuff doesn't get lost (demo/test.c is sometimes a bit fuddly to rebase). You don't need to manually siff through the makefiles, just delete all of the content of the OBJECTS variable (but leave the equal sign), delete tommath_class.h completely and run ./helper.pl -u to regenerate all of the makefiles and tommath_class.h. You need to run a make astyle afterwards (for the regenerated tommath_class.h).

@MasterDuke17
Copy link
Collaborator Author

@MasterDuke17

I extended my description and hope it is complete now (sans the iterative version(s)). If not feel free to ask me, I know it is still a bit of a mess.

Cool. If I understand it correctly we would need n==24 to be able to handle the largest possible number? I've been a bit distracted, but I hope to have some time in the next couple of days to do some more work.

@czurnieden
Copy link
Contributor

If I understand it correctly we would need n==24 to be able to handle the largest possible number?

The names used in that text should be n for the base (n = 10^3 for base 10), s for the shift (s = 20 for base 10) and k for the exponent (the number of squarings or the index into the lists if you like). If that is mixed up somewhere it is a bug in my text and I would appreciate it if you would drop me a note if that is the case.
So I interpret the number in your question as k == 24. Is that correct?

The exact type of the sizes is still in limbo (size_t or not, that seems to be the question here.) but I think I calculated k = 22 for the current type? Yes, I should clean it up a bit, it's quite a mess, admitted.

I wouldn't go with static arrays in the first place but use the heap instead. You can calculate the number of reciprocals in advance and check the result (and repair if necessary), so I would just malloc the right amount and keep the lib a bit smaller that way.

Oh, and although I added the necessary information to extend it to all of the other bases you can stay with base 10 for now.

Did I add the "bigger chunks" stuff? No, I didn't, I was lying when I said it is complete ;-)
OK, will add but not today.

@czurnieden
Copy link
Contributor

@MasterDuke17
Took the liberty to update my description. Added the necessary numerical corrections for the other bases and generalized the algorithm to use all these bases. You do not have to use the optimization for the large leafs for but keep in mind that even the small leafs need a different size of your small buffer sprintf_str now (between 4 and 9 bytes). Will be a bit fuddly, admitted, but should not be a giant problem.

@MasterDuke17
Copy link
Collaborator Author

@czurnieden Great, thanks! I've been sidetracked by getting MoarVM upgraded to libtommath 1.2.0. We just recently turned on -Wall and -Wextra when building MoarVM and that threw up a ton of warnings in our code that uses libtommath (e.g., we were almost never handling return values and that now warns). But I should have that done soon and next thing on the list is working on this.

@minad
Copy link
Member

minad commented Dec 7, 2019

@MasterDuke17 Good to hear that you are doing the update! The last time I looked, the MoarVM codebase had some math functions which are now available in upstream ltm, e.g., floating point conversion and bitwise operations. If something else is missing here let us know!

@sjaeckel
Copy link
Member

Looks like @MasterDuke17 won't finish this, as MoarVM has decided to go for gmp instead of ltm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants