Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Numerics.BigInteger: Add/Subtract performance can be improved when size of arguments are different #83457

Closed
6 tasks done
speshuric opened this issue Mar 15, 2023 · 15 comments
Assignees
Labels
Milestone

Comments

@speshuric
Copy link
Contributor

speshuric commented Mar 15, 2023

Description

System.Numerics.BigInteger add and subtract operations for non-trivial cases are implemented in Add/Subtract static methods of internal class BigIntegerCalculator. Current implementation can be improved by special handling the case of carry==0 when the current position being processed goes beyond the length of the right (short) argument, but does not exceed the length of the left (long) argument.

Reproducing the issue is possible in most environments. This is not a regression but a new optimization.

Main idea can be demonstrated on this part of Add method:

int i = 0;
long carry = 0L;
// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...

for ( ; i < right.Length; i++)
{
    long digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
for ( ; i < left.Length; i++)
{
    // "target loop"
    long digit = left[i] + carry;
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
Unsafe.Add(ref resultPtr, i) = (uint)carry;

In the second loop (marked as // "target loop") once carry is set to 0 it can not be 1 anymore. So the tail of the loop is just the movement of argument values to result span.

Analysis

BigIntegerCalculator now contains 6 static metods for add and subtract:

  • public static void Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits); - used when right argument length is 1
  • public static void Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) - default algorithm for addition
  • private static void AddSelf(Span<uint> left, ReadOnlySpan<uint> right) - checks carry and breaks the loop
  • public static void Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits) - used when right argument length is 1
  • public static void Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) - default algorithm for subtraction
  • private static void SubtractSelf(Span<uint> left, ReadOnlySpan<uint> right) - checks carry and breaks the loop

AddSelf and SubtractSelf used in internals of SquMul part of BigIntegerCalculator, cannot be optimized this way and are not considered below. Add and Subtract can be optimized almost identically so only case of Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) will be used to describe.

Following statements are correct for "target loop"

  1. carry can be 1 or 0
  2. If income carry is 1 then outcome carry can be 1 if and only if left[i] == uint.MaxValue and result[i]==0.
  3. If income carry is 0 then result[i] = left[i] for every next i. This assignment can be optimized by remove any arithmetic and use of special platform-optimized methods of copying data.

So it can be rewritten as follows:

int i = 0;
ulong carry = 0L;

// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...

for ( ; i < right.Length; i++) // this loop was not modified
{
    ulong digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
for ( ; carry != 0 && i < left.Length; i++) // carry != 0 is checked
{
    ulong digit = Unsafe.Add(ref leftPtr, i) + carry;
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
if (i < left.Length)
{
    // only move data from left argument to result
    do 
    {
        Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i);
        i++;
    } while (i < left.Length);

    // Note: if left part to copy is large then special method CopyTo is better
    //left.Slice(i).CopyTo(bits.Slice(i));

    i = left.Length;
    carry = 0;
}
Unsafe.Add(ref resultPtr, i) = unchecked((uint)carry);

Methods of data movement

In this variant, second loop checks when carry become 0 and then special case (plain movement) is triggered. Two possible ways to copy data can be considered:

  1. Loop with Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++;
  2. Copy with platform-dependent Span.CopyTo(): left.Slice(i).CopyTo(bits.Slice(i));. This internally calls high optimized Buffer.Memmove() but Span.CopyTo() do some redundant checks and thus can be slower on short slices.

On my PC (CPU Ryzen 5700G) in my draft benchmarks the second is faster when approximately left.Length - right.Length >= 16

Best and worst cases

The best case of argument data for the new version is such values of argument arrays when carry == 0.
The worst case is when carry is always 1, i.e. all left[i] == uint.MaxValue.
There should be more difference when left.Length - right.Length is large and shouldn't difference when they are equals.

Draft benchmarks

I have written some draft microbenchmarks to measure the difference. These benchmarks test 3 versions of Add method:

  • AddOld - code in the current runtime
  • AddNew - with Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++; loop
  • AddNewMemmove with left.Slice(i).CopyTo(bits.Slice(i)); without loop

Two cases are tested:

  • bad - when all left[i] and right[i] are uint.MaxValue
  • good - when all left[i] and right[i] are 1

Tests cover left.Length from 2 to 256 (doubles on each step, i.e. 2, 4, 8, 16, 32, 64, 128, 256) and right.Length from 2 to left.Length (doubles each step too).

In draft benchmark attribute [InvocationCount(10000000)] was used, hence there can be some inaccuracies on small lengths.

Valuable draft benchmark points

  • In general new versions are noticeably faster.
  • In the worst cases for new versions, slowest of new version has max regression about 0,5-1 ns per invocation. But should be noted that whole BigInteger Add operation contains some extra memcopies and much more logic than internal Add method. It needs some more analysis, but at first look it seems that regressions are insignificant.
  • Max speedup of AddNew version (as expected) is observed in case of good arrays with right.Length 2 uints and left.Length 256 uints. Old version mean is 122.771 ns, AddNew mean is 65.197 ns.
  • AddNewMemmove version is in the same test (good, 256, 2) is even faster and it's mean is 12.797 ns. Almost 10x faster than old version!
  • AddNewMemmove version faster than AddNew when left.Length - right.Length >= 16

Important note: Add method takes only part time of whole operation. End-to-end results will be comparable in absolute time difference but relative difference will differ much less.

To do's

  • Initial description, implement one method and make draft benchmarks
  • Choose new implementation ("ref to ref" or "Slice.CopyTo(Slice)")
  • Implement all methods
  • Full end-to-end benchmarks
  • Pull request

Open questions

  • Which implementation is preferable: "ref to ref" or "Slice.CopyTo(Slice)" or size-dependent?

Data

Draft benchmark results are below. I will publish the code of benchmark soon.

Data of draft benchmarikng (click to expand)

Method Case leftSize rightSize Mean Error StdDev Median
BenchAddOld bad 2 2 2.585 ns 0.0269 ns 0.0225 ns 2.577 ns
BenchAddNew bad 2 2 2.903 ns 0.0597 ns 0.0558 ns 2.895 ns
BenchAddNewMemmove bad 2 2 3.204 ns 0.0942 ns 0.1548 ns 3.175 ns
BenchAddOld bad 4 2 3.520 ns 0.0629 ns 0.0588 ns 3.485 ns
BenchAddNew bad 4 2 3.603 ns 0.0857 ns 0.0760 ns 3.613 ns
BenchAddNewMemmove bad 4 2 3.996 ns 0.0945 ns 0.0789 ns 3.954 ns
BenchAddOld bad 4 4 3.331 ns 0.0771 ns 0.0721 ns 3.312 ns
BenchAddNew bad 4 4 3.560 ns 0.0967 ns 0.0950 ns 3.512 ns
BenchAddNewMemmove bad 4 4 4.239 ns 0.0714 ns 0.0668 ns 4.199 ns
BenchAddOld bad 8 2 5.434 ns 0.1023 ns 0.0957 ns 5.422 ns
BenchAddNew bad 8 2 5.008 ns 0.1292 ns 0.1327 ns 4.914 ns
BenchAddNewMemmove bad 8 2 5.871 ns 0.0992 ns 0.0879 ns 5.888 ns
BenchAddOld bad 8 4 5.394 ns 0.0892 ns 0.0834 ns 5.395 ns
BenchAddNew bad 8 4 5.240 ns 0.1256 ns 0.1113 ns 5.224 ns
BenchAddNewMemmove bad 8 4 6.094 ns 0.1163 ns 0.1087 ns 6.069 ns
BenchAddOld bad 8 8 5.094 ns 0.0092 ns 0.0072 ns 5.093 ns
BenchAddNew bad 8 8 5.579 ns 0.0370 ns 0.0346 ns 5.585 ns
BenchAddNewMemmove bad 8 8 5.996 ns 0.0217 ns 0.0181 ns 5.995 ns
BenchAddOld bad 16 2 8.732 ns 0.0075 ns 0.0062 ns 8.732 ns
BenchAddNew bad 16 2 8.359 ns 0.0644 ns 0.0571 ns 8.338 ns
BenchAddNewMemmove bad 16 2 8.802 ns 0.0741 ns 0.0619 ns 8.782 ns
BenchAddOld bad 16 4 8.774 ns 0.0564 ns 0.0527 ns 8.751 ns
BenchAddNew bad 16 4 8.539 ns 0.0758 ns 0.0672 ns 8.535 ns
BenchAddNewMemmove bad 16 4 9.685 ns 0.0628 ns 0.0587 ns 9.651 ns
BenchAddOld bad 16 8 8.776 ns 0.0288 ns 0.0256 ns 8.768 ns
BenchAddNew bad 16 8 8.455 ns 0.0524 ns 0.0437 ns 8.452 ns
BenchAddNewMemmove bad 16 8 9.643 ns 0.0213 ns 0.0166 ns 9.640 ns
BenchAddOld bad 16 16 9.011 ns 0.0209 ns 0.0163 ns 9.015 ns
BenchAddNew bad 16 16 9.230 ns 0.0147 ns 0.0115 ns 9.232 ns
BenchAddNewMemmove bad 16 16 9.495 ns 0.0252 ns 0.0210 ns 9.488 ns
BenchAddOld bad 32 2 15.952 ns 0.0244 ns 0.0228 ns 15.953 ns
BenchAddNew bad 32 2 16.422 ns 0.1061 ns 0.0941 ns 16.413 ns
BenchAddNewMemmove bad 32 2 16.917 ns 0.0140 ns 0.0124 ns 16.918 ns
BenchAddOld bad 32 4 16.022 ns 0.0115 ns 0.0090 ns 16.023 ns
BenchAddNew bad 32 4 16.499 ns 0.0428 ns 0.0401 ns 16.490 ns
BenchAddNewMemmove bad 32 4 16.875 ns 0.0601 ns 0.0533 ns 16.882 ns
BenchAddOld bad 32 8 16.153 ns 0.0617 ns 0.0547 ns 16.148 ns
BenchAddNew bad 32 8 16.063 ns 0.1146 ns 0.1016 ns 16.035 ns
BenchAddNewMemmove bad 32 8 16.960 ns 0.0484 ns 0.0404 ns 16.947 ns
BenchAddOld bad 32 16 16.665 ns 0.0260 ns 0.0243 ns 16.659 ns
BenchAddNew bad 32 16 16.928 ns 0.0542 ns 0.0452 ns 16.929 ns
BenchAddNewMemmove bad 32 16 17.373 ns 0.0623 ns 0.0583 ns 17.376 ns
BenchAddOld bad 32 32 17.253 ns 0.0250 ns 0.0221 ns 17.260 ns
BenchAddNew bad 32 32 17.515 ns 0.0730 ns 0.0570 ns 17.513 ns
BenchAddNewMemmove bad 32 32 18.008 ns 0.0164 ns 0.0153 ns 18.007 ns
BenchAddOld bad 64 2 30.824 ns 0.0548 ns 0.0458 ns 30.817 ns
BenchAddNew bad 64 2 30.854 ns 0.0239 ns 0.0187 ns 30.858 ns
BenchAddNewMemmove bad 64 2 31.479 ns 0.0456 ns 0.0404 ns 31.481 ns
BenchAddOld bad 64 4 30.796 ns 0.0182 ns 0.0161 ns 30.793 ns
BenchAddNew bad 64 4 29.935 ns 0.0929 ns 0.0869 ns 29.923 ns
BenchAddNewMemmove bad 64 4 30.630 ns 0.1631 ns 0.1526 ns 30.653 ns
BenchAddOld bad 64 8 30.775 ns 0.0837 ns 0.0783 ns 30.784 ns
BenchAddNew bad 64 8 31.110 ns 0.0434 ns 0.0385 ns 31.101 ns
BenchAddNewMemmove bad 64 8 31.391 ns 0.0681 ns 0.0603 ns 31.402 ns
BenchAddOld bad 64 16 31.845 ns 0.0911 ns 0.0852 ns 31.856 ns
BenchAddNew bad 64 16 32.334 ns 0.0686 ns 0.0641 ns 32.316 ns
BenchAddNewMemmove bad 64 16 32.012 ns 0.0781 ns 0.0730 ns 32.003 ns
BenchAddOld bad 64 32 34.869 ns 0.0747 ns 0.0698 ns 34.873 ns
BenchAddNew bad 64 32 35.400 ns 0.1425 ns 0.1333 ns 35.426 ns
BenchAddNewMemmove bad 64 32 34.899 ns 0.0755 ns 0.0707 ns 34.914 ns
BenchAddOld bad 64 64 37.207 ns 0.1676 ns 0.1568 ns 37.125 ns
BenchAddNew bad 64 64 37.827 ns 0.0352 ns 0.0312 ns 37.828 ns
BenchAddNewMemmove bad 64 64 38.326 ns 0.0473 ns 0.0442 ns 38.324 ns
BenchAddOld bad 128 2 64.390 ns 0.0895 ns 0.0747 ns 64.372 ns
BenchAddNew bad 128 2 64.430 ns 0.1261 ns 0.1117 ns 64.400 ns
BenchAddNewMemmove bad 128 2 61.757 ns 0.0785 ns 0.0696 ns 61.758 ns
BenchAddOld bad 128 4 64.489 ns 0.0966 ns 0.0856 ns 64.508 ns
BenchAddNew bad 128 4 61.862 ns 0.1050 ns 0.0982 ns 61.818 ns
BenchAddNewMemmove bad 128 4 64.728 ns 0.0337 ns 0.0299 ns 64.733 ns
BenchAddOld bad 128 8 59.975 ns 0.0802 ns 0.0711 ns 59.958 ns
BenchAddNew bad 128 8 60.545 ns 0.0964 ns 0.0902 ns 60.567 ns
BenchAddNewMemmove bad 128 8 60.686 ns 0.0538 ns 0.0477 ns 60.691 ns
BenchAddOld bad 128 16 60.303 ns 0.1191 ns 0.1056 ns 60.333 ns
BenchAddNew bad 128 16 61.590 ns 0.7334 ns 0.6860 ns 61.163 ns
BenchAddNewMemmove bad 128 16 61.623 ns 0.0646 ns 0.0573 ns 61.618 ns
BenchAddOld bad 128 32 63.400 ns 0.2310 ns 0.2048 ns 63.358 ns
BenchAddNew bad 128 32 64.186 ns 0.2440 ns 0.2282 ns 64.057 ns
BenchAddNewMemmove bad 128 32 64.191 ns 0.0389 ns 0.0304 ns 64.195 ns
BenchAddOld bad 128 64 69.698 ns 0.1385 ns 0.1228 ns 69.638 ns
BenchAddNew bad 128 64 70.742 ns 0.2145 ns 0.2006 ns 70.772 ns
BenchAddNewMemmove bad 128 64 70.199 ns 0.2155 ns 0.1800 ns 70.201 ns
BenchAddOld bad 128 128 81.541 ns 0.0562 ns 0.0526 ns 81.522 ns
BenchAddNew bad 128 128 81.757 ns 0.0175 ns 0.0163 ns 81.761 ns
BenchAddNewMemmove bad 128 128 82.336 ns 0.0408 ns 0.0362 ns 82.331 ns
BenchAddOld bad 256 2 122.853 ns 0.2910 ns 0.2579 ns 122.735 ns
BenchAddNew bad 256 2 122.601 ns 0.0401 ns 0.0335 ns 122.603 ns
BenchAddNewMemmove bad 256 2 118.404 ns 0.4241 ns 0.3967 ns 118.360 ns
BenchAddOld bad 256 4 122.572 ns 0.0952 ns 0.0844 ns 122.575 ns
BenchAddNew bad 256 4 123.188 ns 0.0382 ns 0.0358 ns 123.190 ns
BenchAddNewMemmove bad 256 4 118.899 ns 0.3367 ns 0.3150 ns 119.049 ns
BenchAddOld bad 256 8 122.766 ns 0.0753 ns 0.0629 ns 122.782 ns
BenchAddNew bad 256 8 118.796 ns 0.4198 ns 0.3927 ns 118.989 ns
BenchAddNewMemmove bad 256 8 121.238 ns 0.2532 ns 0.2369 ns 121.260 ns
BenchAddOld bad 256 16 123.677 ns 0.2106 ns 0.1970 ns 123.688 ns
BenchAddNew bad 256 16 125.204 ns 0.0515 ns 0.0482 ns 125.195 ns
BenchAddNewMemmove bad 256 16 125.459 ns 0.0575 ns 0.0537 ns 125.466 ns
BenchAddOld bad 256 32 123.549 ns 0.2009 ns 0.1879 ns 123.557 ns
BenchAddNew bad 256 32 127.413 ns 0.2725 ns 0.2549 ns 127.377 ns
BenchAddNewMemmove bad 256 32 128.176 ns 0.1296 ns 0.1149 ns 128.191 ns
BenchAddOld bad 256 64 129.709 ns 0.1311 ns 0.1226 ns 129.692 ns
BenchAddNew bad 256 64 132.365 ns 0.0854 ns 0.0757 ns 132.396 ns
BenchAddNewMemmove bad 256 64 133.685 ns 0.2328 ns 0.2177 ns 133.636 ns
BenchAddOld bad 256 128 142.283 ns 0.0672 ns 0.0595 ns 142.284 ns
BenchAddNew bad 256 128 145.236 ns 0.2122 ns 0.1985 ns 145.283 ns
BenchAddNewMemmove bad 256 128 145.719 ns 0.2226 ns 0.2082 ns 145.670 ns
BenchAddOld bad 256 256 165.340 ns 0.3852 ns 0.3603 ns 165.212 ns
BenchAddNew bad 256 256 165.619 ns 0.0412 ns 0.0365 ns 165.631 ns
BenchAddNewMemmove bad 256 256 165.315 ns 0.0680 ns 0.0568 ns 165.320 ns
BenchAddOld good 2 2 2.463 ns 0.0765 ns 0.0880 ns 2.472 ns
BenchAddNew good 2 2 2.693 ns 0.0574 ns 0.0509 ns 2.685 ns
BenchAddNewMemmove good 2 2 3.429 ns 0.0930 ns 0.0870 ns 3.445 ns
BenchAddOld good 4 2 3.343 ns 0.0698 ns 0.0618 ns 3.342 ns
BenchAddNew good 4 2 3.247 ns 0.0871 ns 0.1593 ns 3.213 ns
BenchAddNewMemmove good 4 2 5.360 ns 0.0828 ns 0.0734 ns 5.321 ns
BenchAddOld good 4 4 3.344 ns 0.0888 ns 0.0830 ns 3.366 ns
BenchAddNew good 4 4 3.596 ns 0.0934 ns 0.0873 ns 3.618 ns
BenchAddNewMemmove good 4 4 4.481 ns 0.0619 ns 0.0579 ns 4.493 ns
BenchAddOld good 8 2 5.399 ns 0.0864 ns 0.0809 ns 5.401 ns
BenchAddNew good 8 2 4.347 ns 0.0988 ns 0.0876 ns 4.331 ns
BenchAddNewMemmove good 8 2 5.410 ns 0.0615 ns 0.0545 ns 5.429 ns
BenchAddOld good 8 4 5.218 ns 0.1080 ns 0.1010 ns 5.201 ns
BenchAddNew good 8 4 4.667 ns 0.0568 ns 0.0503 ns 4.640 ns
BenchAddNewMemmove good 8 4 6.658 ns 0.0953 ns 0.0845 ns 6.659 ns
BenchAddOld good 8 8 5.094 ns 0.0144 ns 0.0112 ns 5.089 ns
BenchAddNew good 8 8 5.715 ns 0.0724 ns 0.0677 ns 5.700 ns
BenchAddNewMemmove good 8 8 6.075 ns 0.0402 ns 0.0336 ns 6.067 ns
BenchAddOld good 16 2 8.978 ns 0.0386 ns 0.0361 ns 8.976 ns
BenchAddNew good 16 2 5.883 ns 0.0141 ns 0.0125 ns 5.883 ns
BenchAddNewMemmove good 16 2 5.717 ns 0.1207 ns 0.1129 ns 5.753 ns
BenchAddOld good 16 4 8.796 ns 0.0898 ns 0.0796 ns 8.763 ns
BenchAddNew good 16 4 6.440 ns 0.0853 ns 0.0798 ns 6.394 ns
BenchAddNewMemmove good 16 4 6.898 ns 0.1508 ns 0.1410 ns 6.891 ns
BenchAddOld good 16 8 8.804 ns 0.0696 ns 0.0543 ns 8.797 ns
BenchAddNew good 16 8 7.118 ns 0.0780 ns 0.0729 ns 7.095 ns
BenchAddNewMemmove good 16 8 8.060 ns 0.0123 ns 0.0103 ns 8.057 ns
BenchAddOld good 16 16 8.765 ns 0.0132 ns 0.0117 ns 8.766 ns
BenchAddNew good 16 16 9.316 ns 0.0177 ns 0.0148 ns 9.311 ns
BenchAddNewMemmove good 16 16 9.472 ns 0.0136 ns 0.0120 ns 9.476 ns
BenchAddOld good 32 2 15.976 ns 0.0149 ns 0.0132 ns 15.976 ns
BenchAddNew good 32 2 10.003 ns 0.0572 ns 0.0477 ns 9.994 ns
BenchAddNewMemmove good 32 2 6.473 ns 0.1306 ns 0.1222 ns 6.482 ns
BenchAddOld good 32 4 16.029 ns 0.0226 ns 0.0211 ns 16.021 ns
BenchAddNew good 32 4 10.277 ns 0.0319 ns 0.0283 ns 10.268 ns
BenchAddNewMemmove good 32 4 7.188 ns 0.0847 ns 0.0792 ns 7.169 ns
BenchAddOld good 32 8 16.301 ns 0.3195 ns 0.3803 ns 16.152 ns
BenchAddNew good 32 8 12.358 ns 0.2750 ns 0.4670 ns 12.136 ns
BenchAddNewMemmove good 32 8 8.976 ns 0.0233 ns 0.0182 ns 8.983 ns
BenchAddOld good 32 16 16.970 ns 0.0710 ns 0.0630 ns 16.966 ns
BenchAddNew good 32 16 13.146 ns 0.1049 ns 0.0819 ns 13.162 ns
BenchAddNewMemmove good 32 16 12.155 ns 0.0423 ns 0.0353 ns 12.152 ns
BenchAddOld good 32 32 17.361 ns 0.0498 ns 0.0442 ns 17.359 ns
BenchAddNew good 32 32 17.444 ns 0.0156 ns 0.0145 ns 17.443 ns
BenchAddNewMemmove good 32 32 17.859 ns 0.0275 ns 0.0229 ns 17.859 ns
BenchAddOld good 64 2 30.810 ns 0.0510 ns 0.0452 ns 30.802 ns
BenchAddNew good 64 2 17.049 ns 0.0184 ns 0.0164 ns 17.042 ns
BenchAddNewMemmove good 64 2 7.111 ns 0.1680 ns 0.1571 ns 7.145 ns
BenchAddOld good 64 4 30.790 ns 0.0456 ns 0.0404 ns 30.803 ns
BenchAddNew good 64 4 17.728 ns 0.0312 ns 0.0276 ns 17.732 ns
BenchAddNewMemmove good 64 4 8.134 ns 0.0892 ns 0.0834 ns 8.144 ns
BenchAddOld good 64 8 30.834 ns 0.0719 ns 0.0673 ns 30.846 ns
BenchAddNew good 64 8 19.482 ns 0.0262 ns 0.0205 ns 19.482 ns
BenchAddNewMemmove good 64 8 9.674 ns 0.0432 ns 0.0361 ns 9.662 ns
BenchAddOld good 64 16 31.804 ns 0.0847 ns 0.0792 ns 31.775 ns
BenchAddNew good 64 16 20.717 ns 0.0970 ns 0.0907 ns 20.698 ns
BenchAddNewMemmove good 64 16 12.849 ns 0.0284 ns 0.0222 ns 12.846 ns
BenchAddOld good 64 32 34.382 ns 0.1602 ns 0.1498 ns 34.429 ns
BenchAddNew good 64 32 27.406 ns 0.0531 ns 0.0443 ns 27.407 ns
BenchAddNewMemmove good 64 32 20.663 ns 0.0403 ns 0.0377 ns 20.662 ns
BenchAddOld good 64 64 37.511 ns 0.0232 ns 0.0217 ns 37.512 ns
BenchAddNew good 64 64 37.314 ns 0.0940 ns 0.0879 ns 37.339 ns
BenchAddNewMemmove good 64 64 38.143 ns 0.0108 ns 0.0090 ns 38.143 ns
BenchAddOld good 128 2 64.610 ns 0.0514 ns 0.0480 ns 64.607 ns
BenchAddNew good 128 2 35.626 ns 0.0914 ns 0.0810 ns 35.602 ns
BenchAddNewMemmove good 128 2 8.729 ns 0.1174 ns 0.1041 ns 8.726 ns
BenchAddOld good 128 4 64.728 ns 0.1634 ns 0.1448 ns 64.763 ns
BenchAddNew good 128 4 36.102 ns 0.0315 ns 0.0263 ns 36.106 ns
BenchAddNewMemmove good 128 4 9.951 ns 0.0793 ns 0.0662 ns 9.953 ns
BenchAddOld good 128 8 59.980 ns 0.0317 ns 0.0297 ns 59.987 ns
BenchAddNew good 128 8 33.452 ns 0.1252 ns 0.1171 ns 33.407 ns
BenchAddNewMemmove good 128 8 10.903 ns 0.1075 ns 0.0953 ns 10.847 ns
BenchAddOld good 128 16 60.148 ns 0.0566 ns 0.0473 ns 60.139 ns
BenchAddNew good 128 16 36.152 ns 0.0371 ns 0.0329 ns 36.146 ns
BenchAddNewMemmove good 128 16 14.780 ns 0.0253 ns 0.0197 ns 14.776 ns
BenchAddOld good 128 32 62.632 ns 0.1169 ns 0.1037 ns 62.623 ns
BenchAddNew good 128 32 41.202 ns 0.1629 ns 0.1444 ns 41.133 ns
BenchAddNewMemmove good 128 32 22.425 ns 0.0440 ns 0.0368 ns 22.414 ns
BenchAddOld good 128 64 69.786 ns 0.2435 ns 0.2278 ns 69.657 ns
BenchAddNew good 128 64 53.662 ns 0.0841 ns 0.0787 ns 53.662 ns
BenchAddNewMemmove good 128 64 41.252 ns 0.0296 ns 0.0262 ns 41.247 ns
BenchAddOld good 128 128 81.869 ns 0.0553 ns 0.0462 ns 81.866 ns
BenchAddNew good 128 128 82.525 ns 0.0267 ns 0.0250 ns 82.525 ns
BenchAddNewMemmove good 128 128 82.764 ns 0.0255 ns 0.0226 ns 82.761 ns
BenchAddOld good 256 2 122.771 ns 0.1330 ns 0.1179 ns 122.739 ns
BenchAddNew good 256 2 65.197 ns 0.1967 ns 0.1840 ns 65.169 ns
BenchAddNewMemmove good 256 2 12.797 ns 0.1026 ns 0.0960 ns 12.776 ns
BenchAddOld good 256 4 122.580 ns 0.1142 ns 0.0891 ns 122.547 ns
BenchAddNew good 256 4 67.334 ns 1.3588 ns 1.9487 ns 66.053 ns
BenchAddNewMemmove good 256 4 12.118 ns 0.1328 ns 0.1243 ns 12.057 ns
BenchAddOld good 256 8 122.775 ns 0.0513 ns 0.0455 ns 122.769 ns
BenchAddNew good 256 8 67.161 ns 0.1645 ns 0.1538 ns 67.102 ns
BenchAddNewMemmove good 256 8 15.188 ns 0.1121 ns 0.1049 ns 15.137 ns
BenchAddOld good 256 16 123.086 ns 0.2531 ns 0.2244 ns 123.022 ns
BenchAddNew good 256 16 69.501 ns 0.0621 ns 0.0551 ns 69.495 ns
BenchAddNewMemmove good 256 16 18.547 ns 0.0834 ns 0.0780 ns 18.528 ns
BenchAddOld good 256 32 124.622 ns 0.1848 ns 0.1729 ns 124.577 ns
BenchAddNew good 256 32 74.259 ns 0.0878 ns 0.0778 ns 74.234 ns
BenchAddNewMemmove good 256 32 26.275 ns 0.1607 ns 0.1425 ns 26.231 ns
BenchAddOld good 256 64 129.737 ns 0.1050 ns 0.0931 ns 129.750 ns
BenchAddNew good 256 64 86.767 ns 0.1327 ns 0.1242 ns 86.743 ns
BenchAddNewMemmove good 256 64 44.936 ns 0.1495 ns 0.1249 ns 44.929 ns
BenchAddOld good 256 128 144.031 ns 0.1669 ns 0.1561 ns 144.024 ns
BenchAddNew good 256 128 115.261 ns 0.0782 ns 0.0693 ns 115.239 ns
BenchAddNewMemmove good 256 128 89.670 ns 0.2114 ns 0.1977 ns 89.623 ns
BenchAddOld good 256 256 165.619 ns 0.1193 ns 0.1057 ns 165.604 ns
BenchAddNew good 256 256 165.271 ns 0.2068 ns 0.1934 ns 165.296 ns
BenchAddNewMemmove good 256 256 165.865 ns 0.4035 ns 0.3774 ns 165.766 ns

@speshuric speshuric added the tenet-performance Performance related issue label Mar 15, 2023
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Mar 15, 2023
@ghost
Copy link

ghost commented Mar 15, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

System.Numerics.BigInteger add and subtract operations for non-trivial cases are implemented in Add/Subtract static methods of internal class BigIntegerCalculator. Current implementation can be improved by special handling the case of carry==0 when the current position being processed goes beyond the length of the right (short) argument, but does not exceed the length of the left (long) argument.

Reproducing the issue is possible in most environments. This is not a regression but a new optimization.

Main idea can be demonstrated on this part of Add method:

int i = 0;
long carry = 0L;
// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...

for ( ; i < right.Length; i++)
{
    long digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
for ( ; i < left.Length; i++)
{
    // "target loop"
    long digit = left[i] + carry;
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
Unsafe.Add(ref resultPtr, i) = (uint)carry;

In the second loop (marked as // "target loop") once carry is set to 0 it can not be 1 anymore. So the tail of the loop is just the movement of argument values to result span.

Analysis

BigIntegerCalculator now contains 6 static metods for add and subtract:

  • public static void Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits); - used when right argument length is 1
  • public static void Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) - default algorithm for addition
  • private static void AddSelf(Span<uint> left, ReadOnlySpan<uint> right) - checks carry and breaks the loop
  • public static void Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits) - used when right argument length is 1
  • public static void Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) - default algorithm for subtraction
  • private static void SubtractSelf(Span<uint> left, ReadOnlySpan<uint> right) - checks carry and breaks the loop

AddSelf and SubtractSelf used in internals of SquMul part of BigIntegerCalculator, cannot be optimized this way and are not considered below. Add and Subtract can be optimized almost identically so only case of Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) will be used to describe.

Following statements are correct for "target loop"

  1. carry can be 1 or 0
  2. If income carry is 1 then outcome carry can be 1 if and only if left[i] == uint.MaxValue and result[i]==0.
  3. If income carry is 0 then result[i] = left[i] for every next i. This assignment can be optimized by remove any arithmetic and use of special platform-optimized methods of copying data.

So it can be rewritten as follows:

int i = 0;
ulong carry = 0L;

// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...

for ( ; i < right.Length; i++) // this loop was not modified
{
    ulong digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
for ( ; carry != 0 && i < left.Length; i++) // carry != 0 is checked
{
    ulong digit = Unsafe.Add(ref leftPtr, i) + carry;
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
if (i < left.Length)
{
    // only move data from left argument to result
    do 
    {
        Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i);
        i++;
    } while (i < left.Length);

    // Note: if left part to copy is large then special method CopyTo is better
    //left.Slice(i).CopyTo(bits.Slice(i));

    i = left.Length;
    carry = 0;
}
Unsafe.Add(ref resultPtr, i) = unchecked((uint)carry);

Methods of data movement

In this variant, second loop checks when carry become 0 and then special case (plain movement) is triggered. Two possible ways to copy data can be considered:

  1. Loop with Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++;
  2. Copy with platform-dependent Span.CopyTo(): left.Slice(i).CopyTo(bits.Slice(i));. This internally calls high optimized Buffer.Memmove() but Span.CopyTo() do some redundant checks and thus can be slower on short slices.

On my PC (CPU Ryzen 5700G) in my draft benchmarks the second is faster when approximately left.Length - right.Length >= 16

Best and worst cases

The best case of argument data for the new version is such values of argument arrays when carry == 0.
The worst case is when carry is always 1, i.e. all left[i] == uint.MaxValue.
There should be more difference when left.Length - right.Length is large and shouldn't difference when they are equals.

Draft benchmarks

I have written some draft microbenchmarks to measure the difference. These benchmarks test 3 versions of Add method:

  • AddOld - code in the current runtime
  • AddNew - with Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++; loop
  • AddNewMemmove with left.Slice(i).CopyTo(bits.Slice(i)); without loop

Two cases are tested:

  • bad - when all left[i] and right[i] are uint.MaxValue
  • good - when all left[i] and right[i] are 1

Tests cover left.Length from 2 to 256 (doubles on each step, i.e. 2, 4, 8, 16, 32, 64, 128, 256) and right.Length from 2 to left.Length (doubles each step too).

In draft benchmark attribute [InvocationCount(10000000)] was used, hence there can be some inaccuracies on small lengths.

Valuable draft benchmark points

  • In general new versions are noticeably faster.
  • In the worst cases for new versions, slowest of new version has max regression about 0,5-1 ns per invocation. But should be noted that whole BigInteger Add operation contains some extra memcopies and much more logic than internal Add method. It needs some more analysis, but at first look it seems that regressions are insignificant.
  • Max speedup of AddNew version (as expected) is observed in case of good arrays with right.Length 2 uints and left.Length 256 uints. Old version mean is 122.771 ns, AddNew mean is 65.197 ns.
  • AddNewMemmove version is in the same test (good, 256, 2) is even faster and it's mean is 12.797 ns. Almost 10x faster than old version!
  • AddNewMemmove version faster than AddNew when left.Length - right.Length >= 16

Important note: Add method takes only part time of whole operation. End-to-end results will be comparable in absolute time difference but relative difference will differ much less.

To do's

  • Initial description, implement one method and make draft benchmarks
  • Choose new implementation ("ref to ref" or "Slice.CopyTo(Slice)")
  • Implement all methods
  • Full end-to-end benchmarks
  • Pull request

Open questions

  • Which implementation is preferable: "ref to ref" or "Slice.CopyTo(Slice)" or size-dependent?

Data

Draft benchmark results are below. I will publish the code of benchmark soon.

Data of draft benchmarikng (click to expand)

Method Case leftSize rightSize Mean Error StdDev Median
BenchAddOld bad 2 2 2.585 ns 0.0269 ns 0.0225 ns 2.577 ns
BenchAddNew bad 2 2 2.903 ns 0.0597 ns 0.0558 ns 2.895 ns
BenchAddNewMemmove bad 2 2 3.204 ns 0.0942 ns 0.1548 ns 3.175 ns
BenchAddOld bad 4 2 3.520 ns 0.0629 ns 0.0588 ns 3.485 ns
BenchAddNew bad 4 2 3.603 ns 0.0857 ns 0.0760 ns 3.613 ns
BenchAddNewMemmove bad 4 2 3.996 ns 0.0945 ns 0.0789 ns 3.954 ns
BenchAddOld bad 4 4 3.331 ns 0.0771 ns 0.0721 ns 3.312 ns
BenchAddNew bad 4 4 3.560 ns 0.0967 ns 0.0950 ns 3.512 ns
BenchAddNewMemmove bad 4 4 4.239 ns 0.0714 ns 0.0668 ns 4.199 ns
BenchAddOld bad 8 2 5.434 ns 0.1023 ns 0.0957 ns 5.422 ns
BenchAddNew bad 8 2 5.008 ns 0.1292 ns 0.1327 ns 4.914 ns
BenchAddNewMemmove bad 8 2 5.871 ns 0.0992 ns 0.0879 ns 5.888 ns
BenchAddOld bad 8 4 5.394 ns 0.0892 ns 0.0834 ns 5.395 ns
BenchAddNew bad 8 4 5.240 ns 0.1256 ns 0.1113 ns 5.224 ns
BenchAddNewMemmove bad 8 4 6.094 ns 0.1163 ns 0.1087 ns 6.069 ns
BenchAddOld bad 8 8 5.094 ns 0.0092 ns 0.0072 ns 5.093 ns
BenchAddNew bad 8 8 5.579 ns 0.0370 ns 0.0346 ns 5.585 ns
BenchAddNewMemmove bad 8 8 5.996 ns 0.0217 ns 0.0181 ns 5.995 ns
BenchAddOld bad 16 2 8.732 ns 0.0075 ns 0.0062 ns 8.732 ns
BenchAddNew bad 16 2 8.359 ns 0.0644 ns 0.0571 ns 8.338 ns
BenchAddNewMemmove bad 16 2 8.802 ns 0.0741 ns 0.0619 ns 8.782 ns
BenchAddOld bad 16 4 8.774 ns 0.0564 ns 0.0527 ns 8.751 ns
BenchAddNew bad 16 4 8.539 ns 0.0758 ns 0.0672 ns 8.535 ns
BenchAddNewMemmove bad 16 4 9.685 ns 0.0628 ns 0.0587 ns 9.651 ns
BenchAddOld bad 16 8 8.776 ns 0.0288 ns 0.0256 ns 8.768 ns
BenchAddNew bad 16 8 8.455 ns 0.0524 ns 0.0437 ns 8.452 ns
BenchAddNewMemmove bad 16 8 9.643 ns 0.0213 ns 0.0166 ns 9.640 ns
BenchAddOld bad 16 16 9.011 ns 0.0209 ns 0.0163 ns 9.015 ns
BenchAddNew bad 16 16 9.230 ns 0.0147 ns 0.0115 ns 9.232 ns
BenchAddNewMemmove bad 16 16 9.495 ns 0.0252 ns 0.0210 ns 9.488 ns
BenchAddOld bad 32 2 15.952 ns 0.0244 ns 0.0228 ns 15.953 ns
BenchAddNew bad 32 2 16.422 ns 0.1061 ns 0.0941 ns 16.413 ns
BenchAddNewMemmove bad 32 2 16.917 ns 0.0140 ns 0.0124 ns 16.918 ns
BenchAddOld bad 32 4 16.022 ns 0.0115 ns 0.0090 ns 16.023 ns
BenchAddNew bad 32 4 16.499 ns 0.0428 ns 0.0401 ns 16.490 ns
BenchAddNewMemmove bad 32 4 16.875 ns 0.0601 ns 0.0533 ns 16.882 ns
BenchAddOld bad 32 8 16.153 ns 0.0617 ns 0.0547 ns 16.148 ns
BenchAddNew bad 32 8 16.063 ns 0.1146 ns 0.1016 ns 16.035 ns
BenchAddNewMemmove bad 32 8 16.960 ns 0.0484 ns 0.0404 ns 16.947 ns
BenchAddOld bad 32 16 16.665 ns 0.0260 ns 0.0243 ns 16.659 ns
BenchAddNew bad 32 16 16.928 ns 0.0542 ns 0.0452 ns 16.929 ns
BenchAddNewMemmove bad 32 16 17.373 ns 0.0623 ns 0.0583 ns 17.376 ns
BenchAddOld bad 32 32 17.253 ns 0.0250 ns 0.0221 ns 17.260 ns
BenchAddNew bad 32 32 17.515 ns 0.0730 ns 0.0570 ns 17.513 ns
BenchAddNewMemmove bad 32 32 18.008 ns 0.0164 ns 0.0153 ns 18.007 ns
BenchAddOld bad 64 2 30.824 ns 0.0548 ns 0.0458 ns 30.817 ns
BenchAddNew bad 64 2 30.854 ns 0.0239 ns 0.0187 ns 30.858 ns
BenchAddNewMemmove bad 64 2 31.479 ns 0.0456 ns 0.0404 ns 31.481 ns
BenchAddOld bad 64 4 30.796 ns 0.0182 ns 0.0161 ns 30.793 ns
BenchAddNew bad 64 4 29.935 ns 0.0929 ns 0.0869 ns 29.923 ns
BenchAddNewMemmove bad 64 4 30.630 ns 0.1631 ns 0.1526 ns 30.653 ns
BenchAddOld bad 64 8 30.775 ns 0.0837 ns 0.0783 ns 30.784 ns
BenchAddNew bad 64 8 31.110 ns 0.0434 ns 0.0385 ns 31.101 ns
BenchAddNewMemmove bad 64 8 31.391 ns 0.0681 ns 0.0603 ns 31.402 ns
BenchAddOld bad 64 16 31.845 ns 0.0911 ns 0.0852 ns 31.856 ns
BenchAddNew bad 64 16 32.334 ns 0.0686 ns 0.0641 ns 32.316 ns
BenchAddNewMemmove bad 64 16 32.012 ns 0.0781 ns 0.0730 ns 32.003 ns
BenchAddOld bad 64 32 34.869 ns 0.0747 ns 0.0698 ns 34.873 ns
BenchAddNew bad 64 32 35.400 ns 0.1425 ns 0.1333 ns 35.426 ns
BenchAddNewMemmove bad 64 32 34.899 ns 0.0755 ns 0.0707 ns 34.914 ns
BenchAddOld bad 64 64 37.207 ns 0.1676 ns 0.1568 ns 37.125 ns
BenchAddNew bad 64 64 37.827 ns 0.0352 ns 0.0312 ns 37.828 ns
BenchAddNewMemmove bad 64 64 38.326 ns 0.0473 ns 0.0442 ns 38.324 ns
BenchAddOld bad 128 2 64.390 ns 0.0895 ns 0.0747 ns 64.372 ns
BenchAddNew bad 128 2 64.430 ns 0.1261 ns 0.1117 ns 64.400 ns
BenchAddNewMemmove bad 128 2 61.757 ns 0.0785 ns 0.0696 ns 61.758 ns
BenchAddOld bad 128 4 64.489 ns 0.0966 ns 0.0856 ns 64.508 ns
BenchAddNew bad 128 4 61.862 ns 0.1050 ns 0.0982 ns 61.818 ns
BenchAddNewMemmove bad 128 4 64.728 ns 0.0337 ns 0.0299 ns 64.733 ns
BenchAddOld bad 128 8 59.975 ns 0.0802 ns 0.0711 ns 59.958 ns
BenchAddNew bad 128 8 60.545 ns 0.0964 ns 0.0902 ns 60.567 ns
BenchAddNewMemmove bad 128 8 60.686 ns 0.0538 ns 0.0477 ns 60.691 ns
BenchAddOld bad 128 16 60.303 ns 0.1191 ns 0.1056 ns 60.333 ns
BenchAddNew bad 128 16 61.590 ns 0.7334 ns 0.6860 ns 61.163 ns
BenchAddNewMemmove bad 128 16 61.623 ns 0.0646 ns 0.0573 ns 61.618 ns
BenchAddOld bad 128 32 63.400 ns 0.2310 ns 0.2048 ns 63.358 ns
BenchAddNew bad 128 32 64.186 ns 0.2440 ns 0.2282 ns 64.057 ns
BenchAddNewMemmove bad 128 32 64.191 ns 0.0389 ns 0.0304 ns 64.195 ns
BenchAddOld bad 128 64 69.698 ns 0.1385 ns 0.1228 ns 69.638 ns
BenchAddNew bad 128 64 70.742 ns 0.2145 ns 0.2006 ns 70.772 ns
BenchAddNewMemmove bad 128 64 70.199 ns 0.2155 ns 0.1800 ns 70.201 ns
BenchAddOld bad 128 128 81.541 ns 0.0562 ns 0.0526 ns 81.522 ns
BenchAddNew bad 128 128 81.757 ns 0.0175 ns 0.0163 ns 81.761 ns
BenchAddNewMemmove bad 128 128 82.336 ns 0.0408 ns 0.0362 ns 82.331 ns
BenchAddOld bad 256 2 122.853 ns 0.2910 ns 0.2579 ns 122.735 ns
BenchAddNew bad 256 2 122.601 ns 0.0401 ns 0.0335 ns 122.603 ns
BenchAddNewMemmove bad 256 2 118.404 ns 0.4241 ns 0.3967 ns 118.360 ns
BenchAddOld bad 256 4 122.572 ns 0.0952 ns 0.0844 ns 122.575 ns
BenchAddNew bad 256 4 123.188 ns 0.0382 ns 0.0358 ns 123.190 ns
BenchAddNewMemmove bad 256 4 118.899 ns 0.3367 ns 0.3150 ns 119.049 ns
BenchAddOld bad 256 8 122.766 ns 0.0753 ns 0.0629 ns 122.782 ns
BenchAddNew bad 256 8 118.796 ns 0.4198 ns 0.3927 ns 118.989 ns
BenchAddNewMemmove bad 256 8 121.238 ns 0.2532 ns 0.2369 ns 121.260 ns
BenchAddOld bad 256 16 123.677 ns 0.2106 ns 0.1970 ns 123.688 ns
BenchAddNew bad 256 16 125.204 ns 0.0515 ns 0.0482 ns 125.195 ns
BenchAddNewMemmove bad 256 16 125.459 ns 0.0575 ns 0.0537 ns 125.466 ns
BenchAddOld bad 256 32 123.549 ns 0.2009 ns 0.1879 ns 123.557 ns
BenchAddNew bad 256 32 127.413 ns 0.2725 ns 0.2549 ns 127.377 ns
BenchAddNewMemmove bad 256 32 128.176 ns 0.1296 ns 0.1149 ns 128.191 ns
BenchAddOld bad 256 64 129.709 ns 0.1311 ns 0.1226 ns 129.692 ns
BenchAddNew bad 256 64 132.365 ns 0.0854 ns 0.0757 ns 132.396 ns
BenchAddNewMemmove bad 256 64 133.685 ns 0.2328 ns 0.2177 ns 133.636 ns
BenchAddOld bad 256 128 142.283 ns 0.0672 ns 0.0595 ns 142.284 ns
BenchAddNew bad 256 128 145.236 ns 0.2122 ns 0.1985 ns 145.283 ns
BenchAddNewMemmove bad 256 128 145.719 ns 0.2226 ns 0.2082 ns 145.670 ns
BenchAddOld bad 256 256 165.340 ns 0.3852 ns 0.3603 ns 165.212 ns
BenchAddNew bad 256 256 165.619 ns 0.0412 ns 0.0365 ns 165.631 ns
BenchAddNewMemmove bad 256 256 165.315 ns 0.0680 ns 0.0568 ns 165.320 ns
BenchAddOld good 2 2 2.463 ns 0.0765 ns 0.0880 ns 2.472 ns
BenchAddNew good 2 2 2.693 ns 0.0574 ns 0.0509 ns 2.685 ns
BenchAddNewMemmove good 2 2 3.429 ns 0.0930 ns 0.0870 ns 3.445 ns
BenchAddOld good 4 2 3.343 ns 0.0698 ns 0.0618 ns 3.342 ns
BenchAddNew good 4 2 3.247 ns 0.0871 ns 0.1593 ns 3.213 ns
BenchAddNewMemmove good 4 2 5.360 ns 0.0828 ns 0.0734 ns 5.321 ns
BenchAddOld good 4 4 3.344 ns 0.0888 ns 0.0830 ns 3.366 ns
BenchAddNew good 4 4 3.596 ns 0.0934 ns 0.0873 ns 3.618 ns
BenchAddNewMemmove good 4 4 4.481 ns 0.0619 ns 0.0579 ns 4.493 ns
BenchAddOld good 8 2 5.399 ns 0.0864 ns 0.0809 ns 5.401 ns
BenchAddNew good 8 2 4.347 ns 0.0988 ns 0.0876 ns 4.331 ns
BenchAddNewMemmove good 8 2 5.410 ns 0.0615 ns 0.0545 ns 5.429 ns
BenchAddOld good 8 4 5.218 ns 0.1080 ns 0.1010 ns 5.201 ns
BenchAddNew good 8 4 4.667 ns 0.0568 ns 0.0503 ns 4.640 ns
BenchAddNewMemmove good 8 4 6.658 ns 0.0953 ns 0.0845 ns 6.659 ns
BenchAddOld good 8 8 5.094 ns 0.0144 ns 0.0112 ns 5.089 ns
BenchAddNew good 8 8 5.715 ns 0.0724 ns 0.0677 ns 5.700 ns
BenchAddNewMemmove good 8 8 6.075 ns 0.0402 ns 0.0336 ns 6.067 ns
BenchAddOld good 16 2 8.978 ns 0.0386 ns 0.0361 ns 8.976 ns
BenchAddNew good 16 2 5.883 ns 0.0141 ns 0.0125 ns 5.883 ns
BenchAddNewMemmove good 16 2 5.717 ns 0.1207 ns 0.1129 ns 5.753 ns
BenchAddOld good 16 4 8.796 ns 0.0898 ns 0.0796 ns 8.763 ns
BenchAddNew good 16 4 6.440 ns 0.0853 ns 0.0798 ns 6.394 ns
BenchAddNewMemmove good 16 4 6.898 ns 0.1508 ns 0.1410 ns 6.891 ns
BenchAddOld good 16 8 8.804 ns 0.0696 ns 0.0543 ns 8.797 ns
BenchAddNew good 16 8 7.118 ns 0.0780 ns 0.0729 ns 7.095 ns
BenchAddNewMemmove good 16 8 8.060 ns 0.0123 ns 0.0103 ns 8.057 ns
BenchAddOld good 16 16 8.765 ns 0.0132 ns 0.0117 ns 8.766 ns
BenchAddNew good 16 16 9.316 ns 0.0177 ns 0.0148 ns 9.311 ns
BenchAddNewMemmove good 16 16 9.472 ns 0.0136 ns 0.0120 ns 9.476 ns
BenchAddOld good 32 2 15.976 ns 0.0149 ns 0.0132 ns 15.976 ns
BenchAddNew good 32 2 10.003 ns 0.0572 ns 0.0477 ns 9.994 ns
BenchAddNewMemmove good 32 2 6.473 ns 0.1306 ns 0.1222 ns 6.482 ns
BenchAddOld good 32 4 16.029 ns 0.0226 ns 0.0211 ns 16.021 ns
BenchAddNew good 32 4 10.277 ns 0.0319 ns 0.0283 ns 10.268 ns
BenchAddNewMemmove good 32 4 7.188 ns 0.0847 ns 0.0792 ns 7.169 ns
BenchAddOld good 32 8 16.301 ns 0.3195 ns 0.3803 ns 16.152 ns
BenchAddNew good 32 8 12.358 ns 0.2750 ns 0.4670 ns 12.136 ns
BenchAddNewMemmove good 32 8 8.976 ns 0.0233 ns 0.0182 ns 8.983 ns
BenchAddOld good 32 16 16.970 ns 0.0710 ns 0.0630 ns 16.966 ns
BenchAddNew good 32 16 13.146 ns 0.1049 ns 0.0819 ns 13.162 ns
BenchAddNewMemmove good 32 16 12.155 ns 0.0423 ns 0.0353 ns 12.152 ns
BenchAddOld good 32 32 17.361 ns 0.0498 ns 0.0442 ns 17.359 ns
BenchAddNew good 32 32 17.444 ns 0.0156 ns 0.0145 ns 17.443 ns
BenchAddNewMemmove good 32 32 17.859 ns 0.0275 ns 0.0229 ns 17.859 ns
BenchAddOld good 64 2 30.810 ns 0.0510 ns 0.0452 ns 30.802 ns
BenchAddNew good 64 2 17.049 ns 0.0184 ns 0.0164 ns 17.042 ns
BenchAddNewMemmove good 64 2 7.111 ns 0.1680 ns 0.1571 ns 7.145 ns
BenchAddOld good 64 4 30.790 ns 0.0456 ns 0.0404 ns 30.803 ns
BenchAddNew good 64 4 17.728 ns 0.0312 ns 0.0276 ns 17.732 ns
BenchAddNewMemmove good 64 4 8.134 ns 0.0892 ns 0.0834 ns 8.144 ns
BenchAddOld good 64 8 30.834 ns 0.0719 ns 0.0673 ns 30.846 ns
BenchAddNew good 64 8 19.482 ns 0.0262 ns 0.0205 ns 19.482 ns
BenchAddNewMemmove good 64 8 9.674 ns 0.0432 ns 0.0361 ns 9.662 ns
BenchAddOld good 64 16 31.804 ns 0.0847 ns 0.0792 ns 31.775 ns
BenchAddNew good 64 16 20.717 ns 0.0970 ns 0.0907 ns 20.698 ns
BenchAddNewMemmove good 64 16 12.849 ns 0.0284 ns 0.0222 ns 12.846 ns
BenchAddOld good 64 32 34.382 ns 0.1602 ns 0.1498 ns 34.429 ns
BenchAddNew good 64 32 27.406 ns 0.0531 ns 0.0443 ns 27.407 ns
BenchAddNewMemmove good 64 32 20.663 ns 0.0403 ns 0.0377 ns 20.662 ns
BenchAddOld good 64 64 37.511 ns 0.0232 ns 0.0217 ns 37.512 ns
BenchAddNew good 64 64 37.314 ns 0.0940 ns 0.0879 ns 37.339 ns
BenchAddNewMemmove good 64 64 38.143 ns 0.0108 ns 0.0090 ns 38.143 ns
BenchAddOld good 128 2 64.610 ns 0.0514 ns 0.0480 ns 64.607 ns
BenchAddNew good 128 2 35.626 ns 0.0914 ns 0.0810 ns 35.602 ns
BenchAddNewMemmove good 128 2 8.729 ns 0.1174 ns 0.1041 ns 8.726 ns
BenchAddOld good 128 4 64.728 ns 0.1634 ns 0.1448 ns 64.763 ns
BenchAddNew good 128 4 36.102 ns 0.0315 ns 0.0263 ns 36.106 ns
BenchAddNewMemmove good 128 4 9.951 ns 0.0793 ns 0.0662 ns 9.953 ns
BenchAddOld good 128 8 59.980 ns 0.0317 ns 0.0297 ns 59.987 ns
BenchAddNew good 128 8 33.452 ns 0.1252 ns 0.1171 ns 33.407 ns
BenchAddNewMemmove good 128 8 10.903 ns 0.1075 ns 0.0953 ns 10.847 ns
BenchAddOld good 128 16 60.148 ns 0.0566 ns 0.0473 ns 60.139 ns
BenchAddNew good 128 16 36.152 ns 0.0371 ns 0.0329 ns 36.146 ns
BenchAddNewMemmove good 128 16 14.780 ns 0.0253 ns 0.0197 ns 14.776 ns
BenchAddOld good 128 32 62.632 ns 0.1169 ns 0.1037 ns 62.623 ns
BenchAddNew good 128 32 41.202 ns 0.1629 ns 0.1444 ns 41.133 ns
BenchAddNewMemmove good 128 32 22.425 ns 0.0440 ns 0.0368 ns 22.414 ns
BenchAddOld good 128 64 69.786 ns 0.2435 ns 0.2278 ns 69.657 ns
BenchAddNew good 128 64 53.662 ns 0.0841 ns 0.0787 ns 53.662 ns
BenchAddNewMemmove good 128 64 41.252 ns 0.0296 ns 0.0262 ns 41.247 ns
BenchAddOld good 128 128 81.869 ns 0.0553 ns 0.0462 ns 81.866 ns
BenchAddNew good 128 128 82.525 ns 0.0267 ns 0.0250 ns 82.525 ns
BenchAddNewMemmove good 128 128 82.764 ns 0.0255 ns 0.0226 ns 82.761 ns
BenchAddOld good 256 2 122.771 ns 0.1330 ns 0.1179 ns 122.739 ns
BenchAddNew good 256 2 65.197 ns 0.1967 ns 0.1840 ns 65.169 ns
BenchAddNewMemmove good 256 2 12.797 ns 0.1026 ns 0.0960 ns 12.776 ns
BenchAddOld good 256 4 122.580 ns 0.1142 ns 0.0891 ns 122.547 ns
BenchAddNew good 256 4 67.334 ns 1.3588 ns 1.9487 ns 66.053 ns
BenchAddNewMemmove good 256 4 12.118 ns 0.1328 ns 0.1243 ns 12.057 ns
BenchAddOld good 256 8 122.775 ns 0.0513 ns 0.0455 ns 122.769 ns
BenchAddNew good 256 8 67.161 ns 0.1645 ns 0.1538 ns 67.102 ns
BenchAddNewMemmove good 256 8 15.188 ns 0.1121 ns 0.1049 ns 15.137 ns
BenchAddOld good 256 16 123.086 ns 0.2531 ns 0.2244 ns 123.022 ns
BenchAddNew good 256 16 69.501 ns 0.0621 ns 0.0551 ns 69.495 ns
BenchAddNewMemmove good 256 16 18.547 ns 0.0834 ns 0.0780 ns 18.528 ns
BenchAddOld good 256 32 124.622 ns 0.1848 ns 0.1729 ns 124.577 ns
BenchAddNew good 256 32 74.259 ns 0.0878 ns 0.0778 ns 74.234 ns
BenchAddNewMemmove good 256 32 26.275 ns 0.1607 ns 0.1425 ns 26.231 ns
BenchAddOld good 256 64 129.737 ns 0.1050 ns 0.0931 ns 129.750 ns
BenchAddNew good 256 64 86.767 ns 0.1327 ns 0.1242 ns 86.743 ns
BenchAddNewMemmove good 256 64 44.936 ns 0.1495 ns 0.1249 ns 44.929 ns
BenchAddOld good 256 128 144.031 ns 0.1669 ns 0.1561 ns 144.024 ns
BenchAddNew good 256 128 115.261 ns 0.0782 ns 0.0693 ns 115.239 ns
BenchAddNewMemmove good 256 128 89.670 ns 0.2114 ns 0.1977 ns 89.623 ns
BenchAddOld good 256 256 165.619 ns 0.1193 ns 0.1057 ns 165.604 ns
BenchAddNew good 256 256 165.271 ns 0.2068 ns 0.1934 ns 165.296 ns
BenchAddNewMemmove good 256 256 165.865 ns 0.4035 ns 0.3774 ns 165.766 ns

Author: speshuric
Assignees: -
Labels:

area-System.Numerics, tenet-performance, untriaged

Milestone: -

@speshuric
Copy link
Contributor Author

Just a status update.

  • It seems that "Slice.CopyTo(Slice)" is preferable. Separate benchmarks (not the whole Biginteger but this methods only) show that overhead of "Slice.CopyTo(Slice)" probably is negligible.
  • Implemented both Add methods.

@speshuric
Copy link
Contributor Author

Status update

  • Add and Subtract methods are implemented.

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 23, 2023
@tannergooding
Copy link
Member

tannergooding commented Mar 23, 2023

Have to be careful with this type of optimization because while it will speed up the case where the carry terminates quickly, it will also slow down the case where the carry does not due to have an additional condition and branch per iteration.

Perf numbers showing the improvement for many different scenarios, including "worst case" scenarios will be important.

@speshuric
Copy link
Contributor Author

I've created PR. Below I'll describe my benchmarks and its results.

@speshuric
Copy link
Contributor Author

speshuric commented Mar 27, 2023

I apologize in advance to the participants for posting the results step by step, and not all at once. There are a lot of results and it takes time to recheck (and my personal time for this issue is very limited).

I should note that dotnet/performance tests are unsuitable for this issue. Almost all tests are "N-bit, N-bit", so usage of them is limited. I've executed them to check and prevent general regressions. It seems that all results are same or not related or difference is negligible, but I'll post them a bit later.

For this issue I made separate benchmark. It is quite simple benchmark but there are some special remarks.

This benchmark checks add/subtract operations for wide range of arguments. All arguments are non-negative. First argument always greater or equal the second and all such possible pairs are checked, so test set is large and sometimes is excessive.

Following numbers are in testset:

  • 0
  • 1
  • $2^n+m$ where n is taken from int[] sizes array, and m can be -1, 0 or 1. Default set of n's is: 16, 31, 32, 128, 256, 1024, 4096, 16384, 65536

The test are executed on the runtime's main branch local build and my local edited branch with new add/sub versions. For the moment I've tested only x64 core build in Linux. My AMD Ryzen 7 5700G worked about 11 hrs to complete this tests. Execution time can be optimized, but it was easier for me to run it and check it later then tune the tests. Raw results will be attached soon.

Main points:

  • CopyTo wins when length difference is more than 256 bits (8 uints in internal array).
  • As expected the worst cases are decrement of $2^n$ and increment of $2^n-1$. In these cases almost all time is in loop with extra carry check.
  • As expected the best cases are increment of $2^n$ and increment of $2^n-1$ and other when carry is zero after processing length of smaller argument.

Max effect of optimization:

n time/baseline
256 0.96-0.98
1024 ~0.75
4096 ~0.65
16384 ~0.53
65536 ~0.51

(This is very rough estimate. Details will be shown later.)

Worst cases and overall results will be discussed in next comment.

(to be continued...)

@speshuric
Copy link
Contributor Author

speshuric commented Mar 27, 2023

I've tuned benchmark's time. Now it is about 2 hrs.
Left (large) arguments extended, now they are:

  • 0
  • 1
  • 2^16
  • 2^n+m where n in {31, 32, 64, 128, 256, 512, 1024, 4096, 16384, 65536, 262144}, m in -1..1

Right (shorter) arguments are (all mus be smaller than left or equals):

  • 0
  • 1
  • 2^16
  • 2^(n/2) where n is size of left
  • 2^n + m where n is size of left, m in -1..1

One of the benchmark results here (warning: large results).

@speshuric
Copy link
Contributor Author

speshuric commented Mar 28, 2023

Benchmarks results

I've analyzed output of benchmarks. Here are facts, thoughts and conclusions.

  1. Benchmarks time is not stable from launch to launch. Typical spread of values is about 0.5-1.5 ns and it seems this is not depend on code. This is not a problem for big values of arguments, but is significantly for small ones. Typical time of add/sub 128 bit BigInteger is 20 ns, 64 bit - 15-17 ns, trivial (up to 31 bit) - 5-7 ns. To avoid this instability I've launched tests with different --launchCount option values.
  2. As expected, time of trivial add/sub is not affected by this modification. It will be excluded from next benchmarks.
  3. As expected, time of add/sub values of the same length is not affected by this modification. But I have to note cases where length of result is short (N-N, N-(N-1) and so on): time of this operation in baseline and modified runtime are similar and takes about 1.5 times of operations where length is not reduced.
  4. As expected, operations where carry flag is zeroed quickly are much faster then previous version. The effect is valuable when difference of length is 256+ bit or more. I'll publish the table and description below in a day or two.
  5. Add operation with high number of carries and non-trivial right argument ((2^N-1) + 2^32 and similar) is not regressed. This is rather unexpected, but can be caused by collateral loop modification.
  6. Add operation with high number of carries and trivial right argument ((2^N-1) + 1 and similar) is regressed. Average regression is 3-5%. I think this is acceptable but leave the decision to the team.
  7. Subtract operation with high number of carries and non-trivial right argument (2^N - 2^32 and similar) is not regressed. Almost the same situation as with add.
  8. Subtract operation with high number of carries and trivial right argument ((2^N) - 1 and similar) is regressed. Disappointing facts that regression is 5-8% in average and grows up to 9-10% when size of left argument grows. It needs to be investigated.
  9. Benchmark of add/sub operations with left argument size of 64-256 needs to be redone with a higher --launchCount option value.

Latest benchmark results: Benchmarks.BigintBenchmark-report.zip

@speshuric
Copy link
Contributor Author

speshuric commented Mar 28, 2023

Benchmarks of add/sub 64-256 bit integer

I've restarted this benchmarks. There is still instable high spread in some cases (which is changed from launch to launch), but whole picture is clear to me. The worst cases are slower then in reference build by 3-5%. In absolute timing it is about 1-2 ns penalty.

Benchmarks.BigintBenchmark-report64-256.zip

What else I plan to do:

  • Investigate the case of subtract 2^n-1
  • Make summary of performance changes.

@tannergooding (or anybody else): What else could I do to help you decide merge or decline this changes? What can you advice to check? Should I check other execution modes (32-bit, AOT, wasm, whatever else)? How to check performance of ARM build?

@speshuric
Copy link
Contributor Author

Now while another bunch of benchmarks are running, I've taken a look at case N-(N-1). It seems it caused by this:

// Try to conserve space as much as possible by checking for wasted leading span entries
// sometimes the span has leading zeros from bit manipulation operations & and ^
for (len = value.Length; len > 0 && value[len - 1] == 0; len--);

BTW it can be replaced by LastIndexOfAnyExcept(), which is about 4 times faster for this case, but it needs isolated benchmarks, so I'll do it later.

speshuric added a commit to speshuric/runtime that referenced this issue Apr 11, 2023
In BigIntegerCalculator methods Add and Subtract, if sizes of arguments
differ, after processing the part of size of the right (small) argument,
there was loop of add/sub carry value. When the carry value once become
zero, in fact the rest of the larger argument can be copied to the result.

With this commit the second loop is interrupted when carry become zero
and applies fast Span.CopyTo(Span) to the rest part.

This optimization applied only when size of the greatest argument is more
or equal to const CopyToThreshold introduced in this commit. This const
is 8 now.

Also made minor related changes to hot cycles.

See dotnet#83457 for details.
@speshuric
Copy link
Contributor Author

Summary report

Plan

  • Benchmark's design and results
  • What exactly changed and why
  • Conclusion

Benchmark's design and results

I performed 3 benchmark sets:

  1. Isolated: old (baseline) and new versions of BigIntegerCalculator.Add and BigIntegerCalculator.Subtract methods. This helps to debug and tune new versions, but it is too synthetic and does not show full view. Baseline and modified versions run in the same runtime/environment.
  2. Full: BigInteger "+" and "-" operators used in it. Baseline and modified runtime are pointed by corerun option.
  3. Stock: Standard dotnet benchmarks.

Isolated

This benchmark tests add/sub methods on uint spans. Benchmarked methods:

  • BenchSubOneEmpty(int leftSize) - Benchmark simply creates span(s) and run dummy noinline methods. This helps check sidecar effect of other benchmarks of the same type.
  • BenchSubOneBaseline(int leftSize) - Baseline variant of Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits) benchmarked
  • BenchSubOneNew(int leftSize) - New version of Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits) benchmarked
  • BenchSubMultiEmpty(int leftSize, int rightSize) - Benchmark simply creates span(s) and run dummy noinline methods. This hepls check sidecar effect of other benchmarks of the same type.
  • BenchSubMultiBaseline(int leftSize, int rightSize) - Baseline variant of Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) benchmarked
  • BenchSubMultiNew(int leftSize, int rightSize) - New version of Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) benchmarked
  • BenchAddOneEmpty(int leftSize) - Benchmark simply creates span(s) and run dummy noinline methods. This hepls check sidecar effect of other benchmarks of the same type.
  • BenchAddOneBaseline(int leftSize) - Baseline variant of Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits) benchmarked
  • BenchAddOneNew(int leftSize) - New version of Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits) benchmarked
  • BenchAddMultiEmpty(int leftSize, int rightSize) - Benchmark simply creates span(s) and run dummy noinline methods. This hepls check sidecar effect of other benchmarks of the same type.
  • BenchAddMultiBaseline(int leftSize, int rightSize) - Baseline variant of Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) benchmarked
  • BenchAddMultiNew(int leftSize, int rightSize) - New version of Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) benchmarked

Every benchmark method tested in "good" and "bad" case. Where "good" case means no-carry tests and "bad" case means all-carry tests. Good and bad cases for add and subtract are different.

Bench*One* tests used following set of lengths of span: {1, 2, 4, 8, 16, 64, 128, 256, 1024, 4096, 16384, 65536}. The right argument is taken appropriate for this case ("good" or "bad")
Bench*Multi* tests used following set of lengths of left span: {1, 2, 4, 8, 16, 64, 128, 256, 1024, 4096, 16384, 65536}. The right span length can be 1, 2, 4, n/2, n-1, n where n is length of left span.
*Sub* tests set the last span element to uint.MaxValue.

All tests are executed on following environments:

  • "Main": AMD Ryzen 7 5700GE CPU, 128 GB RAM, OS: Linux, the latest CLR built locally, X64 RyuJIT runtime.
  • "Old": Intel Xeon W3690 CPU, 48 GB RAM, OS Windows 11, the latest CLR built locally, X64 RyuJIT runtime.
  • "Old VM": Intel Xeon W3690 CPU, 48 GB RAM, OS Linux in Virtualbox, the latest CLR built locally, X64 RyuJIT runtime.

The first environment is roughly twice faster than other two, but overall relative performance results are similar, so only results for "Main" are provided.

This command used to execute tests:

dotnet run -c Release -- -f "*" --coreRun $corerun_path --iterationTime 350 --minIterationCount 5 --minWarmupCount 4 --maxWarmupCount 8 --launchCount 7 --allStats -m --join

Used options:

  • --iterationTime 350 --minIterationCount 5 --minWarmupCount 4 --maxWarmupCount 8 turned to be sufficient to stable results in one launch
  • --launchCount 7 - results are slightly volatile from launch to launch. This is smoothed by --launchCount option.
  • --allStats helps to analyze volatile results
  • -m memory stats. 0 bytes allocated as expected.
  • --join join all results in one

Full results are attached.

Isolated results summary

Bench*One*:

BenchAddOneEmpty: 1.6 ns, BenchSubOneEmpty: 2.3 ns

Method leftSize All-carry baseline mean ns All-carry new mean ns All-carry Ratio All-carry RatioSD No-carry baseline mean ns No-carry new mean ns No-carry Ratio No-carry RatioSD Comment
Add 1 2.11 2.129 1.01 0.01 2.091 2.112 1.01 0.01
Sub 1 2.942 3.161 1.07 0.04 2.954 3.118 1.06 0.05 0.2 ns regress
Add 2 2.447 2.353 0.96 0.02 2.382 2.357 0.99 0.02
Sub 2 3.241 3.388 1.05 0.03 3.221 3.426 1.06 0.03 0.2 ns regress
Add 4 3.348 2.982 0.89 0.04 3.303 2.978 0.90 0.04 ++
Sub 4 4.062 3.972 0.98 0.05 3.985 4.029 1.01 0.03
Add 8 4.661 4.162 0.90 0.06 4.917 4.151 0.85 0.06 ++
Sub 8 5.589 5.605 1.00 0.07 5.466 5.635 1.01 0.05
Add 16 8.46 8.832 1.05 0.05 8.592 5.099 0.59 0.03 All-carry: 0.4 ns regress
Sub 16 8.878 9.407 1.06 0.02 8.824 6.12 0.69 0.01 All-carry: 0.4 ns regress
Add 64 29.638 29.372 0.99 0.05 30.856 8.994 0.29 0.03
Sub 64 30.935 40.043 1.30 0.09 33.258 9.57 0.29 0.02 All-carry: 30% regress, see highlights
Add 128 61.288 62.448 1.02 0.01 61.096 12.841 0.21 0.01
Sub 128 62.155 67.726 1.09 0.01 62.52 14.201 0.23 0
Add 256 117.518 117.935 1.00 0 117.533 23.446 0.20 0.01
Sub 256 118.833 123.06 1.04 0 118.701 24.195 0.20 0
Add 1024 453.284 450.504 0.99 0 452.394 178.871 0.40 0
Sub 1024 453.14 455.689 1.01 0 454.086 179.465 0.40 0
Add 4096 1784.84 1789.858 1.00 0.01 1781.986 715.593 0.40 0
Sub 4096 1781.051 1793.831 1.01 0 1781.391 717.848 0.40 0
Add 16384 7187.529 7165.47 1.00 0 7185.807 2931.655 0.41 0
Sub 16384 7155.643 7193.452 1.01 0 7179.188 2931.865 0.41 0
Add 65536 28807.404 28803.928 1.00 0.01 28730.689 11684.663 0.41 0
Sub 65536 28870.218 28809.959 1.00 0.01 28880.531 11690.097 0.40 0

Highlights:

  • The only significant slowdown (10 ns, 30%) is subtract from 64 uint span in "all-carry" case. This result is reproducible on my PC. But it seems that is a very specific case, I do not think it is blocker.
  • In no-carry cases when length in range 64..256 new variant is faster up to 5 times, but when length is 1024+, then there is only 2.5 times difference. This is odd, but I did not investigated this.
  • Note that uint is 32 bit length, so for example 16 uint span corresponding to numbers about 2^512.

Bench*Multi*:

BenchAddMultiEmpty: 1.95 ns, BenchSubMultiEmpty: 1.74 ns

Result table is quite large, but overall conclusion is the same: all-carry cases for small (1-8 uints) numbers are slower within 1 ns, average slowdown for larger numbers typically is up to 3-6%.
There are few cases where slowdown is more than 6% and more than 1 ns, I'd like to comment them:

Method Case leftSize rightSize Mean Ratio RatioSD Comment
Sub no carry 4 4 10.564 3.26 0.96 This is statistical artifact. May be some background task started. Not reproduced in other tests. Min time is 3.6 ns
Sub no carry 16 16 18.604 1.64 0.76 This is statistical artifact. May be some background task started. Not reproduced in other tests. Min time is 10.7 ns
Add All-carry 64 4 34.287 1.14 0.01 It seems that 64-uint is unlucky
Sub no carry 16 15 10.367 1.13 0.05 Yes, this case is 10-13% slower
Sub no carry 64 64 40.048 1.08 0 It seems that 64-uint is unlucky
Sub no carry 64 63 39.431 1.07 0.01 It seems that 64-uint is unlucky

In Bench*Multi* no-carry tests there is the same situation with length<=256 and length>256: speed of the first case is roughly 2 times faster then the second.

You may notice that some results are faster even without CopyTo, I'll explain it below.

Full

This benchmark tests add/sub methods on BigInteger class. There are only 2 benchmarked methods:

    [Benchmark]
    [ArgumentsSource(nameof(GetSizes))]
    public BigInteger Add(Entry left, Entry right)
    {
        return right.Value + left.Value;
    }

    [Benchmark]
    [ArgumentsSource(nameof(GetSizes))]
    public BigInteger Sub(Entry left, Entry right)
    {
        return left.Value - right.Value;
    }

All the main part is the method GetSizes():

  • Entry is a struct containing BigInteger value and its parameters, with overloaded ToString(). This struct is used to convenient build values like 2^n+m and suitable output of the benchmark.
  • left.Value is always greater or equal right.Value
  • left.Value parameter is 2^n+m where n in {31, 32, 64, 128, 256, 512, 2048, 4096, 8192, 32768, 131072, 524288, 2097152} (these values, starting from the second, correspond to the length of the parameters from the previous test multiplied by 32) and m can be -1 or 0. Here I use the feature of fast creating 2^n.
  • right.Value goes over {1, 2^16, 2^31, 2^(n/2), 2^n-1, 2^n}, where n is length of left.Value
  • GetSizes() generates described pairs.

All tests are executed on the same environments as Isolated benchmark. Again, the first environment is roughly twice faster than other two, but overall relative performance results are similar, so only results for "main PC" are provided.

This command used to execute tests:

dotnet run -c Release -- -f "*" -h Toolchain --coreRun $baseline_corerun $new_corerun --iterationTime 350 --minIterationCount 5 --minWarmupCount 4 --maxWarmupCount 8 --launchCount 3 --allStats -m --join

Used options:

  • -h Toolchain - to hide lo-o-o-ong path to corerun from results.
  • --iterationTime 350 --minIterationCount 5 --minWarmupCount 4 --maxWarmupCount 8 turned to be sufficient to stable results in one launch
  • --launchCount 3 - results are slightly volatile from launch to launch. This is smoothed by --launchCount option.
  • --allStats helps to analyze volatile results
  • -m memory stats. 0 bytes allocated as expected.
  • --join join all results in one

Full results are attached. There are 548 test cases - 274 baseline and 274 new version, quite a lot, so I do not put it in the text.
Benchmark highlights:

  • The best and worst cases are the same as in "Isolated", but the relative difference between the base version and the new version reduced because the full version does extra work.
  • There are 274 cases, 64 are faster (0.45-0.95 of baseline), 198 are the same (0.96-1.04 of baseline), 12 are slower (1.05-1.1), but 5 of the slower ones are within 1 ns difference and 1 is statistical artifact (min and median are much less then mean).
  • Again the worst cases are "all-carry" subtracts when n == 64*32 == 2048 with the same ~10 ns difference. In full tests it is about 10%, not 30% as in isolated.

Stock

This benchmark is fast but not representative for this modification. Despite this, I performed it to check for regression.

Command:

dotnet run -c Release -f net8.0 -- --filter "*BigInteger*" -h Toolchain --launchCount 7 --coreRun $baseline_corerun $new_corerun --join

All new results are within StdDev of baseline.

Unit-tests

All BigInteger tests are passed. I see no reason to create new UT. If there will be ideas for new UT, I'll implement them.

What exactly changed and why

All changes are in the file BigIntegerCalculator.AddSub.cs.

Add const int CopyToThreshold = 8.

As other similar consts it become private static int in debug mode for testing purposes. The value 8 was chosen based on the test results. When size of left argument is less or equal then CopyToThreshold, then the new approach does not apply.

Add static void CopyTail() method

private static void CopyTail(ReadOnlySpan<uint> source, Span<uint> dest, int start)
{
    source.Slice(start).CopyTo(dest.Slice(start));
}

The method was always inlined in my tests. It copies all elements from source to dest starting from start.

Use nint instead of int for loop counter

I found that Unsafe.Add(ref <ptr>, i) converts int and use extra movsxd (on x64 platform) in the hottest loops. It takes up to 5% of such loops (this is why there is a lot of tests slightly faster then baseline). The same change made for upper bound of loops. IMO this change seems to be safe.

Break the second loop when carry == 0

I used this pattern:

carry >>= 32;
if (carry == 0) break;

JIT is managed to use zero flag from the first line to conditional jump, so there is no extra cmps.

"Right" span converted to ref same way as "left" and "result" spans.

To avoid extra range check.

Conclusion

The modified version is faster when sizes of arguments are different and the greatest argument is more than 8 uints and carry flag becomes 0 quickly. This is a typical case for loops with increment and decrement or for calculating polynomials.
In other cases performance is the same for most cases or slightly slower (1-5%) for some cases where carry stays non-zero. The worst case when size of argument is 64 uints and carry stays non-zero is slower by 10%.

benchmark-results.zip

@speshuric
Copy link
Contributor Author

@tannergooding please take a look

@speshuric
Copy link
Contributor Author

Just discovered old issue to the same theme: #41495 (mention to keep linked)

@tannergooding
Copy link
Member

This is on my backlog to review more in depth, but I might not be able to get to it until next week.

At a high level overview, the numbers look acceptable.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants