System.Numerics.BigInteger: Add/Subtract performance can be improved when size of arguments are different #83457

speshuric · 2023-03-15T14:45:18Z

Description

System.Numerics.BigInteger add and subtract operations for non-trivial cases are implemented in Add/Subtract static methods of internal class BigIntegerCalculator. Current implementation can be improved by special handling the case of carry==0 when the current position being processed goes beyond the length of the right (short) argument, but does not exceed the length of the left (long) argument.

Reproducing the issue is possible in most environments. This is not a regression but a new optimization.

Main idea can be demonstrated on this part of Add method:

int i = 0;
long carry = 0L;
// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...

for ( ; i < right.Length; i++)
{
    long digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
for ( ; i < left.Length; i++)
{
    // "target loop"
    long digit = left[i] + carry;
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
Unsafe.Add(ref resultPtr, i) = (uint)carry;

In the second loop (marked as // "target loop") once carry is set to 0 it can not be 1 anymore. So the tail of the loop is just the movement of argument values to result span.

Analysis

BigIntegerCalculator now contains 6 static metods for add and subtract:

public static void Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits); - used when right argument length is 1
public static void Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) - default algorithm for addition
private static void AddSelf(Span<uint> left, ReadOnlySpan<uint> right) - checks carry and breaks the loop
public static void Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits) - used when right argument length is 1
public static void Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) - default algorithm for subtraction
private static void SubtractSelf(Span<uint> left, ReadOnlySpan<uint> right) - checks carry and breaks the loop

AddSelf and SubtractSelf used in internals of SquMul part of BigIntegerCalculator, cannot be optimized this way and are not considered below. Add and Subtract can be optimized almost identically so only case of Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) will be used to describe.

Following statements are correct for "target loop"

carry can be 1 or 0
If income carry is 1 then outcome carry can be 1 if and only if left[i] == uint.MaxValue and result[i]==0.
If income carry is 0 then result[i] = left[i] for every next i. This assignment can be optimized by remove any arithmetic and use of special platform-optimized methods of copying data.

So it can be rewritten as follows:

int i = 0;
ulong carry = 0L;

// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...

for ( ; i < right.Length; i++) // this loop was not modified
{
    ulong digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
for ( ; carry != 0 && i < left.Length; i++) // carry != 0 is checked
{
    ulong digit = Unsafe.Add(ref leftPtr, i) + carry;
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
if (i < left.Length)
{
    // only move data from left argument to result
    do 
    {
        Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i);
        i++;
    } while (i < left.Length);

    // Note: if left part to copy is large then special method CopyTo is better
    //left.Slice(i).CopyTo(bits.Slice(i));

    i = left.Length;
    carry = 0;
}
Unsafe.Add(ref resultPtr, i) = unchecked((uint)carry);

Methods of data movement

In this variant, second loop checks when carry become 0 and then special case (plain movement) is triggered. Two possible ways to copy data can be considered:

Loop with Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++;
Copy with platform-dependent Span.CopyTo(): left.Slice(i).CopyTo(bits.Slice(i));. This internally calls high optimized Buffer.Memmove() but Span.CopyTo() do some redundant checks and thus can be slower on short slices.

On my PC (CPU Ryzen 5700G) in my draft benchmarks the second is faster when approximately left.Length - right.Length >= 16

Best and worst cases

The best case of argument data for the new version is such values of argument arrays when carry == 0.
The worst case is when carry is always 1, i.e. all left[i] == uint.MaxValue.
There should be more difference when left.Length - right.Length is large and shouldn't difference when they are equals.

Draft benchmarks

I have written some draft microbenchmarks to measure the difference. These benchmarks test 3 versions of Add method:

AddOld - code in the current runtime
AddNew - with Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++; loop
AddNewMemmove with left.Slice(i).CopyTo(bits.Slice(i)); without loop

Two cases are tested:

bad - when all left[i] and right[i] are uint.MaxValue
good - when all left[i] and right[i] are 1

Tests cover left.Length from 2 to 256 (doubles on each step, i.e. 2, 4, 8, 16, 32, 64, 128, 256) and right.Length from 2 to left.Length (doubles each step too).

In draft benchmark attribute [InvocationCount(10000000)] was used, hence there can be some inaccuracies on small lengths.

Valuable draft benchmark points

In general new versions are noticeably faster.
In the worst cases for new versions, slowest of new version has max regression about 0,5-1 ns per invocation. But should be noted that whole BigInteger Add operation contains some extra memcopies and much more logic than internal Add method. It needs some more analysis, but at first look it seems that regressions are insignificant.
Max speedup of AddNew version (as expected) is observed in case of good arrays with right.Length 2 uints and left.Length 256 uints. Old version mean is 122.771 ns, AddNew mean is 65.197 ns.
AddNewMemmove version is in the same test (good, 256, 2) is even faster and it's mean is 12.797 ns. Almost 10x faster than old version!
AddNewMemmove version faster than AddNew when left.Length - right.Length >= 16

Important note: Add method takes only part time of whole operation. End-to-end results will be comparable in absolute time difference but relative difference will differ much less.

To do's

Initial description, implement one method and make draft benchmarks
Choose new implementation ("ref to ref" or "Slice.CopyTo(Slice)")
Implement all methods
Full end-to-end benchmarks
Pull request

Open questions

Which implementation is preferable: "ref to ref" or "Slice.CopyTo(Slice)" or size-dependent?

Data

Draft benchmark results are below. I will publish the code of benchmark soon.

Data of draft benchmarikng (click to expand)

Method	Case	leftSize	rightSize	Mean	Error	StdDev	Median
BenchAddOld	bad	2	2	2.585 ns	0.0269 ns	0.0225 ns	2.577 ns
BenchAddNew	bad	2	2	2.903 ns	0.0597 ns	0.0558 ns	2.895 ns
BenchAddNewMemmove	bad	2	2	3.204 ns	0.0942 ns	0.1548 ns	3.175 ns
BenchAddOld	bad	4	2	3.520 ns	0.0629 ns	0.0588 ns	3.485 ns
BenchAddNew	bad	4	2	3.603 ns	0.0857 ns	0.0760 ns	3.613 ns
BenchAddNewMemmove	bad	4	2	3.996 ns	0.0945 ns	0.0789 ns	3.954 ns
BenchAddOld	bad	4	4	3.331 ns	0.0771 ns	0.0721 ns	3.312 ns
BenchAddNew	bad	4	4	3.560 ns	0.0967 ns	0.0950 ns	3.512 ns
BenchAddNewMemmove	bad	4	4	4.239 ns	0.0714 ns	0.0668 ns	4.199 ns
BenchAddOld	bad	8	2	5.434 ns	0.1023 ns	0.0957 ns	5.422 ns
BenchAddNew	bad	8	2	5.008 ns	0.1292 ns	0.1327 ns	4.914 ns
BenchAddNewMemmove	bad	8	2	5.871 ns	0.0992 ns	0.0879 ns	5.888 ns
BenchAddOld	bad	8	4	5.394 ns	0.0892 ns	0.0834 ns	5.395 ns
BenchAddNew	bad	8	4	5.240 ns	0.1256 ns	0.1113 ns	5.224 ns
BenchAddNewMemmove	bad	8	4	6.094 ns	0.1163 ns	0.1087 ns	6.069 ns
BenchAddOld	bad	8	8	5.094 ns	0.0092 ns	0.0072 ns	5.093 ns
BenchAddNew	bad	8	8	5.579 ns	0.0370 ns	0.0346 ns	5.585 ns
BenchAddNewMemmove	bad	8	8	5.996 ns	0.0217 ns	0.0181 ns	5.995 ns
BenchAddOld	bad	16	2	8.732 ns	0.0075 ns	0.0062 ns	8.732 ns
BenchAddNew	bad	16	2	8.359 ns	0.0644 ns	0.0571 ns	8.338 ns
BenchAddNewMemmove	bad	16	2	8.802 ns	0.0741 ns	0.0619 ns	8.782 ns
BenchAddOld	bad	16	4	8.774 ns	0.0564 ns	0.0527 ns	8.751 ns
BenchAddNew	bad	16	4	8.539 ns	0.0758 ns	0.0672 ns	8.535 ns
BenchAddNewMemmove	bad	16	4	9.685 ns	0.0628 ns	0.0587 ns	9.651 ns
BenchAddOld	bad	16	8	8.776 ns	0.0288 ns	0.0256 ns	8.768 ns
BenchAddNew	bad	16	8	8.455 ns	0.0524 ns	0.0437 ns	8.452 ns
BenchAddNewMemmove	bad	16	8	9.643 ns	0.0213 ns	0.0166 ns	9.640 ns
BenchAddOld	bad	16	16	9.011 ns	0.0209 ns	0.0163 ns	9.015 ns
BenchAddNew	bad	16	16	9.230 ns	0.0147 ns	0.0115 ns	9.232 ns
BenchAddNewMemmove	bad	16	16	9.495 ns	0.0252 ns	0.0210 ns	9.488 ns
BenchAddOld	bad	32	2	15.952 ns	0.0244 ns	0.0228 ns	15.953 ns
BenchAddNew	bad	32	2	16.422 ns	0.1061 ns	0.0941 ns	16.413 ns
BenchAddNewMemmove	bad	32	2	16.917 ns	0.0140 ns	0.0124 ns	16.918 ns
BenchAddOld	bad	32	4	16.022 ns	0.0115 ns	0.0090 ns	16.023 ns
BenchAddNew	bad	32	4	16.499 ns	0.0428 ns	0.0401 ns	16.490 ns
BenchAddNewMemmove	bad	32	4	16.875 ns	0.0601 ns	0.0533 ns	16.882 ns
BenchAddOld	bad	32	8	16.153 ns	0.0617 ns	0.0547 ns	16.148 ns
BenchAddNew	bad	32	8	16.063 ns	0.1146 ns	0.1016 ns	16.035 ns
BenchAddNewMemmove	bad	32	8	16.960 ns	0.0484 ns	0.0404 ns	16.947 ns
BenchAddOld	bad	32	16	16.665 ns	0.0260 ns	0.0243 ns	16.659 ns
BenchAddNew	bad	32	16	16.928 ns	0.0542 ns	0.0452 ns	16.929 ns
BenchAddNewMemmove	bad	32	16	17.373 ns	0.0623 ns	0.0583 ns	17.376 ns
BenchAddOld	bad	32	32	17.253 ns	0.0250 ns	0.0221 ns	17.260 ns
BenchAddNew	bad	32	32	17.515 ns	0.0730 ns	0.0570 ns	17.513 ns
BenchAddNewMemmove	bad	32	32	18.008 ns	0.0164 ns	0.0153 ns	18.007 ns
BenchAddOld	bad	64	2	30.824 ns	0.0548 ns	0.0458 ns	30.817 ns
BenchAddNew	bad	64	2	30.854 ns	0.0239 ns	0.0187 ns	30.858 ns
BenchAddNewMemmove	bad	64	2	31.479 ns	0.0456 ns	0.0404 ns	31.481 ns
BenchAddOld	bad	64	4	30.796 ns	0.0182 ns	0.0161 ns	30.793 ns
BenchAddNew	bad	64	4	29.935 ns	0.0929 ns	0.0869 ns	29.923 ns
BenchAddNewMemmove	bad	64	4	30.630 ns	0.1631 ns	0.1526 ns	30.653 ns
BenchAddOld	bad	64	8	30.775 ns	0.0837 ns	0.0783 ns	30.784 ns
BenchAddNew	bad	64	8	31.110 ns	0.0434 ns	0.0385 ns	31.101 ns
BenchAddNewMemmove	bad	64	8	31.391 ns	0.0681 ns	0.0603 ns	31.402 ns
BenchAddOld	bad	64	16	31.845 ns	0.0911 ns	0.0852 ns	31.856 ns
BenchAddNew	bad	64	16	32.334 ns	0.0686 ns	0.0641 ns	32.316 ns
BenchAddNewMemmove	bad	64	16	32.012 ns	0.0781 ns	0.0730 ns	32.003 ns
BenchAddOld	bad	64	32	34.869 ns	0.0747 ns	0.0698 ns	34.873 ns
BenchAddNew	bad	64	32	35.400 ns	0.1425 ns	0.1333 ns	35.426 ns
BenchAddNewMemmove	bad	64	32	34.899 ns	0.0755 ns	0.0707 ns	34.914 ns
BenchAddOld	bad	64	64	37.207 ns	0.1676 ns	0.1568 ns	37.125 ns
BenchAddNew	bad	64	64	37.827 ns	0.0352 ns	0.0312 ns	37.828 ns
BenchAddNewMemmove	bad	64	64	38.326 ns	0.0473 ns	0.0442 ns	38.324 ns
BenchAddOld	bad	128	2	64.390 ns	0.0895 ns	0.0747 ns	64.372 ns
BenchAddNew	bad	128	2	64.430 ns	0.1261 ns	0.1117 ns	64.400 ns
BenchAddNewMemmove	bad	128	2	61.757 ns	0.0785 ns	0.0696 ns	61.758 ns
BenchAddOld	bad	128	4	64.489 ns	0.0966 ns	0.0856 ns	64.508 ns
BenchAddNew	bad	128	4	61.862 ns	0.1050 ns	0.0982 ns	61.818 ns
BenchAddNewMemmove	bad	128	4	64.728 ns	0.0337 ns	0.0299 ns	64.733 ns
BenchAddOld	bad	128	8	59.975 ns	0.0802 ns	0.0711 ns	59.958 ns
BenchAddNew	bad	128	8	60.545 ns	0.0964 ns	0.0902 ns	60.567 ns
BenchAddNewMemmove	bad	128	8	60.686 ns	0.0538 ns	0.0477 ns	60.691 ns
BenchAddOld	bad	128	16	60.303 ns	0.1191 ns	0.1056 ns	60.333 ns
BenchAddNew	bad	128	16	61.590 ns	0.7334 ns	0.6860 ns	61.163 ns
BenchAddNewMemmove	bad	128	16	61.623 ns	0.0646 ns	0.0573 ns	61.618 ns
BenchAddOld	bad	128	32	63.400 ns	0.2310 ns	0.2048 ns	63.358 ns
BenchAddNew	bad	128	32	64.186 ns	0.2440 ns	0.2282 ns	64.057 ns
BenchAddNewMemmove	bad	128	32	64.191 ns	0.0389 ns	0.0304 ns	64.195 ns
BenchAddOld	bad	128	64	69.698 ns	0.1385 ns	0.1228 ns	69.638 ns
BenchAddNew	bad	128	64	70.742 ns	0.2145 ns	0.2006 ns	70.772 ns
BenchAddNewMemmove	bad	128	64	70.199 ns	0.2155 ns	0.1800 ns	70.201 ns
BenchAddOld	bad	128	128	81.541 ns	0.0562 ns	0.0526 ns	81.522 ns
BenchAddNew	bad	128	128	81.757 ns	0.0175 ns	0.0163 ns	81.761 ns
BenchAddNewMemmove	bad	128	128	82.336 ns	0.0408 ns	0.0362 ns	82.331 ns
BenchAddOld	bad	256	2	122.853 ns	0.2910 ns	0.2579 ns	122.735 ns
BenchAddNew	bad	256	2	122.601 ns	0.0401 ns	0.0335 ns	122.603 ns
BenchAddNewMemmove	bad	256	2	118.404 ns	0.4241 ns	0.3967 ns	118.360 ns
BenchAddOld	bad	256	4	122.572 ns	0.0952 ns	0.0844 ns	122.575 ns
BenchAddNew	bad	256	4	123.188 ns	0.0382 ns	0.0358 ns	123.190 ns
BenchAddNewMemmove	bad	256	4	118.899 ns	0.3367 ns	0.3150 ns	119.049 ns
BenchAddOld	bad	256	8	122.766 ns	0.0753 ns	0.0629 ns	122.782 ns
BenchAddNew	bad	256	8	118.796 ns	0.4198 ns	0.3927 ns	118.989 ns
BenchAddNewMemmove	bad	256	8	121.238 ns	0.2532 ns	0.2369 ns	121.260 ns
BenchAddOld	bad	256	16	123.677 ns	0.2106 ns	0.1970 ns	123.688 ns
BenchAddNew	bad	256	16	125.204 ns	0.0515 ns	0.0482 ns	125.195 ns
BenchAddNewMemmove	bad	256	16	125.459 ns	0.0575 ns	0.0537 ns	125.466 ns
BenchAddOld	bad	256	32	123.549 ns	0.2009 ns	0.1879 ns	123.557 ns
BenchAddNew	bad	256	32	127.413 ns	0.2725 ns	0.2549 ns	127.377 ns
BenchAddNewMemmove	bad	256	32	128.176 ns	0.1296 ns	0.1149 ns	128.191 ns
BenchAddOld	bad	256	64	129.709 ns	0.1311 ns	0.1226 ns	129.692 ns
BenchAddNew	bad	256	64	132.365 ns	0.0854 ns	0.0757 ns	132.396 ns
BenchAddNewMemmove	bad	256	64	133.685 ns	0.2328 ns	0.2177 ns	133.636 ns
BenchAddOld	bad	256	128	142.283 ns	0.0672 ns	0.0595 ns	142.284 ns
BenchAddNew	bad	256	128	145.236 ns	0.2122 ns	0.1985 ns	145.283 ns
BenchAddNewMemmove	bad	256	128	145.719 ns	0.2226 ns	0.2082 ns	145.670 ns
BenchAddOld	bad	256	256	165.340 ns	0.3852 ns	0.3603 ns	165.212 ns
BenchAddNew	bad	256	256	165.619 ns	0.0412 ns	0.0365 ns	165.631 ns
BenchAddNewMemmove	bad	256	256	165.315 ns	0.0680 ns	0.0568 ns	165.320 ns
BenchAddOld	good	2	2	2.463 ns	0.0765 ns	0.0880 ns	2.472 ns
BenchAddNew	good	2	2	2.693 ns	0.0574 ns	0.0509 ns	2.685 ns
BenchAddNewMemmove	good	2	2	3.429 ns	0.0930 ns	0.0870 ns	3.445 ns
BenchAddOld	good	4	2	3.343 ns	0.0698 ns	0.0618 ns	3.342 ns
BenchAddNew	good	4	2	3.247 ns	0.0871 ns	0.1593 ns	3.213 ns
BenchAddNewMemmove	good	4	2	5.360 ns	0.0828 ns	0.0734 ns	5.321 ns
BenchAddOld	good	4	4	3.344 ns	0.0888 ns	0.0830 ns	3.366 ns
BenchAddNew	good	4	4	3.596 ns	0.0934 ns	0.0873 ns	3.618 ns
BenchAddNewMemmove	good	4	4	4.481 ns	0.0619 ns	0.0579 ns	4.493 ns
BenchAddOld	good	8	2	5.399 ns	0.0864 ns	0.0809 ns	5.401 ns
BenchAddNew	good	8	2	4.347 ns	0.0988 ns	0.0876 ns	4.331 ns
BenchAddNewMemmove	good	8	2	5.410 ns	0.0615 ns	0.0545 ns	5.429 ns
BenchAddOld	good	8	4	5.218 ns	0.1080 ns	0.1010 ns	5.201 ns
BenchAddNew	good	8	4	4.667 ns	0.0568 ns	0.0503 ns	4.640 ns
BenchAddNewMemmove	good	8	4	6.658 ns	0.0953 ns	0.0845 ns	6.659 ns
BenchAddOld	good	8	8	5.094 ns	0.0144 ns	0.0112 ns	5.089 ns
BenchAddNew	good	8	8	5.715 ns	0.0724 ns	0.0677 ns	5.700 ns
BenchAddNewMemmove	good	8	8	6.075 ns	0.0402 ns	0.0336 ns	6.067 ns
BenchAddOld	good	16	2	8.978 ns	0.0386 ns	0.0361 ns	8.976 ns
BenchAddNew	good	16	2	5.883 ns	0.0141 ns	0.0125 ns	5.883 ns
BenchAddNewMemmove	good	16	2	5.717 ns	0.1207 ns	0.1129 ns	5.753 ns
BenchAddOld	good	16	4	8.796 ns	0.0898 ns	0.0796 ns	8.763 ns
BenchAddNew	good	16	4	6.440 ns	0.0853 ns	0.0798 ns	6.394 ns
BenchAddNewMemmove	good	16	4	6.898 ns	0.1508 ns	0.1410 ns	6.891 ns
BenchAddOld	good	16	8	8.804 ns	0.0696 ns	0.0543 ns	8.797 ns
BenchAddNew	good	16	8	7.118 ns	0.0780 ns	0.0729 ns	7.095 ns
BenchAddNewMemmove	good	16	8	8.060 ns	0.0123 ns	0.0103 ns	8.057 ns
BenchAddOld	good	16	16	8.765 ns	0.0132 ns	0.0117 ns	8.766 ns
BenchAddNew	good	16	16	9.316 ns	0.0177 ns	0.0148 ns	9.311 ns
BenchAddNewMemmove	good	16	16	9.472 ns	0.0136 ns	0.0120 ns	9.476 ns
BenchAddOld	good	32	2	15.976 ns	0.0149 ns	0.0132 ns	15.976 ns
BenchAddNew	good	32	2	10.003 ns	0.0572 ns	0.0477 ns	9.994 ns
BenchAddNewMemmove	good	32	2	6.473 ns	0.1306 ns	0.1222 ns	6.482 ns
BenchAddOld	good	32	4	16.029 ns	0.0226 ns	0.0211 ns	16.021 ns
BenchAddNew	good	32	4	10.277 ns	0.0319 ns	0.0283 ns	10.268 ns
BenchAddNewMemmove	good	32	4	7.188 ns	0.0847 ns	0.0792 ns	7.169 ns
BenchAddOld	good	32	8	16.301 ns	0.3195 ns	0.3803 ns	16.152 ns
BenchAddNew	good	32	8	12.358 ns	0.2750 ns	0.4670 ns	12.136 ns
BenchAddNewMemmove	good	32	8	8.976 ns	0.0233 ns	0.0182 ns	8.983 ns
BenchAddOld	good	32	16	16.970 ns	0.0710 ns	0.0630 ns	16.966 ns
BenchAddNew	good	32	16	13.146 ns	0.1049 ns	0.0819 ns	13.162 ns
BenchAddNewMemmove	good	32	16	12.155 ns	0.0423 ns	0.0353 ns	12.152 ns
BenchAddOld	good	32	32	17.361 ns	0.0498 ns	0.0442 ns	17.359 ns
BenchAddNew	good	32	32	17.444 ns	0.0156 ns	0.0145 ns	17.443 ns
BenchAddNewMemmove	good	32	32	17.859 ns	0.0275 ns	0.0229 ns	17.859 ns
BenchAddOld	good	64	2	30.810 ns	0.0510 ns	0.0452 ns	30.802 ns
BenchAddNew	good	64	2	17.049 ns	0.0184 ns	0.0164 ns	17.042 ns
BenchAddNewMemmove	good	64	2	7.111 ns	0.1680 ns	0.1571 ns	7.145 ns
BenchAddOld	good	64	4	30.790 ns	0.0456 ns	0.0404 ns	30.803 ns
BenchAddNew	good	64	4	17.728 ns	0.0312 ns	0.0276 ns	17.732 ns
BenchAddNewMemmove	good	64	4	8.134 ns	0.0892 ns	0.0834 ns	8.144 ns
BenchAddOld	good	64	8	30.834 ns	0.0719 ns	0.0673 ns	30.846 ns
BenchAddNew	good	64	8	19.482 ns	0.0262 ns	0.0205 ns	19.482 ns
BenchAddNewMemmove	good	64	8	9.674 ns	0.0432 ns	0.0361 ns	9.662 ns
BenchAddOld	good	64	16	31.804 ns	0.0847 ns	0.0792 ns	31.775 ns
BenchAddNew	good	64	16	20.717 ns	0.0970 ns	0.0907 ns	20.698 ns
BenchAddNewMemmove	good	64	16	12.849 ns	0.0284 ns	0.0222 ns	12.846 ns
BenchAddOld	good	64	32	34.382 ns	0.1602 ns	0.1498 ns	34.429 ns
BenchAddNew	good	64	32	27.406 ns	0.0531 ns	0.0443 ns	27.407 ns
BenchAddNewMemmove	good	64	32	20.663 ns	0.0403 ns	0.0377 ns	20.662 ns
BenchAddOld	good	64	64	37.511 ns	0.0232 ns	0.0217 ns	37.512 ns
BenchAddNew	good	64	64	37.314 ns	0.0940 ns	0.0879 ns	37.339 ns
BenchAddNewMemmove	good	64	64	38.143 ns	0.0108 ns	0.0090 ns	38.143 ns
BenchAddOld	good	128	2	64.610 ns	0.0514 ns	0.0480 ns	64.607 ns
BenchAddNew	good	128	2	35.626 ns	0.0914 ns	0.0810 ns	35.602 ns
BenchAddNewMemmove	good	128	2	8.729 ns	0.1174 ns	0.1041 ns	8.726 ns
BenchAddOld	good	128	4	64.728 ns	0.1634 ns	0.1448 ns	64.763 ns
BenchAddNew	good	128	4	36.102 ns	0.0315 ns	0.0263 ns	36.106 ns
BenchAddNewMemmove	good	128	4	9.951 ns	0.0793 ns	0.0662 ns	9.953 ns
BenchAddOld	good	128	8	59.980 ns	0.0317 ns	0.0297 ns	59.987 ns
BenchAddNew	good	128	8	33.452 ns	0.1252 ns	0.1171 ns	33.407 ns
BenchAddNewMemmove	good	128	8	10.903 ns	0.1075 ns	0.0953 ns	10.847 ns
BenchAddOld	good	128	16	60.148 ns	0.0566 ns	0.0473 ns	60.139 ns
BenchAddNew	good	128	16	36.152 ns	0.0371 ns	0.0329 ns	36.146 ns
BenchAddNewMemmove	good	128	16	14.780 ns	0.0253 ns	0.0197 ns	14.776 ns
BenchAddOld	good	128	32	62.632 ns	0.1169 ns	0.1037 ns	62.623 ns
BenchAddNew	good	128	32	41.202 ns	0.1629 ns	0.1444 ns	41.133 ns
BenchAddNewMemmove	good	128	32	22.425 ns	0.0440 ns	0.0368 ns	22.414 ns
BenchAddOld	good	128	64	69.786 ns	0.2435 ns	0.2278 ns	69.657 ns
BenchAddNew	good	128	64	53.662 ns	0.0841 ns	0.0787 ns	53.662 ns
BenchAddNewMemmove	good	128	64	41.252 ns	0.0296 ns	0.0262 ns	41.247 ns
BenchAddOld	good	128	128	81.869 ns	0.0553 ns	0.0462 ns	81.866 ns
BenchAddNew	good	128	128	82.525 ns	0.0267 ns	0.0250 ns	82.525 ns
BenchAddNewMemmove	good	128	128	82.764 ns	0.0255 ns	0.0226 ns	82.761 ns
BenchAddOld	good	256	2	122.771 ns	0.1330 ns	0.1179 ns	122.739 ns
BenchAddNew	good	256	2	65.197 ns	0.1967 ns	0.1840 ns	65.169 ns
BenchAddNewMemmove	good	256	2	12.797 ns	0.1026 ns	0.0960 ns	12.776 ns
BenchAddOld	good	256	4	122.580 ns	0.1142 ns	0.0891 ns	122.547 ns
BenchAddNew	good	256	4	67.334 ns	1.3588 ns	1.9487 ns	66.053 ns
BenchAddNewMemmove	good	256	4	12.118 ns	0.1328 ns	0.1243 ns	12.057 ns
BenchAddOld	good	256	8	122.775 ns	0.0513 ns	0.0455 ns	122.769 ns
BenchAddNew	good	256	8	67.161 ns	0.1645 ns	0.1538 ns	67.102 ns
BenchAddNewMemmove	good	256	8	15.188 ns	0.1121 ns	0.1049 ns	15.137 ns
BenchAddOld	good	256	16	123.086 ns	0.2531 ns	0.2244 ns	123.022 ns
BenchAddNew	good	256	16	69.501 ns	0.0621 ns	0.0551 ns	69.495 ns
BenchAddNewMemmove	good	256	16	18.547 ns	0.0834 ns	0.0780 ns	18.528 ns
BenchAddOld	good	256	32	124.622 ns	0.1848 ns	0.1729 ns	124.577 ns
BenchAddNew	good	256	32	74.259 ns	0.0878 ns	0.0778 ns	74.234 ns
BenchAddNewMemmove	good	256	32	26.275 ns	0.1607 ns	0.1425 ns	26.231 ns
BenchAddOld	good	256	64	129.737 ns	0.1050 ns	0.0931 ns	129.750 ns
BenchAddNew	good	256	64	86.767 ns	0.1327 ns	0.1242 ns	86.743 ns
BenchAddNewMemmove	good	256	64	44.936 ns	0.1495 ns	0.1249 ns	44.929 ns
BenchAddOld	good	256	128	144.031 ns	0.1669 ns	0.1561 ns	144.024 ns
BenchAddNew	good	256	128	115.261 ns	0.0782 ns	0.0693 ns	115.239 ns
BenchAddNewMemmove	good	256	128	89.670 ns	0.2114 ns	0.1977 ns	89.623 ns
BenchAddOld	good	256	256	165.619 ns	0.1193 ns	0.1057 ns	165.604 ns
BenchAddNew	good	256	256	165.271 ns	0.2068 ns	0.1934 ns	165.296 ns
BenchAddNewMemmove	good	256	256	165.865 ns	0.4035 ns	0.3774 ns	165.766 ns

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2023-03-15T14:45:23Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2023-03-15T14:53:18Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

System.Numerics.BigInteger add and subtract operations for non-trivial cases are implemented in Add/Subtract static methods of internal class BigIntegerCalculator. Current implementation can be improved by special handling the case of carry==0 when the current position being processed goes beyond the length of the right (short) argument, but does not exceed the length of the left (long) argument.

Reproducing the issue is possible in most environments. This is not a regression but a new optimization.

Main idea can be demonstrated on this part of Add method:

int i = 0;
long carry = 0L;
// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...

for ( ; i < right.Length; i++)
{
    long digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
for ( ; i < left.Length; i++)
{
    // "target loop"
    long digit = left[i] + carry;
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
Unsafe.Add(ref resultPtr, i) = (uint)carry;

In the second loop (marked as // "target loop") once carry is set to 0 it can not be 1 anymore. So the tail of the loop is just the movement of argument values to result span.

Analysis

BigIntegerCalculator now contains 6 static metods for add and subtract:

public static void Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits); - used when right argument length is 1
public static void Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) - default algorithm for addition
private static void AddSelf(Span<uint> left, ReadOnlySpan<uint> right) - checks carry and breaks the loop
public static void Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits) - used when right argument length is 1
public static void Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) - default algorithm for subtraction
private static void SubtractSelf(Span<uint> left, ReadOnlySpan<uint> right) - checks carry and breaks the loop

AddSelf and SubtractSelf used in internals of SquMul part of BigIntegerCalculator, cannot be optimized this way and are not considered below. Add and Subtract can be optimized almost identically so only case of Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) will be used to describe.

Following statements are correct for "target loop"

carry can be 1 or 0
If income carry is 1 then outcome carry can be 1 if and only if left[i] == uint.MaxValue and result[i]==0.
If income carry is 0 then result[i] = left[i] for every next i. This assignment can be optimized by remove any arithmetic and use of special platform-optimized methods of copying data.

So it can be rewritten as follows:

int i = 0;
ulong carry = 0L;

// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...

for ( ; i < right.Length; i++) // this loop was not modified
{
    ulong digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
for ( ; carry != 0 && i < left.Length; i++) // carry != 0 is checked
{
    ulong digit = Unsafe.Add(ref leftPtr, i) + carry;
    Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
    carry = digit >> 32;
}
if (i < left.Length)
{
    // only move data from left argument to result
    do 
    {
        Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i);
        i++;
    } while (i < left.Length);

    // Note: if left part to copy is large then special method CopyTo is better
    //left.Slice(i).CopyTo(bits.Slice(i));

    i = left.Length;
    carry = 0;
}
Unsafe.Add(ref resultPtr, i) = unchecked((uint)carry);

Methods of data movement

In this variant, second loop checks when carry become 0 and then special case (plain movement) is triggered. Two possible ways to copy data can be considered:

Loop with Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++;
Copy with platform-dependent Span.CopyTo(): left.Slice(i).CopyTo(bits.Slice(i));. This internally calls high optimized Buffer.Memmove() but Span.CopyTo() do some redundant checks and thus can be slower on short slices.

On my PC (CPU Ryzen 5700G) in my draft benchmarks the second is faster when approximately left.Length - right.Length >= 16

Best and worst cases

The best case of argument data for the new version is such values of argument arrays when carry == 0.
The worst case is when carry is always 1, i.e. all left[i] == uint.MaxValue.
There should be more difference when left.Length - right.Length is large and shouldn't difference when they are equals.

Draft benchmarks

I have written some draft microbenchmarks to measure the difference. These benchmarks test 3 versions of Add method:

AddOld - code in the current runtime
AddNew - with Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++; loop
AddNewMemmove with left.Slice(i).CopyTo(bits.Slice(i)); without loop

Two cases are tested:

bad - when all left[i] and right[i] are uint.MaxValue
good - when all left[i] and right[i] are 1

Tests cover left.Length from 2 to 256 (doubles on each step, i.e. 2, 4, 8, 16, 32, 64, 128, 256) and right.Length from 2 to left.Length (doubles each step too).

In draft benchmark attribute [InvocationCount(10000000)] was used, hence there can be some inaccuracies on small lengths.

Valuable draft benchmark points

In general new versions are noticeably faster.
In the worst cases for new versions, slowest of new version has max regression about 0,5-1 ns per invocation. But should be noted that whole BigInteger Add operation contains some extra memcopies and much more logic than internal Add method. It needs some more analysis, but at first look it seems that regressions are insignificant.
Max speedup of AddNew version (as expected) is observed in case of good arrays with right.Length 2 uints and left.Length 256 uints. Old version mean is 122.771 ns, AddNew mean is 65.197 ns.
AddNewMemmove version is in the same test (good, 256, 2) is even faster and it's mean is 12.797 ns. Almost 10x faster than old version!
AddNewMemmove version faster than AddNew when left.Length - right.Length >= 16

Important note: Add method takes only part time of whole operation. End-to-end results will be comparable in absolute time difference but relative difference will differ much less.

To do's

Initial description, implement one method and make draft benchmarks
Choose new implementation ("ref to ref" or "Slice.CopyTo(Slice)")
Implement all methods
Full end-to-end benchmarks
Pull request

Open questions

Which implementation is preferable: "ref to ref" or "Slice.CopyTo(Slice)" or size-dependent?

Data

Draft benchmark results are below. I will publish the code of benchmark soon.

Data of draft benchmarikng (click to expand)

Method	Case	leftSize	rightSize	Mean	Error	StdDev	Median
BenchAddOld	bad	2	2	2.585 ns	0.0269 ns	0.0225 ns	2.577 ns
BenchAddNew	bad	2	2	2.903 ns	0.0597 ns	0.0558 ns	2.895 ns
BenchAddNewMemmove	bad	2	2	3.204 ns	0.0942 ns	0.1548 ns	3.175 ns
BenchAddOld	bad	4	2	3.520 ns	0.0629 ns	0.0588 ns	3.485 ns
BenchAddNew	bad	4	2	3.603 ns	0.0857 ns	0.0760 ns	3.613 ns
BenchAddNewMemmove	bad	4	2	3.996 ns	0.0945 ns	0.0789 ns	3.954 ns
BenchAddOld	bad	4	4	3.331 ns	0.0771 ns	0.0721 ns	3.312 ns
BenchAddNew	bad	4	4	3.560 ns	0.0967 ns	0.0950 ns	3.512 ns
BenchAddNewMemmove	bad	4	4	4.239 ns	0.0714 ns	0.0668 ns	4.199 ns
BenchAddOld	bad	8	2	5.434 ns	0.1023 ns	0.0957 ns	5.422 ns
BenchAddNew	bad	8	2	5.008 ns	0.1292 ns	0.1327 ns	4.914 ns
BenchAddNewMemmove	bad	8	2	5.871 ns	0.0992 ns	0.0879 ns	5.888 ns
BenchAddOld	bad	8	4	5.394 ns	0.0892 ns	0.0834 ns	5.395 ns
BenchAddNew	bad	8	4	5.240 ns	0.1256 ns	0.1113 ns	5.224 ns
BenchAddNewMemmove	bad	8	4	6.094 ns	0.1163 ns	0.1087 ns	6.069 ns
BenchAddOld	bad	8	8	5.094 ns	0.0092 ns	0.0072 ns	5.093 ns
BenchAddNew	bad	8	8	5.579 ns	0.0370 ns	0.0346 ns	5.585 ns
BenchAddNewMemmove	bad	8	8	5.996 ns	0.0217 ns	0.0181 ns	5.995 ns
BenchAddOld	bad	16	2	8.732 ns	0.0075 ns	0.0062 ns	8.732 ns
BenchAddNew	bad	16	2	8.359 ns	0.0644 ns	0.0571 ns	8.338 ns
BenchAddNewMemmove	bad	16	2	8.802 ns	0.0741 ns	0.0619 ns	8.782 ns
BenchAddOld	bad	16	4	8.774 ns	0.0564 ns	0.0527 ns	8.751 ns
BenchAddNew	bad	16	4	8.539 ns	0.0758 ns	0.0672 ns	8.535 ns
BenchAddNewMemmove	bad	16	4	9.685 ns	0.0628 ns	0.0587 ns	9.651 ns
BenchAddOld	bad	16	8	8.776 ns	0.0288 ns	0.0256 ns	8.768 ns
BenchAddNew	bad	16	8	8.455 ns	0.0524 ns	0.0437 ns	8.452 ns
BenchAddNewMemmove	bad	16	8	9.643 ns	0.0213 ns	0.0166 ns	9.640 ns
BenchAddOld	bad	16	16	9.011 ns	0.0209 ns	0.0163 ns	9.015 ns
BenchAddNew	bad	16	16	9.230 ns	0.0147 ns	0.0115 ns	9.232 ns
BenchAddNewMemmove	bad	16	16	9.495 ns	0.0252 ns	0.0210 ns	9.488 ns
BenchAddOld	bad	32	2	15.952 ns	0.0244 ns	0.0228 ns	15.953 ns
BenchAddNew	bad	32	2	16.422 ns	0.1061 ns	0.0941 ns	16.413 ns
BenchAddNewMemmove	bad	32	2	16.917 ns	0.0140 ns	0.0124 ns	16.918 ns
BenchAddOld	bad	32	4	16.022 ns	0.0115 ns	0.0090 ns	16.023 ns
BenchAddNew	bad	32	4	16.499 ns	0.0428 ns	0.0401 ns	16.490 ns
BenchAddNewMemmove	bad	32	4	16.875 ns	0.0601 ns	0.0533 ns	16.882 ns
BenchAddOld	bad	32	8	16.153 ns	0.0617 ns	0.0547 ns	16.148 ns
BenchAddNew	bad	32	8	16.063 ns	0.1146 ns	0.1016 ns	16.035 ns
BenchAddNewMemmove	bad	32	8	16.960 ns	0.0484 ns	0.0404 ns	16.947 ns
BenchAddOld	bad	32	16	16.665 ns	0.0260 ns	0.0243 ns	16.659 ns
BenchAddNew	bad	32	16	16.928 ns	0.0542 ns	0.0452 ns	16.929 ns
BenchAddNewMemmove	bad	32	16	17.373 ns	0.0623 ns	0.0583 ns	17.376 ns
BenchAddOld	bad	32	32	17.253 ns	0.0250 ns	0.0221 ns	17.260 ns
BenchAddNew	bad	32	32	17.515 ns	0.0730 ns	0.0570 ns	17.513 ns
BenchAddNewMemmove	bad	32	32	18.008 ns	0.0164 ns	0.0153 ns	18.007 ns
BenchAddOld	bad	64	2	30.824 ns	0.0548 ns	0.0458 ns	30.817 ns
BenchAddNew	bad	64	2	30.854 ns	0.0239 ns	0.0187 ns	30.858 ns
BenchAddNewMemmove	bad	64	2	31.479 ns	0.0456 ns	0.0404 ns	31.481 ns
BenchAddOld	bad	64	4	30.796 ns	0.0182 ns	0.0161 ns	30.793 ns
BenchAddNew	bad	64	4	29.935 ns	0.0929 ns	0.0869 ns	29.923 ns
BenchAddNewMemmove	bad	64	4	30.630 ns	0.1631 ns	0.1526 ns	30.653 ns
BenchAddOld	bad	64	8	30.775 ns	0.0837 ns	0.0783 ns	30.784 ns
BenchAddNew	bad	64	8	31.110 ns	0.0434 ns	0.0385 ns	31.101 ns
BenchAddNewMemmove	bad	64	8	31.391 ns	0.0681 ns	0.0603 ns	31.402 ns
BenchAddOld	bad	64	16	31.845 ns	0.0911 ns	0.0852 ns	31.856 ns
BenchAddNew	bad	64	16	32.334 ns	0.0686 ns	0.0641 ns	32.316 ns
BenchAddNewMemmove	bad	64	16	32.012 ns	0.0781 ns	0.0730 ns	32.003 ns
BenchAddOld	bad	64	32	34.869 ns	0.0747 ns	0.0698 ns	34.873 ns
BenchAddNew	bad	64	32	35.400 ns	0.1425 ns	0.1333 ns	35.426 ns
BenchAddNewMemmove	bad	64	32	34.899 ns	0.0755 ns	0.0707 ns	34.914 ns
BenchAddOld	bad	64	64	37.207 ns	0.1676 ns	0.1568 ns	37.125 ns
BenchAddNew	bad	64	64	37.827 ns	0.0352 ns	0.0312 ns	37.828 ns
BenchAddNewMemmove	bad	64	64	38.326 ns	0.0473 ns	0.0442 ns	38.324 ns
BenchAddOld	bad	128	2	64.390 ns	0.0895 ns	0.0747 ns	64.372 ns
BenchAddNew	bad	128	2	64.430 ns	0.1261 ns	0.1117 ns	64.400 ns
BenchAddNewMemmove	bad	128	2	61.757 ns	0.0785 ns	0.0696 ns	61.758 ns
BenchAddOld	bad	128	4	64.489 ns	0.0966 ns	0.0856 ns	64.508 ns
BenchAddNew	bad	128	4	61.862 ns	0.1050 ns	0.0982 ns	61.818 ns
BenchAddNewMemmove	bad	128	4	64.728 ns	0.0337 ns	0.0299 ns	64.733 ns
BenchAddOld	bad	128	8	59.975 ns	0.0802 ns	0.0711 ns	59.958 ns
BenchAddNew	bad	128	8	60.545 ns	0.0964 ns	0.0902 ns	60.567 ns
BenchAddNewMemmove	bad	128	8	60.686 ns	0.0538 ns	0.0477 ns	60.691 ns
BenchAddOld	bad	128	16	60.303 ns	0.1191 ns	0.1056 ns	60.333 ns
BenchAddNew	bad	128	16	61.590 ns	0.7334 ns	0.6860 ns	61.163 ns
BenchAddNewMemmove	bad	128	16	61.623 ns	0.0646 ns	0.0573 ns	61.618 ns
BenchAddOld	bad	128	32	63.400 ns	0.2310 ns	0.2048 ns	63.358 ns
BenchAddNew	bad	128	32	64.186 ns	0.2440 ns	0.2282 ns	64.057 ns
BenchAddNewMemmove	bad	128	32	64.191 ns	0.0389 ns	0.0304 ns	64.195 ns
BenchAddOld	bad	128	64	69.698 ns	0.1385 ns	0.1228 ns	69.638 ns
BenchAddNew	bad	128	64	70.742 ns	0.2145 ns	0.2006 ns	70.772 ns
BenchAddNewMemmove	bad	128	64	70.199 ns	0.2155 ns	0.1800 ns	70.201 ns
BenchAddOld	bad	128	128	81.541 ns	0.0562 ns	0.0526 ns	81.522 ns
BenchAddNew	bad	128	128	81.757 ns	0.0175 ns	0.0163 ns	81.761 ns
BenchAddNewMemmove	bad	128	128	82.336 ns	0.0408 ns	0.0362 ns	82.331 ns
BenchAddOld	bad	256	2	122.853 ns	0.2910 ns	0.2579 ns	122.735 ns
BenchAddNew	bad	256	2	122.601 ns	0.0401 ns	0.0335 ns	122.603 ns
BenchAddNewMemmove	bad	256	2	118.404 ns	0.4241 ns	0.3967 ns	118.360 ns
BenchAddOld	bad	256	4	122.572 ns	0.0952 ns	0.0844 ns	122.575 ns
BenchAddNew	bad	256	4	123.188 ns	0.0382 ns	0.0358 ns	123.190 ns
BenchAddNewMemmove	bad	256	4	118.899 ns	0.3367 ns	0.3150 ns	119.049 ns
BenchAddOld	bad	256	8	122.766 ns	0.0753 ns	0.0629 ns	122.782 ns
BenchAddNew	bad	256	8	118.796 ns	0.4198 ns	0.3927 ns	118.989 ns
BenchAddNewMemmove	bad	256	8	121.238 ns	0.2532 ns	0.2369 ns	121.260 ns
BenchAddOld	bad	256	16	123.677 ns	0.2106 ns	0.1970 ns	123.688 ns
BenchAddNew	bad	256	16	125.204 ns	0.0515 ns	0.0482 ns	125.195 ns
BenchAddNewMemmove	bad	256	16	125.459 ns	0.0575 ns	0.0537 ns	125.466 ns
BenchAddOld	bad	256	32	123.549 ns	0.2009 ns	0.1879 ns	123.557 ns
BenchAddNew	bad	256	32	127.413 ns	0.2725 ns	0.2549 ns	127.377 ns
BenchAddNewMemmove	bad	256	32	128.176 ns	0.1296 ns	0.1149 ns	128.191 ns
BenchAddOld	bad	256	64	129.709 ns	0.1311 ns	0.1226 ns	129.692 ns
BenchAddNew	bad	256	64	132.365 ns	0.0854 ns	0.0757 ns	132.396 ns
BenchAddNewMemmove	bad	256	64	133.685 ns	0.2328 ns	0.2177 ns	133.636 ns
BenchAddOld	bad	256	128	142.283 ns	0.0672 ns	0.0595 ns	142.284 ns
BenchAddNew	bad	256	128	145.236 ns	0.2122 ns	0.1985 ns	145.283 ns
BenchAddNewMemmove	bad	256	128	145.719 ns	0.2226 ns	0.2082 ns	145.670 ns
BenchAddOld	bad	256	256	165.340 ns	0.3852 ns	0.3603 ns	165.212 ns
BenchAddNew	bad	256	256	165.619 ns	0.0412 ns	0.0365 ns	165.631 ns
BenchAddNewMemmove	bad	256	256	165.315 ns	0.0680 ns	0.0568 ns	165.320 ns
BenchAddOld	good	2	2	2.463 ns	0.0765 ns	0.0880 ns	2.472 ns
BenchAddNew	good	2	2	2.693 ns	0.0574 ns	0.0509 ns	2.685 ns
BenchAddNewMemmove	good	2	2	3.429 ns	0.0930 ns	0.0870 ns	3.445 ns
BenchAddOld	good	4	2	3.343 ns	0.0698 ns	0.0618 ns	3.342 ns
BenchAddNew	good	4	2	3.247 ns	0.0871 ns	0.1593 ns	3.213 ns
BenchAddNewMemmove	good	4	2	5.360 ns	0.0828 ns	0.0734 ns	5.321 ns
BenchAddOld	good	4	4	3.344 ns	0.0888 ns	0.0830 ns	3.366 ns
BenchAddNew	good	4	4	3.596 ns	0.0934 ns	0.0873 ns	3.618 ns
BenchAddNewMemmove	good	4	4	4.481 ns	0.0619 ns	0.0579 ns	4.493 ns
BenchAddOld	good	8	2	5.399 ns	0.0864 ns	0.0809 ns	5.401 ns
BenchAddNew	good	8	2	4.347 ns	0.0988 ns	0.0876 ns	4.331 ns
BenchAddNewMemmove	good	8	2	5.410 ns	0.0615 ns	0.0545 ns	5.429 ns
BenchAddOld	good	8	4	5.218 ns	0.1080 ns	0.1010 ns	5.201 ns
BenchAddNew	good	8	4	4.667 ns	0.0568 ns	0.0503 ns	4.640 ns
BenchAddNewMemmove	good	8	4	6.658 ns	0.0953 ns	0.0845 ns	6.659 ns
BenchAddOld	good	8	8	5.094 ns	0.0144 ns	0.0112 ns	5.089 ns
BenchAddNew	good	8	8	5.715 ns	0.0724 ns	0.0677 ns	5.700 ns
BenchAddNewMemmove	good	8	8	6.075 ns	0.0402 ns	0.0336 ns	6.067 ns
BenchAddOld	good	16	2	8.978 ns	0.0386 ns	0.0361 ns	8.976 ns
BenchAddNew	good	16	2	5.883 ns	0.0141 ns	0.0125 ns	5.883 ns
BenchAddNewMemmove	good	16	2	5.717 ns	0.1207 ns	0.1129 ns	5.753 ns
BenchAddOld	good	16	4	8.796 ns	0.0898 ns	0.0796 ns	8.763 ns
BenchAddNew	good	16	4	6.440 ns	0.0853 ns	0.0798 ns	6.394 ns
BenchAddNewMemmove	good	16	4	6.898 ns	0.1508 ns	0.1410 ns	6.891 ns
BenchAddOld	good	16	8	8.804 ns	0.0696 ns	0.0543 ns	8.797 ns
BenchAddNew	good	16	8	7.118 ns	0.0780 ns	0.0729 ns	7.095 ns
BenchAddNewMemmove	good	16	8	8.060 ns	0.0123 ns	0.0103 ns	8.057 ns
BenchAddOld	good	16	16	8.765 ns	0.0132 ns	0.0117 ns	8.766 ns
BenchAddNew	good	16	16	9.316 ns	0.0177 ns	0.0148 ns	9.311 ns
BenchAddNewMemmove	good	16	16	9.472 ns	0.0136 ns	0.0120 ns	9.476 ns
BenchAddOld	good	32	2	15.976 ns	0.0149 ns	0.0132 ns	15.976 ns
BenchAddNew	good	32	2	10.003 ns	0.0572 ns	0.0477 ns	9.994 ns
BenchAddNewMemmove	good	32	2	6.473 ns	0.1306 ns	0.1222 ns	6.482 ns
BenchAddOld	good	32	4	16.029 ns	0.0226 ns	0.0211 ns	16.021 ns
BenchAddNew	good	32	4	10.277 ns	0.0319 ns	0.0283 ns	10.268 ns
BenchAddNewMemmove	good	32	4	7.188 ns	0.0847 ns	0.0792 ns	7.169 ns
BenchAddOld	good	32	8	16.301 ns	0.3195 ns	0.3803 ns	16.152 ns
BenchAddNew	good	32	8	12.358 ns	0.2750 ns	0.4670 ns	12.136 ns
BenchAddNewMemmove	good	32	8	8.976 ns	0.0233 ns	0.0182 ns	8.983 ns
BenchAddOld	good	32	16	16.970 ns	0.0710 ns	0.0630 ns	16.966 ns
BenchAddNew	good	32	16	13.146 ns	0.1049 ns	0.0819 ns	13.162 ns
BenchAddNewMemmove	good	32	16	12.155 ns	0.0423 ns	0.0353 ns	12.152 ns
BenchAddOld	good	32	32	17.361 ns	0.0498 ns	0.0442 ns	17.359 ns
BenchAddNew	good	32	32	17.444 ns	0.0156 ns	0.0145 ns	17.443 ns
BenchAddNewMemmove	good	32	32	17.859 ns	0.0275 ns	0.0229 ns	17.859 ns
BenchAddOld	good	64	2	30.810 ns	0.0510 ns	0.0452 ns	30.802 ns
BenchAddNew	good	64	2	17.049 ns	0.0184 ns	0.0164 ns	17.042 ns
BenchAddNewMemmove	good	64	2	7.111 ns	0.1680 ns	0.1571 ns	7.145 ns
BenchAddOld	good	64	4	30.790 ns	0.0456 ns	0.0404 ns	30.803 ns
BenchAddNew	good	64	4	17.728 ns	0.0312 ns	0.0276 ns	17.732 ns
BenchAddNewMemmove	good	64	4	8.134 ns	0.0892 ns	0.0834 ns	8.144 ns
BenchAddOld	good	64	8	30.834 ns	0.0719 ns	0.0673 ns	30.846 ns
BenchAddNew	good	64	8	19.482 ns	0.0262 ns	0.0205 ns	19.482 ns
BenchAddNewMemmove	good	64	8	9.674 ns	0.0432 ns	0.0361 ns	9.662 ns
BenchAddOld	good	64	16	31.804 ns	0.0847 ns	0.0792 ns	31.775 ns
BenchAddNew	good	64	16	20.717 ns	0.0970 ns	0.0907 ns	20.698 ns
BenchAddNewMemmove	good	64	16	12.849 ns	0.0284 ns	0.0222 ns	12.846 ns
BenchAddOld	good	64	32	34.382 ns	0.1602 ns	0.1498 ns	34.429 ns
BenchAddNew	good	64	32	27.406 ns	0.0531 ns	0.0443 ns	27.407 ns
BenchAddNewMemmove	good	64	32	20.663 ns	0.0403 ns	0.0377 ns	20.662 ns
BenchAddOld	good	64	64	37.511 ns	0.0232 ns	0.0217 ns	37.512 ns
BenchAddNew	good	64	64	37.314 ns	0.0940 ns	0.0879 ns	37.339 ns
BenchAddNewMemmove	good	64	64	38.143 ns	0.0108 ns	0.0090 ns	38.143 ns
BenchAddOld	good	128	2	64.610 ns	0.0514 ns	0.0480 ns	64.607 ns
BenchAddNew	good	128	2	35.626 ns	0.0914 ns	0.0810 ns	35.602 ns
BenchAddNewMemmove	good	128	2	8.729 ns	0.1174 ns	0.1041 ns	8.726 ns
BenchAddOld	good	128	4	64.728 ns	0.1634 ns	0.1448 ns	64.763 ns
BenchAddNew	good	128	4	36.102 ns	0.0315 ns	0.0263 ns	36.106 ns
BenchAddNewMemmove	good	128	4	9.951 ns	0.0793 ns	0.0662 ns	9.953 ns
BenchAddOld	good	128	8	59.980 ns	0.0317 ns	0.0297 ns	59.987 ns
BenchAddNew	good	128	8	33.452 ns	0.1252 ns	0.1171 ns	33.407 ns
BenchAddNewMemmove	good	128	8	10.903 ns	0.1075 ns	0.0953 ns	10.847 ns
BenchAddOld	good	128	16	60.148 ns	0.0566 ns	0.0473 ns	60.139 ns
BenchAddNew	good	128	16	36.152 ns	0.0371 ns	0.0329 ns	36.146 ns
BenchAddNewMemmove	good	128	16	14.780 ns	0.0253 ns	0.0197 ns	14.776 ns
BenchAddOld	good	128	32	62.632 ns	0.1169 ns	0.1037 ns	62.623 ns
BenchAddNew	good	128	32	41.202 ns	0.1629 ns	0.1444 ns	41.133 ns
BenchAddNewMemmove	good	128	32	22.425 ns	0.0440 ns	0.0368 ns	22.414 ns
BenchAddOld	good	128	64	69.786 ns	0.2435 ns	0.2278 ns	69.657 ns
BenchAddNew	good	128	64	53.662 ns	0.0841 ns	0.0787 ns	53.662 ns
BenchAddNewMemmove	good	128	64	41.252 ns	0.0296 ns	0.0262 ns	41.247 ns
BenchAddOld	good	128	128	81.869 ns	0.0553 ns	0.0462 ns	81.866 ns
BenchAddNew	good	128	128	82.525 ns	0.0267 ns	0.0250 ns	82.525 ns
BenchAddNewMemmove	good	128	128	82.764 ns	0.0255 ns	0.0226 ns	82.761 ns
BenchAddOld	good	256	2	122.771 ns	0.1330 ns	0.1179 ns	122.739 ns
BenchAddNew	good	256	2	65.197 ns	0.1967 ns	0.1840 ns	65.169 ns
BenchAddNewMemmove	good	256	2	12.797 ns	0.1026 ns	0.0960 ns	12.776 ns
BenchAddOld	good	256	4	122.580 ns	0.1142 ns	0.0891 ns	122.547 ns
BenchAddNew	good	256	4	67.334 ns	1.3588 ns	1.9487 ns	66.053 ns
BenchAddNewMemmove	good	256	4	12.118 ns	0.1328 ns	0.1243 ns	12.057 ns
BenchAddOld	good	256	8	122.775 ns	0.0513 ns	0.0455 ns	122.769 ns
BenchAddNew	good	256	8	67.161 ns	0.1645 ns	0.1538 ns	67.102 ns
BenchAddNewMemmove	good	256	8	15.188 ns	0.1121 ns	0.1049 ns	15.137 ns
BenchAddOld	good	256	16	123.086 ns	0.2531 ns	0.2244 ns	123.022 ns
BenchAddNew	good	256	16	69.501 ns	0.0621 ns	0.0551 ns	69.495 ns
BenchAddNewMemmove	good	256	16	18.547 ns	0.0834 ns	0.0780 ns	18.528 ns
BenchAddOld	good	256	32	124.622 ns	0.1848 ns	0.1729 ns	124.577 ns
BenchAddNew	good	256	32	74.259 ns	0.0878 ns	0.0778 ns	74.234 ns
BenchAddNewMemmove	good	256	32	26.275 ns	0.1607 ns	0.1425 ns	26.231 ns
BenchAddOld	good	256	64	129.737 ns	0.1050 ns	0.0931 ns	129.750 ns
BenchAddNew	good	256	64	86.767 ns	0.1327 ns	0.1242 ns	86.743 ns
BenchAddNewMemmove	good	256	64	44.936 ns	0.1495 ns	0.1249 ns	44.929 ns
BenchAddOld	good	256	128	144.031 ns	0.1669 ns	0.1561 ns	144.024 ns
BenchAddNew	good	256	128	115.261 ns	0.0782 ns	0.0693 ns	115.239 ns
BenchAddNewMemmove	good	256	128	89.670 ns	0.2114 ns	0.1977 ns	89.623 ns
BenchAddOld	good	256	256	165.619 ns	0.1193 ns	0.1057 ns	165.604 ns
BenchAddNew	good	256	256	165.271 ns	0.2068 ns	0.1934 ns	165.296 ns
BenchAddNewMemmove	good	256	256	165.865 ns	0.4035 ns	0.3774 ns	165.766 ns

Author:	speshuric
Assignees:	-
Labels:	`area-System.Numerics`, `tenet-performance`, `untriaged`
Milestone:	-

speshuric · 2023-03-20T01:55:50Z

Just a status update.

It seems that "Slice.CopyTo(Slice)" is preferable. Separate benchmarks (not the whole Biginteger but this methods only) show that overhead of "Slice.CopyTo(Slice)" probably is negligible.
Implemented both Add methods.

speshuric · 2023-03-23T01:05:15Z

Status update

Add and Subtract methods are implemented.

tannergooding · 2023-03-23T14:53:50Z

Have to be careful with this type of optimization because while it will speed up the case where the carry terminates quickly, it will also slow down the case where the carry does not due to have an additional condition and branch per iteration.

Perf numbers showing the improvement for many different scenarios, including "worst case" scenarios will be important.

speshuric · 2023-03-26T21:48:12Z

I've created PR. Below I'll describe my benchmarks and its results.

speshuric · 2023-03-27T01:33:22Z

I apologize in advance to the participants for posting the results step by step, and not all at once. There are a lot of results and it takes time to recheck (and my personal time for this issue is very limited).

I should note that dotnet/performance tests are unsuitable for this issue. Almost all tests are "N-bit, N-bit", so usage of them is limited. I've executed them to check and prevent general regressions. It seems that all results are same or not related or difference is negligible, but I'll post them a bit later.

For this issue I made separate benchmark. It is quite simple benchmark but there are some special remarks.

This benchmark checks add/subtract operations for wide range of arguments. All arguments are non-negative. First argument always greater or equal the second and all such possible pairs are checked, so test set is large and sometimes is excessive.

Following numbers are in testset:

0
1
$2^n+m$ where n is taken from int[] sizes array, and m can be -1, 0 or 1. Default set of n's is: 16, 31, 32, 128, 256, 1024, 4096, 16384, 65536

The test are executed on the runtime's main branch local build and my local edited branch with new add/sub versions. For the moment I've tested only x64 core build in Linux. My AMD Ryzen 7 5700G worked about 11 hrs to complete this tests. Execution time can be optimized, but it was easier for me to run it and check it later then tune the tests. Raw results will be attached soon.

Main points:

CopyTo wins when length difference is more than 256 bits (8 uints in internal array).
As expected the worst cases are decrement of $2^n$ and increment of $2^n-1$. In these cases almost all time is in loop with extra carry check.
As expected the best cases are increment of $2^n$ and increment of $2^n-1$ and other when carry is zero after processing length of smaller argument.

Max effect of optimization:

n	time/baseline
256	0.96-0.98
1024	~0.75
4096	~0.65
16384	~0.53
65536	~0.51

(This is very rough estimate. Details will be shown later.)

Worst cases and overall results will be discussed in next comment.

(to be continued...)

speshuric · 2023-03-27T19:36:39Z

I've tuned benchmark's time. Now it is about 2 hrs.
Left (large) arguments extended, now they are:

0
1
2^16
2^n+m where n in {31, 32, 64, 128, 256, 512, 1024, 4096, 16384, 65536, 262144}, m in -1..1

Right (shorter) arguments are (all mus be smaller than left or equals):

0
1
2^16
2^(n/2) where n is size of left
2^n + m where n is size of left, m in -1..1

One of the benchmark results here (warning: large results).

speshuric · 2023-03-28T12:42:21Z

Benchmarks results

I've analyzed output of benchmarks. Here are facts, thoughts and conclusions.

Benchmarks time is not stable from launch to launch. Typical spread of values is about 0.5-1.5 ns and it seems this is not depend on code. This is not a problem for big values of arguments, but is significantly for small ones. Typical time of add/sub 128 bit BigInteger is 20 ns, 64 bit - 15-17 ns, trivial (up to 31 bit) - 5-7 ns. To avoid this instability I've launched tests with different --launchCount option values.
As expected, time of trivial add/sub is not affected by this modification. It will be excluded from next benchmarks.
As expected, time of add/sub values of the same length is not affected by this modification. But I have to note cases where length of result is short (N-N, N-(N-1) and so on): time of this operation in baseline and modified runtime are similar and takes about 1.5 times of operations where length is not reduced.
As expected, operations where carry flag is zeroed quickly are much faster then previous version. The effect is valuable when difference of length is 256+ bit or more. I'll publish the table and description below in a day or two.
Add operation with high number of carries and non-trivial right argument ((2^N-1) + 2^32 and similar) is not regressed. This is rather unexpected, but can be caused by collateral loop modification.
Add operation with high number of carries and trivial right argument ((2^N-1) + 1 and similar) is regressed. Average regression is 3-5%. I think this is acceptable but leave the decision to the team.
Subtract operation with high number of carries and non-trivial right argument (2^N - 2^32 and similar) is not regressed. Almost the same situation as with add.
Subtract operation with high number of carries and trivial right argument ((2^N) - 1 and similar) is regressed. Disappointing facts that regression is 5-8% in average and grows up to 9-10% when size of left argument grows. It needs to be investigated.
Benchmark of add/sub operations with left argument size of 64-256 needs to be redone with a higher --launchCount option value.

Latest benchmark results: Benchmarks.BigintBenchmark-report.zip

speshuric · 2023-03-28T22:44:12Z

Benchmarks of add/sub 64-256 bit integer

I've restarted this benchmarks. There is still instable high spread in some cases (which is changed from launch to launch), but whole picture is clear to me. The worst cases are slower then in reference build by 3-5%. In absolute timing it is about 1-2 ns penalty.

Benchmarks.BigintBenchmark-report64-256.zip

What else I plan to do:

Investigate the case of subtract 2^n-1
Make summary of performance changes.

@tannergooding (or anybody else): What else could I do to help you decide merge or decline this changes? What can you advice to check? Should I check other execution modes (32-bit, AOT, wasm, whatever else)? How to check performance of ARM build?

speshuric · 2023-03-29T23:22:01Z

Now while another bunch of benchmarks are running, I've taken a look at case N-(N-1). It seems it caused by this:

// Try to conserve space as much as possible by checking for wasted leading span entries
// sometimes the span has leading zeros from bit manipulation operations & and ^
for (len = value.Length; len > 0 && value[len - 1] == 0; len--);

BTW it can be replaced by LastIndexOfAnyExcept(), which is about 4 times faster for this case, but it needs isolated benchmarks, so I'll do it later.

In BigIntegerCalculator methods Add and Subtract, if sizes of arguments differ, after processing the part of size of the right (small) argument, there was loop of add/sub carry value. When the carry value once become zero, in fact the rest of the larger argument can be copied to the result. With this commit the second loop is interrupted when carry become zero and applies fast Span.CopyTo(Span) to the rest part. This optimization applied only when size of the greatest argument is more or equal to const CopyToThreshold introduced in this commit. This const is 8 now. Also made minor related changes to hot cycles. See dotnet#83457 for details.

speshuric · 2023-04-11T14:33:38Z

Summary report

Plan

Benchmark's design and results
What exactly changed and why
Conclusion

Benchmark's design and results

I performed 3 benchmark sets:

Isolated: old (baseline) and new versions of BigIntegerCalculator.Add and BigIntegerCalculator.Subtract methods. This helps to debug and tune new versions, but it is too synthetic and does not show full view. Baseline and modified versions run in the same runtime/environment.
Full: BigInteger "+" and "-" operators used in it. Baseline and modified runtime are pointed by corerun option.
Stock: Standard dotnet benchmarks.

Isolated

This benchmark tests add/sub methods on uint spans. Benchmarked methods:

BenchSubOneEmpty(int leftSize) - Benchmark simply creates span(s) and run dummy noinline methods. This helps check sidecar effect of other benchmarks of the same type.
BenchSubOneBaseline(int leftSize) - Baseline variant of Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits) benchmarked
BenchSubOneNew(int leftSize) - New version of Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits) benchmarked
BenchSubMultiEmpty(int leftSize, int rightSize) - Benchmark simply creates span(s) and run dummy noinline methods. This hepls check sidecar effect of other benchmarks of the same type.
BenchSubMultiBaseline(int leftSize, int rightSize) - Baseline variant of Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) benchmarked
BenchSubMultiNew(int leftSize, int rightSize) - New version of Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) benchmarked
BenchAddOneEmpty(int leftSize) - Benchmark simply creates span(s) and run dummy noinline methods. This hepls check sidecar effect of other benchmarks of the same type.
BenchAddOneBaseline(int leftSize) - Baseline variant of Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits) benchmarked
BenchAddOneNew(int leftSize) - New version of Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits) benchmarked
BenchAddMultiEmpty(int leftSize, int rightSize) - Benchmark simply creates span(s) and run dummy noinline methods. This hepls check sidecar effect of other benchmarks of the same type.
BenchAddMultiBaseline(int leftSize, int rightSize) - Baseline variant of Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) benchmarked
BenchAddMultiNew(int leftSize, int rightSize) - New version of Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits) benchmarked

Every benchmark method tested in "good" and "bad" case. Where "good" case means no-carry tests and "bad" case means all-carry tests. Good and bad cases for add and subtract are different.

Bench*One* tests used following set of lengths of span: {1, 2, 4, 8, 16, 64, 128, 256, 1024, 4096, 16384, 65536}. The right argument is taken appropriate for this case ("good" or "bad")
Bench*Multi* tests used following set of lengths of left span: {1, 2, 4, 8, 16, 64, 128, 256, 1024, 4096, 16384, 65536}. The right span length can be 1, 2, 4, n/2, n-1, n where n is length of left span.
*Sub* tests set the last span element to uint.MaxValue.

All tests are executed on following environments:

"Main": AMD Ryzen 7 5700GE CPU, 128 GB RAM, OS: Linux, the latest CLR built locally, X64 RyuJIT runtime.
"Old": Intel Xeon W3690 CPU, 48 GB RAM, OS Windows 11, the latest CLR built locally, X64 RyuJIT runtime.
"Old VM": Intel Xeon W3690 CPU, 48 GB RAM, OS Linux in Virtualbox, the latest CLR built locally, X64 RyuJIT runtime.

The first environment is roughly twice faster than other two, but overall relative performance results are similar, so only results for "Main" are provided.

This command used to execute tests:

dotnet run -c Release -- -f "*" --coreRun $corerun_path --iterationTime 350 --minIterationCount 5 --minWarmupCount 4 --maxWarmupCount 8 --launchCount 7 --allStats -m --join

Used options:

--iterationTime 350 --minIterationCount 5 --minWarmupCount 4 --maxWarmupCount 8 turned to be sufficient to stable results in one launch
--launchCount 7 - results are slightly volatile from launch to launch. This is smoothed by --launchCount option.
--allStats helps to analyze volatile results
-m memory stats. 0 bytes allocated as expected.
--join join all results in one

Full results are attached.

Isolated results summary

Bench*One*:

BenchAddOneEmpty: 1.6 ns, BenchSubOneEmpty: 2.3 ns

Method	leftSize	All-carry baseline mean ns	All-carry new mean ns	All-carry Ratio	All-carry RatioSD	No-carry baseline mean ns	No-carry new mean ns	No-carry Ratio	No-carry RatioSD	Comment
Add	1	2.11	2.129	1.01	0.01	2.091	2.112	1.01	0.01
Sub	1	2.942	3.161	1.07	0.04	2.954	3.118	1.06	0.05	0.2 ns regress
Add	2	2.447	2.353	0.96	0.02	2.382	2.357	0.99	0.02
Sub	2	3.241	3.388	1.05	0.03	3.221	3.426	1.06	0.03	0.2 ns regress
Add	4	3.348	2.982	0.89	0.04	3.303	2.978	0.90	0.04	++
Sub	4	4.062	3.972	0.98	0.05	3.985	4.029	1.01	0.03
Add	8	4.661	4.162	0.90	0.06	4.917	4.151	0.85	0.06	++
Sub	8	5.589	5.605	1.00	0.07	5.466	5.635	1.01	0.05
Add	16	8.46	8.832	1.05	0.05	8.592	5.099	0.59	0.03	All-carry: 0.4 ns regress
Sub	16	8.878	9.407	1.06	0.02	8.824	6.12	0.69	0.01	All-carry: 0.4 ns regress
Add	64	29.638	29.372	0.99	0.05	30.856	8.994	0.29	0.03
Sub	64	30.935	40.043	1.30	0.09	33.258	9.57	0.29	0.02	All-carry: 30% regress, see highlights
Add	128	61.288	62.448	1.02	0.01	61.096	12.841	0.21	0.01
Sub	128	62.155	67.726	1.09	0.01	62.52	14.201	0.23	0
Add	256	117.518	117.935	1.00	0	117.533	23.446	0.20	0.01
Sub	256	118.833	123.06	1.04	0	118.701	24.195	0.20	0
Add	1024	453.284	450.504	0.99	0	452.394	178.871	0.40	0
Sub	1024	453.14	455.689	1.01	0	454.086	179.465	0.40	0
Add	4096	1784.84	1789.858	1.00	0.01	1781.986	715.593	0.40	0
Sub	4096	1781.051	1793.831	1.01	0	1781.391	717.848	0.40	0
Add	16384	7187.529	7165.47	1.00	0	7185.807	2931.655	0.41	0
Sub	16384	7155.643	7193.452	1.01	0	7179.188	2931.865	0.41	0
Add	65536	28807.404	28803.928	1.00	0.01	28730.689	11684.663	0.41	0
Sub	65536	28870.218	28809.959	1.00	0.01	28880.531	11690.097	0.40	0

Highlights:

The only significant slowdown (10 ns, 30%) is subtract from 64 uint span in "all-carry" case. This result is reproducible on my PC. But it seems that is a very specific case, I do not think it is blocker.
In no-carry cases when length in range 64..256 new variant is faster up to 5 times, but when length is 1024+, then there is only 2.5 times difference. This is odd, but I did not investigated this.
Note that uint is 32 bit length, so for example 16 uint span corresponding to numbers about 2^512.

Bench*Multi*:

BenchAddMultiEmpty: 1.95 ns, BenchSubMultiEmpty: 1.74 ns

Result table is quite large, but overall conclusion is the same: all-carry cases for small (1-8 uints) numbers are slower within 1 ns, average slowdown for larger numbers typically is up to 3-6%.
There are few cases where slowdown is more than 6% and more than 1 ns, I'd like to comment them:

Method	Case	leftSize	rightSize	Mean	Ratio	RatioSD	Comment
Sub	no carry	4	4	10.564	3.26	0.96	This is statistical artifact. May be some background task started. Not reproduced in other tests. Min time is 3.6 ns
Sub	no carry	16	16	18.604	1.64	0.76	This is statistical artifact. May be some background task started. Not reproduced in other tests. Min time is 10.7 ns
Add	All-carry	64	4	34.287	1.14	0.01	It seems that 64-uint is unlucky
Sub	no carry	16	15	10.367	1.13	0.05	Yes, this case is 10-13% slower
Sub	no carry	64	64	40.048	1.08	0	It seems that 64-uint is unlucky
Sub	no carry	64	63	39.431	1.07	0.01	It seems that 64-uint is unlucky

In Bench*Multi* no-carry tests there is the same situation with length<=256 and length>256: speed of the first case is roughly 2 times faster then the second.

You may notice that some results are faster even without CopyTo, I'll explain it below.

Full

This benchmark tests add/sub methods on BigInteger class. There are only 2 benchmarked methods:

    [Benchmark]
    [ArgumentsSource(nameof(GetSizes))]
    public BigInteger Add(Entry left, Entry right)
    {
        return right.Value + left.Value;
    }

    [Benchmark]
    [ArgumentsSource(nameof(GetSizes))]
    public BigInteger Sub(Entry left, Entry right)
    {
        return left.Value - right.Value;
    }

All the main part is the method GetSizes():

Entry is a struct containing BigInteger value and its parameters, with overloaded ToString(). This struct is used to convenient build values like 2^n+m and suitable output of the benchmark.
left.Value is always greater or equal right.Value
left.Value parameter is 2^n+m where n in {31, 32, 64, 128, 256, 512, 2048, 4096, 8192, 32768, 131072, 524288, 2097152} (these values, starting from the second, correspond to the length of the parameters from the previous test multiplied by 32) and m can be -1 or 0. Here I use the feature of fast creating 2^n.
right.Value goes over {1, 2^16, 2^31, 2^(n/2), 2^n-1, 2^n}, where n is length of left.Value
GetSizes() generates described pairs.

All tests are executed on the same environments as Isolated benchmark. Again, the first environment is roughly twice faster than other two, but overall relative performance results are similar, so only results for "main PC" are provided.

This command used to execute tests:

dotnet run -c Release -- -f "*" -h Toolchain --coreRun $baseline_corerun $new_corerun --iterationTime 350 --minIterationCount 5 --minWarmupCount 4 --maxWarmupCount 8 --launchCount 3 --allStats -m --join

Used options:

-h Toolchain - to hide lo-o-o-ong path to corerun from results.
--iterationTime 350 --minIterationCount 5 --minWarmupCount 4 --maxWarmupCount 8 turned to be sufficient to stable results in one launch
--launchCount 3 - results are slightly volatile from launch to launch. This is smoothed by --launchCount option.
--allStats helps to analyze volatile results
-m memory stats. 0 bytes allocated as expected.
--join join all results in one

Full results are attached. There are 548 test cases - 274 baseline and 274 new version, quite a lot, so I do not put it in the text.
Benchmark highlights:

The best and worst cases are the same as in "Isolated", but the relative difference between the base version and the new version reduced because the full version does extra work.
There are 274 cases, 64 are faster (0.45-0.95 of baseline), 198 are the same (0.96-1.04 of baseline), 12 are slower (1.05-1.1), but 5 of the slower ones are within 1 ns difference and 1 is statistical artifact (min and median are much less then mean).
Again the worst cases are "all-carry" subtracts when n == 64*32 == 2048 with the same ~10 ns difference. In full tests it is about 10%, not 30% as in isolated.

Stock

This benchmark is fast but not representative for this modification. Despite this, I performed it to check for regression.

Command:

dotnet run -c Release -f net8.0 -- --filter "*BigInteger*" -h Toolchain --launchCount 7 --coreRun $baseline_corerun $new_corerun --join

All new results are within StdDev of baseline.

Unit-tests

All BigInteger tests are passed. I see no reason to create new UT. If there will be ideas for new UT, I'll implement them.

What exactly changed and why

All changes are in the file BigIntegerCalculator.AddSub.cs.

Add const `int CopyToThreshold = 8`.

As other similar consts it become private static int in debug mode for testing purposes. The value 8 was chosen based on the test results. When size of left argument is less or equal then CopyToThreshold, then the new approach does not apply.

Add `static void CopyTail()` method

private static void CopyTail(ReadOnlySpan<uint> source, Span<uint> dest, int start)
{
    source.Slice(start).CopyTo(dest.Slice(start));
}

The method was always inlined in my tests. It copies all elements from source to dest starting from start.

Use `nint` instead of `int` for loop counter

I found that Unsafe.Add(ref <ptr>, i) converts int and use extra movsxd (on x64 platform) in the hottest loops. It takes up to 5% of such loops (this is why there is a lot of tests slightly faster then baseline). The same change made for upper bound of loops. IMO this change seems to be safe.

Break the second loop when `carry == 0`

I used this pattern:

carry >>= 32;
if (carry == 0) break;

JIT is managed to use zero flag from the first line to conditional jump, so there is no extra cmps.

"Right" span converted to ref same way as "left" and "result" spans.

To avoid extra range check.

Conclusion

The modified version is faster when sizes of arguments are different and the greatest argument is more than 8 uints and carry flag becomes 0 quickly. This is a typical case for loops with increment and decrement or for calculating polynomials.
In other cases performance is the same for most cases or slightly slower (1-5%) for some cases where carry stays non-zero. The worst case when size of argument is 64 uints and carry stays non-zero is slower by 10%.

benchmark-results.zip

speshuric · 2023-04-11T15:18:16Z

@tannergooding please take a look

speshuric · 2023-04-17T08:27:38Z

Just discovered old issue to the same theme: #41495 (mention to keep linked)

tannergooding · 2023-04-17T14:18:12Z

This is on my backlog to review more in depth, but I might not be able to get to it until next week.

At a high level overview, the numbers look acceptable.

speshuric added the tenet-performance Performance related issue label Mar 15, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Mar 15, 2023

vcsjones added the area-System.Numerics label Mar 15, 2023

tannergooding removed the untriaged New issue has not been triaged by the area owner label Mar 23, 2023

speshuric mentioned this issue Mar 26, 2023

Use Span.CopyTo(Span) in BigInteger's add and subtract #83951

Merged

adamsitnik closed this as completed in 2f843a8 Jul 18, 2023

adamsitnik assigned speshuric Jul 18, 2023

adamsitnik added this to the 8.0.0 milestone Jul 18, 2023

speshuric mentioned this issue Jul 19, 2023

BigInteger performance improvements #41495

Open

ghost locked as resolved and limited conversation to collaborators Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.Numerics.BigInteger: Add/Subtract performance can be improved when size of arguments are different #83457

System.Numerics.BigInteger: Add/Subtract performance can be improved when size of arguments are different #83457

speshuric commented Mar 15, 2023 •

edited

Loading

dotnet-issue-labeler bot commented Mar 15, 2023

ghost commented Mar 15, 2023

Description

Analysis

Methods of data movement

Best and worst cases

Draft benchmarks

Valuable draft benchmark points

To do's

Open questions

Data

speshuric commented Mar 20, 2023

speshuric commented Mar 23, 2023

tannergooding commented Mar 23, 2023 •

edited

Loading

speshuric commented Mar 26, 2023

speshuric commented Mar 27, 2023 •

edited

Loading

speshuric commented Mar 27, 2023 •

edited

Loading

speshuric commented Mar 28, 2023 •

edited

Loading

speshuric commented Mar 28, 2023 •

edited

Loading

speshuric commented Mar 29, 2023

speshuric commented Apr 11, 2023

speshuric commented Apr 11, 2023

speshuric commented Apr 17, 2023

tannergooding commented Apr 17, 2023

System.Numerics.BigInteger: Add/Subtract performance can be improved when size of arguments are different #83457

System.Numerics.BigInteger: Add/Subtract performance can be improved when size of arguments are different #83457

Comments

speshuric commented Mar 15, 2023 • edited Loading

Description

Analysis

Methods of data movement

Best and worst cases

Draft benchmarks

Valuable draft benchmark points

To do's

Open questions

Data

dotnet-issue-labeler bot commented Mar 15, 2023

ghost commented Mar 15, 2023

Description

Analysis

Methods of data movement

Best and worst cases

Draft benchmarks

Valuable draft benchmark points

To do's

Open questions

Data

speshuric commented Mar 20, 2023

speshuric commented Mar 23, 2023

tannergooding commented Mar 23, 2023 • edited Loading

speshuric commented Mar 26, 2023

speshuric commented Mar 27, 2023 • edited Loading

speshuric commented Mar 27, 2023 • edited Loading

speshuric commented Mar 28, 2023 • edited Loading

Benchmarks results

speshuric commented Mar 28, 2023 • edited Loading

Benchmarks of add/sub 64-256 bit integer

speshuric commented Mar 29, 2023

speshuric commented Apr 11, 2023

Summary report

Plan

Benchmark's design and results

Isolated

Isolated results summary

Full

Stock

Unit-tests

What exactly changed and why

Add const int CopyToThreshold = 8.

Add static void CopyTail() method

Use nint instead of int for loop counter

Break the second loop when carry == 0

"Right" span converted to ref same way as "left" and "result" spans.

Conclusion

speshuric commented Apr 11, 2023

speshuric commented Apr 17, 2023

tannergooding commented Apr 17, 2023

speshuric commented Mar 15, 2023 •

edited

Loading

tannergooding commented Mar 23, 2023 •

edited

Loading

speshuric commented Mar 27, 2023 •

edited

Loading

speshuric commented Mar 27, 2023 •

edited

Loading

speshuric commented Mar 28, 2023 •

edited

Loading

speshuric commented Mar 28, 2023 •

edited

Loading

Add const `int CopyToThreshold = 8`.

Add `static void CopyTail()` method

Use `nint` instead of `int` for loop counter

Break the second loop when `carry == 0`