-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Numerics.BigInteger: Add/Subtract performance can be improved when size of arguments are different #83457
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsDescription
Reproducing the issue is possible in most environments. This is not a regression but a new optimization. Main idea can be demonstrated on this part of int i = 0;
long carry = 0L;
// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...
for ( ; i < right.Length; i++)
{
long digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
carry = digit >> 32;
}
for ( ; i < left.Length; i++)
{
// "target loop"
long digit = left[i] + carry;
Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
carry = digit >> 32;
}
Unsafe.Add(ref resultPtr, i) = (uint)carry; In the second loop (marked as Analysis
Following statements are correct for
So it can be rewritten as follows: int i = 0;
ulong carry = 0L;
// ...
ref uint leftPtr = ref MemoryMarshal.GetReference(left);
ref uint resultPtr = ref MemoryMarshal.GetReference(bits);
// ...
for ( ; i < right.Length; i++) // this loop was not modified
{
ulong digit = (Unsafe.Add(ref leftPtr, i) + carry) + right[i];
Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
carry = digit >> 32;
}
for ( ; carry != 0 && i < left.Length; i++) // carry != 0 is checked
{
ulong digit = Unsafe.Add(ref leftPtr, i) + carry;
Unsafe.Add(ref resultPtr, i) = unchecked((uint)digit);
carry = digit >> 32;
}
if (i < left.Length)
{
// only move data from left argument to result
do
{
Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i);
i++;
} while (i < left.Length);
// Note: if left part to copy is large then special method CopyTo is better
//left.Slice(i).CopyTo(bits.Slice(i));
i = left.Length;
carry = 0;
}
Unsafe.Add(ref resultPtr, i) = unchecked((uint)carry); Methods of data movementIn this variant, second loop checks when
On my PC (CPU Ryzen 5700G) in my draft benchmarks the second is faster when approximately Best and worst casesThe best case of argument data for the new version is such values of argument arrays when Draft benchmarksI have written some draft microbenchmarks to measure the difference. These benchmarks test 3 versions of
Two cases are tested:
Tests cover In draft benchmark attribute Valuable draft benchmark points
Important note: To do's
Open questions
DataDraft benchmark results are below. I will publish the code of benchmark soon. Data of draft benchmarikng (click to expand)
|
Just a status update.
|
Status update
|
Have to be careful with this type of optimization because while it will speed up the case where the carry terminates quickly, it will also slow down the case where the carry does not due to have an additional condition and branch per iteration. Perf numbers showing the improvement for many different scenarios, including "worst case" scenarios will be important. |
I've created PR. Below I'll describe my benchmarks and its results. |
I apologize in advance to the participants for posting the results step by step, and not all at once. There are a lot of results and it takes time to recheck (and my personal time for this issue is very limited). I should note that dotnet/performance tests are unsuitable for this issue. Almost all tests are "N-bit, N-bit", so usage of them is limited. I've executed them to check and prevent general regressions. It seems that all results are same or not related or difference is negligible, but I'll post them a bit later. For this issue I made separate benchmark. It is quite simple benchmark but there are some special remarks. This benchmark checks add/subtract operations for wide range of arguments. All arguments are non-negative. First argument always greater or equal the second and all such possible pairs are checked, so test set is large and sometimes is excessive. Following numbers are in testset:
The test are executed on the runtime's Main points:
Max effect of optimization:
(This is very rough estimate. Details will be shown later.) Worst cases and overall results will be discussed in next comment. (to be continued...) |
I've tuned benchmark's time. Now it is about 2 hrs.
Right (shorter) arguments are (all mus be smaller than left or equals):
One of the benchmark results here (warning: large results). |
Benchmarks resultsI've analyzed output of benchmarks. Here are facts, thoughts and conclusions.
Latest benchmark results: Benchmarks.BigintBenchmark-report.zip |
Benchmarks of add/sub 64-256 bit integerI've restarted this benchmarks. There is still instable high spread in some cases (which is changed from launch to launch), but whole picture is clear to me. The worst cases are slower then in reference build by 3-5%. In absolute timing it is about 1-2 ns penalty. Benchmarks.BigintBenchmark-report64-256.zip What else I plan to do:
@tannergooding (or anybody else): What else could I do to help you decide merge or decline this changes? What can you advice to check? Should I check other execution modes (32-bit, AOT, wasm, whatever else)? How to check performance of ARM build? |
Now while another bunch of benchmarks are running, I've taken a look at case // Try to conserve space as much as possible by checking for wasted leading span entries
// sometimes the span has leading zeros from bit manipulation operations & and ^
for (len = value.Length; len > 0 && value[len - 1] == 0; len--); BTW it can be replaced by |
In BigIntegerCalculator methods Add and Subtract, if sizes of arguments differ, after processing the part of size of the right (small) argument, there was loop of add/sub carry value. When the carry value once become zero, in fact the rest of the larger argument can be copied to the result. With this commit the second loop is interrupted when carry become zero and applies fast Span.CopyTo(Span) to the rest part. This optimization applied only when size of the greatest argument is more or equal to const CopyToThreshold introduced in this commit. This const is 8 now. Also made minor related changes to hot cycles. See dotnet#83457 for details.
Summary reportPlan
Benchmark's design and resultsI performed 3 benchmark sets:
IsolatedThis benchmark tests add/sub methods on uint spans. Benchmarked methods:
Every benchmark method tested in "good" and "bad" case. Where "good" case means no-carry tests and "bad" case means all-carry tests. Good and bad cases for add and subtract are different.
All tests are executed on following environments:
The first environment is roughly twice faster than other two, but overall relative performance results are similar, so only results for "Main" are provided. This command used to execute tests:
Used options:
Full results are attached. Isolated results summary
Highlights:
Result table is quite large, but overall conclusion is the same: all-carry cases for small (1-8 uints) numbers are slower within 1 ns, average slowdown for larger numbers typically is up to 3-6%.
In You may notice that some results are faster even without FullThis benchmark tests add/sub methods on BigInteger class. There are only 2 benchmarked methods: [Benchmark]
[ArgumentsSource(nameof(GetSizes))]
public BigInteger Add(Entry left, Entry right)
{
return right.Value + left.Value;
}
[Benchmark]
[ArgumentsSource(nameof(GetSizes))]
public BigInteger Sub(Entry left, Entry right)
{
return left.Value - right.Value;
} All the main part is the method
All tests are executed on the same environments as Isolated benchmark. Again, the first environment is roughly twice faster than other two, but overall relative performance results are similar, so only results for "main PC" are provided. This command used to execute tests:
Used options:
Full results are attached. There are 548 test cases - 274 baseline and 274 new version, quite a lot, so I do not put it in the text.
StockThis benchmark is fast but not representative for this modification. Despite this, I performed it to check for regression. Command:
All new results are within StdDev of baseline. Unit-testsAll BigInteger tests are passed. I see no reason to create new UT. If there will be ideas for new UT, I'll implement them. What exactly changed and whyAll changes are in the file Add const
|
@tannergooding please take a look |
Just discovered old issue to the same theme: #41495 (mention to keep linked) |
This is on my backlog to review more in depth, but I might not be able to get to it until next week. At a high level overview, the numbers look acceptable. |
Description
System.Numerics.BigInteger
add and subtract operations for non-trivial cases are implemented inAdd
/Subtract
static methods of internal classBigIntegerCalculator
. Current implementation can be improved by special handling the case ofcarry==0
when the current position being processed goes beyond the length of the right (short) argument, but does not exceed the length of the left (long) argument.Reproducing the issue is possible in most environments. This is not a regression but a new optimization.
Main idea can be demonstrated on this part of
Add
method:In the second loop (marked as
// "target loop"
) oncecarry
is set to 0 it can not be 1 anymore. So the tail of the loop is just the movement of argument values to result span.Analysis
BigIntegerCalculator
now contains 6 static metods for add and subtract:public static void Add(ReadOnlySpan<uint> left, uint right, Span<uint> bits);
- used when right argument length is 1public static void Add(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits)
- default algorithm for additionprivate static void AddSelf(Span<uint> left, ReadOnlySpan<uint> right)
- checkscarry
and breaks the looppublic static void Subtract(ReadOnlySpan<uint> left, uint right, Span<uint> bits)
- used when right argument length is 1public static void Subtract(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits)
- default algorithm for subtractionprivate static void SubtractSelf(Span<uint> left, ReadOnlySpan<uint> right)
- checkscarry
and breaks the loopAddSelf
andSubtractSelf
used in internals ofSquMul
part ofBigIntegerCalculator
, cannot be optimized this way and are not considered below.Add
andSubtract
can be optimized almost identically so only case ofAdd(ReadOnlySpan<uint> left, ReadOnlySpan<uint> right, Span<uint> bits)
will be used to describe.Following statements are correct for
"target loop"
carry
can be 1 or 0carry
is 1 then outcomecarry
can be 1 if and only ifleft[i] == uint.MaxValue
andresult[i]==0
.carry
is 0 thenresult[i] = left[i]
for every nexti
. This assignment can be optimized by remove any arithmetic and use of special platform-optimized methods of copying data.So it can be rewritten as follows:
Methods of data movement
In this variant, second loop checks when
carry
become 0 and then special case (plain movement) is triggered. Two possible ways to copy data can be considered:Unsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++;
Span.CopyTo()
:left.Slice(i).CopyTo(bits.Slice(i));
. This internally calls high optimizedBuffer.Memmove()
butSpan.CopyTo()
do some redundant checks and thus can be slower on short slices.On my PC (CPU Ryzen 5700G) in my draft benchmarks the second is faster when approximately
left.Length - right.Length >= 16
Best and worst cases
The best case of argument data for the new version is such values of argument arrays when
carry == 0
.The worst case is when
carry
is always 1, i.e. allleft[i] == uint.MaxValue
.There should be more difference when
left.Length - right.Length
is large and shouldn't difference when they are equals.Draft benchmarks
I have written some draft microbenchmarks to measure the difference. These benchmarks test 3 versions of
Add
method:AddOld
- code in the current runtimeAddNew
- withUnsafe.Add(ref resultPtr, i) = Unsafe.Add(ref leftPtr, i); i++;
loopAddNewMemmove
withleft.Slice(i).CopyTo(bits.Slice(i));
without loopTwo cases are tested:
bad
- when allleft[i]
andright[i]
areuint.MaxValue
good
- when allleft[i]
andright[i]
are 1Tests cover
left.Length
from 2 to 256 (doubles on each step, i.e. 2, 4, 8, 16, 32, 64, 128, 256) andright.Length
from 2 toleft.Length
(doubles each step too).In draft benchmark attribute
[InvocationCount(10000000)]
was used, hence there can be some inaccuracies on small lengths.Valuable draft benchmark points
Add
method. It needs some more analysis, but at first look it seems that regressions are insignificant.AddNew
version (as expected) is observed in case ofgood
arrays withright.Length
2 uints andleft.Length
256 uints. Old version mean is 122.771 ns,AddNew
mean is 65.197 ns.AddNewMemmove
version is in the same test (good
, 256, 2) is even faster and it's mean is 12.797 ns. Almost 10x faster than old version!AddNewMemmove
version faster thanAddNew
whenleft.Length - right.Length >= 16
Important note:
Add
method takes only part time of whole operation. End-to-end results will be comparable in absolute time difference but relative difference will differ much less.To do's
Open questions
Data
Draft benchmark results are below. I will publish the code of benchmark soon.
Data of draft benchmarikng (click to expand)
The text was updated successfully, but these errors were encountered: