Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlBulkCopy - generics to avoid boxing #358

Closed
wants to merge 33 commits into from

Conversation

cmeyertons
Copy link
Contributor

@cmeyertons cmeyertons commented Dec 18, 2019

This PR is in reference to #353

This is an attempt to reduce garbage created during a SqlBulkCopy operation for applications that are using SqlBulkCopy to stream batches of data out during a larger overall process.

The overall approach was to be more respectful of the IDataReader interface and use it's typed methods if possible to eliminate uneccesary boxing. The unboxed value is passed around generically and converted back to its underlying type using a ValueTypeConverter.

I did observe some performance degradation when passing around by value, kicking off the idea to pass everything by ref -- this is more in line with how SqlBulkCopy previously behaved, minus the boxing.

While I do see less garbage created, I am seeing some disappointing performance results (abbreviated Benchmark.NET results):

|                Method |     Mean ||    Gen 0 | Gen 1 | Gen 2 |  Allocated |
|---------------------- |---------:||---------:|------:|------:|-----------:|
| NewBulkCopy_NewReader | 387.4 ms ||        - |     - |     - | 12017576 B |
| OldBulkCopy_NewReader | 161.9 ms || 250.0000 |     - |     - | 26412872 B |

There are still a considerable amount of heap allocations occurring, primarily during SqlDecimal -- I have logged the following issue to address: dotnet/runtime#1034

I think there is something different with the way my dlls are compiled? I'm seeing much slower times in places where I did not make any changes, such as TdsParser.WriteTokenLength

My WriteTokenLength: 144ms
Old WriteTokenLength: 30ms

Any assistance in understanding the perf degradation is welcomed and appreciated!

@dnfclas
Copy link

dnfclas commented Dec 18, 2019

CLA assistant check
All CLA requirements met.

@cheenamalhotra
Copy link
Member

There seem to be build errors for the PR, please take a look at the logs and try to build and run tests locally before the pipelines trigger builds. You can use steps from BUILDGUIDE to do the same.

@Wraith2
Copy link
Contributor

Wraith2 commented Dec 19, 2019

Can you post your test program/query and i'll see if i can help on the WriteTokenLength speed issue?

@cmeyertons
Copy link
Contributor Author

cmeyertons commented Dec 19, 2019

@Wraith2 I pushed the branch benchmark in my fork that has a Benchmark project added to it. Currently it's pulling from NuGet 1.1 to run the benchmark.

@Wraith2
Copy link
Contributor

Wraith2 commented Dec 25, 2019

I've done some experimenting and have some suggestions. All of this I've only checked on netcore, not netfx so you'll need to check the behaviour on netfx to see if they same applies.

ValueConverter can be reduced to:

    public static class ValueTypeConverter
    {
        /// <summary>
        /// Converts the value to the TOut type
        /// </summary>
        /// <typeparam name="TIn"></typeparam>
        /// <typeparam name="TOut"></typeparam>
        public static TOut Convert<TIn,TOut>(TIn input) => (TOut)(object)input;
    }

And then made internal (so it doesn't change the public api surface). You can then add a linked reference to the file in the test harness so you can access it if you need to. Though really I don't think there need to be specific tests for it.

  1. I suggest changing the type to be non-generic and making the method generic so that the number of instantiations don't eat space in the type list for the app. Consider that if you have a list of all the types in the application and then add 40 generic instantiations of ValueTypeConverter you have to search longer than if you have a single ValueTypeConverter type with 40 method overloads on it.
  2. when you're not using lambda compilation you don't need the 'ref' everywhere so all the signatures can have it removed which makes thing look a lot more natural. Because you're no longer having to pass references you don't need to add many intermediate variables you can just instantiate directly at the call site.
  3. AOT compilation will "see" these calls and be able to produce the required instantiations because they're all statically declared in the code. This removes any problem with lambda and code emission.

All of this relies on the netcore JIT being able to elide the pattern (T)(object)t to t. If netfx doesn't do this then your approach of using lambda compilation will probably be the way to implement it on netfx and that should be ok because we know netfx is desktop only. If this is the case can you see if it's possible to do it using the same public api surface on ValueTypeConverter and using a local variable to ref inside the call somewhere? that way if we do ever get to the position where we can merge a lot of the netfx and netcore codebases we only have to conditionally include the correct implantation and it'll all still fit together.

@cmeyertons
Copy link
Contributor Author

I've done some experimenting and have some suggestions. All of this I've only checked on netcore, not netfx so you'll need to check the behaviour on netfx to see if they same applies.

ValueConverter can be reduced to:

    public static class ValueTypeConverter
    {
        /// <summary>
        /// Converts the value to the TOut type
        /// </summary>
        /// <typeparam name="TIn"></typeparam>
        /// <typeparam name="TOut"></typeparam>
        public static TOut Convert<TIn,TOut>(TIn input) => (TOut)(object)input;
    }

And then made internal (so it doesn't change the public api surface). You can then add a linked reference to the file in the test harness so you can access it if you need to. Though really I don't think there need to be specific tests for it.

I will implement your ValueTypeConverter suggestions -- I like that approach more. I will also make it internal.

I was using the unit tests to more quickly iterate on the ValueTypeConverter implementation -- took some trial and error to get it right, especially for the ref implementation. I would say they are still valuable to keep, but definitely agree that adding the linked reference is much cleaner than making it public.

Side note, I was kind of surprised to see the Test project as a whole didn't have internals of the SqlClient project exposed to it.

  1. I suggest changing the type to be non-generic and making the method generic so that the number of instantiations don't eat space in the type list for the app. Consider that if you have a list of all the types in the application and then add 40 generic instantiations of ValueTypeConverter you have to search longer than if you have a single ValueTypeConverter type with 40 method overloads on it.

I definitely agree -- that is definitely an improvement and I will implement that.

  1. when you're not using lambda compilation you don't need the 'ref' everywhere so all the signatures can have it removed which makes thing look a lot more natural. Because you're no longer having to pass references you don't need to add many intermediate variables you can just instantiate directly at the call site.

I actually didn't need the ref implementation initially -- it's not a requirement of the lambda compilation. My first iteration omitted ref, so T value would pass by value -- ValueTypeConverter just needed some slight tweaks to go to ref. Then I started trying to diagnose the performance issues. I realized the current implementation passes the boxed value around by reference, so I converted to the ref implementation to align to the previous behavior. I was hoping to see better perf using ref, but did not.

Side note, for a decimal type, is the conclusion correct that passing by value is slower than passing by stack reference (16 bytes vs 8 bytes in 64x)?

  1. AOT compilation will "see" these calls and be able to produce the required instantiations because they're all statically declared in the code. This removes any problem with lambda and code emission.

All of this relies on the netcore JIT being able to elide the pattern (T)(object)t to t. If netfx doesn't do this then your approach of using lambda compilation will probably be the way to implement it on netfx and that should be ok because we know netfx is desktop only. If this is the case can you see if it's possible to do it using the same public api surface on ValueTypeConverter and using a local variable to ref inside the call somewhere? that way if we do ever get to the position where we can merge a lot of the netfx and netcore codebases we only have to conditionally include the correct implantation and it'll all still fit together.

I will definitely keep the public API surface aligned between netfx and netcore (I guess that means I have to copy over all of the T value changes to the netfx side?).

@Wraith2
Copy link
Contributor

Wraith2 commented Dec 30, 2019

Side note, for a decimal type, is the conclusion correct that passing by value is slower than passing by stack reference (16 bytes vs 8 bytes in 64x)?

That's a behaviour which can be changed by the JIT and by the platform. Unix and windows may have different behaviours due to the ABI of the native code generated. It is also a behaviour that may change in future iterations of the JIT so it's a level I don't tend to push at too hard. So I don't know 😁.

@cmeyertons
Copy link
Contributor Author

cmeyertons commented Dec 31, 2019

Expanded ValueMethod to include GetInt32, etc. -- considering switching it to a bit-wise enum to possibly improve perf.

Still seeing some perf diffs, however they're much smaller (20%) when doing a like-for-like comparison using benchmark_current branch as the baseline and benchmark branch containing the generic changes.

Method Mean Error StdDev Rank Gen 0 Gen 1 Gen 2 Allocated
current_1 264.8 ms 15.17 ms 44.49 ms 1 4000.0000 - - 25.19 MB
current_2 273.2 ms 21.70 ms 63.97 ms 1 4000.0000 - - 25.19 MB
new_1 327.1 ms 20.00 ms 58.65 ms 1 1000.0000 - - 11.46 MB
new_2 319.9 ms 20.81 ms 60.36 ms 1 1000.0000 - - 11.46 MB
new_3 306.6 ms 20.38 ms 60.09 ms 1 1000.0000 - - 11.46 MB

@cmeyertons
Copy link
Contributor Author

@Wraith2 I completed your PR feedback on the netcore side.

Some notes:

I changed the name of ValueTypeConverter to GenericConverter to more accurately describe the behavior. I think on the netfx side, it's still going to require an underlying GenericConverter<TIn, TOut to leverage the static constructor to cache off the lambda expression compilation -- I didn't see a good way of implementing that w/o using some typed dictionary lookup (which would not perform adequately compared to the generic).

I'd like to resolve the perf issues above before porting to netfx to avoid having to keep the code in sync multiple times.

Some ideas I have:

  1. passing the decimal / SqlDecimal around by value is hurting perf
  2. less inlining w/ the generics?
  3. ValueMethod could be converted to a bit-wise enum.

I've been profiling using JetBrains DotTrace and haven't had success in determining why it's so much slower -- it's hard to pinpoint comparing old vs. new because the control flow has changed to let the generic value flow downwards from read -> write.

@Wraith2
Copy link
Contributor

Wraith2 commented Jan 2, 2020

Can you export the profiler traces and upload them, I have dottrace so I should be able to open them and take a look. You also need to take a look at the test failures and see if there's anything that can be fixed.

@cmeyertons
Copy link
Contributor Author

Can you export the profiler traces and upload them, I have dottrace so I should be able to open them and take a look. You also need to take a look at the test failures and see if there's anything that can be fixed.

@Wraith2 I can't upload to Github, so sent to your email listed on your profile. I will take a look at the tests to get that cleaned up.

@cmeyertons
Copy link
Contributor Author

I split out the benchmark into 4 different tables (decimal, string, int, bool) being copied into (instead of one table w/ all 4), and now i'm seeing zero performance degradation:

Method Mean Error StdDev Median Rank Gen 0 Gen 1 Gen 2 Allocated
Decimal_100000_old 86.23 ms 4.994 ms 14.004 ms 83.92 ms 4 3000.0000 - - 18760.16 KB
Decimal_100000_new 73.38 ms 2.352 ms 6.749 ms 70.83 ms 3 1875.0000 - - 11728.91 KB
String_100000_old 69.12 ms 3.608 ms 9.996 ms 67.73 ms 3 - - - 8.72 KB
String_100000_new 63.69 ms 4.466 ms 12.742 ms 59.59 ms 2 - - - 8.78 KB
Int_100000_old 54.92 ms 1.976 ms 5.702 ms 53.51 ms 1 333.3333 - - 2352.45 KB
Int_100000_new 50.90 ms 1.791 ms 5.081 ms 48.76 ms 1 - - - 8.05 KB
Bool_100000_old 58.14 ms 2.311 ms 6.518 ms 55.70 ms 2 333.3333 - - 2352.5 KB
Bool_100000_new 59.42 ms 3.138 ms 8.747 ms 56.05 ms 2 - - - 8.11 KB

@Wraith2
Copy link
Contributor

Wraith2 commented Jan 2, 2020

Weird. Incidentally a lot of those SqlDecimal allocations will be removed in .net 5 if dotnet/runtime#1155 gets merged.

Copy link
Contributor Author

@cmeyertons cmeyertons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wraith2 @cheenamalhotra please see my comments in this review --

@cheenamalhotra does this logic needs to be additionally ported into netfx?

@cmeyertons
Copy link
Contributor Author

There was a problem with my benchmark (some random-ness) and after de-randomizing the copied data, the changes in this PR appear to improve BulkCopy's overall performance (as I was hoping).

Copy link
Member

@cheenamalhotra cheenamalhotra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

  1. The classes GenericConverter you added in NetFx and NetCore projects are not same.
  2. You did not make changes to NetFx project, any specific reason?
  3. Also I did not see extensive test cases to reflect what we achieved with the PR.

I'm yet to go deep down the logic path, wanted to highlight some top observations.

@cmeyertons
Copy link
Contributor Author

cmeyertons commented Sep 14, 2020

Are the failing checks due to something more widespread in the SqlClient ecosystem?

Looks like a lot of CreateDatabase failures in Enclave jobs + 2 functional test failures related to Always Encrypted

@ErikEJ
Copy link
Contributor

ErikEJ commented Sep 14, 2020

@cmeyertons Yes, I think these failures are not related to your PR

@cmeyertons
Copy link
Contributor Author

@cheenamalhotra @DavoudEshtehari @Wraith2 I believe this PR has all feedback completed -- the only remaining item is the decision on whether or not we want to port these changes to netfx -- happy to do so, just need to hear from team if we're good w/ changes thus far

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 17, 2020

Have you investigated how difficult/different the port to netfx would be? I think it it's low friction you'll almost certainly be asked to do so, there's a lot of line of business software out there using bulk copy which would be improved by better performance and it would be a good selling point for driving library adoption over the system version.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Sep 18, 2020

@cmeyertons

Please go ahead and update NetFx source. I think the changes look good to take it forward to completion 👍

Please don't mind the Enclave pipeline failures, we're dealing with some environment issues currently. We've disabled these pipelines for PR validation currently, and will resume them when test environments are ready again!

@cheenamalhotra cheenamalhotra added this to the 2.1.0-preview2 milestone Sep 18, 2020
@cmeyertons
Copy link
Contributor Author

@cheenamalhotra @Wraith2 performed the following:

  • ported code into netfx -- it was pretty trivial. The only thing I had to watch out for was not copying over optimizations made to Guid to use stackalloc (left alone to use .ToByteArray())
  • got the benchmark fully working (even in net461) as a unit test that measures the no boxing effect.
  • after getting the benchmark fully working, was able to confirm that (TOut)(object)value gets optimized out by the JIT even in net461, so was able to align the netfx and netcore GenericConverter implementations

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 24, 2020

Excellent work, thanks 👍

@cheenamalhotra
Copy link
Member

That's great progress, would request you to fix the failing tests in NetCore, as these failures are related to changes here..
Also could you provide us benchmarks for perf improvements you observed, in order to document here for the reviewing team?

@cheenamalhotra cheenamalhotra removed this from the 2.1.0-preview2 milestone Oct 14, 2020
Base automatically changed from master to main March 15, 2021 17:54
@Wraith2
Copy link
Contributor

Wraith2 commented Apr 13, 2021

/me pokes @cmeyertons

It'd be nice to get this into the next release

@cmeyertons
Copy link
Contributor Author

@Wraith2 i can carve out some time today

I was kind of stuck on fixing the test - I remember it being related to the stream bulk copy functionality and couldn’t easily step through it because the test had a large amount of data unrelated to the breakage.

will dig in some more today and post my findings

@DavoudEshtehari
Copy link
Contributor

@cmeyertons Could you resolve conflicts, please?

@Wraith2
Copy link
Contributor

Wraith2 commented Apr 22, 2021

It is quite irksome that PR's are left open for so long and yet when you do get chance to review them quite often the only thing that is asked is to rebase them on the current version. If you gave us some warning we could do it ahead of time but trying to keep all open PR's rebased on master constantly is a big time sync for contributors.
Is there a way that we could get told ahead of time that a PR is going to be reviewed so we can make sure it's current? /cc @David-Engel

@cmeyertons
Copy link
Contributor Author

I’m currently working on patching my changes on top of main and will create a new PR - there were too many hairy conflicts I couldn’t easily resolve.

@cheenamalhotra
Copy link
Member

Closing as new PR #1048 replaces it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💡 Enhancement Issues that are feature requests for the drivers we maintain.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SqlBulkCopy - ReadWriteColumnValueAsync creates lot of garbage
6 participants