-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add packet multiplexer. #2608
base: main
Are you sure you want to change the base?
Add packet multiplexer. #2608
Conversation
3c6a66f
to
9155cfc
Compare
@ErikEJ is there any chance you you could pick up the ci artifact (or compile your own version) of this PR and run any EF type testing you do on it? The more realistic scenarios with the capacity for network latency I can get the better I'll feel about it. @panoskj you're knowledgeable about the async machinery. Could I ask you to have a look at the code changes and give a general opinion? I know there's still some cleanup to do so no need to nitpick style/naming. |
@Wraith2 Sure, but exactly what are you asking for? Test timing, failures? The EF Core tests run against a local engine by default. |
Test failures primarily, test timing would be interesting particularly async strings. Even locally partial packets can happen if your machine is capable of running both test and server fast enough. |
@Wraith2 Could you point me to the artifact location? |
probably.... If you find then click on the item which is just "CI-SqlClient (Build NuGet) ", then from there click on the link on the right hand pane "View more details on Azure Pipelines" it'll take you to the build log and there is an underlinked "artifacts" link. the Artifacts entry in the list on the next page is the zip with the nuget packages in. |
@David-Engel Not a problem, I can just download the whole artifact.zip |
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs
Outdated
Show resolved
Hide resolved
Meh
|
Odd that it wants a strong name. Is that net8? |
.NET 9 - but odd, yes - @roji : any ideas? |
CI pipelines are currently not generating StrongName as packages are not signed. For local testing you can use delay signed, but that will need a snk file. |
So how are we supposed to test things outside the CI? This seems like an unhelpful restriction. |
This is ready for review now. It fixes a long standing performance issue and a reliability issue which users may have been facing but would be almost impossible to trace. Can this have some review priority please @David-Engel ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @Wraith2!
Love the results. Noted a few issues.
I focused on the logic being translated correctly in the existing code wrt to the bool to enum changes to ensure there were no translation bugs there. Some more review focus on the new packet/multiplexer code and the tests would be good.
I really wish the CI wasn't so broken right now. 😞 Combination of internal factors there...
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.Multiplexer.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/tests/FunctionalTests/MultiplexerTests.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/tests/FunctionalTests/MultiplexerTests.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/Packet.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.Multiplexer.cs
Outdated
Show resolved
Hide resolved
@Wraith2 If you merge from main, CI should be able to run cleanly now. |
Rebased. |
It looks like there is a bug somewhere that is causing intermittent failures in MoveToContentAsyncTest. 6 of the CI jobs hung on that test, causing the job to timeout (example), and 1 failed on CancelAsyncConnections. I re-ran the failed jobs and they may pass, but there's definitely an issue there. |
There was a bug that I worked through and resolved on saturday. It was stopping the multipacket test completing on netfx. This PR requires careful review and the CI coverage alone isn't going to be sufficient. It'll need some big chunks of data to be run through it to validate that it is steady under sustained load. Both in the default/current retry mode and in the experimental continue mode. The combination of running the consuming process fast and having some latency on the network is needed to make the partial packet recombination kick in. |
The same two tests are still showing intermittent hangs (MoveToContentAsyncTest) and failures (CancelAsyncConnections) in the CI on managed SNI on Windows and Linux. |
I think the tests are bad. If I run them in debug mode I will always get an exception created (not thrown) related to an ERR_DONE status code from the server. If I run the tests through the test explorer they succeed. Every time. I think the error being created is somehow hidden or ignored in the normal running of the test and you can only see if if the timing is disturbed. That would explain why it only happens on some runs. |
Can I get some code review please. I'm sure everyone is busy but this isn't a PR that I want to spend the next 6 months keeping up to date with main. |
@Wraith2 we dont intend to make you wait for 6 months. This PR is valuable to us. I will post comments as I go over the PR. |
If you'd like a teams call to discuss it just let me know. I can probably still talk people through most of the changes effectively. As you can appreciate the longer it takes the harder it will be to remember. Hopefully if continue mode can be enabled it will help performance with json support in async mode. |
also @Wraith2 JSON is a good point. It relies on PLP data, which follows varchar(max) read patterns. I have some hidden gems for perf improvement in these areas. Will need more help with allocations. I will open some discussions and tag those problem areas. |
Hidden gems? I'm intrigued. I thought I'd got almost as much performance from the driver as I could in many places. |
@Wraith2 Do you have the test names handy for the tests that are hanging in the CI? |
MoveToContentAsyncTest and CancelAsyncConnections as David mentioned earlier in the discussion. I'm very suspicious of MoveToContentAsyncTest for reasons cited above. |
Thanks. I will read through the test code. Sorry completely missed the mention of the test names |
@cheenamalhotra @Wraith2 and I met over a teams call. Thanks for bring us to speed. Following are the decisions and next steps:
Next steps:
@Wraith2 has offered to break this PR down in the above-mentioned logical PRs. Also @Wraith2 the @Wraith2 @cheenamalhotra let me know if I have faithfully captured the outcome of our discussions. Else please comment with edits/additional comments. |
Part 1 posted in #2690 |
contributes to fixing #593
Background
Asynchronous reads in this library use a retry mechanism. A context object is created a snapshot of the current state object is taken and the process of reading the data begins. As each data packet is received the packet data is appended to a list of packets, the state object is reset to the snapshot and all of the packets in the list are read from the first to the last. There can be two outcomes of the processing and specified by the return values allowed from the parsing functions that are used to read the data, success or failure. If the process fails without causing an exception the snapshot state is restored and more data is requested and it goes around again.
The complexity of this method is factorial because every packet must be processed once a new packet is received. This makes the time required to fetch a any data which spans multiple packets is factorial with N depending on the number of packets required to contain the data. A 20Mib string using the default packet size of 4K occupies over 2500 packets. Large amounts of time and computation resources are expended copying the same data over and over gain.
Parsing Changes
This PR changes all parsing functions so that they no longer return
bool
and now return aTdsOperationStatus
enum value. This is modelled after the System.Buffers.OperationStatus but omits one value that we don't need.Every parsing function which calls another now checks for
Done
indicating success and if the value is notDone
the result of the call is returned.This means that an example callsite like this:
changes to this:
This accounts for the bulk of the changes in this PR. These changes allow safe detection of the third state at the end of the packet replay, the state where more data is needed.
Snapshot Changes
With the ability to detect when a snapshot can continue functionality has been added to the snapshot mechanism which allows a continue point to be saved. When a snapshot is active and the parsing functions run to the point where all packets have been processed end at a rsult of
NeedMoreData
the current state of the stateobject can now be saved as the continue point. When as new packet arrives and the replay is about to be started instead of going all the way back to the starting point we can now pick up from the continue point and avoid a lot of reprocessing.To allow us not to need to reprocess all the packets up to the continue point the existing plpBuffer saved in the snapshot is augmented to store the total amount of data that has been written into the temporary buffer so that new data can be appended to the buffer at the correct point.
Snapshot changes and support for them account for most of the non-parsing changes in the
TdsParserStateObject
files.Reliability
With all these changes performance of async reads improves to scale similarly to synchronus operations. Reliability is vastly reduced with random seeming errors occuring regularly during testing making it impossible to finish any lengthy benchmark run. Investigating this took several months and showed an underlying incorrect assumption in the networking handling code which was difficult to encounter in previous versions of the library because speed was limited by the replay mechanism.
The
TryProcessHeader
function makes sure that enough data is available to contain the packet header. Once that data is available it extracts the spid, status and packet length from the header. Crucially it does not check that the number of bytes received from the network matches the number of bytes that are in the packet. The code trusts that the buffer that we passed to the sni layer now contains the number of bytes reported by the packet header. This is not always true. This is observable in managed and native sni modes.In situations where 1) there is some latency 2) the code is able to run very fast (meaning not debug mode), 3) enough data is being trasferred that a rare event can happen, it is possible to observe events where a part of a packet is received which contains the header data but not the complete rest of the packet. In this situation the existing code can react in many ways depending on the exactl location of the end of the correct data in the packet. Most of them produce an error condition but it is possible that some zero data could be copied into the pllBuffer. On the next packet recepit the remainder of the partial packet will be recieved, it will not contain a correctly formed packet header and will cause an exception.
Partial Packet Changes
Code has been added which detects the receipt of a partial packet and stores the data. The partial packet is set aside and another network read is initiated. Once the second read has completed the newly received data has to be combined with the existing partial packet. It is possible that the newly received data will contain the remainder of the partial packet data that we already hold and if there is more data being sent it can contain the start of another packet. To deal with all the complexity involved in sorting out the packets i have added a packet multiplexing function which reconstructs correctly formed packets from the incoming data and partial packet that are held and then releases those packets for the library to deal with normally.
Once a partial packet has occurred it is not possible to get back into sync with full packets unless another partial packet read of exactly the remainder number of bytes occurs. This is very unlikely. Sync will only return once the existing response stream has completed.
the multiplexing function has full test coverage by being split into a separate file which is then included in the functional tests with a test harness allowing fake packets sequences to be constructed processed and the results validated.
Outcome
All of these changes result in adding the ability to use the use of an appcontext switch to turn on async with continue mode. The library defaults to async-retry mode which is the current method of operation. If we can validate that the changes are reliable and mergable in retry mode then we can turn on continue mode and validate that functionality.
The changes applied to net(core) and netfx builds. I originally stated that i did not want to do this because I thought that we should merge the codebases and implement the changes only once. Realistically the rate of merging of the codebase in the time since I made that decisions means that this bug would not be fixed for more years. So I've changed my mind and implemented everything again on the current codebase.
Performance changes are inline with previous expectations:
Feedback
If you think that there is a better way to do this then please implement it and open a PR allowing your method to be tested. In all the time that this issue has been open no-one else has come up with any other method of fixing it. If you don't like how this works do better please don't just complain and nitpick.