Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using SqlColumnEncryptionAzureKeyVaultProvider with an expired client secret, bizarre error messages and symptoms result #2683

Closed
jehhynes opened this issue Jul 17, 2024 · 19 comments · Fixed by #2825
Assignees
Labels
✔️ Repro Available Issues that are reproducible with repro provided.

Comments

@jehhynes
Copy link

jehhynes commented Jul 17, 2024

When using SqlColumnEncryptionAzureKeyVaultProvider with an expired client secret, the following error is returned:

SqlException: New request is not allowed to start because it should come with valid transaction descriptor.

This error is very confusing and has nothing to do with the actual problem. After seeing this error, we then see the following error on all queries even when column encryption is not enabled on the command:

IndexOutOfRangeException: SOME_COLUMN_NAME
    at Microsoft.Data.SqlClient.SqlDataReader.GetOrdinal(string name)

(Note that the query actually does select SOME_COLUMN_NAME)

Full stack trace

SqlException: New request is not allowed to start because it should come with valid transaction descriptor.
Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, bool breakConnection, Action<Action> wrapCloseInAction)
Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, bool breakConnection, Action<Action> wrapCloseInAction)
Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, bool callerHasConnectionLock, bool asyncClose)
Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, out bool dataReady)
Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
Microsoft.Data.SqlClient.SqlDataReader.get_MetaData()
Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, string resetOptionsString, bool isInternal, bool forDescribeParameterEncryption, bool shouldCacheForAlwaysEncrypted)
Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, bool returnStream, bool isAsync, int timeout, out Task task, bool asyncWrite, bool inRetry, SqlDataReader ds, bool describeParameterEncryptionRequest)
Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, bool returnStream, TaskCompletionSource<object> completion, int timeout, out Task task, out bool usedCache, bool asyncWrite, bool inRetry, string method)
Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
NHibernate.AdoNet.AbstractBatcher.DoExecuteReader(DbCommand cmd)
NHibernate.Loader.Loader.GetResultSet(DbCommand st, QueryParameters queryParameters, ISessionImplementor session, IResultTransformer forcedResultTransformer)
NHibernate.Loader.Loader.DoQuery(ISessionImplementor session, QueryParameters queryParameters, bool returnProxies, IResultTransformer forcedResultTransformer, QueryCacheResultBuilder queryCacheResultBuilder)
NHibernate.Loader.Loader.DoQueryAndInitializeNonLazyCollections(ISessionImplementor session, QueryParameters queryParameters, bool returnProxies, IResultTransformer forcedResultTransformer, QueryCacheResultBuilder queryCacheResultBuilder)
NHibernate.Loader.Loader.DoList(ISessionImplementor session, QueryParameters queryParameters, IResultTransformer forcedResultTransformer, QueryCacheResultBuilder queryCacheResultBuilder)

Expected behavior

I expect a clear error message to be returned which states that column encryption was not possible due to the expired client secret
I do not expect it to "pollute" the connection pool and make all requests to the same database fail with bizarre error messages.

Further technical details

Microsoft.Data.SqlClient version: 5.1.4
.NET target: .NET 8.0
SQL Server version: 2022
Operating system: Windows 11

Additional context

<PackageReference Include="Microsoft.Data.SqlClient.AlwaysEncrypted.AzureKeyVaultProvider" Version="3.0.0" />
NHibernate 5.5.2

@ErikEJ
Copy link
Contributor

ErikEJ commented Jul 17, 2024

Have you tried with 5.2.1 ?

@jehhynes
Copy link
Author

SqlClient 5.2.1 does the same thing.

@JRahnama JRahnama added the 🆕 Triage Needed For new issues, not triaged yet. label Jul 19, 2024
@kf-gonzalez2 kf-gonzalez2 removed the 🆕 Triage Needed For new issues, not triaged yet. label Jul 23, 2024
@mdaigle
Copy link
Contributor

mdaigle commented Aug 15, 2024

Hi @jehhynes thanks for filing this issue. I'm working on reproducing it. In the meantime if you already have a minimal repro you can share that would be much appreciated.

@jehhynes
Copy link
Author

Hi @mdaigle I could create a minimal repro but it's dependent on a specific state in an external system (i.e. Expired Azure client secret) which I would be unable to share.

@mdaigle
Copy link
Contributor

mdaigle commented Aug 15, 2024

Understood. Even with that limitation it will be helpful to see.

@jehhynes
Copy link
Author

I haven't been able to reproduce this with straight SqlClient usage. I'm wondering if the issue has something to do with the way that NHibernate is interacting with SqlClient.

@arellegue
Copy link
Contributor

@jehhynes Could you share the repro, please?

@jehhynes
Copy link
Author

jehhynes commented Aug 20, 2024

This repro's consistently in my application, but I have not been able to figure out how to reproduce an identical issue outside of my application. I have however reproduced a similar issue. It seems to be caused by a more complex combination of factors than I originally thought.

The error is occurring in an ASP.NET Core app, and I think it may have to do with multithreading. Upon further investigation, this error is initially received when rendering a view (using Html.PartialAsync):

Failed to decrypt column 'ssn59_479_'. Failed to decrypt a column encryption key using key store provider: 'AZURE_KEY_VAULT'. Verify the properties of the column encryption key and its column master key in your database. The last 10 bytes of the encrypted column encryption key are: 'C1-BA-F1-E9-5C-97-6D-39-D7-F6'. ClientSecretCredential authentication failed: A configuration issue is preventing authentication - check the error message from the server for details. You can modify the configuration in the application registration portal. See https://aka.ms/msal-net-invalid-client for details. Original exception: AADSTS7000215: Invalid client secret provided. Ensure the secret being sent in the request is the client secret value, not the client secret ID, for a secret added to app '6859ea32-a175-4d32-9511-6f0d40637e1a'. Trace ID: 9989e0a9-5487-4530-9934-286a2f597c00 Correlation ID: ca4e081b-04d1-47bc-b5d3-4514f6e2a3fa Timestamp: 2024-08-20 19:33:12Z

Then another database query occurs from an IResultFilter attribute (in OnResultExecuted), which triggers this error:

New request is not allowed to start because it should come with valid transaction descriptor.

It seems the first error (which is the expected error) is hidden (presumably due to AspNetCore's multithreaded behavior) and instead the second error is show (which is not expected).

Based on these theories, I attempted to create a minimal repro using AspNetCore (attached). Although I wasn't able to get the exact same exception, I did encounter some strange/unexpected behavior/errors so I think it's starting to touch on this issue and may provide you what you need at least to begin troubleshooting.

The repro project

  • There's a connection string hardcoded into NHibernateConfigurationBuilder that will need to be set
  • It expects the following table: COR_Entity (Id bigint, Ssn nvarchar(20))
  • The Ssn column should be encrypted
  • There should be at least one row in the table with a non-null Ssn.
  • AzureTenantId and AzureClientId should be specified in the ColumnEncryption class
  • DisplaySsn.cshtml triggers the initial (expected) error, which you can see if you comment out OnResultExecutedAttribute
  • In OnResultExecutedAttribute, I observed two unexpected errors, as described in comments

AlwaysEncryptedRepro.zip

Thanks!

@jehhynes
Copy link
Author

jehhynes commented Aug 21, 2024

Update:

I discovered that by adding a second attribute implementing IResultFilter.OnResultExecuted, we uncover an additional sequence of bizarre exceptions that you can observe by refreshing the page repeatedly:

image

image

image

image

The updated repro project:
AlwaysEncryptedRepro.zip

@jehhynes
Copy link
Author

Update 2

So I further discovered that by making the ISessionFactory static (instead of rebuilding it with every request) I get a different sequence of errors:

image

image

image

This sequence is similar to what I observed in my application - see my StackOverflow question here:
https://stackoverflow.com/questions/78759572/sqlexception-new-request-is-not-allowed-to-start-because-it-should-come-with-va

The updated repro project:

AlwaysEncryptedRepro.zip

@mdaigle
Copy link
Contributor

mdaigle commented Aug 22, 2024

Thanks for the repros @jehhynes. I've been working through them and have a couple avenues for investigation.

Just to double check, are you running each of these repros with pooling enabled?
And same question for the original issue in your application, is pooling enabled there?

If you can, can you tell me if the issue still repros when you disable pooling?

Thanks!

@jehhynes
Copy link
Author

If I disable pooling, the bizarre errors go away.

@arellegue arellegue added the ✔️ Repro Available Issues that are reproducible with repro provided. label Aug 22, 2024
@mdaigle mdaigle self-assigned this Aug 22, 2024
@mdaigle
Copy link
Contributor

mdaigle commented Aug 22, 2024

This is a bug that is leaving connections in an inconsistent state with unread bytes still hanging on the wire. When pooling is enabled, the inconsistent connections are recycled back through to your application and result in undefined error behavior for new connections with the same connection string. The unread bytes off the wire from the last command interfere with the results for the new command.

A fix was merged to main last month and it looks like it's slated for the 6.0 release: #2618

@David-Engel
Copy link
Contributor

Fixes for this have been ported to the 5.1.6 (#2818) and 5.2.2 (#2817) hotfixes that are coming out soon.

@dauinsight
Copy link
Contributor

dauinsight commented Aug 27, 2024

Hotfixes 5.1.6 and 5.22 have been released!

@jehhynes
Copy link
Author

@mdaigle @David-Engel
I installed 5.2.2 on the repro project and re-tested. It now gives the following exception after waiting 30 seconds and the command timing out:

SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

I don't think this is a correct fix - the error message I expect is the Failed to decrypt column message

@dauinsight dauinsight reopened this Aug 28, 2024
@DavoudEshtehari DavoudEshtehari added the 🆕 Triage Needed For new issues, not triaged yet. label Aug 28, 2024
@mdaigle
Copy link
Contributor

mdaigle commented Aug 28, 2024

What I'm seeing is that DrainData() is double counting the encrypted column.

Here's the sequence of events:

  1. column header read
  2. sharedState._nextColumnHeaderToRead++ (incremented from 0 to 1)
  3. column data read
  4. column data decryption (fails)
  5. data drained

So at the time we drain the data, we've already read all of the bytes for the encrypted column off of the wire or out of the buffer, but we've not incremented the _nextColumnDataToRead value from 0 to 1. This value is only incremented after we exit the scope of TdsParser.TryReadSqlValue(). DrainData() compares _nextColumnDataToRead to _nextColumnHeaderToRead to determine if we stopped part way through reading a column. If _nextColumnDataToRead < _nextColumnHeaderToRead, it assumes that the column data still needs to be read off the wire/out of the buffer. But we did already read out the column data. So when DrainData() tries to flush enough data for the length of the column, it will fail. Either this was the last column and there's no data available or we get into an inconsistent state where we read some of the next column but treat it as if it's this column.

If there's not enough data available, we wait on the socket for more data which will never arrive. This leads to a timeout, which is thrown in place of the column decryption error.

@mdaigle
Copy link
Contributor

mdaigle commented Aug 29, 2024

I have a draft PR here (#2825) that, in local testing, fixes the issue. Using this change with the repro produces the expected error.
I'm still looking into the best way to add tests for this to prevent regressions in the future.

@JRahnama JRahnama removed the 🆕 Triage Needed For new issues, not triaged yet. label Sep 3, 2024
@JRahnama JRahnama linked a pull request Sep 9, 2024 that will close this issue
@kf-gonzalez2
Copy link

Closing issue as PR has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✔️ Repro Available Issues that are reproducible with repro provided.
Projects
Development

Successfully merging a pull request may close this issue.

9 participants