-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DynamoDb .NET threadpool starvation #1476
Comments
Sorry we haven't responded to this issue yet. Due to re:Invent we have been doing a bit of catchup. Thanks so much for going deep on this issue and providing so much detail. I agree with your conclusion and proposed solutions. Let me discuss with the rest of the team to figure out how we can this work done. |
I have been trying to remove the sync call to DescribeTable for the last couple days which is not going well. The problem is this high level abstraction was first written when when we only supported synchronous methods in the SDK and it was designed with that knowledge. The problem is so many of the synchronous methods can trigger a lazy load to DescribeTable. I'm having to create a lot of duplicate async methods but I can't remove the sync methods as that would be a breaking change. This is causing a lot of code duplication, which is extra complicated due to the AWSSDK.DynamoDBv2 still supporting .NET 3.5 which doesn't know about Tasks, and you would have to know to opt into the new methods that you didn't realize could trigger a call to DescribeTable for example I'm wondering if instead of trying to put the band aid here if we should bite the bullet and be a bit more radical. We have been debating for awhile about pulling the Document and DataModel abstractions in their own separate repo and NuGet package. In the process of doing that we could do some more breaking changes to get the behavior correct from the start. For example in the new version I would like to drop .NET 3.5 Framework and support async only to so we can avoid getting these types of threads. Once we have the new package I think we would mark the Amazon.DynamoDBv2.DocumentModel.Table and Amazon.DynamoDBv2.DataModel.Context in AWSSDK.DynamoDBv2 as obsolete and whenever we do a major version 4 of the SDK we would drop them from the package. Moving these abstractions out of the SDK repo should also make working on these abstractions much easier then the SDK repo which is very large and is hard to make large changes as it has to ship basically every day. To alleviate this issue would a helper method added to the Context to preload the cache for all of the types help? You could call it a the start of the application and then the sync DescribeTable call should not be called when the app is in full stress mode. I would really appreciate the community thoughts here. What does everybody prefer, continuing on the path of patching the existing version or start spinning up a project to create new version of these abstractions in a new repo. |
I was conflicted about which approach I would personally go with until I broke things down a bit. Regardless of whether we try to duplicate methods to allow async within the same assembly, or we pull the base models out and create a newer, modern dynamoDb package - the required .net 3.5 compatibility will require two implementations of similar code 'somewhere'. This complicates support for the dynamo package - but this is simply not possible to avoid in order to support async operation correctly, while also supporting non async. Having async and legacy methods in the same assembly creates technical debt for when the old non-async code is eventually made obsolete. The only benefit is that required changes to the library can have both changes made in the same project. The band-aid approach also bugs me in the same way you describe - it won't assist users in upgrading their implementation as their existing code will still continue to work, and they won't know to use the async versions. By contrast, a new async-aware package would force a major code update on how people are using dynamoDb in their solutions In the short term, double-checked locking will help to alleviate build-up on the locks, and a command to fill those caches in serial will fix the issue for us. |
Until an inbuilt workaround is in, for those who come across this thread and just want a quick win, use |
I've encountered this issue in my .NET lambdas. Occassionally (~2-4) times a month, the lambda locks on the first call to a DynamoDB table. In my case, I'm not using the DynamoDBContext. I'm using the DynamoDBClient. After adding detailed logging, we've found that the lock is invariably on the first call using the DynamoDBClient. |
I believe my team is also hitting this |
We've done some more testing and are definitely hitting it :( |
We are also hitting this issue and ThreadPool.SetMinThreads does not look as a stable solution... |
We have exactly the same issue and it took us a long time to figure out the exact cause, as our production applications randomly stopped working, and we never thought that this could be caused by the DynamoDB client. Fortunately, we found this issue and were baffled as to how this could even be the case. We also investigated with dotnet counters and could clearly see a deadlock situation and can easily reproduce the issue by simply setting up a .NET API project, use DynamoDB and generate a lot of requests against it with a load testing tool. After looking at the code and reading this and related issues, we understand that there is a lot of legacy code and the overall structure doesn't make sense especially when working fully async. For example, why do i need get table information via a synchronous network request for a simple model conversion in the .NET SDK. To our understanding all relvant informations can be retrieved e.g. from attributes. We now have two options: Either we write our own functions that work directly with the standard .NET HTTP client and use some utility functions to handle AWS authentication and make requests directly to DynamoDB (which also means we have to review, write & test parts of the conversion logic, as it contains a lot of get table calls that are locked and cause this problem), or we wait for an official solution, which of course would be the much better option. From a paying customer's point of view, we would greatly appreciate an official statement to see if it's worth waiting or abandoning the official SDK for DynamoDB. We also would suggest treating this as a high priority issue, as it is simple to reproduce and leads to very hard to detect application deadlocks and this is the last place I would start looking when I have an API with many external services. We completly understand the issues involved with legancy code, backwards compatibility and so on, but this is not a small service, used by a lot of people and enough time have passed since the issue was reported. In the meantime, we can work around the problem by calling the following when we start our application. This will populate all internal caches with the required information and a deadlock can be prevented. We are not 100% sure if this solves the issue completely and could't invest more time to figure out all problem areas, but so far it seems to have fixed our issues. Hopefully this helps someone.
However, such a dirty fix is not very robust in the long run, as it only requires the absence of a table name to be initialized, and then random deadlocks occur, which are very hard to find. (The reason why we call it multiple times is because the cache uses different cache keys, and we wanted to be sure to get all of them) |
We ran into this ourselves and it was deadlocking our servers on the daily, and our solution was similar to @vindor's. Effectively locally cache the table descriptions for the lifetime of the program on boot:
This solution has worked for us for about 8 months now with all our DynamoDB .NET deadlocks resolved to our knowledge. One thing that got us was that the documentation for DescribeTables basically being absolutely buried. We're apparently expected to only call this once and cache the response: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItemsDocumentClasses.html
At the very least, this info could be added to the comment header in the SDK to keep people from making the same mistake. It took us many developer hours to hunt down and solve this issue. (Apologies for not posting this earlier and saving some readers a lot of time, I think we lost this thread a bit once we got deep into the issue and never came back to update. But hey, at least someone else has run the solution for a long time and can say pretty definitively that it works as a workaround) |
Hello all. As @normj discussed in #1476 (comment) above, we've long thought of this as something we would address in a new major version and/or by extracting the DynamoDB high-level models out to a new, separate package. This would give us an opportunity to make breaking changes, including removing the sync-over-async codepaths. We've also been thinking about if we can remove the internal One idea we've been exploring could address the case where:
Then could the SDK use the information provided in the .NET attributes to fill its cache about the key and index structure and attribute naming to replace the We have a prototype of this in #3001 that is working for a happy path, but haven't yet tested all the edge cases and failure modes for when the table's structure is not fully described. We're looking for early feedback on:
|
We decided to opt for the document model over the object persistence model largely for flexibility. Being able to update our schema from release to release as we go on demand is extremely useful, and would be quite difficult to do using object persistence model--or at least, we'd need a full copy of the object definition in the code for every schema version to map upgrades, whereas with the document model we can define the upgrade step only as a small piece of code mapping old fields to new fields, etc. Any fix that only included the object persistence model would be of limited value to us. Others may have a different opinion however. |
Another update on our investigation to provide options to avoid the Document model// Instead of the current state, which relies on DescribeTable
Table employeeTable = Table.LoadTable(_client, "EmployeeTable");
// One can do this instead
Table employeeTable = new TableBuilder(_client, "EmployeeTable")
.AddHashKey("Name", DynamoDBEntryType.String)
.AddRangeKey("Age", DynamoDBEntryType.Numeric)
.AddGlobalSecondaryIndex("GlobalIndex", "Company", DynamoDBEntryType.String, "Score", DynamoDBEntryType.Numeric)
.AddLocalSecondaryIndex("LocalIndex", "Manager", DynamoDBEntryType.String)
.Build(); Object-persistence modelBecause the object-persistence model uses the document model under the hood, this same builder can be used to populate the cache inside the // One would initialize the Table via the new builder
var employeeTable = new TableBuilder(_client, "EmployeeTable")
.AddHashKey("Name", DynamoDBEntryType.String)
.AddRangeKey("Age", DynamoDBEntryType.Numeric)
.AddGlobalSecondaryIndex("GlobalIndex", "Company", DynamoDBEntryType.String,
"Score", DynamoDBEntryType.Numeric)
.AddLocalSecondaryIndex("LocalIndex", "Manager", DynamoDBEntryType.String)
.Build();
// Then register that with the DynamoDBContext object, which populates its cache
var context = new DynamoDBContext(_client);
context.RegisterTableDefinition(employeeTable);
// Now the user can use context like normal with a class like the following
var employee = context.Load<Employee>("Alex");
// We still need to know the mapping from class to table,
// which can alternatively be in app.config or AWSConfigsDynamoDB
[DynamoDBTable("EmployeeTable")
public class Employee
{
// But do not need the index-related attributes here as long as the attribute names match
public string Name { get; set; }
public int Age { get; set; }
...
} I have a prototype (that I have yet to push up) passing our current test suite, but we'll have to look at edge cases and failure modes for when there's a mismatch between the provided and actual table structure. @Talarian (thanks for the reply) and others - does this seem viable for your usage of the document model? It's not quite as flexible as the current behavior since the key and index structure that's defined in code would need to match the table it's connecting to, but if your keys/indices are fairly static this may be a path to avoid the issues with the internal |
We released AWSSDK.DynamoDBv2 version 3.7.201 yesterday which introduces a new There is a new option Object-persistence model
Document model
We're still working on the two options described above to eliminate the sync-over-async call to fill the cache if you are able to define the table key/index structure in advance. |
We've released new ways to initialize the high-level DynamoDB clients in AWSSDK.DynamoDBv2 3.7.203. You can now provide the table's key and index structure in code, which allows the SDK to skip the problematic It's important to describe your table’s key and indexes accurately via these new methods otherwise you may see exceptions when attempting to perform DynamoDB operations. Also the We published a blog post today that details how to use the new initialization patterns, but pasting here as well: Document Model and TableBuilderInstead of
Object Persistence Model and DisableFetchingTableMetadataIf you fully describe the table's key and index structure via the existing attributes, you can now set
Object Persistence Model and TableBuilderYou can also combine the above techniques, and register the
These do require code changes to opt in, but we hope it's worth it. Let us know if you see any issues with the new patterns. |
Hello @KereyWatters, Going forward you could use the improved Initialization patterns for DynamoDB. We'll close this as completed. Regards, |
Comments on closed issues are hard for our team to see. |
When describing tables, there are a few levels of locks which force single-threaded operation
aws-sdk-net/sdk/src/Core/Amazon.Runtime/Internal/Util/SdkCache.cs
Line 421 in b691e46
aws-sdk-net/sdk/src/Services/DynamoDBv2/Custom/DataModel/ContextInternal.cs
Line 125 in b691e46
aws-sdk-net/sdk/src/Services/DynamoDBv2/Custom/DataModel/InternalModel.cs
Line 608 in b691e46
And inside those locks we synchronously call an async method
aws-sdk-net/sdk/src/Core/Amazon.Runtime/Pipeline/HttpHandler/_mobile/HttpRequestMessageFactory.cs
Line 499 in 189abfb
This is a problem as it creates a new task which may not be able to be run by the ThreadPool for quite some time. Also, because the locks in these three places do not pre-check their condition before locking - they make the problem worse by consuming more threads than necessary.
Expected Behavior
We run 100 requests, and after the initial describe table call, the cache is loaded, and the program quickly completes the 100 requests
Current Behavior
Cold start - no caches are filled.
We run 100 dynamodb LoadAsync requests, creating 100 tasks.
The threadpool begins processing the tasks, starting with threads equal to the number of cores we have to work with, say, 4.
The first thread gets through the locks and begins a synchronous call of an async http method, creating our 101th task.
Deadlock. All 4 threads are either stuck at a lock, or waiting for the 101st task to continue
The threadpool adds one more thread to the pool each second to help handle load.
97 (101-4) seconds go by, and the 101st thread picks up the http response task, resolving the deadlock
All the threads enter the lock, use the cache and complete shortly after.
Possible Solution
To prevent build-up on the locks (and greatly improve performance) all three locks above should be using the double-checked locking pattern to remove unnecessary calls to Monitor.Enter.
https://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_C#
Table.DescribeTable should be made asynchronous, calling DDBClient.DescribeTableAsync. Locks should be replaced with SemaphoreSlims as they are awaitable, and async should be brought all the way up above the locking code, to prevent blocking within locked sections: https://stackoverflow.com/questions/7612602/why-cant-i-use-the-await-operator-within-the-body-of-a-lock-statement/7612714#7612714
Steps to Reproduce (for bugs)
The above snippet is overly simple and doesn't show what is happening with the threadpools like this reproduction solution does:
https://drive.google.com/open?id=1UXItp4aRlGLBMjbuYm94WU_2TUoT4dKf
Context
This issue caused a production incident because the ECS container locked up under initial load and couldn't reply to health-checks - causing the ELB to kill the container and repeat.
Your Environment
.NET Core Info
The text was updated successfully, but these errors were encountered: