Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamoDb .NET threadpool starvation #1476

Closed
KereyWatters opened this issue Dec 5, 2019 · 17 comments
Closed

DynamoDb .NET threadpool starvation #1476

KereyWatters opened this issue Dec 5, 2019 · 17 comments
Labels
bug This issue is a bug. dynamodb p2 This is a standard priority issue queued

Comments

@KereyWatters
Copy link

KereyWatters commented Dec 5, 2019

When describing tables, there are a few levels of locks which force single-threaded operation

And inside those locks we synchronously call an async method

return this.GetResponseAsync(System.Threading.CancellationToken.None).Result;

This is a problem as it creates a new task which may not be able to be run by the ThreadPool for quite some time. Also, because the locks in these three places do not pre-check their condition before locking - they make the problem worse by consuming more threads than necessary.

Expected Behavior

We run 100 requests, and after the initial describe table call, the cache is loaded, and the program quickly completes the 100 requests

Current Behavior

Cold start - no caches are filled.
We run 100 dynamodb LoadAsync requests, creating 100 tasks.
The threadpool begins processing the tasks, starting with threads equal to the number of cores we have to work with, say, 4.
The first thread gets through the locks and begins a synchronous call of an async http method, creating our 101th task.
Deadlock. All 4 threads are either stuck at a lock, or waiting for the 101st task to continue
The threadpool adds one more thread to the pool each second to help handle load.
97 (101-4) seconds go by, and the 101st thread picks up the http response task, resolving the deadlock
All the threads enter the lock, use the cache and complete shortly after.

Possible Solution

To prevent build-up on the locks (and greatly improve performance) all three locks above should be using the double-checked locking pattern to remove unnecessary calls to Monitor.Enter.
https://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_C#

Table.DescribeTable should be made asynchronous, calling DDBClient.DescribeTableAsync. Locks should be replaced with SemaphoreSlims as they are awaitable, and async should be brought all the way up above the locking code, to prevent blocking within locked sections: https://stackoverflow.com/questions/7612602/why-cant-i-use-the-await-operator-within-the-body-of-a-lock-statement/7612714#7612714

Steps to Reproduce (for bugs)

[DynamoDBTable("dynamo_table")]
class Entry
{
    [DynamoDBHashKey]
    public int id { get; set; }
}

static void Main(string[] args)
{
    var context = new DynamoDBContext(new AmazonDynamoDBClient());
    var tasks = new List<Task<Entry>>();
    for (var i = 0; i < 30; i++)
        tasks.Add(Task.Run(() => context.LoadAsync<Entry>(0)));
    
    Console.WriteLine("Waiting for tasks to complete");
    try
    {
        Task.WaitAll(tasks.ToArray());
    } catch (Exception){}
}

The above snippet is overly simple and doesn't show what is happening with the threadpools like this reproduction solution does:
https://drive.google.com/open?id=1UXItp4aRlGLBMjbuYm94WU_2TUoT4dKf

Context

This issue caused a production incident because the ECS container locked up under initial load and couldn't reply to health-checks - causing the ELB to kill the container and repeat.

Your Environment

  • AWSSDK.Core version used: 3.3.104.2
  • Operating System and version: Windows 10 Pro v10.0.19033
  • Visual Studio version: VS2019 v16.1.3
  • Targeted .NET platform: .Net core 2.2

.NET Core Info

  • .NET Core version used for development: 2.2
  • .NET Core version installed in the environment where application runs: 2.2.402
@normj
Copy link
Member

normj commented Dec 14, 2019

Sorry we haven't responded to this issue yet. Due to re:Invent we have been doing a bit of catchup. Thanks so much for going deep on this issue and providing so much detail. I agree with your conclusion and proposed solutions. Let me discuss with the rest of the team to figure out how we can this work done.

@klaytaybai klaytaybai added the bug This issue is a bug. label Dec 16, 2019
@normj
Copy link
Member

normj commented Dec 18, 2019

I have been trying to remove the sync call to DescribeTable for the last couple days which is not going well. The problem is this high level abstraction was first written when when we only supported synchronous methods in the SDK and it was designed with that knowledge. The problem is so many of the synchronous methods can trigger a lazy load to DescribeTable. I'm having to create a lot of duplicate async methods but I can't remove the sync methods as that would be a breaking change. This is causing a lot of code duplication, which is extra complicated due to the AWSSDK.DynamoDBv2 still supporting .NET 3.5 which doesn't know about Tasks, and you would have to know to opt into the new methods that you didn't realize could trigger a call to DescribeTable for example FromDocuments.

I'm wondering if instead of trying to put the band aid here if we should bite the bullet and be a bit more radical. We have been debating for awhile about pulling the Document and DataModel abstractions in their own separate repo and NuGet package. In the process of doing that we could do some more breaking changes to get the behavior correct from the start. For example in the new version I would like to drop .NET 3.5 Framework and support async only to so we can avoid getting these types of threads. Once we have the new package I think we would mark the Amazon.DynamoDBv2.DocumentModel.Table and Amazon.DynamoDBv2.DataModel.Context in AWSSDK.DynamoDBv2 as obsolete and whenever we do a major version 4 of the SDK we would drop them from the package.

Moving these abstractions out of the SDK repo should also make working on these abstractions much easier then the SDK repo which is very large and is hard to make large changes as it has to ship basically every day.

To alleviate this issue would a helper method added to the Context to preload the cache for all of the types help? You could call it a the start of the application and then the sync DescribeTable call should not be called when the app is in full stress mode.

I would really appreciate the community thoughts here. What does everybody prefer, continuing on the path of patching the existing version or start spinning up a project to create new version of these abstractions in a new repo.

@costleya

@KereyWatters
Copy link
Author

KereyWatters commented Jan 8, 2020

I was conflicted about which approach I would personally go with until I broke things down a bit.

Regardless of whether we try to duplicate methods to allow async within the same assembly, or we pull the base models out and create a newer, modern dynamoDb package - the required .net 3.5 compatibility will require two implementations of similar code 'somewhere'. This complicates support for the dynamo package - but this is simply not possible to avoid in order to support async operation correctly, while also supporting non async.

Having async and legacy methods in the same assembly creates technical debt for when the old non-async code is eventually made obsolete. The only benefit is that required changes to the library can have both changes made in the same project.

The band-aid approach also bugs me in the same way you describe - it won't assist users in upgrading their implementation as their existing code will still continue to work, and they won't know to use the async versions. By contrast, a new async-aware package would force a major code update on how people are using dynamoDb in their solutions

In the short term, double-checked locking will help to alleviate build-up on the locks, and a command to fill those caches in serial will fix the issue for us.

@KereyWatters
Copy link
Author

Until an inbuilt workaround is in, for those who come across this thread and just want a quick win, use
ThreadPool.SetMinThreads
For us we are working around this issue by starting with 32 worker threads which gets us through the initial few requests enough for the cache to fill. This is fine for our application but may not work for all applications, so, use due diligence.

@johniwasz
Copy link

I've encountered this issue in my .NET lambdas. Occassionally (~2-4) times a month, the lambda locks on the first call to a DynamoDB table. In my case, I'm not using the DynamoDBContext. I'm using the DynamoDBClient. After adding detailed logging, we've found that the lock is invariably on the first call using the DynamoDBClient.

@xero-timothy
Copy link

I believe my team is also hitting this

@xero-timothy
Copy link

We've done some more testing and are definitely hitting it :(

@Funbit
Copy link

Funbit commented Jan 26, 2023

We are also hitting this issue and ThreadPool.SetMinThreads does not look as a stable solution...
Is there any progress on this issue?

@vindor
Copy link

vindor commented May 23, 2023

We have exactly the same issue and it took us a long time to figure out the exact cause, as our production applications randomly stopped working, and we never thought that this could be caused by the DynamoDB client. Fortunately, we found this issue and were baffled as to how this could even be the case. We also investigated with dotnet counters and could clearly see a deadlock situation and can easily reproduce the issue by simply setting up a .NET API project, use DynamoDB and generate a lot of requests against it with a load testing tool.

After looking at the code and reading this and related issues, we understand that there is a lot of legacy code and the overall structure doesn't make sense especially when working fully async. For example, why do i need get table information via a synchronous network request for a simple model conversion in the .NET SDK. To our understanding all relvant informations can be retrieved e.g. from attributes.

We now have two options: Either we write our own functions that work directly with the standard .NET HTTP client and use some utility functions to handle AWS authentication and make requests directly to DynamoDB (which also means we have to review, write & test parts of the conversion logic, as it contains a lot of get table calls that are locked and cause this problem), or we wait for an official solution, which of course would be the much better option.

From a paying customer's point of view, we would greatly appreciate an official statement to see if it's worth waiting or abandoning the official SDK for DynamoDB. We also would suggest treating this as a high priority issue, as it is simple to reproduce and leads to very hard to detect application deadlocks and this is the last place I would start looking when I have an API with many external services. We completly understand the issues involved with legancy code, backwards compatibility and so on, but this is not a small service, used by a lot of people and enough time have passed since the issue was reported.

In the meantime, we can work around the problem by calling the following when we start our application. This will populate all internal caches with the required information and a deadlock can be prevented. We are not 100% sure if this solves the issue completely and could't invest more time to figure out all problem areas, but so far it seems to have fixed our issues. Hopefully this helps someone.

public static void InitializeDynamoDb<DbEntryTypeT>(this IDynamoDBContext dynamoDbContext, DynamoDBOperationConfig config)
        where DbEntryTypeT : IDynamoDBEntry
    {
        var type = typeof(DbEntryTypeT);
        DbEntryTypeT instance = (DbEntryTypeT)Activator.CreateInstance(type);
        _ = dynamoDbContext.ToDocument(instance, config);
        _ = dynamoDbContext.ToDocument(instance);
        _ = dynamoDbContext.GetTargetTable<DbEntryTypeT>(config);
    }

However, such a dirty fix is not very robust in the long run, as it only requires the absence of a table name to be initialized, and then random deadlocks occur, which are very hard to find. (The reason why we call it multiple times is because the cache uses different cache keys, and we wanted to be sure to get all of them)

@Talarian
Copy link

Talarian commented May 23, 2023

We ran into this ourselves and it was deadlocking our servers on the daily, and our solution was similar to @vindor's. Effectively locally cache the table descriptions for the lifetime of the program on boot:

// Get all tables from the database
var tables = await ExecuteDBOperation(
	ts => DynamoClient.ListTablesAsync( ts.Token ), 
	new DBOperationTrace( nameof( DynamoClient.ListTablesAsync ), "load-tables", "all", cmn, cfp ) );

// Initialize tables in the cache
foreach ( var tableName in tables.TableNames )
{
	var table = Table.LoadTable( DynamoClient, tableName );
	foreach ( var kvp in m_tableCache )
		if ( kvp.Value.TableName == tableName )
			kvp.Value.SetTable( table );
}

This solution has worked for us for about 8 months now with all our DynamoDB .NET deadlocks resolved to our knowledge.

One thing that got us was that the documentation for DescribeTables basically being absolutely buried. We're apparently expected to only call this once and cache the response: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItemsDocumentClasses.html

In general, you use the LoadTable method once at the beginning of your application because it makes a DescribeTable call that adds to the round trip to DynamoDB.

At the very least, this info could be added to the comment header in the SDK to keep people from making the same mistake. It took us many developer hours to hunt down and solve this issue.

(Apologies for not posting this earlier and saving some readers a lot of time, I think we lost this thread a bit once we got deep into the issue and never came back to update. But hey, at least someone else has run the solution for a long time and can say pretty definitively that it works as a workaround)

@ashovlin
Copy link
Member

ashovlin commented Jul 13, 2023

Hello all. As @normj discussed in #1476 (comment) above, we've long thought of this as something we would address in a new major version and/or by extracting the DynamoDB high-level models out to a new, separate package. This would give us an opportunity to make breaking changes, including removing the sync-over-async codepaths.

We've also been thinking about if we can remove the internal DescribeTable call altogether. The threadpool issue aside, this would reduce these programming model's latency for the first DynamoDB call, and potentially avoid hitting the requests per second quota on control plane API requests.

One idea we've been exploring could address the case where:

  1. You're using object persistence model, as opposed to the document model.
  2. You have fully described the table's key, index structure, and attribute mappings using the DynamoDB attributes.
  3. You opt into this new behavior via something like DynamoDBContextConfig.SkipDescribeTableLoad.

Then could the SDK use the information provided in the .NET attributes to fill its cache about the key and index structure and attribute naming to replace the DescribeTable call?

We have a prototype of this in #3001 that is working for a happy path, but haven't yet tested all the edge cases and failure modes for when the table's structure is not fully described.


We're looking for early feedback on:

  1. Are you hitting this issue from the object persistence model? Or the document model? (which doesn't have an existing mechanism to provide the table structure up front)
  2. If the object model, do you fully describe the key and index structure and attribute naming in your .NET classes via the DynamoDB attributes? Or are you relying on the SDK's inferring this via the internal DescribeTable call?
  3. If you don't apply all of the attributes today, is adding them a reasonable trade-off if it would avoid this issue?

@Talarian
Copy link

  1. Are you hitting this issue from the object persistence model? Or the document model? (which doesn't have an existing mechanism to provide the table structure up front)
    [Talarian] We're hitting this from the document model, so we're heavily reliant on the DescribeTable call currently as required by the SDK.

  2. If the object model, do you fully describe the key and index structure and attribute naming in your .NET classes via the DynamoDB attributes? Or are you relying on the SDK's inferring this via the internal DescribeTable call?
    [Talarian] N/A

  3. If you don't apply all of the attributes today, is adding them a reasonable trade-off if it would avoid this issue?
    [Talarian] N/A

We decided to opt for the document model over the object persistence model largely for flexibility. Being able to update our schema from release to release as we go on demand is extremely useful, and would be quite difficult to do using object persistence model--or at least, we'd need a full copy of the object definition in the code for every schema version to map upgrades, whereas with the document model we can define the upgrade step only as a small piece of code mapping old fields to new fields, etc. Any fix that only included the object persistence model would be of limited value to us. Others may have a different opinion however.

@ashovlin
Copy link
Member

Another update on our investigation to provide options to avoid the DescribeTable call entirely: we're also exploring a mechanism to let developers provide their table key and index structure in code. This can apply to both the document model as well as the object-persistence model.

Document model

// Instead of the current state, which relies on DescribeTable
Table employeeTable = Table.LoadTable(_client, "EmployeeTable");

// One can do this instead
Table employeeTable = new TableBuilder(_client, "EmployeeTable")
        .AddHashKey("Name", DynamoDBEntryType.String)
        .AddRangeKey("Age", DynamoDBEntryType.Numeric)
        .AddGlobalSecondaryIndex("GlobalIndex", "Company", DynamoDBEntryType.String, "Score", DynamoDBEntryType.Numeric)
        .AddLocalSecondaryIndex("LocalIndex", "Manager", DynamoDBEntryType.String)
        .Build();

Object-persistence model

Because the object-persistence model uses the document model under the hood, this same builder can be used to populate the cache inside the DynamoDBContext object

// One would initialize the Table via the new builder
var employeeTable = new TableBuilder(_client, "EmployeeTable")
        .AddHashKey("Name", DynamoDBEntryType.String)
        .AddRangeKey("Age", DynamoDBEntryType.Numeric)
        .AddGlobalSecondaryIndex("GlobalIndex", "Company", DynamoDBEntryType.String, 
            "Score", DynamoDBEntryType.Numeric)
        .AddLocalSecondaryIndex("LocalIndex", "Manager", DynamoDBEntryType.String)
        .Build();

// Then register that with the DynamoDBContext object, which populates its cache
var context = new DynamoDBContext(_client);
context.RegisterTableDefinition(employeeTable);

// Now the user can use context like normal with a class like the following
var employee = context.Load<Employee>("Alex");

// We still need to know the mapping from class to table,
// which can alternatively be in app.config or AWSConfigsDynamoDB
[DynamoDBTable("EmployeeTable")
public class Employee 
{
    // But do not need the index-related attributes here as long as the attribute names match
    public string Name { get; set; }

    public int Age { get; set; }

    ...
}

I have a prototype (that I have yet to push up) passing our current test suite, but we'll have to look at edge cases and failure modes for when there's a mismatch between the provided and actual table structure.


@Talarian (thanks for the reply) and others - does this seem viable for your usage of the document model? It's not quite as flexible as the current behavior since the key and index structure that's defined in code would need to match the table it's connecting to, but if your keys/indices are fairly static this may be a path to avoid the issues with the internal DescribeTable.

@ashovlin
Copy link
Member

We released AWSSDK.DynamoDBv2 version 3.7.201 yesterday which introduces a new MetadataCachingMode option. By default when the SDK caches the table definition it loads via the internal DescribeTable call, the cache key consists of a combination of the current credentials, region, endpoint, and the table name. This can lead to cache misses in cases where credentials are refreshed or a new role is assumed, even when you are accessing the same table.

There is a new option TableNameOnly, which will only use the table name as the cache key. If you are accessing an identically structured table (same name, keys, and index structure), enabling this may reduce cache misses so the SDK does not have to enter this locked, sync-over-async code path to fill the cache again. Be mindful of enabling this, since the SDK's table metadata cache will no longer keep separate entries for different credentials/accounts/regions.

Object-persistence model

var config = new DynamoDBContextConfig
{
    MetadataCachingMode = MetadataCachingMode.TableNameOnly
};

var context = new DynamoDBContext(_dynamoDBClient, config);

Document model

var tableConfig = new TableConfig("TableName")
{
    MetadataCachingMode = MetadataCachingMode.TableNameOnly
};

var table = Table.LoadTable(_dynamoDBClient, tableConfig);

We're still working on the two options described above to eliminate the sync-over-async call to fill the cache if you are able to define the table key/index structure in advance.

@ashovlin
Copy link
Member

We've released new ways to initialize the high-level DynamoDB clients in AWSSDK.DynamoDBv2 3.7.203. You can now provide the table's key and index structure in code, which allows the SDK to skip the problematic DescribeTable call. This can avoid the threadpool starvation, as well as improve cold-start latency since the SDK doesn't need to fill the cache before the first actual DynamoDB operation.

It's important to describe your table’s key and indexes accurately via these new methods otherwise you may see exceptions when attempting to perform DynamoDB operations. Also the Table objects created via the new initialization methods will not contain some information that would have been populated the DescribeTable call, though this does not impact the SDK’s ability to construct the low-level requests. Specifically properties inside its LocalSecondaryIndexDescription and GlobalSecondaryIndexDescription objects will be null.

We published a blog post today that details how to use the new initialization patterns, but pasting here as well:

Document Model and TableBuilder

Instead of Table.LoadTable or Table.TryLoadTable, you can use the new TableBuilder to create the Table object by specifying the hash/range key and any secondary indexes.

var table = new TableBuilder(client, "Reply")
    .AddHashKey("Id", DynamoDBEntryType.String)
    .AddRangeKey("ReplyDateTime", DynamoDBEntryType.String)
    .AddGlobalSecondaryIndex("PostedBy-Message-index", "Author", DynamoDBEntryType.String, "Message", DynamoDBEntryType.String)
    .Build();

Object Persistence Model and DisableFetchingTableMetadata

If you fully describe the table's key and index structure via the existing attributes, you can now set DisableFetchingTableMetadata to true on DynamoDBContextConfig.

[DynamoDBTable("Reply")]
public class Reply
{
    [DynamoDBHashKey]
    public string Id { get; set; }

    [DynamoDBRangeKey(StoreAsEpoch = false)]
    public DateTime ReplyDateTime { get; set; }

    [DynamoDBGlobalSecondaryIndexHashKey("PostedBy-Message-Index", AttributeName ="PostedBy")]
    public string Author { get; set; }

    [DynamoDBGlobalSecondaryIndexRangeKey("PostedBy-Message-Index")]
    public string Message { get; set; }
}
var client = new AmazonDynamoDBClient();
var config = new DynamoDBContextConfig
{
    DisableFetchingTableMetadata = true
};

var context = new DynamoDBContext(client, config);

// Constructs a query that will find all replies by a specific author,
// which relies on the global secondary index defined above
var query = context.QueryAsync<Reply>("Author1", new DynamoDBOperationConfig() { IndexName = "PostedBy-Message-index"});

Object Persistence Model and TableBuilder

You can also combine the above techniques, and register the Table object created by a TableBuilder with your DynamoDBContext instance. This may allow you to use the object persistence model even if you are unable to add the DynamoDB attributes to the .NET class, such as if the class is defined in a dependency.

var client = new AmazonDynamoDBClient();
var config = new DynamoDBContextConfig
{
    DisableFetchingTableMetadata = true
};

var context = new DynamoDBContext(client, config);

var table = new TableBuilder(client, "Reply")
    .AddHashKey("Id", DynamoDBEntryType.String)
    .AddRangeKey("ReplyDateTime", DynamoDBEntryType.String)
    .AddGlobalSecondaryIndex("PostedBy-Message-index", "Author", DynamoDBEntryType.String, "Message", DynamoDBEntryType.String)
    .Build();

// This registers the "Reply" table we constructed via the builder
context.RegisterTableDefinition(table);

// So operations like this will work,
// even if the Reply class was not annotated with this index
var query = context.QueryAsync<Reply>("Author1", new DynamoDBOperationConfig() { IndexName = "PostedBy-Message-index"});

These do require code changes to opt in, but we hope it's worth it. Let us know if you see any issues with the new patterns.

@bhoradc
Copy link

bhoradc commented Sep 13, 2024

Hello @KereyWatters,

Going forward you could use the improved Initialization patterns for DynamoDB. We'll close this as completed.

Regards,
Chaitanya

@bhoradc bhoradc closed this as completed Sep 13, 2024
Copy link

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. dynamodb p2 This is a standard priority issue queued
Projects
None yet
Development

Successfully merging a pull request may close this issue.