-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the new async state machine pooling feature on websockets #50921
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsDescriptionLooking at an allocation profile for SignalR, the top allocations are coming from state machines allocated by Kestrel on websocket reads. We should be able to remove these by using the new pooling introduced here #50116. Regression?No DataAnalysisThis is teh codez: runtime/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs Line 685 in d927fcb
|
The plan is to use the feature in a few places based on profiling to see that it not only reduces allocations but also improves throughput under load. This is certainly one of the places we'll experiment with. We haven't done that yet because the feature hasn't been implemented in the compiler yet. There's no action that can be reasonably taken until that happens. |
This is the one that always gets me. CPU time spent in the GC vs CPU time spent in our own pools calling rent/return.
Yea, I know, I'm just filing issues and storing profiles. |
"gets you" meaning you forget that there's a cost to pooling, or "gets you" meaning you think we shouldn't be concerned about such costs and should always pool everything? |
The latter assuming it doesn't make things worse because allocations are like paper cuts that add up (peanut butter). There are a couple of things we don't have:
We don't know how to measure the CPU impact of all of these small pools and that's problematic when comparing to if it's better than GC if you don't see a throughput increase (or working set decrease). The other very unsatisfying thing IMO is not seeing a working set reduction when reducing allocations... (gen0 ones). It feels like we're just delaying the time to when a GC happens but not saving on overall memory of the application. |
They're all "allocations"; it's just that only some of them show up in GC allocation profile traces. We could allocate and free native memory, that wouldn't show up on a managed profile, but it could be a worse issue. We could allocate and free by renting and returning from some pool, and that similarly wouldn't show up on a profile. But these things all have costs. In sense, the GC itself is a pool: it grabs some memory from the OS, and then effectively rents out pieces of it... it's just that rather than those pieces being explicitly returned by code, the GC goes and returns them itself. There are many considerations that need to factor in when pooling, and most pool implementations don't, e.g. using objects in a manner less likely to cause cache problems.
Right. So if we can't come up with a scenario where we see a meaningful improvement, we're better off not adding the complexity of a custom pool, just to hide it from a managed allocation profile, with no other benefits we're able to measure, especially if that pool doesn't play nicely with all of the GC's knobs, which close to none do.
This is where we differ. You say "do it as long as it doesn't make things worse". I say "do it if it makes things better". cc: @Maoni0 |
This is exactly why we need to be able to measure because we measure scenarios and make decisions, when as a library author, you don't have any control over those scenarios. The peanut butter nature of gen0 allocations is what bugs me and why it "feels right" as a library author to err on the side of reuse as much as possible where it doesn't make the code crazy. Now I'm no advocating for global pools for all the things (I think they suck in the various ways you mention) but its really the only control you have outside of telling the app developer to
Complexity is in the eye of the beholder. I think we just need to be able to measure the cost of these custom pools better. It's the only knob we have as library authors.
This is where we disagree 😄, because I plan to pool where possible (and where it seems wasteful) and where it doesn't make things too complex (judgement call obviously). I see it as giving more headroom to the applications where libraries exist as a guest. |
I wonder about this specifically re pooling of state machines. The state machines hold the async "stack". Seems like GC allocations would do better at locality than pooling. |
I'd also like to be able to measure some of this as well because I assume this is also very much like peanut butter. |
I guess this comes down to the fact that gen0 isn't better than pooling in lots of cases. We've seen evidence in Kestrel that pooling has cumulatively improved the overall throughput. I guess you could write a pool that was really bad and ended up making the performance worse or one that is too complex to maintain but generally well isolated pools, or simple object reuse is better. |
I see this; short of reducing allocs to 0 or adding the settings to cap the GC size, even if your allocations are only 1000 Tasks per second eventually it will tend to a 1GB of heap even if only 5MB is ever in use. I did think there was some heuristic that the GC used to adjust its allocation budget as to collect to match the memory size to allocation rate? https://devblogs.microsoft.com/dotnet/garbage-collection-at-food-courts/
Do the startup long surviving allocations throw it off; or maybe I'm misunderstanding the allocation budget. Perhaps it starts off "infinite" and reduces if allocations are high (large heap -> collected faster large heap); rather than starting off low and increasing if allocations are high (small heap -> large heap) /cc @Maoni0 |
Well; maybe not 1GB anymore with the segment size changes; but larger than expected as a slow allocation rate waits a very long time for a GC by which time the heap has grown significantly even if nothing is ever particularly in use (on a machine with significant RAM and no GC setting tweaking). Saying that I haven't experienced any performance issues from this large heap collection; its more of a perception issue from looking at what the OS reports as the app is using, so perhaps its more of a feel good factor? 😅 |
the initial gen0 budget is mostly based on LLC (last level cache) size. during startup when gen0 surv rate is very high the budget for gen0 will quickly get adjusted to be high. when the gen0 surv rate becomes much lower during the steady state the budget will be adjusted down to reflect that. |
Might be the extremely fast ramp up that exhibits this then? Pushing up the predicted budget; followed by no GC for a very long time where it fills the budget? I'm understand I'm not familiar enough so essentially completely guessing 😅; but if the budgets are revaluated only at the next GC; perhaps a timeout "pseudo" GC if the budget hasn't been hit for a while that can serve to evaluates budget down is the steady state allocation flips into a massively lower allocation rate? e.g. from my various amortize issues #30168, dotnet/aspnetcore#11940, dotnet/aspnetcore#11940, etc the only allocations for our websocket server are the Read async boxes in the layering chain of reads; but the app will still end up using a lot of memory according to the OS (though the active managed heap is pretty minimal) @stephentoub's "Make "async ValueTask/ValueTask" methods ammortized allocation-free" dotnet/coreclr#26310 (now removed 😢) reduced this to a single asyncbox allocation per read for the steady state (in WebSockets where does an I haven't rerun this test on the latest runtime; so perhaps I should also do that and find out what it stops at. |
I could probably also do a |
There's a new counter in .NET 6 dotnet/diagnostics#2139 that might help track down where the rest of the memory is going. |
if the budget was adjusted very high, yes. it'll need to wait till the next GC to observe that your alloc behavior has changed. in workstation GC we do check more often (but then again WKS GC has a much smaller max budget). we should probably check this for Server GC as well but in general this hasn't been a problem (happy to discuss if it's a problem for you). |
of course you can also collect a trace to verify this theory - GC informational events will tell you the budget. |
It's more than a feel good factor, it's also why you can't tell how much memory you need to run. This shows up when people try to set container limits. I do think there's a balance to be struck here and that's basically what's being discussed. Easy allocations don't really need a benchmark to remove. We removed extra closures, string, delegate allocations with small changes and new compiler features (like static lambdas). We also agree that pooling large objects is beneficial because they are actually expensive in the GC so reusing them is beneficial. So now we're trying to find the middle ground, how much do we contribute to gen0 as a framework/set of libraries. It's sometimes obvious to do this if there's a throughout benefit but we rarely mention the memory benefit because it's somewhat out of the libraries hands (you just run the GC more often right?). PS: I'm spending some time trying to better understand the overhead and impact of many object specialized pools in the system: |
An example in Kestrel/samples/PlaintextApp with the addition of TLS there are essentially 3 allocations after start-up and initial establishment of 256 connections; the Async Boxes for Steady state (while doing 72k req/s) is 8.497 MB After GC?, but TaskManger is reporting a WorkingSet of 90MB and stays there (10min run)
Trace: Plaintext.etl
So the
Pooling these 3 async boxes in this example might be a drop in the ocean for another apps allocations as it will be something else; however it will still have these allocations. However it currently presents as the simplest Plaintext app; with only 256 connections needs 90MB to run, which seems high as the baseline for the most minimal ASP.NET Core app with TLS? |
i.e. 2 GCs/sec seems nice; and perhaps there is a tweak so its budget doesn't go so high at the very start: However mostly it is actually allocating 45MB/sec, and I'd prefer to not allocate that 45MB; since its an option, rather than increase the GC rate (if that makes sense?) |
@benaadams it was high because the survival rate was very high and soon after the budget got adjust down so that seems exactly the right thing to do. I'm curious to know why you wouldn't want it to go so high? you get fewer GCs that way when GC knows it's not going to be productive to do a collection.
I think we should seriously rename all these counters to add an "after GC" at the end of their names... because this does not include the gen0 you are about to allocate. would that help with clarifying why there's a discrepancy? |
I don't particularly have an issue with what the GC is doing; the app is allocating at a rate and the GC is running twice a second for that allocation rate; I'm not saying I want the GC to run more frequently 😅 Its the async scenario that bothers me. If these were changed to sync reads rather than async reads then the steady state allocation rate would become zero (and likely the working set would be smaller). Moving to async, the allocation rate moves to 45MB/s purely for these 3 async boxes and as an app developer since they are embedded in the framework they are allocations that nothing can be done about; but its very clear what they are: To me it just seems if there is a clear; demonstratable and straightforward opportunity to bring the allocation rate and working set down; assuming there aren't significant downsides, that would be a better direction to favour? |
I have a different take on this. In the above case, the GC can keep up with the allocation rate, things are great, CPU time spent in the GC is low. The problem is, there's a rate at which the GC can't keep up and CPU time inevitably rises and then we look at how to spend time reducing allocations to reduce time spent in the GC. If the libraries you use allocate into gen0 then they end up setting the floor for your allocation rate. At this stage you have 2 options:
Does this sound correct or am I off base? |
So for less total memory use the GC can run faster (and there are constraint settings available for that); but that means allocations will prematurely age-up making a Gen2 collection more likely; or as you say can use much more memory so less frequent GCs meaning the lifetime of the objects has likely passed making a Gen2 collection less likely. However it's better to avoid allocations where we can; to give the GC more headroom, to deal with the allocations that we can't easily, and to give user code a bigger goldilocks zone where they aren't worrying about how much they allocate.
This is the crux of the issue for user code, for example, if it's Linq use then can likely tweak it to not use Linq without much of an issue; and if it's something infrequent and not in the hot path then it doesn't matter too much. However, the above examples which happen for every read of data are both very much in the hot path; and the cost for user-code to avoid them is extremely high (need to reimplement their own TLS layer as per Drawaes/Leto; reimplement WebSockets as per StackExchange/NetGain); essentially ditching the framework stack, which are major investments with extra security concerns. Moving away from the simple Plaintext app; this is our "In Space" WebSocket server showing allocations between two GCs on current .NET 6.0: 99.9% of the allocations are the main path .NET/ASP.NET Core apis and not our app's; as far as I'm aware there is nothing we can do; short of ditching the Framework, to reduce these allocations? It's essentially the baseline allocation rate we cannot go under. |
Added change to fix the |
Triage: part of larger effort, needs measurable wins. |
Networking is a high performance part of the stack that is used in many scenarios. I'd expect us to do this in websockets, and SSL Stream as they are in the hot path of apps that use them. If it doesn't fit there I'd have a hard time imagining where it does fit. |
@stephentoub is feature (the pooling) in yet for C# 10? |
The compiler support for async method builder overrides is not in yet. |
SslStream next! 😁 |
Not in 6. |
Yea I didn't expect any of it in 6 so I'm thankful for what we got far |
Description
Looking at an allocation profile for SignalR, the top allocations are coming from state machines allocated by Kestrel on websocket reads. We should be able to remove these by using the new pooling introduced here #50116.
Regression?
No
Data
Analysis
This is the code path:
runtime/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs
Line 685 in d927fcb
cc @stephentoub @CarnaViire
The text was updated successfully, but these errors were encountered: