Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single epoll thread per 28 cores #35800

Merged
merged 15 commits into from
May 7, 2020
Merged

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented May 4, 2020

Edit: this PR has evolved over time, please see the next comments for accurate description

@adamsitnik adamsitnik added area-System.Net.Sockets os-linux Linux OS (any supported distro) tenet-performance Performance related issue labels May 4, 2020
@ghost
Copy link

ghost commented May 4, 2020

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

@tmds
Copy link
Member

tmds commented May 4, 2020

kestrel-linux-transport doesn't use ConcurrentDictionary, instead a regular Dictionary with a lock is used. The lookup is performed up-front, which improves locality.

Previous benchmarks for ConcurrentDictionary vs Dictionary+lock showed only small difference. Maybe we'll see a bigger difference for this scenario.

// the goal is to have a dedicated generic instantiation and using:
// System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.Net.Sockets.SocketAsyncContextWrapper]::TryGetValueInternal(!0,int32,!1&)
// instead of:
// System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.__Canon]::TryGetValueInternal(!0,int32,!1&)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious that this would perform better. Why is the dedicated generic instantiation better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever update the value for an existing key in the dictionary? If we do, this will make updates more expensive, as they'll be forced to allocate a new node in the CD, whereas with a reference type value, the existing node will be used.

as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that? Often I see a similar optimization applied as a workaround for removing array covariance checks, but that's on writes, which we shouldn't be doing here frequently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever update the value for an existing key in the dictionary?

We don't. We use incremental keys for each AsyncContext. When we run out of keys, we start a new SocketEngine.

as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that?

I'm curious what it is.

@tmds
Copy link
Member

tmds commented May 4, 2020

kestrel-linux-transport doesn't use ConcurrentDictionary, instead a regular Dictionary with a lock is used. The lookup is performed up-front, which improves locality.

Like EPollAsyncEngine.EPollThread.cs#L85-L104

@kouvel
Copy link
Member

kouvel commented May 4, 2020

I got to the point that the biggest bottleneck is ConcurrentDictionary.TryGetValue and before I try any experiments with ConcurrentDictionary

From the graph it looks like the contention is coming from EventLoop(), if I'm reading it right that would be contention on the ConcurrentQueue and not the ConcurrentDictionary. If it's contention on the ConcurrentQueue it might be interesting to try having a non-contending overload/implementation of ConcurrentQueue.TryDequeue that differentiates between empty, contention, and successful dequeue (without spin-waiting). In the contention case then try to exit EventLoop() prematurely to avoid contention after scheduling another work item to replace it. My hope is that the new work item would not run too soon and it may decrease the parallelization a bit to make faster progress on the queue. Just a thought though, may or may not work.

@kouvel
Copy link
Member

kouvel commented May 4, 2020

Previous benchmarks for ConcurrentDictionary vs Dictionary+lock showed only small difference. Maybe we'll see a bigger difference for this scenario.

👍 The lock seems to be rarely taken on other paths could be taken here around the whole inner loop with faster lookups.

@kouvel
Copy link
Member

kouvel commented May 4, 2020

From the graph it looks like the contention is coming from EventLoop(), if I'm reading it right that would be contention on the ConcurrentQueue

Ah I didn't read that right, nevermind

@kouvel
Copy link
Member

kouvel commented May 4, 2020

I am working on a PR that is going to set the number of epoll threads to 1.

While testing it for a big number of clients (20k) I've noticed a regression, for JSON Platform benchmark the RPS dropped from 780k to 715k.

Was the 20K clients test also with 1 epoll thread, or would it use 20? I figure with 1 epoll thread ConcurrentDictionary accesses shouldn't be contending at all, so there must have been more epoll threads. With 20K connections does it perform better or worse with 1 epoll thread? Lock still may be better in both cases.

@adamsitnik
Copy link
Member Author

if I'm reading it right that would be contention on the ConcurrentQueue and not the ConcurrentDictionary

15.64% of time is spent in ConcurrentDictionary.TryGetValue while only 1.41% in ConcurrentQueueSegment.TryEnqueue

obraz

@adamsitnik
Copy link
Member Author

Was the 20K clients test also with 1 epoll thread, or would it use 20?

For 20k connections from a single load machine:

Before your change it was 740k RPS with 14 epoll threads (Cores / 2)
After your change it was 780k RPS with 14 epoll threads (Cores / 2)

With your change and single epoll thread it dropped to 715k RPS, with the microoptimizations from this PR it's 740k again.

// the goal is to have a dedicated generic instantiation and using:
// System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.Net.Sockets.SocketAsyncContextWrapper]::TryGetValueInternal(!0,int32,!1&)
// instead of:
// System.Collections.Concurrent.ConcurrentDictionary`2[System.IntPtr,System.__Canon]::TryGetValueInternal(!0,int32,!1&)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever update the value for an existing key in the dictionary? If we do, this will make updates more expensive, as they'll be forced to allocate a new node in the CD, whereas with a reference type value, the existing node will be used.

as for why a specific generic instantiation would do better, presumably it's because it's avoiding the generic dictionary lookup, or helping with inlining, or something like that? Often I see a similar optimization applied as a workaround for removing array covariance checks, but that's on writes, which we shouldn't be doing here frequently.

@adamsitnik adamsitnik changed the title Epoll thread event loop micro optimizations Single epoll thread May 5, 2020
@adamsitnik adamsitnik added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 5, 2020
@adamsitnik
Copy link
Member Author

Update: from the initial data that I have it looks like switching from concurrent to a regular dictionary under lock (thanks for a great hint @tmds !) combined with the few micro-optimizations is enough to always have a single epoll thread.

I am now going to run the program that is going to run a matrix of:

And share the results. If there are no regressions, I am going to ask you for review again. For now please don't merge it.

@adamsitnik
Copy link
Member Author

adamsitnik commented May 6, 2020

How to read the results

obraz

before #35330 means results before merging #35330
#35330 means code after merging #35330
xET yD means code after merging #35330 with the micro-optimizations from this PR, using x epoll threads, using y Dictionary. y: C stands for Concurrent while L for generic dictionary used under Lock. So 1ET CD means single epoll thread using Concurrent Dictionary.

Fortunes Batching means Fortunes Platform benchmark executed with a copy of Npgsql.dll provided by @roji that implements batching

Colors: default MS Excel color scheme where red means the worst and green means the best result.

x64 12 Cores (the perf machine)

Let's start with something simple:

obraz

As we can see, switching to a single epoll thread and using ConcurrentDictionary gives the best results - the 1ET CD column is the greenest one. No regressions, pure win.

There are two cases where having more epoll threads gives better results:

  • JsonPlatform using 512 connections. We could get 130k instead of 128k. The difference is so small that it's ignorable
  • PlaintextPlatform using 20_000 connections. The difference is small, but IMHO Plaintext is the most artificial benchmark (because of the pipelining and super small response) and making the heuristic more complex to get few extra % here is not worth it.

x64 28 Cores (Citrine, the TechEmpower machine)

TechEmpower hardware:

obraz

Again, switching to a single epoll thread and using ConcurrentDictionary gives the best results - the 1ET CD column is the greenest one.

There are few cases where having more epoll threads gives better results:

  • small and ignorable differences within the marigin of error:
    • 300k vs 305k for Fortunes using 128 connections
    • 311k vs 318k for Fortunes using 512 connections
    • 9268k vs 9373k for Plaintext using 1024 connections
  • a regression from 742k to 723k for JsonPlatform with 20_000 connections. It's a 2.5% regression, so it's small and the two other benchmarks (Plaintext and Fortunes) give the best results for this config so I think that it's acceptable

Very good thing: the throughput of JSON and Fortunes benchmarks rise when the number of clients increases (to some point ofc). We did not have that before.
Another great thing: 417,499 for Fortunes 1024 connections with latest bits from @roji It's top 10 of Fortunes ;)

x64 56 Cores (Mono machine)

obraz

With 56 cores having a single epoll thread is not enough. Having two gives us the most optimal solution that is improving all cases.

There are two cases where having more epoll threads gives better results, but all of them are small and ignorable differences within the margin of error:

  • 6954k vs 6950k for Plaintext using 256 connections
  • 6964k vs 6960k for Plaintext using 512 connections

There are two where having less epoll threads gives better results:

  • ignorable 660k vs 673k for JsonPlatform using 128 connections
  • 6523k vs 6964k for PlaintextPlatform using 128 connections. Having a single epoll thread could give us better results, but we still have an improvement compared to base 6011k. We could reach it by setting the MinHandles to 128 instead of 32, but I don't think that it's worth it - it's rather unlikely that such a beefy machine is going to be used for handling such a small load.

Very nice thing: the gains are really big. Even up to x2 for Json with 512 connections.

The Fortunes benchmark is not included because for some reason this machine can not currently access the db server.

ARM64 32 Cores

Here is where things get complicated:

obraz

Having a single epoll thread, no matter what dictionary we use gives us a lot of red color (except the case with 20k connections).

There is no obvious dependency between the number of connections and the number of threads (like the more connections the more threads we need). If we take a look at the numbers before our changes it looks like this machine is struggling to scale up when the number of connections grows (JSON numbers are: 470->455->425->350->246).
This requires an independent investigation.

Using 4 epoll threads gives us more improvements than using two. There is only one regression: JSON using 128 connections. Again, I think that for this number of Cores we should optimize for many connections and I hope that this is acceptable.

@adamsitnik
Copy link
Member Author

I've shared the numbers from my most recent experiment in a comment above. PTAL

Based on these numbers I came up with the following proposal for the heuristic that determines the number of epoll threads:

  • we need one epoll thread for every 28 cores
  • we need to "round up" in a way that 29 cores get 2 epoll threads
  • we need to double that for ARM

The code that I've just pushed gives the following results:

ratio is (#35800/before #35330) - 1.0

Machine Connections Benchmark before #35330 #35330 #35800 ratio
Citrine 28 cores 128 PlaintextPlatform 7,274,914 7,389,508 7,753,695 6.58%
    JsonPlatform 728,738 753,185 824,774 13.18%
    FortunesPlatform 288,242 293,637 301,217 4.50%
    Fortunes Batching 169,087 167,237 175,886 4.02%
             
  256 PlaintextPlatform 8,855,217 8,898,258 8,949,456 1.06%
    JsonPlatform 941,176 952,582 1,078,476 14.59%
    FortunesPlatform 291,339 301,213 334,134 14.69%
    Fortunes Batching 311,945 299,331 336,176 7.77%
             
  512 PlaintextPlatform 8,785,644 9,139,882 9,251,310 5.30%
    JsonPlatform 919,425 956,259 1,124,823 22.34%
    FortunesPlatform 289,177 302,984 305,273 5.57%
    Fortunes Batching 358,163 349,256 411,341 14.85%
             
  1,024 PlaintextPlatform 8,798,429 9,093,448 9,329,115 6.03%
    JsonPlatform 917,482 983,014 1,135,564 23.77%
    FortunesPlatform 261,790 273,522 298,583 14.05%
    Fortunes Batching 372,989 374,679 407,914 9.36%
             
  20,000 PlaintextPlatform 6,711,039 6,707,423 7,251,077 8.05%
    JsonPlatform 742,247 754,620 732,218 -1.35%
    FortunesPlatform 208,385 220,029 227,443 9.15%
    Fortunes Batching 289,530 301,026 347,141 19.90%
             
             
Perf 12 cores 128 PlaintextPlatform 4,548,601 4,581,534 4,439,358 -2.40%
    JsonPlatform 438,914 456,929 502,222 14.42%
    FortunesPlatform 120,766 127,799 136,628 13.13%
             
  256 PlaintextPlatform 4,520,799 4,728,698 5,288,782 16.99%
    JsonPlatform 441,074 464,803 545,610 23.70%
    FortunesPlatform 123,775 132,081 138,271 11.71%
             
  512 PlaintextPlatform 4,439,709 4,915,243 5,368,813 20.93%
    JsonPlatform 456,198 480,191 554,399 21.53%
    FortunesPlatform 121,289 130,383 129,009 6.36%
             
  1,024 PlaintextPlatform 4,270,802 4,856,757 5,282,018 23.68%
    JsonPlatform 453,737 480,158 561,559 23.76%
    FortunesPlatform 108,143 118,506 123,213 13.94%
             
  20,000 PlaintextPlatform 3,886,569 3,960,775 4,039,859 3.94%
    JsonPlatform 309,933 333,290 388,005 25.19%
    FortunesPlatform 94,303 105,309 110,534 17.21%
             
             
ARM 32 cores 128 PlaintextPlatform 5,325,320 5,248,309 5,320,333 -0.09%
    JsonPlatform 470,719 467,996 430,931 -8.45%
    FortunesPlatform 70,159 79,601 87,110 24.16%
             
  256 PlaintextPlatform 5,443,043 5,433,406 5,571,782 2.37%
    JsonPlatform 455,767 420,229 458,268 0.55%
    FortunesPlatform 73,379 76,414 86,376 17.71%
             
  512 PlaintextPlatform 5,143,935 5,644,389 5,773,530 12.24%
    JsonPlatform 425,086 397,756 454,824 7.00%
    FortunesPlatform 80,027 79,361 84,626 5.75%
             
  1,024 PlaintextPlatform 5,289,294 5,409,985 5,817,791 9.99%
    JsonPlatform 350,471 376,589 440,122 25.58%
    FortunesPlatform 59,300 53,292 60,414 1.88%
             
  20,000 PlaintextPlatform 3,799,859 4,109,911 4,229,440 11.31%
    JsonPlatform 246,717 258,675 309,816 25.58%
    FortunesPlatform 44,415 36,242 40,573 -8.65%
             
             
Mono 56 cores 128 PlaintextPlatform 6,011,013 6,508,597 6,577,628 9.43%
    JsonPlatform 462,300 673,968 655,741 41.84%
             
  256 PlaintextPlatform 6,896,236 6,906,699 6,930,573 0.50%
    JsonPlatform 600,973 980,908 1,052,635 75.16%
             
  512 PlaintextPlatform 6,941,870 6,941,820 6,954,308 0.18%
    JsonPlatform 623,578 1,079,661 1,136,029 82.18%
             
  1,024 PlaintextPlatform 6,960,810 6,962,596 6,957,445 -0.05%
    JsonPlatform 741,710 1,138,508 1,166,838 57.32%
             
  20,000 PlaintextPlatform 6,825,034 6,784,191 6,786,858 -0.56%
    JsonPlatform 660,291 919,557 944,349 43.02%

@adamsitnik adamsitnik removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 6, 2020
@adamsitnik adamsitnik changed the title Single epoll thread Single epoll thread per 28 cores May 6, 2020

// the data that stands behind this heuristic can be found at https://github.com/dotnet/runtime/pull/35800#issuecomment-624719500
// the goal is to have a single epoll thread per every 28 cores
const int coresPerSingleEpollThread = 28;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks confirm an observation we had made from the perftraces. It may interesting to put in the comment:

TechEmpower JSON platform benchmark (which has a low workload per request) shows the epoll thread is fully loaded on a 28-core machine. We add 1 epoll thread per 28 cores to avoid it being a bottleneck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can already hear all the complaints about this line ... me being the first concerned. I think we should not have any heuristic with such a value. What if tomorrow we decide to use a different hardware to do benchmarks? We should have some heuristics that are good for the general cases, and allow customer to define custom values that might be better for them. In our case, in the TE repository we would then define an ENV with a number of epoll thread we want. Same for ARM probably, which might depend on each vendor.

Copy link
Member

@tmds tmds May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can already hear all the complaints about this line ... me being the first concerned.

I guess you refer to the suggestion I made? I prefer to put it explicitly here than to have it implicit in the linked comment.

From the benchmarking we did, TE platform JSON benchmark represents the lowest threadpool workload per request. This means the epoll thread will sooner become the bottleneck than on the other benchmarks.

We should have some heuristics that are good for the general cases

This heuristic is good for the general case, It uses less epoll threads than the previous heuristic (which was a guess with little benchmarking done) and achieves higher performance.

and allow customer to define custom values that might be better for them.

This is now possible, the count can be set explicitly using the envvar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env var is good. But the number 28 within the code is my concern. Unless we say we need to pick one, but 28 because CITRINE should not be the reason IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik, on the 56-core machine (2-socket 28-core) are the numbers with or without the env vars COMPlus_Thread_UseAllCpuGroups=1 and COMPlus_GCCpuGroup=1? Without those I suspect it would only be using one CPU group and behaving as a single-socket 28-core machine, with those it should try to use both sockets. The numbers seem to be similar to the 28-core machine and it's a bit odd that 2 epoll threads do better there, though there may be other things going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be common to set those env vars on multi-numa-node machines if the intention is to scale up. Might also be interesting to try the AMD machine with those env vars since it also has multiple numa nodes. Not suggesting for this change or anything but it might provide more insights on heuristics for number of epoll threads.

Copy link
Member

@kouvel kouvel May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh nevermind, from a brief look it almost looks like all of the the CPU group stuff is disabled on Linux and those env vars may not have any effect. Sorry for the distraction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be common to set those env vars on multi-numa-node machines if the intention is to scale up.

Thanks for pointing this out. I am going to run this config as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls see my latest comment :) there might be more work to do there in the VM, I'm not up-to-date on what's happening there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sebastienros that this value is fairly arbitrary, based on the specific (and limited) hardware we've tested on. Does the heuristic hold up on machines with a similar number of cores but a different distribution across nodes? What about when hyperthreading is disabled? Did we try it with cloud VMs?

We see 1 epoll thread is enough to load a 28 core machine (Citrine) with a benchmark that has low threadpool workload vs epoll workload (TE JSON platform).

That's what is captured by coresPerSingleEpollThread = 28.

This heuristic also works well in the likely cases that

  • ProcessorCount is lower than 28
  • for higher workloads per epoll workload

This heuristic isn't tuned for multi-node machines, or machines with 28+++ procs.

@tmds
Copy link
Member

tmds commented May 6, 2020

@stephentoub @adamsitnik I think we can remove the MinHandlesForAdditionalEngine logic. It was there to avoid creating too many epoll threads. Now we have low nr of epoll threads anyway.

Environment.ProcessorCount >= 6 ? Environment.ProcessorCount / 2 : 1;
#endif
#pragma warning restore CA1802
private static readonly int s_engineCount = GetEnginesCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this be s_maxEngineCount? We won't always have this many, but we may grow to this many based on the number of concurrent sockets, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub I've suggested to remove that logic as part of this PR (#35800 (comment)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmds You are most probably right. The only use case for keeping it is a machine with many cores and very few connections. Which should be uncommon.

Would you prefer me to remove it now or would you like to do this in your upcoming PR that is going to enable the "inlining"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik remove it here, it is unrelated to inlining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmds I am going to merge it as it is right now as I would really love to see the update numbers. I am going to send a PR with MinHandles logic removal today or tomorrow.


return Math.Min(result, Environment.ProcessorCount / 2);
return Math.Max(1, (int)Math.Round(Environment.ProcessorCount / (double)coresPerEngine));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and "round" it up, in a way that 29 cores gets 2 epoll threads

So now anyting below 44 cores on x64 will get 1 thread? Then 2 after 76 cores ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and this should be enough for vast majorty of real-life scenarios.

@adamsitnik
Copy link
Member Author

TechEmpower is super artificial (super small socket reads and writes & extremely high load) and even under such high load, one engine (producer) is capable of keeping busy up to 30 CPU Cores (8 on ARM). This is possible thanks to the amazing work that @kouvel has done in #35330

In real-life scenario, nobody should ever need more than one epoll thread for the entire app. But we can't predict all possible usages and I believe that these numbers (30 & 8) are safe because it would be super hard to generate more network load.

I've simplified the heuristic and added explanation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Sockets os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants