-
-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance measurements #16791
Comments
I tried to not do locking inside |
It's not file operations, as I suspected. Anyway, if we have 5-6% on char-by-char lowercasing and could get rid of it, it would make sense. |
Property parsing will always have an overhead but I am surprised it's this significant. |
You could measure it yourself, I may be wrong |
Once I'm not on holiday :) |
Take your time :) |
OK. Next possible approach - 16% of time is used by |
No, looks like it's a wrong idea |
OK. I got an overall impression that it's totally useless to look for a bottleneck here and the problem is design-level. I've got an impression that we use a universal schema, and universality of the schema is the main problem. What can we do to speed up the key/cert loading process?
|
Yes to all these suggestions. The current parsing could be sped up by making a decent guess as to the file type. |
That's not so easy as the file can contain both PEM formatted private keys and certificates. |
BTW, will PEM/DER classification speed the process? |
I think so - perhaps even twice. |
Finally got around to trying 3.0 in microsoft/msquic and perf dropped across the board in our automation. Handshakes per seconds dropped by 91%. |
The question is what is done as part of the handshake in your perf test. If that includes setting up a cert store or loading keys, that might be the most probable culprit for most of the performance degradation. |
Does the libssl code do any caching of things like ciphers, or do we do a lookup of them for each connection?
|
They are cached in the SSL_CTX |
Unsure of the scope of this issue, perhaps we can clarify it a bit. @t-j-h mentioned msquic finding a massive slowdown so I'm assuming that's the focus now. I've looked at the code using OpenSSL in that repo and the likely hotspots are probably also affected by my fixes to #17064. @nibanks Would you be able to give this another look based on master but also applying the patches in #17857 and #17862? If there are still issues profiler output would be an enormous help. |
I'd be happy to run our performance tests again, but I need a branch out of the quictls fork to do so (for the QUIC functionality there). So if you can point me to any repo/branch that has everything I need, I will give it a try. |
@nibanks I've rebased the quictls patches found in quictls branch The catch is that some tests don't pass. Possibly these are false alarms and this blind rebase will work okay. Possibly not. If you could try your performance tests with this branch and let me know the results, it would be a great help. |
Linux MsQuic results are in:
As you can see, handshakes per second has completely tanked (compared to main, using 1.1.1.1n). The rest seem to be within the realm of noise. |
Thanks for running this again. Since you're focused on it, I'll focus on benchmarking handshakes per second and see what's going on here. |
Thanks @hlandau. I forgot to paste the Windows perf numbers. They show the same HPS drop, but also seem to show a (more than noise) drop in bulk throughput:
|
I think that the only interesting code paths that lead here are the EVP_XXX_do_all_provided() calls. We don't call these from libcrypto or libssl that I can see. The useless work done by the I've half a suspicion that we're mostly not using this as a sparse array -- each provider allocations it's internal "nids" sequentially. |
Is the source code for the performance testing tool available? I'm not seeing any calls to the _do_all_provided functions in the MSquic repo. |
The performance data are a little confusing. Why would the kernel schannel be slower than the user space version? Why are Linux results so much slower than Windows for most of the tests? Notably, not for handshakes per second. |
Could the suboptimal throughput on Linux be caused by the UDP protocol implementation inefficiencies in the Linux kernel? |
https://microsoft.github.io/msquic/index.html#methodology
Scheduler complexities. User mode does a better job of parallelizing the threads involved.
@t8m is right and the primary differences are at the UDP layer bottleneck. For HPS, UDP is not the bottleneck as much. |
@paulidale The tool is |
Thanks. I've got a few ideas about how to make MsQuic more friendly with OpenSSL 3.0. There are some problems with the current code. The CAPI version pre-fetches all the algorithm implementations, the OpenSSL versions don't. This wasn't a big problem with OpenSSL 1.1.1 but it is a major performance loss for OpenSSL 3.0. I.e. pre-fecth all the required algorithms in the initialiser and just use these. There are also some issues with the certificates. The CAPI version preloads these into memory, the OpenSSL versions load them from PKCS#12 and PKCS#7 files each time. Loading from file is something OpenSSL 3.0 does rather poorly. |
I'm backtracking from Hugo's mentioned of ossl_sa_doall_arg() from his profiling. This is only called from the various do_all_provided functions and they are only called from the |
I've rerun the test with current master.
|
@t8m it currently fails to compile :( |
@beldmit yep, sorry I did not try to build it :D Now it builds for me. |
got a segfault:
|
@beldmit There was still one error in the PR. Should be OK now. |
Significantly better!!! |
The next suspicious place is |
I've got some ideas about speeding up |
See #18354 😀 The top two places are now |
Its a bit unclear to me what the status of this issue is currently? It seems to document performance issues, But I can't see what the target performance is, or if the goal was just 'better' If the later, we have ongoing performance measurements and improvement goals listed in https://github.com/orgs/openssl/projects/12/views/1 As such, is there more work to do here, or should this be closed? |
no response, marking as inactive, to be closed at the end of 3.4 dev barring further input |
I tried to profile the example from #16540
We spend ~20% of time in various lock/unlock functions.
Also we spend ~20% of time in
ossl_parse_property
. It may partially overlap with lock/unlock but looks like a hotspot to me.Also various
ossl_tolower
andossl_ctype_check
use 6-7%.The text was updated successfully, but these errors were encountered: