Performance measurements #16791

beldmit · 2021-10-08T19:57:45Z

I tried to profile the example from #16540

We spend ~20% of time in various lock/unlock functions.
Also we spend ~20% of time in ossl_parse_property. It may partially overlap with lock/unlock but looks like a hotspot to me.
Also various ossl_tolower and ossl_ctype_check use 6-7%.

The text was updated successfully, but these errors were encountered:

paulidale · 2021-10-08T23:22:35Z

I tried to not do locking inside ossl_parse_property. It didn't work out :(

beldmit · 2021-10-09T11:34:05Z

It's not file operations, as I suspected. Anyway, if we have 5-6% on char-by-char lowercasing and could get rid of it, it would make sense.

paulidale · 2021-10-09T13:27:13Z

ossl_tolower will be a real function call. I'd have thought that ossl_ctype_check would inline in ossl_tolower. It won't in other places.

Property parsing will always have an overhead but I am surprised it's this significant.

beldmit · 2021-10-09T13:33:39Z

You could measure it yourself, I may be wrong

paulidale · 2021-10-10T07:30:08Z

Once I'm not on holiday :)

beldmit · 2021-10-10T08:19:33Z

Take your time :)

beldmit · 2021-10-12T11:23:24Z

OK. Next possible approach - 16% of time is used by EVP_KEYMGMT_do_all_provided. Could we cache the fetching results produced in this function (and invalidate the cache when we load/unload the provider)? It will matter for the cases when we use long-term ctx with periodical key/cert parsing.

beldmit · 2021-10-12T16:46:50Z

No, looks like it's a wrong idea

beldmit · 2021-10-13T14:05:25Z

OK. I got an overall impression that it's totally useless to look for a bottleneck here and the problem is design-level. I've got an impression that we use a universal schema, and universality of the schema is the main problem.

What can we do to speed up the key/cert loading process?

Format hints (DER/PEM/3rdparty)
File type hints (key/cert/p12/req/crl/etc)
Algorithm hints
???

paulidale · 2021-10-25T08:20:01Z

Yes to all these suggestions. The current parsing could be sped up by making a decent guess as to the file type.

t8m · 2021-10-25T13:10:57Z

The current parsing could be sped up by making a decent guess as to the file type.

That's not so easy as the file can contain both PEM formatted private keys and certificates.

beldmit · 2021-10-25T13:13:11Z

BTW, will PEM/DER classification speed the process?

t8m · 2021-10-25T13:34:49Z

BTW, will PEM/DER classification speed the process?

I think so - perhaps even twice.

nibanks · 2021-11-17T19:43:45Z

Finally got around to trying 3.0 in microsoft/msquic and perf dropped across the board in our automation. Handshakes per seconds dropped by 91%.

t8m · 2021-11-18T11:14:31Z

Handshakes per seconds dropped by 91%.

The question is what is done as part of the handshake in your perf test. If that includes setting up a cert store or loading keys, that might be the most probable culprit for most of the performance degradation.

kroeckx · 2021-11-20T09:54:13Z

Does the libssl code do any caching of things like ciphers, or do we do a lookup of them for each connection?

mattcaswell · 2021-11-20T15:35:31Z

Does the libssl code do any caching of things like ciphers, or do we do a lookup of them for each connection?

They are cached in the SSL_CTX

hlandau · 2022-03-10T13:25:10Z

Unsure of the scope of this issue, perhaps we can clarify it a bit.

@t-j-h mentioned msquic finding a massive slowdown so I'm assuming that's the focus now. I've looked at the code using OpenSSL in that repo and the likely hotspots are probably also affected by my fixes to #17064.

@nibanks Would you be able to give this another look based on master but also applying the patches in #17857 and #17862? If there are still issues profiler output would be an enormous help.

nibanks · 2022-03-10T13:58:33Z

I'd be happy to run our performance tests again, but I need a branch out of the quictls fork to do so (for the QUIC functionality there). So if you can point me to any repo/branch that has everything I need, I will give it a try.

hlandau · 2022-03-15T08:16:52Z

@nibanks I've rebased the quictls patches found in quictls branch openssl-3.0.1+quic on top of master and applied possible performance fixes on top. You can find this branch here: https://github.com/hlandau/openssl/tree/openssl-master%2Bquic%2Bperf (openssl-master+quic+perf)

The catch is that some tests don't pass. Possibly these are false alarms and this blind rebase will work okay. Possibly not. If you could try your performance tests with this branch and let me know the results, it would be a great help.

nibanks · 2022-03-15T16:48:57Z

Linux MsQuic results are in:

Running Test ThroughputUp_linux_x64_openssl_Default
Run 1: 3842092 kbps
Run 2: 4865925 kbps
Run 3: 4826690 kbps
Run 4: 4808631 kbps
Run 5: 4815566 kbps
Median: 4815566 kbps (+29.88%)
Remote: 3707803.8 kbps

Running Test ThroughputUp_linux_x64_openssl_Encryption_Off
Run 1: 4514885 kbps
Run 2: 4462798 kbps
Run 3: 4481542 kbps
Run 4: 4469929 kbps
Run 5: 4452209 kbps
Median: 4469929 kbps (+7.66%)
Remote: 4151898.8 kbps

Running Test ThroughputUp_linux_x64_openssl_SendBuffering_On
Run 1: 2378973 kbps
Run 2: 4777578 kbps
Run 3: 3871649 kbps
Run 4: 1982995 kbps
Run 5: 1962396 kbps
Median: 2378973 kbps (-36.27%)
Remote: 3732678.2 kbps

Running Test ThroughputDown_linux_x64_openssl_Default
Run 1: 5386666 kbps
Run 2: 5338299 kbps
Run 3: 5397644 kbps
Run 4: 5422802 kbps
Run 5: 5478371 kbps
Median: 5397644 kbps (-3.53%)
Remote: 5595371.2 kbps

Running Test ThroughputDown_linux_x64_openssl_Encryption_Off
Run 1: 5863196 kbps
Run 2: 4771436 kbps
Run 3: 5726993 kbps
Run 4: 5064485 kbps
Run 5: 5809322 kbps
Median: 5726993 kbps (+4.71%)
Remote: 5469430.8 kbps

Running Test RPS_linux_x64_openssl_Default
Run 1: 539711 RPS, Minimum: 97, Maximum: 104421, Percentiles: 50th: 4807.000000, 90th: 17080.000000, 99th: 31300.000000, 99.9th: 52861.000000, 99.99th: 79838.000000, 99.999th: 89574.000000, 99.9999th: 101131.000000, Standard Error: 3.069512
Run 2: 553974 RPS, Minimum: 91, Maximum: 347885, Percentiles: 50th: 4315.000000, 90th: 12668.000000, 99th: 152289.000000, 99.9th: 224679.000000, 99.99th: 270892.000000, 99.999th: 304910.000000, 99.9999th: 324628.000000, Standard Error: 9.324490
Run 3: 537314 RPS, Minimum: 95, Maximum: 86029, Percentiles: 50th: 4784.000000, 90th: 17057.000000, 99th: 33439.000000, 99.9th: 51340.000000, 99.99th: 65976.000000, 99.999th: 76323.000000, 99.9999th: 83895.000000, Standard Error: 3.136500
Run 4: 539245 RPS, Minimum: 99, Maximum: 77559, Percentiles: 50th: 5752.000000, 90th: 17596.000000, 99th: 31658.000000, 99.9th: 45853.000000, 99.99th: 58550.000000, 99.999th: 72062.000000, 99.9999th: 76046.000000, Standard Error: 3.055176
Run 5: 540020 RPS, Minimum: 91, Maximum: 377582, Percentiles: 50th: 4221.000000, 90th: 12394.000000, 99th: 139377.000000, 99.9th: 222810.000000, 99.99th: 287493.000000, 99.999th: 335810.000000, 99.9999th: 363626.000000, Standard Error: 8.677806
Median: 539711 RPS (+20.65%)
Remote: 447351.6 RPS

Running Test RPS_linux_x64_openssl_ConnectionCount_40
Run 1: 113457 RPS, Minimum: 81, Maximum: 14568, Percentiles: 50th: 154.000000, 90th: 201.000000, 99th: 310.000000, 99.9th: 414.000000, 99.99th: 515.000000, 99.999th: 1931.000000, 99.9999th: 8729.000000, Standard Error: 0.038857
Run 2: 118053 RPS, Minimum: 82, Maximum: 8261, Percentiles: 50th: 152.000000, 90th: 188.000000, 99th: 261.000000, 99.9th: 326.000000, 99.99th: 459.000000, 99.999th: 1750.000000, 99.9999th: 3391.000000, Standard Error: 0.029159
Run 3: 41014 RPS, Minimum: 91, Maximum: 2865, Percentiles: 50th: 588.000000, 90th: 647.000000, 99th: 692.000000, 99.9th: 1145.000000, 99.99th: 1511.000000, 99.999th: 2623.000000, 99.9999th: 2865.000000, Standard Error: 0.334432
Run 4: 49062 RPS, Minimum: 86, Maximum: 7400, Percentiles: 50th: 568.000000, 90th: 619.000000, 99th: 663.000000, 99.9th: 861.000000, 99.99th: 1592.000000, 99.999th: 7006.000000, 99.9999th: 7400.000000, Standard Error: 0.340066
Run 5: 120710 RPS, Minimum: 83, Maximum: 12170, Percentiles: 50th: 150.000000, 90th: 179.000000, 99th: 230.000000, 99.9th: 287.000000, 99.99th: 422.000000, 99.999th: 1762.000000, 99.9999th: 12143.000000, Standard Error: 0.027000
Median: 113457 RPS (+33.84%)
Remote: 84770.2 RPS

Running Test RPS_linux_x64_openssl_ResponseSize_16384
Run 1: 214419 RPS, Minimum: 248, Maximum: 1201091, Percentiles: 50th: 19239.000000, 90th: 55382.000000, 99th: 418472.000000, 99.9th: 748240.000000, 99.99th: 979578.000000, 99.999th: 1096285.000000, 99.9999th: 1155963.000000, Standard Error: 46.297788
Run 2: 219736 RPS, Minimum: 191, Maximum: 641189, Percentiles: 50th: 23737.000000, 90th: 64601.000000, 99th: 191832.000000, 99.9th: 372919.000000, 99.99th: 535361.000000, 99.999th: 597173.000000, 99.9999th: 626192.000000, Standard Error: 25.005669
Run 3: 221352 RPS, Minimum: 226, Maximum: 1035258, Percentiles: 50th: 19308.000000, 90th: 53995.000000, 99th: 421316.000000, 99.9th: 661036.000000, 99.99th: 829205.000000, 99.999th: 947996.000000, 99.9999th: 1005580.000000, Standard Error: 41.980364
Run 4: 228267 RPS, Minimum: 190, Maximum: 663611, Percentiles: 50th: 24714.000000, 90th: 57724.000000, 99th: 183324.000000, 99.9th: 371716.000000, 99.99th: 544988.000000, 99.999th: 608855.000000, 99.9999th: 657094.000000, Standard Error: 22.776080
Run 5: 211166 RPS, Minimum: 201, Maximum: 1113881, Percentiles: 50th: 20807.000000, 90th: 56413.000000, 99th: 411633.000000, 99.9th: 663276.000000, 99.99th: 855654.000000, 99.999th: 1031836.000000, 99.9999th: 1110616.000000, Standard Error: 43.415396
Median: 219736 RPS (-27.59%)
Remote: 303458 RPS

Running Test RPS_linux_x64_openssl_ResponseSize_512
Run 1: 725935 RPS, Minimum: 67, Maximum: 201914, Percentiles: 50th: 2766.000000, 90th: 9670.000000, 99th: 75134.000000, 99.9th: 136007.000000, 99.99th: 165778.000000, 99.999th: 188739.000000, 99.9999th: 198668.000000, Standard Error: 4.468496
Run 2: 723964 RPS, Minimum: 70, Maximum: 268382, Percentiles: 50th: 2819.000000, 90th: 9499.000000, 99th: 74394.000000, 99.9th: 167234.000000, 99.99th: 199935.000000, 99.999th: 228250.000000, 99.9999th: 249506.000000, Standard Error: 5.049488
Run 3: 713852 RPS, Minimum: 71, Maximum: 177929, Percentiles: 50th: 2983.000000, 90th: 9556.000000, 99th: 85119.000000, 99.9th: 129854.000000, 99.99th: 156169.000000, 99.999th: 167951.000000, 99.9999th: 174609.000000, Standard Error: 4.637166
Run 4: 720172 RPS, Minimum: 66, Maximum: 100280, Percentiles: 50th: 2907.000000, 90th: 8777.000000, 99th: 20920.000000, 99.9th: 39873.000000, 99.99th: 61018.000000, 99.999th: 86600.000000, 99.9999th: 95354.000000, Standard Error: 1.592973
Run 5: 711167 RPS, Minimum: 65, Maximum: 72724, Percentiles: 50th: 3056.000000, 90th: 12520.000000, 99th: 29959.000000, 99.9th: 45390.000000, 99.99th: 56679.000000, 99.999th: 66030.000000, 99.9999th: 69472.000000, Standard Error: 2.257471
Median: 720172 RPS (+7.01%)
Remote: 673022 RPS

Running Test RPS_linux_x64_openssl_ResponseSize_0
Test Run Took 00:00:15.3082504
Run 1: 764907 RPS, Minimum: 59, Maximum: 217505, Percentiles: 50th: 2544.000000, 90th: 8947.000000, 99th: 45943.000000, 99.9th: 119158.000000, 99.99th: 169282.000000, 99.999th: 192823.000000, 99.9999th: 205599.000000, Standard Error: 3.168612
Run 2: 768485 RPS, Minimum: 68, Maximum: 180413, Percentiles: 50th: 2628.000000, 90th: 8107.000000, 99th: 46570.000000, 99.9th: 93628.000000, 99.99th: 137912.000000, 99.999th: 154694.000000, 99.9999th: 171268.000000, Standard Error: 2.885535
Run 3: 809245 RPS, Minimum: 68, Maximum: 212031, Percentiles: 50th: 2366.000000, 90th: 8491.000000, 99th: 57595.000000, 99.9th: 144965.000000, 99.99th: 175184.000000, 99.999th: 191395.000000, 99.9999th: 203663.000000, Standard Error: 4.019140
Run 4: 733329 RPS, Minimum: 72, Maximum: 59529, Percentiles: 50th: 3166.000000, 90th: 10865.000000, 99th: 23956.000000, 99.9th: 36182.000000, 99.99th: 45810.000000, 99.999th: 53651.000000, 99.9999th: 58329.000000, Standard Error: 1.832538
Run 5: 746098 RPS, Minimum: 60, Maximum: 236235, Percentiles: 50th: 2257.000000, 90th: 8812.000000, 99th: 34274.000000, 99.9th: 132696.000000, 99.99th: 176983.000000, 99.999th: 206747.000000, 99.9999th: 223703.000000, Standard Error: 3.362937
Median: 764907 RPS (+12.53%)
Remote: 679706.2 RPS

Running Test HPS_linux_x64_openssl_Default
Run 1: 469 HPS
Run 2: 477 HPS
Run 3: 480 HPS
Run 4: 480 HPS
Run 5: 472 HPS
Median: 477 HPS (-96.25%)
Remote: 12732.8 HPS

As you can see, handshakes per second has completely tanked (compared to main, using 1.1.1.1n). The rest seem to be within the realm of noise.

hlandau · 2022-03-16T16:16:23Z

Thanks for running this again. Since you're focused on it, I'll focus on benchmarking handshakes per second and see what's going on here.

nibanks · 2022-03-16T16:37:47Z

Thanks @hlandau. I forgot to paste the Windows perf numbers. They show the same HPS drop, but also seem to show a (more than noise) drop in bulk throughput:

Running Test ThroughputUp_Windows_x64_openssl_Default
Run 1: 5798084 kbps
Run 2: 5788188 kbps
Run 3: 5808107 kbps
Run 4: 5799065 kbps
Run 5: 5809729 kbps
Median: 5799065 kbps (-13.37%)
Remote: 6693896 kbps

Running Test ThroughputUp_Windows_x64_openssl_SendBuffering_On
Run 1: 5458144 kbps
Run 2: 5422203 kbps
Run 3: 5459794 kbps
Run 4: 5447238 kbps
Run 5: 5411963 kbps
Median: 5447238 kbps (-10.8%)
Remote: 6106504.8 kbps

Running Test ThroughputDown_Windows_x64_openssl_Default
Run 1: 6059683 kbps
Run 2: 6066493 kbps
Run 3: 6050032 kbps
Run 4: 6061678 kbps
Run 5: 6040359 kbps
Median: 6059683 kbps (-13.3%)
Remote: 6989342.8 kbps

Running Test RPS_Windows_x64_openssl_Default
Run 1: 702916 RPS, Minimum: 107, Maximum: 87670, Percentiles: 50th: 5293.000000, 90th: 29576.000000, 99th: 54690.000000, 99.9th: 68528.000000, 99.99th: 76136.000000, 99.999th: 81001.000000, 99.9999th: 84587.000000, Standard Error: 4.641954
Run 2: 738797 RPS, Minimum: 150, Maximum: 77289, Percentiles: 50th: 7381.000000, 90th: 21600.000000, 99th: 43992.000000, 99.9th: 51624.000000, 99.99th: 59004.000000, 99.999th: 70428.000000, 99.9999th: 74614.000000, Standard Error: 3.353766
Run 3: 674636 RPS, Minimum: 109, Maximum: 118990, Percentiles: 50th: 4836.000000, 90th: 30518.000000, 99th: 60506.000000, 99.9th: 71434.000000, 99.99th: 93582.000000, 99.999th: 112494.000000, 99.9999th: 117060.000000, Standard Error: 5.141409
Run 4: 678832 RPS, Minimum: 135, Maximum: 85750, Percentiles: 50th: 4863.000000, 90th: 30710.000000, 99th: 59825.000000, 99.9th: 70858.000000, 99.99th: 76907.000000, 99.999th: 81130.000000, 99.9999th: 84027.000000, Standard Error: 5.130757
Run 5: 685388 RPS, Minimum: 137, Maximum: 104925, Percentiles: 50th: 5551.000000, 90th: 28566.000000, 99th: 61157.000000, 99.9th: 86962.000000, 99.99th: 94950.000000, 99.999th: 99900.000000, 99.9999th: 103916.000000, Standard Error: 4.909168
Median: 685388 RPS (-20.74%)
Remote: 864723.8 RPS

Running Test RPS_Windows_x64_openssl_ConnectionCount_40
Run 1: 126410 RPS, Minimum: 78, Maximum: 22966, Percentiles: 50th: 142.000000, 90th: 187.000000, 99th: 327.000000, 99.9th: 462.000000, 99.99th: 1384.000000, 99.999th: 4090.000000, 99.9999th: 22618.000000, Standard Error: 0.054350
Run 2: 126218 RPS, Minimum: 75, Maximum: 3181, Percentiles: 50th: 142.000000, 90th: 193.000000, 99th: 334.000000, 99.9th: 462.000000, 99.99th: 586.000000, 99.999th: 1665.000000, 99.9999th: 2937.000000, Standard Error: 0.038540
Run 3: 116020 RPS, Minimum: 75, Maximum: 3209, Percentiles: 50th: 150.000000, 90th: 231.000000, 99th: 365.000000, 99.9th: 493.000000, 99.99th: 624.000000, 99.999th: 2087.000000, 99.9999th: 2901.000000, Standard Error: 0.050001
Run 4: 127878 RPS, Minimum: 76, Maximum: 3261, Percentiles: 50th: 140.000000, 90th: 188.000000, 99th: 327.000000, 99.9th: 451.000000, 99.99th: 565.000000, 99.999th: 1553.000000, 99.9999th: 2519.000000, Standard Error: 0.036321
Run 5: 126053 RPS, Minimum: 75, Maximum: 3103, Percentiles: 50th: 142.000000, 90th: 190.000000, 99th: 333.000000, 99.9th: 458.000000, 99.99th: 582.000000, 99.999th: 1798.000000, 99.9999th: 3064.000000, Standard Error: 0.037978
Median: 126218 RPS (-1.27%)
Remote: 127835.6 RPS

Running Test RPS_Windows_x64_openssl_ResponseSize_0
Run 1: 802405 RPS, Minimum: 83, Maximum: 446519, Percentiles: 50th: 1798.000000, 90th: 6828.000000, 99th: 73675.000000, 99.9th: 225062.000000, 99.99th: 341388.000000, 99.999th: 441596.000000, 99.9999th: 445934.000000, Standard Error: 5.789335
Run 2: 838865 RPS, Minimum: 99, Maximum: 573136, Percentiles: 50th: 1833.000000, 90th: 6163.000000, 99th: 55596.000000, 99.9th: 190140.000000, 99.99th: 293309.000000, 99.999th: 452717.000000, 99.9999th: 572797.000000, Standard Error: 4.489191
Run 3: 754607 RPS, Minimum: 101, Maximum: 491969, Percentiles: 50th: 2014.000000, 90th: 8935.000000, 99th: 90891.000000, 99.9th: 261636.000000, 99.99th: 403768.000000, 99.999th: 446391.000000, 99.9999th: 480124.000000, Standard Error: 7.037936
Run 4: 747882 RPS, Minimum: 94, Maximum: 498181, Percentiles: 50th: 2235.000000, 90th: 9995.000000, 99th: 85762.000000, 99.9th: 237168.000000, 99.99th: 356854.000000, 99.999th: 481589.000000, 99.9999th: 495017.000000, Standard Error: 6.594846
Run 5: 848877 RPS, Minimum: 100, Maximum: 401208, Percentiles: 50th: 1836.000000, 90th: 6161.000000, 99th: 60749.000000, 99.9th: 207568.000000, 99.99th: 304262.000000, 99.999th: 381185.000000, 99.9999th: 400570.000000, Standard Error: 4.818216
Median: 802405 RPS (-.79%)
Remote: 808812.6 RPS

Running Test RPS_Windows_x64_openssl_ResponseSize_16384
Run 1: 199163 RPS, Minimum: 203, Maximum: 289321, Percentiles: 50th: 4840.000000, 90th: 131384.000000, 99th: 230626.000000, 99.9th: 277283.000000, 99.99th: 283328.000000, 99.999th: 286247.000000, 99.9999th: 288470.000000, Standard Error: 41.672038
Run 2: 240169 RPS, Minimum: 216, Maximum: 249727, Percentiles: 50th: 27129.000000, 90th: 57260.000000, 99th: 90407.000000, 99.9th: 135101.000000, 99.99th: 192111.000000, 99.999th: 221562.000000, 99.9999th: 246971.000000, Standard Error: 12.666900
Run 3: 237956 RPS, Minimum: 263, Maximum: 236295, Percentiles: 50th: 26102.000000, 90th: 61419.000000, 99th: 107298.000000, 99.9th: 137453.000000, 99.99th: 162757.000000, 99.999th: 213933.000000, 99.9999th: 235254.000000, Standard Error: 14.637100
Run 4: 201606 RPS, Minimum: 259, Maximum: 391733, Percentiles: 50th: 5599.000000, 90th: 123610.000000, 99th: 289075.000000, 99.9th: 369805.000000, 99.99th: 381441.000000, 99.999th: 388321.000000, 99.9999th: 390906.000000, Standard Error: 43.795195
Run 5: 224036 RPS, Minimum: 271, Maximum: 459296, Percentiles: 50th: 16370.000000, 90th: 84444.000000, 99th: 235744.000000, 99.9th: 434752.000000, 99.99th: 446326.000000, 99.999th: 452878.000000, 99.9999th: 455388.000000, Standard Error: 30.484848
Median: 224036 RPS (-34.11%)
Remote: 340031 RPS

Running Test RPS_Windows_x64_openssl_ResponseSize_512
Run 1: 824774 RPS, Minimum: 95, Maximum: 424464, Percentiles: 50th: 1744.000000, 90th: 6767.
Run 2: 861357 RPS, Minimum: 99, Maximum: 443313, Percentiles: 50th: 1793.000000, 90th: 6886.000000, 99th: 81085.000000, 99.9th: 225820.000000, 99.99th: 312857.000000, 99.999th: 410391.000000, 99.9999th: 429913.000000, Standard Error: 5.657962
Run 3: 882379 RPS, Minimum: 91, Maximum: 462740, Percentiles: 50th: 1835.000000, 90th: 6953.000000, 99th: 74517.000000, 99.9th: 234536.000000, 99.99th: 339978.000000, 99.999th: 411907.000000, 99.9999th: 460107.000000, Standard Error: 5.502248
Run 4: 842740 RPS, Minimum: 106, Maximum: 546870, Percentiles: 50th: 1884.000000, 90th: 6748.000000, 99th: 66105.000000, 99.9th: 221690.000000, 99.99th: 392442.000000, 99.999th: 461417.000000, 99.9999th: 480632.000000, Standard Error: 5.348733
Run 5: 868326 RPS, Minimum: 96, Maximum: 546212, Percentiles: 50th: 1817.000000, 90th: 6081.000000, 99th: 66719.000000, 99.9th: 219037.000000, 99.99th: 351700.000000, 99.999th: 446988.000000, 99.9999th: 535555.000000, Standard Error: 5.176615
Median: 861357 RPS (+5.98%)
Remote: 812763.8 RPS

Running Test HPS_Windows_x64_openssl_Default
Run 1: 788 HPS
Run 2: 759 HPS
Run 3: 751 HPS
Run 4: 785 HPS
Run 5: 829 HPS
Median: 785 HPS (-87.31%)
Remote: 6187 HPS

paulidale · 2022-03-22T01:41:39Z

The other major bottleneck is ossl_sa_doall_arg (10.7% time).

I think that the only interesting code paths that lead here are the EVP_XXX_do_all_provided() calls. We don't call these from libcrypto or libssl that I can see.

The useless work done by the ossl_sa_doall_arg function can be reduced by setting OPENSSL_SA_BLOCK_BITS to a smaller value (the default is 12, which results in 4k blocks). -DOPENSSL_SA_BLOCK_BITS=4 on the configure line should be enough to test things. I guessed appropriate values for these way back when I implemented this -- so likely they can be tuned.

I've half a suspicion that we're mostly not using this as a sparse array -- each provider allocations it's internal "nids" sequentially.

paulidale · 2022-03-23T01:45:27Z

Is the source code for the performance testing tool available?

I'm not seeing any calls to the _do_all_provided functions in the MSquic repo.

paulidale · 2022-03-23T01:48:24Z

The performance data are a little confusing.

Why would the kernel schannel be slower than the user space version?

Why are Linux results so much slower than Windows for most of the tests? Notably, not for handshakes per second.

t8m · 2022-03-23T08:53:12Z

Could the suboptimal throughput on Linux be caused by the UDP protocol implementation inefficiencies in the Linux kernel?

nibanks · 2022-03-23T09:05:35Z

Is the source code for the performance testing tool available?

https://microsoft.github.io/msquic/index.html#methodology

Why would the kernel schannel be slower than the user space version?

Scheduler complexities. User mode does a better job of parallelizing the threads involved.

Why are Linux results so much slower than Windows for most of the tests? Notably, not for handshakes per second.

@t8m is right and the primary differences are at the UDP layer bottleneck. For HPS, UDP is not the bottleneck as much.

hlandau · 2022-03-24T06:42:05Z

@paulidale The tool is ./bin/Release/secnetperf in the build directory. I was able to build it with cmake -DQUIC_BUILD_PERF=ON. I don't see any string _do_all_provided in the msquic repo, where are you seeing it?

paulidale · 2022-03-24T07:24:12Z

Thanks. I've got a few ideas about how to make MsQuic more friendly with OpenSSL 3.0.

There are some problems with the current code. The CAPI version pre-fetches all the algorithm implementations, the OpenSSL versions don't. This wasn't a big problem with OpenSSL 1.1.1 but it is a major performance loss for OpenSSL 3.0. I.e. pre-fecth all the required algorithms in the initialiser and just use these.

There are also some issues with the certificates. The CAPI version preloads these into memory, the OpenSSL versions load them from PKCS#12 and PKCS#7 files each time. Loading from file is something OpenSSL 3.0 does rather poorly.

paulidale · 2022-03-25T01:07:58Z

I'm backtracking from Hugo's mentioned of ossl_sa_doall_arg() from his profiling. This is only called from the various do_all_provided functions and they are only called from the list command.

beldmit · 2022-05-18T11:20:03Z

I've rerun the test with current master.
Don't see any locking now.

openssl_namemap_name2num_n looks suspicious because of many CRYPTO_strdup/CRYPTO_free operations.
ossl_tolower takes ~5%, ossl_ctype_check takes ~2% and looks to be invoked only from ossl_tolower

t8m · 2022-05-18T14:48:21Z

@beldmit, could you try with #18341? At least it eliminates the strndup when the _n variant is not used.

beldmit · 2022-05-18T14:56:00Z

@t8m it currently fails to compile :(

t8m · 2022-05-18T15:13:25Z

@beldmit yep, sorry I did not try to build it :D Now it builds for me.

beldmit · 2022-05-18T15:25:21Z

got a segfault:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7cb283d in ossl_namemap_add_names (namemap=0x406200, number=number@entry=0, names=0x7ffff7eb64de "TLS1-PRF", separator=separator@entry=58 ':') at crypto/core_namemap.c:323
323	    for (p = names; *p != '\0'; p = (q == NULL ? p + l : q + 1)) {
(gdb) bt
#0  0x00007ffff7cb283d in ossl_namemap_add_names (namemap=0x406200, number=number@entry=0, names=0x7ffff7eb64de "TLS1-PRF", separator=separator@entry=58 ':') at crypto/core_namemap.c:323
#1  0x00007ffff7c8be10 in construct_evp_method (algodef=0x7ffff7fab580 <deflt_keymgmt+320>, prov=0x4212c0, data=0x7fffffffd2c0) at crypto/evp/evp_fetch.c:191
#2  0x00007ffff7cb2235 in ossl_method_construct_this (provider=0x4212c0, algo=0x7ffff7fab580 <deflt_keymgmt+320>, no_store=0, cbdata=0x7fffffffd210) at crypto/core_fetch.c:84
#3  0x00007ffff7cb2015 in algorithm_do_map (cbdata=0x7fffffffd1b0, no_store=0, cur_operation=10, map=0x7ffff7fab440 <deflt_keymgmt>, provider=0x4212c0) at crypto/core_algorithm.c:68
#4  algorithm_do_this (provider=0x4212c0, cbdata=cbdata@entry=0x7fffffffd1b0) at crypto/core_algorithm.c:110
#5  0x00007ffff7cc378b in ossl_provider_doall_activated (ctx=<optimized out>, cb=cb@entry=0x7ffff7cb1f40 <algorithm_do_this>, cbdata=cbdata@entry=0x7fffffffd1b0) at crypto/provider_core.c:1404
#6  0x00007ffff7cb215f in ossl_algorithm_do_all (libctx=0x0, operation_id=<optimized out>, provider=<optimized out>, pre=pre@entry=0x7ffff7cb2340 <ossl_method_construct_precondition>, 
    fn=fn@entry=0x7ffff7cb2210 <ossl_method_construct_this>, post=post@entry=0x7ffff7cb22c0 <ossl_method_construct_postcondition>, data=0x7fffffffd210) at crypto/core_algorithm.c:146
#7  0x00007ffff7cb2430 in ossl_method_construct (libctx=<optimized out>, operation_id=operation_id@entry=10, provider_rw=provider_rw@entry=0x7fffffffd288, force_store=force_store@entry=0, mcm=mcm@entry=0x7fffffffd290, 
    mcm_data=mcm_data@entry=0x7fffffffd2c0) at crypto/core_fetch.c:144
#8  0x00007ffff7c8d1f4 in inner_evp_generic_fetch (free_method=0x7ffff7c95130 <EVP_KEYMGMT_free>, up_ref_method=0x7ffff7c95120 <EVP_KEYMGMT_up_ref>, new_method=0x7ffff7c951b0 <keymgmt_from_algorithm>, properties=0x0, name=0x0, 
    operation_id=10, prov=<optimized out>, methdata=0x7fffffffd2c0) at crypto/evp/evp_fetch.c:288
#9  evp_generic_do_all (libctx=libctx@entry=0x0, operation_id=operation_id@entry=10, user_fn=user_fn@entry=0x7ffff7c5b070 <collect_keymgmt>, user_arg=user_arg@entry=0x7fffffffd390, 
    new_method=new_method@entry=0x7ffff7c951b0 <keymgmt_from_algorithm>, up_ref_method=up_ref_method@entry=0x7ffff7c95120 <EVP_KEYMGMT_up_ref>, free_method=0x7ffff7c95130 <EVP_KEYMGMT_free>) at crypto/evp/evp_fetch.c:571
#10 0x00007ffff7c95748 in EVP_KEYMGMT_do_all_provided (libctx=libctx@entry=0x0, fn=fn@entry=0x7ffff7c5b070 <collect_keymgmt>, arg=arg@entry=0x7fffffffd390) at crypto/evp/keymgmt_meth.c:279
#11 0x00007ffff7c5b40e in ossl_decoder_ctx_setup_for_pkey (ctx=ctx@entry=0x407b20, pkey=pkey@entry=0x406aa0, keytype=keytype@entry=0x7fffffffd480 "rsaEncryption", libctx=libctx@entry=0x0, propquery=propquery@entry=0x0)
    at crypto/encode_decode/decoder_pkey.c:436
#12 0x00007ffff7c5b5e4 in OSSL_DECODER_CTX_new_for_pkey (pkey=pkey@entry=0x406aa0, input_type=input_type@entry=0x7ffff7ed363b "DER", input_structure=input_structure@entry=0x7ffff7ebf13a "SubjectPublicKeyInfo", 
    keytype=keytype@entry=0x7fffffffd480 "rsaEncryption", selection=selection@entry=134, libctx=0x0, propquery=0x0) at crypto/encode_decode/decoder_pkey.c:499
#13 0x00007ffff7e1e899 in x509_pubkey_ex_d2i_ex (pval=<optimized out>, in=<optimized out>, len=<optimized out>, it=<optimized out>, tag=<optimized out>, aclass=<optimized out>, opt=0 '\000', ctx=0x7fffffffd8b0, libctx=0x0, propq=0x0)
    at crypto/x509/x_pubkey.c:207
#14 0x00007ffff7b838f6 in asn1_item_embed_d2i (pval=pval@entry=0x4076a0, in=in@entry=0x7fffffffd610, len=<optimized out>, len@entry=421, it=0x7ffff7f83060 <local_it>, tag=tag@entry=-1, aclass=aclass@entry=0, opt=0 '\000', 
    ctx=0x7fffffffd8b0, depth=3, libctx=0x0, propq=0x0) at crypto/asn1/tasn_dec.c:262
#15 0x00007ffff7b8444e in asn1_template_noexp_d2i (val=0x4076a0, in=0x7fffffffd6c8, len=<optimized out>, tt=0x7ffff7fa8ad0 <X509_CINF_seq_tt+240>, opt=opt@entry=0 '\000', ctx=ctx@entry=0x7fffffffd8b0, depth=2, libctx=0x0, propq=0x0)
    at crypto/asn1/tasn_dec.c:682
#16 0x00007ffff7b84797 in asn1_template_ex_d2i (val=val@entry=0x4076a0, in=in@entry=0x7fffffffd6c8, inlen=inlen@entry=421, tt=tt@entry=0x7ffff7fa8ad0 <X509_CINF_seq_tt+240>, opt=<optimized out>, ctx=ctx@entry=0x7fffffffd8b0, 
    depth=<optimized out>, libctx=<optimized out>, propq=<optimized out>) at crypto/asn1/tasn_dec.c:558
#17 0x00007ffff7b83ae7 in asn1_item_embed_d2i (pval=pval@entry=0x7fffffffd768, in=in@entry=0x7fffffffd770, len=<optimized out>, len@entry=805, it=0x7ffff7f83280 <local_it>, tag=<optimized out>, tag@entry=-1, aclass=<optimized out>, 
    aclass@entry=0, opt=0 '\000', ctx=0x7fffffffd8b0, depth=2, libctx=0x0, propq=0x0) at crypto/asn1/tasn_dec.c:422
#18 0x00007ffff7b8444e in asn1_template_noexp_d2i (val=0x7fffffffd768, in=0x7fffffffd828, len=<optimized out>, tt=0x7ffff7fa8960 <X509_seq_tt>, opt=opt@entry=0 '\000', ctx=ctx@entry=0x7fffffffd8b0, depth=1, libctx=0x0, propq=0x0)
    at crypto/asn1/tasn_dec.c:682
#19 0x00007ffff7b84797 in asn1_template_ex_d2i (val=val@entry=0x407650, in=in@entry=0x7fffffffd828, inlen=inlen@entry=805, tt=tt@entry=0x7ffff7fa8960 <X509_seq_tt>, opt=<optimized out>, ctx=ctx@entry=0x7fffffffd8b0, 
    depth=<optimized out>, libctx=<optimized out>, propq=<optimized out>) at crypto/asn1/tasn_dec.c:558
#20 0x00007ffff7b83ae7 in asn1_item_embed_d2i (pval=pval@entry=0x7fffffffd940, in=<optimized out>, len=<optimized out>, it=it@entry=0x7ffff7f83240 <local_it>, tag=<optimized out>, tag@entry=-1, aclass=<optimized out>, aclass@entry=0, 
    opt=0 '\000', ctx=0x7fffffffd8b0, depth=1, libctx=0x0, propq=0x0) at crypto/asn1/tasn_dec.c:422
#21 0x00007ffff7b84bdf in asn1_item_ex_d2i_intern (propq=<optimized out>, libctx=<optimized out>, ctx=0x7fffffffd8b0, opt=0 '\000', aclass=0, tag=-1, it=0x7ffff7f83240 <local_it>, len=<optimized out>, in=<optimized out>, 
    pval=<optimized out>) at crypto/asn1/tasn_dec.c:118
#22 ASN1_item_d2i_ex (pval=<optimized out>, in=<optimized out>, len=<optimized out>, it=0x7ffff7f83240 <local_it>, libctx=<optimized out>, propq=<optimized out>) at crypto/asn1/tasn_dec.c:144
#23 0x00007ffff7d89809 in PEM_ASN1_read_bio (d2i=0x7ffff7e208e0 <d2i_X509>, name=<optimized out>, bp=<optimized out>, x=<optimized out>, cb=<optimized out>, u=<optimized out>) at crypto/pem/pem_oth.c:31
#24 0x00000000004012cc in readx509 ()
#25 0x00000000004016f9 in main ()

t8m · 2022-05-18T15:42:51Z

@beldmit There was still one error in the PR. Should be OK now.

beldmit · 2022-05-18T16:06:11Z

Significantly better!!!

beldmit · 2022-05-18T16:11:45Z

The next suspicious place is ossl_lh_strcasehash but it is not so obvious.

paulidale · 2022-05-19T00:11:37Z

ossl_tolower takes ~5%, ossl_ctype_check takes ~2% and looks to be invoked only from ossl_tolower

I've got some ideas about speeding up ossl_tolower and ossl_toupper...

paulidale · 2022-05-20T00:31:23Z

The next suspicious place is ossl_lh_strcasehash but it is not so obvious.

See #18354 😀

The top two places are now OPENSSL_LH_retrieve and malloc.

nhorman · 2024-06-10T00:51:54Z

Its a bit unclear to me what the status of this issue is currently? It seems to document performance issues, But I can't see what the target performance is, or if the goal was just 'better' If the later, we have ongoing performance measurements and improvement goals listed in https://github.com/orgs/openssl/projects/12/views/1

As such, is there more work to do here, or should this be closed?

nhorman · 2024-06-30T19:32:04Z

no response, marking as inactive, to be closed at the end of 3.4 dev barring further input

beldmit added the issue: bug report The issue was opened to report a bug label Oct 8, 2021

t8m added triaged: bug The issue/pr is/fixes a bug triaged: feature The issue/pr requests/adds a feature branch: 3.0 Merge to openssl-3.0 branch branch: master Merge to master branch and removed issue: bug report The issue was opened to report a bug labels Oct 11, 2021

paulidale mentioned this issue Feb 2, 2022

3.0 Performance problems #17627

Open

mattcaswell mentioned this issue Mar 25, 2022

Performance issue with Openssl 3.0 in multi threaded application when using d2i_x509 #17950

Open

paulidale mentioned this issue May 18, 2022

Ignore locale creation failure #18337

Closed

t8m mentioned this issue May 18, 2022

ossl_namemap_name2_num: Avoid unnecessary OPENSSL_strndup(). #18341

Closed

beldmit mentioned this issue Jul 17, 2022

Severe efficiency degradation of credential loading in comparison to 1.1.1 #18814

Open

arhadthedev mentioned this issue Jul 20, 2022

OpenSSL 3.0 performance issue: SSLContext.set_default_verify_paths / load_verify_locations about 5x slower python/cpython#95031

Open

beldmit mentioned this issue Sep 27, 2022

Can a OpenSSL release that contains PR's that exist in master but not in the 3.0.X branch be created? #18846

Open

t8m added branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 labels Oct 26, 2023

paulidale added the triaged: performance The issue/pr reports/fixes a performance concern label Nov 7, 2023

nhorman added the inactive label Jun 30, 2024

Performance measurements #16791

Performance measurements #16791

Comments

beldmit commented Oct 8, 2021

paulidale commented Oct 8, 2021

beldmit commented Oct 9, 2021

paulidale commented Oct 9, 2021

beldmit commented Oct 9, 2021

paulidale commented Oct 10, 2021

beldmit commented Oct 10, 2021

beldmit commented Oct 12, 2021

beldmit commented Oct 12, 2021

beldmit commented Oct 13, 2021

paulidale commented Oct 25, 2021

t8m commented Oct 25, 2021

beldmit commented Oct 25, 2021

t8m commented Oct 25, 2021

nibanks commented Nov 17, 2021

t8m commented Nov 18, 2021

kroeckx commented Nov 20, 2021 via email

mattcaswell commented Nov 20, 2021

hlandau commented Mar 10, 2022

nibanks commented Mar 10, 2022

hlandau commented Mar 15, 2022

nibanks commented Mar 15, 2022

hlandau commented Mar 16, 2022

nibanks commented Mar 16, 2022

paulidale commented Mar 22, 2022

paulidale commented Mar 23, 2022

paulidale commented Mar 23, 2022

t8m commented Mar 23, 2022

nibanks commented Mar 23, 2022

hlandau commented Mar 24, 2022

paulidale commented Mar 24, 2022

paulidale commented Mar 25, 2022

beldmit commented May 18, 2022

t8m commented May 18, 2022

beldmit commented May 18, 2022

t8m commented May 18, 2022

beldmit commented May 18, 2022

t8m commented May 18, 2022

beldmit commented May 18, 2022

beldmit commented May 18, 2022

paulidale commented May 19, 2022

paulidale commented May 20, 2022

nhorman commented Jun 10, 2024

nhorman commented Jun 30, 2024