Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zzachimed
Copy link

@zzachimed zzachimed commented Apr 9, 2023

Use Bmi2 instrunction to optmize compact protocol int64 code and decode.

In our unit test, it showed that code has 30% performance boost, and in decode it show 5% performance boost.

origin bench:

$ ./GbenchMarkForCompactProtocol
2023-04-10T00:56:33+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 1280 KiB (x64)
  L3 Unified 49152 KiB (x2)
Load Average: 0.71, 0.57, 0.24
-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_Int64_Encode_1        22.1 ns         22.1 ns     31677633
BM_Int64_Encode_2        23.7 ns         23.7 ns     29457528
BM_Int64_Encode_3        26.1 ns         26.1 ns     26844781
BM_Int64_Encode_4        27.1 ns         27.1 ns     25705377
BM_Int64_Encode_5        30.0 ns         29.9 ns     23345095
BM_Int64_Encode_6        32.3 ns         32.3 ns     21658363
BM_Int64_Encode_7        34.5 ns         34.5 ns     20256118
BM_Int64_Encode_8        36.3 ns         36.3 ns     19234451
BM_Int64_Encode_9        37.8 ns         37.8 ns     18510508
BM_Int64_Encode_10       39.8 ns         39.8 ns     17531197
BM_Int64_Decode_1         485 ns          486 ns      1439281
BM_Int64_Decode_2         489 ns          491 ns      1424556
BM_Int64_Decode_3         493 ns          495 ns      1413917
BM_Int64_Decode_4         491 ns          493 ns      1421200
BM_Int64_Decode_5         495 ns          497 ns      1408950
BM_Int64_Decode_6         496 ns          498 ns      1406069
BM_Int64_Decode_7         504 ns          506 ns      1383636
BM_Int64_Decode_8         507 ns          509 ns      1374477
BM_Int64_Decode_9         505 ns          507 ns      1380481
BM_Int64_Decode_10        503 ns          505 ns      1386843

optimized bench:

$ ./GbenchMarkForCompactProtocol
2023-04-10T01:09:25+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 1280 KiB (x64)
  L3 Unified 49152 KiB (x2)
Load Average: 0.00, 0.24, 0.31
-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_Int64_Encode_1        21.0 ns         21.0 ns     30414973
BM_Int64_Encode_2        22.3 ns         22.2 ns     28778678
BM_Int64_Encode_3        25.6 ns         25.6 ns     27278662
BM_Int64_Encode_4        29.1 ns         29.1 ns     23984143
BM_Int64_Encode_5        29.3 ns         29.2 ns     23856317
BM_Int64_Encode_6        29.6 ns         29.6 ns     23585080
BM_Int64_Encode_7        29.9 ns         29.8 ns     23423250
BM_Int64_Encode_8        29.7 ns         29.7 ns     23496002
BM_Int64_Encode_9        28.9 ns         28.9 ns     24128142
BM_Int64_Encode_10       28.9 ns         28.8 ns     24132967
BM_Int64_Decode_1         480 ns          480 ns      1430995
BM_Int64_Decode_2         481 ns          481 ns      1390746
BM_Int64_Decode_3         490 ns          492 ns      1395035
BM_Int64_Decode_4         495 ns          496 ns      1409327
BM_Int64_Decode_5         500 ns          501 ns      1398344
BM_Int64_Decode_6         495 ns          497 ns      1408096
BM_Int64_Decode_7         499 ns          501 ns      1398316
BM_Int64_Decode_8         496 ns          497 ns      1409067
BM_Int64_Decode_9         501 ns          503 ns      1392703
BM_Int64_Decode_10        500 ns          502 ns      1395121
  • Did you create an Apache Jira ticket? (Request account here, not required for trivial changes)
  • If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
  • Did you squash your changes to a single commit? (not required, but preferred)
  • Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
  • If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

@Jens-G Jens-G added the c++ label Apr 9, 2023
…de,in our unit

test.It show that code has 30% peformance boost, and in decode it show 5% performance
boost.
origin bench:
$ ./GbenchMarkForCompactProtocol
2023-04-10T00:56:33+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 1280 KiB (x64)
  L3 Unified 49152 KiB (x2)
Load Average: 0.71, 0.57, 0.24
-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_Int64_Encode_1        22.1 ns         22.1 ns     31677633
BM_Int64_Encode_2        23.7 ns         23.7 ns     29457528
BM_Int64_Encode_3        26.1 ns         26.1 ns     26844781
BM_Int64_Encode_4        27.1 ns         27.1 ns     25705377
BM_Int64_Encode_5        30.0 ns         29.9 ns     23345095
BM_Int64_Encode_6        32.3 ns         32.3 ns     21658363
BM_Int64_Encode_7        34.5 ns         34.5 ns     20256118
BM_Int64_Encode_8        36.3 ns         36.3 ns     19234451
BM_Int64_Encode_9        37.8 ns         37.8 ns     18510508
BM_Int64_Encode_10       39.8 ns         39.8 ns     17531197
BM_Int64_Decode_1         485 ns          486 ns      1439281
BM_Int64_Decode_2         489 ns          491 ns      1424556
BM_Int64_Decode_3         493 ns          495 ns      1413917
BM_Int64_Decode_4         491 ns          493 ns      1421200
BM_Int64_Decode_5         495 ns          497 ns      1408950
BM_Int64_Decode_6         496 ns          498 ns      1406069
BM_Int64_Decode_7         504 ns          506 ns      1383636
BM_Int64_Decode_8         507 ns          509 ns      1374477
BM_Int64_Decode_9         505 ns          507 ns      1380481
BM_Int64_Decode_10        503 ns          505 ns      1386843

optmizied bench:
$ ./GbenchMarkForCompactProtocol
2023-04-10T01:09:25+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 1280 KiB (x64)
  L3 Unified 49152 KiB (x2)
Load Average: 0.00, 0.24, 0.31
-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_Int64_Encode_1        21.0 ns         21.0 ns     30414973
BM_Int64_Encode_2        22.3 ns         22.2 ns     28778678
BM_Int64_Encode_3        25.6 ns         25.6 ns     27278662
BM_Int64_Encode_4        29.1 ns         29.1 ns     23984143
BM_Int64_Encode_5        29.3 ns         29.2 ns     23856317
BM_Int64_Encode_6        29.6 ns         29.6 ns     23585080
BM_Int64_Encode_7        29.9 ns         29.8 ns     23423250
BM_Int64_Encode_8        29.7 ns         29.7 ns     23496002
BM_Int64_Encode_9        28.9 ns         28.9 ns     24128142
BM_Int64_Encode_10       28.9 ns         28.8 ns     24132967
BM_Int64_Decode_1         480 ns          480 ns      1430995
BM_Int64_Decode_2         481 ns          481 ns      1390746
BM_Int64_Decode_3         490 ns          492 ns      1395035
BM_Int64_Decode_4         495 ns          496 ns      1409327
BM_Int64_Decode_5         500 ns          501 ns      1398344
BM_Int64_Decode_6         495 ns          497 ns      1408096
BM_Int64_Decode_7         499 ns          501 ns      1398316
BM_Int64_Decode_8         496 ns          497 ns      1409067
BM_Int64_Decode_9         501 ns          503 ns      1392703
BM_Int64_Decode_10        500 ns          502 ns      1395121
@emmenlau
Copy link
Member

Whow, nice work! But not easy to validate. One option would be to keep this optional and disabled by default, so people can experiment with the new code?

@emmenlau
Copy link
Member

emmenlau commented Aug 7, 2024

This is very interesting. But how to proceed? @Jens-G I guess the tests should cover sufficient checks whether encoded/decoded data is identical with this change? Can we rely that the tests would typically fail if there are relevant differences with this encoding, compared to current thrift?

Otherwise it seems really challenging to build trust here...

@@ -124,6 +124,58 @@ if(UNIX)
endif()
endif()

if (CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you kindly change the indenting to 4 white spaces, instead of two?

@@ -124,6 +124,58 @@ if(UNIX)
endif()
endif()

if (CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)")
set(PREV_CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS})
set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} -mbmi2 -mbmi -mlzcnt -msse3 -mavx512bw -mavx512vl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions:

  • Not sure if these flags are portable to MSVC? On which platforms did you test?
  • Do you not enforce here already that the CPU supports all these flags? Then, further down, you test which CPU features are supported? Am I missing something?

Problems:

  • Not all users build for their current platforms. I.e. people may want to build on a modern CPU, but may plan to ship their binaries to broader platforms. This is for example the case with Linux distributors. It may be required to make these changes optional, particularly for the very advanced features. I.e. building on an AVX512 capable CPU should not necessarily make the binaries non-portable to non-AVX512-CPUs (which still exist quite a lot).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants