[PROPOSAL] Add Performance Testing + SDK Span Performance Tests #1443

NathanielRN · 2020-12-03T09:09:14Z

Description

Part of #335, this PR aims to include the pytest-benchmark library to get micro benching results on some key SDK operations.

Specifically, this follows the OpenTelemetry Specifications on Performance Benchmarking guidelines.

Regarding pytest-benchmark:

Pros:

Easy to use (integrates with pytest
Lots of features (it was easy to mock and pass parameters to functions under test)
Nice ways to output (See the pytest-benchmark Command Line Options
- Table in Terminal
- Histogram (Since all the .json logs will stay in gh-pages, people can create these or we can create these!)
- Stats summary file
- Detailed Stats summary file

Cons:

No resource profiling (CPU/RAM/IO) (See the open issue on resource profiling for more information. Will mean we need a different profiler for these things?)

Type of change

In this PR:

Benchmarks for every python version are added
The package tests will FAIL if during comparison. performance regresses by 200% from the most recent recorded test (To avoid flaky failing of tests)
Benchmark results are printed to the workflow output (to know what happened + to update the current results if needed)
Performance tests work locally too! (The first run fails because there are no benchmarks yet)
It is easy to add more benchmarks by adding
- The tests themselves
- Updating tox.ini to print the tests using the added tests/print_benchmarks.py script
- Added their most recent recorded results in the relevant package folder.
All the history of the logs will be available in the gh-pages branch and at the link https://open-telemetry.github.io/opentelemetry-python/benchmarks/index.html

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Locally

Running tox -e test-core-sdk produces the following table:

test-core-sdk run-test: commands[0] | pytest
================================================================================================================================================= test session starts =================================================================================================================================================
platform darwin -- Python 3.8.2, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /Users/enowell/git/opentelemetry-python/.tox/test-core-sdk/bin/python
cachedir: .tox/test-core-sdk/.pytest_cache
benchmark: 3.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/enowell/git/opentelemetry-python, configfile: pyproject.toml
plugins: benchmark-3.2.3
collected 249 items

...

-------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in us)                         Min                 Max               Mean             StdDev             Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_simple_start_span                27.3250 (1.0)      223.2550 (1.0)      31.0913 (1.0)       9.2509 (1.0)      28.6885 (1.0)      1.5250 (1.55)     218;1017       32.1634 (1.0)        5320           1
test_simple_start_as_current_span     36.0340 (1.32)     251.3840 (1.13)     40.2968 (1.30)     11.7254 (1.27)     37.5900 (1.31)     0.9810 (1.0)      744;2070       24.8159 (0.77)      14010           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=========================================================================================================================================== 249 passed, 1 warning in 8.83s ============================================================================================================================================
_______________________________________________________________________________________________________________________________________________________ summary _______________________________________________________________________________________________________________________________________________________
  test-core-sdk: commands succeeded
  congratulations :)

Example Repo

I made an example repo here to prove the json merging works

You can see the difference between PRs that have different performance benchmarks:

This combines all the tests under OpenTelemetry Python Benchmark as seen here. The fluctuations are on purpose because I made the test take longer.

Does This PR Require a Contrib Repo Change?

No. (But it will be a good follow up)

Checklist:

Followed the style guidelines of this project
~~- [] Changelogs have been updated~~ (Not changing functionality, only adding tests
~~- [ ] Unit tests have been added~~
~~- [ ] Documentation has been updated~~

aabmass

Awesome!

opentelemetry-sdk/benchmarks/trace/test_benchmark__init__.py

srikanthccv · 2020-12-06T14:46:18Z

@NathanielRN came across this https://pypi.org/project/scalene/, might be helpful for cpu and memory profiling.

ocelotl · 2020-12-08T16:11:30Z

Description

Part of #335, this PR aims to include the pytest-benchmark library to get micro benching results on some key SDK operations.

Specifically, this follows the OpenTelemetry Specifications on Performance Benchmarking guidelines.

Regarding pytest-benchmark:

Pros:

Easy to use (integrates with pytest
Lots of features (it was easy to mock and pass parameters to functions under test)
Nice ways to output (See the pytest-benchmark Command Line Options
- Table in Terminal
- Histogram
- Stats summary file
- Detailed Stats summary file
You can compare benchmarking runs! Good for regression testing. (We can publish results on a per release basis? Can automate as part of eachdist.py)

Cons:

No resource profiling (CPU/RAM/IO) (See the open issue on resource profiling for more information. Will mean we need a different profiler for these things?)

Type of change

In this PR:

The most up to date tests are recorded in <package>/tests/.benchmarks/ folder of every package (this is according to running it on whatever GitHub actions runs them on)
Benchmarks for every python version are added
The package tests will FAIL if during comparison. performance regresses by 75% from the most recent recorded test (To avoid flaky failing of tests)
Benchmark results are printed to the workflow output (to know what happened + to update the current results if needed)
Performance tests work locally too! (The first run fails because there are no benchmarks yet)
It is easy to add more benchmarks by adding
- The tests themselves
- Updating tox.ini to print the tests using the added tests/print_benchmarks.py script
- Added their most recent recorded results in the relevant package folder.
New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Running tox -e test-core-sdk produces the following table:

---------------------------------------------------------------------------------- benchmark: 4 tests ----------------------------------------------------------------------------------
Name (time in us)                                          Mean                Min                   Max             StdDev             Median            OPS (Kops/s)            Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_b3_propagator_start_span (NOW)                     33.2795 (1.0)      23.4010 (1.0)        126.2050 (1.0)       4.5293 (1.0)      33.8020 (1.0)           30.0485 (1.0)        4936
test_b3_propagator_start_as_current_span (NOW)          44.5598 (1.34)     31.3010 (1.34)       149.2060 (1.18)      5.3161 (1.17)     45.6020 (1.35)          22.4417 (0.75)       7611
test_b3_propagator_start_span (0001_old)                46.3537 (1.39)     37.1000 (1.59)     2,572.7180 (20.39)    46.9578 (10.37)    41.6010 (1.23)          21.5732 (0.72)       3767
test_b3_propagator_start_as_current_span (0001_old)     62.8846 (1.89)     48.5010 (2.07)     3,622.1260 (28.70)    58.6981 (12.96)    57.2010 (1.69)          15.9021 (0.53)       8765
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
======================= 236 passed, 1 warning in 10.19s ========================
py38-test-core-sdk run-test: commands[1] | /home/runner/work/opentelemetry-python/opentelemetry-python/tests/print_benchmarks.py /home/runner/work/opentelemetry-python/opentelemetry-python/opentelemetry-sdk
{
    "machine_info": {
        "node": "fv-az50-359",
        "processor": "x86_64",
        "machine": "x86_64",
        "python_compiler": "GCC 7.5.0",
        "python_implementation": "CPython",
        "python_implementation_version": "3.8.6",
        "python_version": "3.8.6",
        "python_build": [
            "default",
            "Oct 28 2020 13:08:18"
        ],
        "release": "5.4.0-1031-azure",
        "system": "Linux",
        "cpu": {
            "vendor_id": "unknown",
            "hardware": "unknown",
            "brand": "unknown"
        }
    },
    "commit_info": {
        "id": "ed502ab2ef5913bcc20a5cc3207bfafe74a0da41",
        "time": "2020-12-08T09:21:21+00:00",
        "author_time": "2020-12-08T09:21:21+00:00",
        "dirty": false,
        "project": "opentelemetry-python",
        "branch": "(detached head)"
    },
    "benchmarks": [
        {
            "group": null,
            "name": "test_b3_propagator_start_span",
            "fullname": "opentelemetry-sdk/tests/performance/benchmarks/trace/test_benchmark_trace.py::test_b3_propagator_start_span",
            "params": null,
            "param": null,
            "extra_info": {},
            "options": {
                "disable_gc": false,
                "timer": "perf_counter",
                "min_rounds": 5,
                "max_time": 1.0,
                "min_time": 5e-06,
                "warmup": false
            },
            "stats": {
                "min": 2.340099999287304e-05,
                "max": 0.00012620499998661217,
                "mean": 3.327952289266268e-05,
                "stddev": 4.529273113632252e-06,
                "rounds": 4936,
                "median": 3.3801999990146214e-05,
                "iqr": 2.8009999653022533e-06,
                "q1": 3.170100001170795e-05,
                "q3": 3.45019999770102e-05,
                "iqr_outliers": 521,
                "stddev_outliers": 912,
                "outliers": "912;521",
                "ld15iqr": 2.7500999976837193e-05,
                "hd15iqr": 3.880200000594414e-05,
                "ops": 30048.507703230192,
                "total": 0.16426772499818298,
                "iterations": 1
            }
        },
        {
            "group": null,
            "name": "test_b3_propagator_start_as_current_span",
            "fullname": "opentelemetry-sdk/tests/performance/benchmarks/trace/test_benchmark_trace.py::test_b3_propagator_start_as_current_span",
            "params": null,
            "param": null,
            "extra_info": {},
            "options": {
                "disable_gc": false,
                "timer": "perf_counter",
                "min_rounds": 5,
                "max_time": 1.0,
                "min_time": 5e-06,
                "warmup": false
            },
            "stats": {
                "min": 3.13009999786118e-05,
                "max": 0.00014920600000323248,
                "mean": 4.455981447907276e-05,
                "stddev": 5.31612969477487e-06,
                "rounds": 7611,
                "median": 4.560200000014447e-05,
                "iqr": 3.600999974651131e-06,
                "q1": 4.2601000018294144e-05,
                "q3": 4.6201999992945275e-05,
                "iqr_outliers": 1075,
                "stddev_outliers": 1483,
                "outliers": "1483;1075",
                "ld15iqr": 3.720099999782178e-05,
                "hd15iqr": 5.170199997905911e-05,
                "ops": 22441.745139438222,
                "total": 0.33914474800022276,
                "iterations": 1
            }
        }
    ],
    "datetime": "2020-12-08T09:23:18.999916",
    "version": "3.2.3"
}

Does This PR Require a Contrib Repo Change?

No. (But it will be a good follow up)

Checklist:

Followed the style guidelines of this project
~~- [] Changelogs have been updated~~ (Not changing functionality, only adding tests
~~- [ ] Unit tests have been added~~
~~- [ ] Documentation has been updated~~

Performance testing! This is outstanding 😎. I have one conceptual question (it very possibly may be related to the pytest-benchmark rather than being related to your changes, @NathanielRN):

What does performance mean in:

The package tests will FAIL if during comparison. performance regresses by 75% from the most recent recorded test (To avoid flaky failing of tests)
?

Is performance measured in seconds? For example, a performance test is run once and it takes 10 seconds, so its performance is 10s?

As far as I understand, a regression of less than 75% is accepted. So, if a test case takes 100s to run, it is accepted if it is run again and it takes 174s? Does that mean that it is acceptable that the performance can continue to degrade more and more as long as it is less than 75% of the previous run?

NathanielRN · 2020-12-08T17:35:25Z

@ocelotl That's a great question!

Does that mean that it is acceptable that the performance can continue to degrade more and more as long as it is less than 75% of the previous run?

No, it compares against some baseline performance. In this PR I committed 0001_old.json files with the performance results that I saw the GitHub actions workflow reported:

(e.g. opentelemetry-sdk/tests/.benchmarks/Linux-CPython-3.5-64bit/0001_old.json)

This becomes the expected performance for the package. On subsequent PRs, pytest-benchmark now always compares the results of 0001_old.json with a new 0002_now.json file generated using the changes in the PR. If the changes deviate by more than 75% from 0001_old.json, then the test fails. So in your example the PR would fail if it goes past 175 seconds 😄

Is performance measured in seconds? For example, a performance test is run once and it takes 10 seconds, so its performance is 10s?

According to the pytest-benchmark docs tests are run for 1 second, as many times as that test can be run in that second:

--benchmark-max-time=SECONDS
Maximum run time per test - it will be repeated until this total time is reached. It may be exceeded if test function is very slow or –benchmark-min-rounds is large (it takes precedence). Default: ‘1.0’

And the units seem to be in seconds.

--benchmark-compare-fail=EXPR
Fail test if performance regresses according to given EXPR (eg: min:5% or mean:0.001 for number of seconds). Can be used multiple times.

So a "mean": 4.2917183875654695e-05, for my test_simple_start_span benchmark means that in all the rounds it had, the average runtime was 429ms for this test.

NathanielRN · 2020-12-08T18:15:46Z

@lonewolf3739 Thanks for the suggestion! Will be looking into this 😄

aabmass

I think we should only be updating these saved benchmark files from Github action CI runs somehow, or maybe storing them as artifacts. The results could vary a lot depending on the PR author's hardware. If they run on Github Actions it should be somewhat normalized and wouldn't prevent people from using Mac/Windows for development, or force them to install all 5 versions of python we support to regen the benchmarks.

That might be a bit tricky, what is the Java SIG doing for their JMH benchmarks?

aabmass · 2020-12-08T18:32:54Z

opentelemetry-sdk/tests/.benchmarks/Linux-CPython-3.5-64bit/0001_old.json

@@ -0,0 +1,104 @@
+{


Should probably add a linguist-generated entry for these so to keep PR noise down?

That's really cool! I didn't know about this 😄

However I think it's only big this time, the benefit of having it included in the diff is that when someone updates the performance later, only the numbers should change (which means we can see how the performance changes in the git diff)

Also if you add new performance tests, that will be clearly seen as only a few lines of code added to these files.

Let me know what you think?

Oh I see, it always updates the same file. I was looking at the pytest-benchmark docs here with --benchmark-autosave. Is it worth preserving all previous runs? They show some cool box and whisker plots from previously saved runs. We can always iterate 😄

It's a good idea! For now we will have a history of the mean throughput in benchmarks/data.js for all commits, but if someone wants to generate the cool whisker plots all they would have to do is clone the repo and run tox -e tests-core-sdk -- --benchmark-histogram=<path> to generate the .json.

But I think it would be cool to have just the latest .json file committed somewhere here on the repo later :)

tests/print_benchmarks.py

aabmass · 2020-12-08T18:52:05Z

tox.ini

-  test: pytest {posargs}
+  test: pytest --benchmark-save=now --benchmark-compare --benchmark-compare-fail=min:75% --benchmark-sort=mean --benchmark-columns=mean,min,max,stddev,median,ops,rounds {posargs}
+
+  core-sdk: {toxinidir}/tests/print_benchmarks.py {toxinidir}/opentelemetry-sdk


I'm not sure I follow what this is for? Doesn't already print when you run pytest in a nice colored format

It prints the table which is a nice visual format. However, we want to see the .json file produced because pytest-benchmark can only run comparisons between .json files, not tables.

So the .json file is printed to the workflow logs so that the author can see the performance results of their changes according to the normalize standard of "running on GitHub actions".

And if they improved performance they would update the relevant files in 0001_old.json for each python version 😄 (This is the only manual process of this 😕 )

Close, they would copy/paste the output JSON into the opentelemetry-sdk/tests/performance/.benchmarks/Linux-CPython-3.5-64bit/0001_old.json file for example.

This is only if they want to change the expected performance standard, which is something we could automate to happen every "release" for example.

tests/print_benchmarks.py

NathanielRN · 2020-12-08T19:17:25Z

@aabmass

I think we should only be updating these saved benchmark files from Github action CI runs somehow, or maybe storing them as artifacts.

The intention is to only update form the Github action CI 🙂 Short of committing the actual results into the PRs files (I don't know how to do that and I don't think we should...) the the next best solution I thought of was to actually output the results in the GitHub Action CI logs. That's where the 0001_old.json files in this PR came from, I copied and pasted the results of the tox -e test-core-sdk output in the workflow.

The results could vary a lot depending on the PR author's hardware.

Author's will never have to commit results from their machine, they will use the GitHub actions machine results.

We'll know it came from GitHub because "machine_info"."node" will look like fv-az12-670 or fv-az52-394 for example

Compare this to a run on my local machine which has machine_info.node: ###.ant.amazon.com.

If they run on Github Actions it should be somewhat normalized

Agreed!

and wouldn't prevent people from using Mac/Windows for development, or force them to install all 5 versions of python we support to regen the benchmarks.

Yes also agreed :) I tested locally and you can still run tox -e test-core-sdk no problem. And through the interpretation of your Python environment done in the print_benchmarks.py script it will still be able to compare against your previous runs and save your latest runs in a folder that is appropriate for your hardware.

That might be a bit tricky, what is the Java SIG doing for their JMH benchmarks?

The Java SIG seems to be adding the performance test results in the PR description: See PR 2186. And I think I heard they post them in some .md files in a table somewhere else but I'm not certain.

When considering their solution, I thought this solution here where we commit them as .json files is a bit better because then pytest-benchmark can actually compare and fail/succeed on the differences which is what we originally asked for in the SIG meeting! And then people can update them relatively easy with a copy & paste. When performance tests change, it will be easy to see this in a git diff of the 0001_old.json files 😄

tox.ini

.github/workflows/test.yml

opentelemetry-sdk/tests/performance/benchmarks/trace/test_benchmark_trace.py

aabmass

@NathanielRN why aren't we seeing the run on this PR? Is it because there is no existing saved benchmarks, but it will work going forward?

.github/workflows/test.yml

aabmass · 2020-12-10T21:06:22Z

CONTRIBUTING.md

+Performance progression of benchmarks for packages distributed by OpenTelemetry Python can be viewed as a [graph of throughput vs commit history](https://open-telemetry.github.io/opentelemetry-python/benchmarks/index.html). From this page, you can download a JSON file with the performance results.
+
+Running the `tox` tests also runs the performance tests if any are available. Benchmarking tests are done with `pytest-benchmark` and they output a table with results to the console.
+


@NathanielRN this will show the perf change on your own machine between runs, right?

Do the JSON files need to be added to .gitignore for this PR?

tox.ini

NathanielRN · 2020-12-10T21:52:27Z

@aabmass

@NathanielRN why aren't we seeing the run on this PR? Is it because there is no existing saved benchmarks, but it will work going forward?

Yes exactly that is what I expect, on my test repo I did not see the comments until I had something on gh-pages.

lzchen

Niiiiiiiiiiiiiiiiiiiiice.

As discussed with @NathanielRN , since readthedocs builds from master, this github action will also write to master in the benchmarks folder for every build.

ocelotl

Just a question regarding specification compliance 👍

opentelemetry-sdk/tests/performance/benchmarks/trace/test_benchmark_trace.py

aabmass

🚢

.github/workflows/test.yml

.gitignore

Co-authored-by: Aaron Abbott <[email protected]>

NathanielRN requested review from a team, codeboten and hectorhdzg and removed request for a team December 3, 2020 09:09

NathanielRN force-pushed the add-sdk-span-performance-tests branch from 0afd9ba to 6e5d7f1 Compare December 3, 2020 09:12

aabmass reviewed Dec 3, 2020

View reviewed changes

opentelemetry-sdk/benchmarks/trace/test_benchmark__init__.py Outdated Show resolved Hide resolved

NathanielRN force-pushed the add-sdk-span-performance-tests branch from 6e5d7f1 to fc0f2dc Compare December 3, 2020 17:03

NathanielRN requested a review from aabmass December 3, 2020 23:02

NathanielRN force-pushed the add-sdk-span-performance-tests branch 7 times, most recently from 725b632 to 92099a8 Compare December 8, 2020 09:07

NathanielRN force-pushed the add-sdk-span-performance-tests branch from 0aab259 to 488b4e7 Compare December 8, 2020 17:46

aabmass reviewed Dec 8, 2020

View reviewed changes

NathanielRN force-pushed the add-sdk-span-performance-tests branch 7 times, most recently from 30bbf14 to 93885b3 Compare December 9, 2020 00:05