Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Add Performance Testing + SDK Span Performance Tests #1443

Merged

Conversation

NathanielRN
Copy link
Contributor

@NathanielRN NathanielRN commented Dec 3, 2020

Description

Part of #335, this PR aims to include the pytest-benchmark library to get micro benching results on some key SDK operations.

Specifically, this follows the OpenTelemetry Specifications on Performance Benchmarking guidelines.

Regarding pytest-benchmark:

Pros:

  • Easy to use (integrates with pytest
  • Lots of features (it was easy to mock and pass parameters to functions under test)
  • Nice ways to output (See the pytest-benchmark Command Line Options
    • Table in Terminal
    • Histogram (Since all the .json logs will stay in gh-pages, people can create these or we can create these!)
    • Stats summary file
    • Detailed Stats summary file

Cons:

Type of change

In this PR:

  • Benchmarks for every python version are added
  • The package tests will FAIL if during comparison. performance regresses by 200% from the most recent recorded test (To avoid flaky failing of tests)
  • Benchmark results are printed to the workflow output (to know what happened + to update the current results if needed)
  • Performance tests work locally too! (The first run fails because there are no benchmarks yet)
  • It is easy to add more benchmarks by adding
    • The tests themselves
    • Updating tox.ini to print the tests using the added tests/print_benchmarks.py script
    • Added their most recent recorded results in the relevant package folder.
  • All the history of the logs will be available in the gh-pages branch and at the link https://open-telemetry.github.io/opentelemetry-python/benchmarks/index.html
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Locally

Running tox -e test-core-sdk produces the following table:

test-core-sdk run-test: commands[0] | pytest
================================================================================================================================================= test session starts =================================================================================================================================================
platform darwin -- Python 3.8.2, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /Users/enowell/git/opentelemetry-python/.tox/test-core-sdk/bin/python
cachedir: .tox/test-core-sdk/.pytest_cache
benchmark: 3.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/enowell/git/opentelemetry-python, configfile: pyproject.toml
plugins: benchmark-3.2.3
collected 249 items

...

-------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in us)                         Min                 Max               Mean             StdDev             Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_simple_start_span                27.3250 (1.0)      223.2550 (1.0)      31.0913 (1.0)       9.2509 (1.0)      28.6885 (1.0)      1.5250 (1.55)     218;1017       32.1634 (1.0)        5320           1
test_simple_start_as_current_span     36.0340 (1.32)     251.3840 (1.13)     40.2968 (1.30)     11.7254 (1.27)     37.5900 (1.31)     0.9810 (1.0)      744;2070       24.8159 (0.77)      14010           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=========================================================================================================================================== 249 passed, 1 warning in 8.83s ============================================================================================================================================
_______________________________________________________________________________________________________________________________________________________ summary _______________________________________________________________________________________________________________________________________________________
  test-core-sdk: commands succeeded
  congratulations :)

Example Repo

I made an example repo here to prove the json merging works

You can see the difference between PRs that have different performance benchmarks:

This combines all the tests under OpenTelemetry Python Benchmark as seen here. The fluctuations are on purpose because I made the test take longer.

image

Does This PR Require a Contrib Repo Change?

  • No. (But it will be a good follow up)

Checklist:

  • Followed the style guidelines of this project
    - [] Changelogs have been updated (Not changing functionality, only adding tests
    - [ ] Unit tests have been added
    - [ ] Documentation has been updated

@NathanielRN NathanielRN requested review from a team, codeboten and hectorhdzg and removed request for a team December 3, 2020 09:09
Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@srikanthccv
Copy link
Member

@NathanielRN came across this https://pypi.org/project/scalene/, might be helpful for cpu and memory profiling.

@NathanielRN NathanielRN force-pushed the add-sdk-span-performance-tests branch 7 times, most recently from 725b632 to 92099a8 Compare December 8, 2020 09:07
@ocelotl
Copy link
Contributor

ocelotl commented Dec 8, 2020

Description

Part of #335, this PR aims to include the pytest-benchmark library to get micro benching results on some key SDK operations.

Specifically, this follows the OpenTelemetry Specifications on Performance Benchmarking guidelines.

Regarding pytest-benchmark:

Pros:

  • Easy to use (integrates with pytest

  • Lots of features (it was easy to mock and pass parameters to functions under test)

  • Nice ways to output (See the pytest-benchmark Command Line Options

    • Table in Terminal
    • Histogram
    • Stats summary file
    • Detailed Stats summary file
  • You can compare benchmarking runs! Good for regression testing. (We can publish results on a per release basis? Can automate as part of eachdist.py)

Cons:

Type of change

In this PR:

  • The most up to date tests are recorded in <package>/tests/.benchmarks/ folder of every package (this is according to running it on whatever GitHub actions runs them on)

  • Benchmarks for every python version are added

  • The package tests will FAIL if during comparison. performance regresses by 75% from the most recent recorded test (To avoid flaky failing of tests)

  • Benchmark results are printed to the workflow output (to know what happened + to update the current results if needed)

  • Performance tests work locally too! (The first run fails because there are no benchmarks yet)

  • It is easy to add more benchmarks by adding

    • The tests themselves
    • Updating tox.ini to print the tests using the added tests/print_benchmarks.py script
    • Added their most recent recorded results in the relevant package folder.
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Running tox -e test-core-sdk produces the following table:

---------------------------------------------------------------------------------- benchmark: 4 tests ----------------------------------------------------------------------------------
Name (time in us)                                          Mean                Min                   Max             StdDev             Median            OPS (Kops/s)            Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_b3_propagator_start_span (NOW)                     33.2795 (1.0)      23.4010 (1.0)        126.2050 (1.0)       4.5293 (1.0)      33.8020 (1.0)           30.0485 (1.0)        4936
test_b3_propagator_start_as_current_span (NOW)          44.5598 (1.34)     31.3010 (1.34)       149.2060 (1.18)      5.3161 (1.17)     45.6020 (1.35)          22.4417 (0.75)       7611
test_b3_propagator_start_span (0001_old)                46.3537 (1.39)     37.1000 (1.59)     2,572.7180 (20.39)    46.9578 (10.37)    41.6010 (1.23)          21.5732 (0.72)       3767
test_b3_propagator_start_as_current_span (0001_old)     62.8846 (1.89)     48.5010 (2.07)     3,622.1260 (28.70)    58.6981 (12.96)    57.2010 (1.69)          15.9021 (0.53)       8765
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
======================= 236 passed, 1 warning in 10.19s ========================
py38-test-core-sdk run-test: commands[1] | /home/runner/work/opentelemetry-python/opentelemetry-python/tests/print_benchmarks.py /home/runner/work/opentelemetry-python/opentelemetry-python/opentelemetry-sdk
{
    "machine_info": {
        "node": "fv-az50-359",
        "processor": "x86_64",
        "machine": "x86_64",
        "python_compiler": "GCC 7.5.0",
        "python_implementation": "CPython",
        "python_implementation_version": "3.8.6",
        "python_version": "3.8.6",
        "python_build": [
            "default",
            "Oct 28 2020 13:08:18"
        ],
        "release": "5.4.0-1031-azure",
        "system": "Linux",
        "cpu": {
            "vendor_id": "unknown",
            "hardware": "unknown",
            "brand": "unknown"
        }
    },
    "commit_info": {
        "id": "ed502ab2ef5913bcc20a5cc3207bfafe74a0da41",
        "time": "2020-12-08T09:21:21+00:00",
        "author_time": "2020-12-08T09:21:21+00:00",
        "dirty": false,
        "project": "opentelemetry-python",
        "branch": "(detached head)"
    },
    "benchmarks": [
        {
            "group": null,
            "name": "test_b3_propagator_start_span",
            "fullname": "opentelemetry-sdk/tests/performance/benchmarks/trace/test_benchmark_trace.py::test_b3_propagator_start_span",
            "params": null,
            "param": null,
            "extra_info": {},
            "options": {
                "disable_gc": false,
                "timer": "perf_counter",
                "min_rounds": 5,
                "max_time": 1.0,
                "min_time": 5e-06,
                "warmup": false
            },
            "stats": {
                "min": 2.340099999287304e-05,
                "max": 0.00012620499998661217,
                "mean": 3.327952289266268e-05,
                "stddev": 4.529273113632252e-06,
                "rounds": 4936,
                "median": 3.3801999990146214e-05,
                "iqr": 2.8009999653022533e-06,
                "q1": 3.170100001170795e-05,
                "q3": 3.45019999770102e-05,
                "iqr_outliers": 521,
                "stddev_outliers": 912,
                "outliers": "912;521",
                "ld15iqr": 2.7500999976837193e-05,
                "hd15iqr": 3.880200000594414e-05,
                "ops": 30048.507703230192,
                "total": 0.16426772499818298,
                "iterations": 1
            }
        },
        {
            "group": null,
            "name": "test_b3_propagator_start_as_current_span",
            "fullname": "opentelemetry-sdk/tests/performance/benchmarks/trace/test_benchmark_trace.py::test_b3_propagator_start_as_current_span",
            "params": null,
            "param": null,
            "extra_info": {},
            "options": {
                "disable_gc": false,
                "timer": "perf_counter",
                "min_rounds": 5,
                "max_time": 1.0,
                "min_time": 5e-06,
                "warmup": false
            },
            "stats": {
                "min": 3.13009999786118e-05,
                "max": 0.00014920600000323248,
                "mean": 4.455981447907276e-05,
                "stddev": 5.31612969477487e-06,
                "rounds": 7611,
                "median": 4.560200000014447e-05,
                "iqr": 3.600999974651131e-06,
                "q1": 4.2601000018294144e-05,
                "q3": 4.6201999992945275e-05,
                "iqr_outliers": 1075,
                "stddev_outliers": 1483,
                "outliers": "1483;1075",
                "ld15iqr": 3.720099999782178e-05,
                "hd15iqr": 5.170199997905911e-05,
                "ops": 22441.745139438222,
                "total": 0.33914474800022276,
                "iterations": 1
            }
        }
    ],
    "datetime": "2020-12-08T09:23:18.999916",
    "version": "3.2.3"
}

Does This PR Require a Contrib Repo Change?

  • No. (But it will be a good follow up)

Checklist:

  • Followed the style guidelines of this project
    - [] Changelogs have been updated (Not changing functionality, only adding tests
    - [ ] Unit tests have been added
    - [ ] Documentation has been updated

Performance testing! This is outstanding 😎. I have one conceptual question (it very possibly may be related to the pytest-benchmark rather than being related to your changes, @NathanielRN):

What does performance mean in:

  • The package tests will FAIL if during comparison. performance regresses by 75% from the most recent recorded test (To avoid flaky failing of tests)
    ?

Is performance measured in seconds? For example, a performance test is run once and it takes 10 seconds, so its performance is 10s?

As far as I understand, a regression of less than 75% is accepted. So, if a test case takes 100s to run, it is accepted if it is run again and it takes 174s? Does that mean that it is acceptable that the performance can continue to degrade more and more as long as it is less than 75% of the previous run?

@NathanielRN
Copy link
Contributor Author

@ocelotl That's a great question!

Does that mean that it is acceptable that the performance can continue to degrade more and more as long as it is less than 75% of the previous run?

No, it compares against some baseline performance. In this PR I committed 0001_old.json files with the performance results that I saw the GitHub actions workflow reported:

(e.g. opentelemetry-sdk/tests/.benchmarks/Linux-CPython-3.5-64bit/0001_old.json)

This becomes the expected performance for the package. On subsequent PRs, pytest-benchmark now always compares the results of 0001_old.json with a new 0002_now.json file generated using the changes in the PR. If the changes deviate by more than 75% from 0001_old.json, then the test fails. So in your example the PR would fail if it goes past 175 seconds 😄

Is performance measured in seconds? For example, a performance test is run once and it takes 10 seconds, so its performance is 10s?

According to the pytest-benchmark docs tests are run for 1 second, as many times as that test can be run in that second:

--benchmark-max-time=SECONDS
Maximum run time per test - it will be repeated until this total time is reached. It may be exceeded if test function is very slow or –benchmark-min-rounds is large (it takes precedence). Default: ‘1.0’

And the units seem to be in seconds.

--benchmark-compare-fail=EXPR
Fail test if performance regresses according to given EXPR (eg: min:5% or mean:0.001 for number of seconds). Can be used multiple times.

So a "mean": 4.2917183875654695e-05, for my test_simple_start_span benchmark means that in all the rounds it had, the average runtime was 429ms for this test.

@NathanielRN
Copy link
Contributor Author

@lonewolf3739 Thanks for the suggestion! Will be looking into this 😄

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only be updating these saved benchmark files from Github action CI runs somehow, or maybe storing them as artifacts. The results could vary a lot depending on the PR author's hardware. If they run on Github Actions it should be somewhat normalized and wouldn't prevent people from using Mac/Windows for development, or force them to install all 5 versions of python we support to regen the benchmarks.

That might be a bit tricky, what is the Java SIG doing for their JMH benchmarks?

@@ -0,0 +1,104 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add a linguist-generated entry for these so to keep PR noise down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really cool! I didn't know about this 😄

However I think it's only big this time, the benefit of having it included in the diff is that when someone updates the performance later, only the numbers should change (which means we can see how the performance changes in the git diff)

Also if you add new performance tests, that will be clearly seen as only a few lines of code added to these files.

Let me know what you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it always updates the same file. I was looking at the pytest-benchmark docs here with --benchmark-autosave. Is it worth preserving all previous runs? They show some cool box and whisker plots from previously saved runs. We can always iterate 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea! For now we will have a history of the mean throughput in benchmarks/data.js for all commits, but if someone wants to generate the cool whisker plots all they would have to do is clone the repo and run tox -e tests-core-sdk -- --benchmark-histogram=<path> to generate the .json.

But I think it would be cool to have just the latest .json file committed somewhere here on the repo later :)

tests/print_benchmarks.py Outdated Show resolved Hide resolved
tests/print_benchmarks.py Outdated Show resolved Hide resolved
tests/print_benchmarks.py Outdated Show resolved Hide resolved
tox.ini Outdated
test: pytest {posargs}
test: pytest --benchmark-save=now --benchmark-compare --benchmark-compare-fail=min:75% --benchmark-sort=mean --benchmark-columns=mean,min,max,stddev,median,ops,rounds {posargs}

core-sdk: {toxinidir}/tests/print_benchmarks.py {toxinidir}/opentelemetry-sdk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what this is for? Doesn't already print when you run pytest in a nice colored format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prints the table which is a nice visual format. However, we want to see the .json file produced because pytest-benchmark can only run comparisons between .json files, not tables.

So the .json file is printed to the workflow logs so that the author can see the performance results of their changes according to the normalize standard of "running on GitHub actions".

And if they improved performance they would update the relevant files in 0001_old.json for each python version 😄 (This is the only manual process of this 😕 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close, they would copy/paste the output JSON into the opentelemetry-sdk/tests/performance/.benchmarks/Linux-CPython-3.5-64bit/0001_old.json file for example.

This is only if they want to change the expected performance standard, which is something we could automate to happen every "release" for example.

tests/print_benchmarks.py Outdated Show resolved Hide resolved
tests/print_benchmarks.py Outdated Show resolved Hide resolved
@NathanielRN
Copy link
Contributor Author

@aabmass

I think we should only be updating these saved benchmark files from Github action CI runs somehow, or maybe storing them as artifacts.

The intention is to only update form the Github action CI 🙂 Short of committing the actual results into the PRs files (I don't know how to do that and I don't think we should...) the the next best solution I thought of was to actually output the results in the GitHub Action CI logs. That's where the 0001_old.json files in this PR came from, I copied and pasted the results of the tox -e test-core-sdk output in the workflow.

The results could vary a lot depending on the PR author's hardware.

Author's will never have to commit results from their machine, they will use the GitHub actions machine results.

We'll know it came from GitHub because "machine_info"."node" will look like fv-az12-670 or fv-az52-394 for example

Compare this to a run on my local machine which has machine_info.node: ###.ant.amazon.com.

If they run on Github Actions it should be somewhat normalized

Agreed!

and wouldn't prevent people from using Mac/Windows for development, or force them to install all 5 versions of python we support to regen the benchmarks.

Yes also agreed :) I tested locally and you can still run tox -e test-core-sdk no problem. And through the interpretation of your Python environment done in the print_benchmarks.py script it will still be able to compare against your previous runs and save your latest runs in a folder that is appropriate for your hardware.

That might be a bit tricky, what is the Java SIG doing for their JMH benchmarks?

The Java SIG seems to be adding the performance test results in the PR description: See PR 2186. And I think I heard they post them in some .md files in a table somewhere else but I'm not certain.

When considering their solution, I thought this solution here where we commit them as .json files is a bit better because then pytest-benchmark can actually compare and fail/succeed on the differences which is what we originally asked for in the SIG meeting! And then people can update them relatively easy with a copy & paste. When performance tests change, it will be easy to see this in a git diff of the 0001_old.json files 😄

@NathanielRN NathanielRN force-pushed the add-sdk-span-performance-tests branch 7 times, most recently from 30bbf14 to 93885b3 Compare December 9, 2020 00:05
tox.ini Show resolved Hide resolved
.github/workflows/test.yml Outdated Show resolved Hide resolved
Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanielRN why aren't we seeing the run on this PR? Is it because there is no existing saved benchmarks, but it will work going forward?

.github/workflows/test.yml Show resolved Hide resolved
.github/workflows/test.yml Outdated Show resolved Hide resolved
.github/workflows/test.yml Show resolved Hide resolved
.github/workflows/test.yml Show resolved Hide resolved
Performance progression of benchmarks for packages distributed by OpenTelemetry Python can be viewed as a [graph of throughput vs commit history](https://open-telemetry.github.io/opentelemetry-python/benchmarks/index.html). From this page, you can download a JSON file with the performance results.

Running the `tox` tests also runs the performance tests if any are available. Benchmarking tests are done with `pytest-benchmark` and they output a table with results to the console.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanielRN this will show the perf change on your own machine between runs, right?

Do the JSON files need to be added to .gitignore for this PR?

tox.ini Outdated Show resolved Hide resolved
@NathanielRN
Copy link
Contributor Author

@aabmass

@NathanielRN why aren't we seeing the run on this PR? Is it because there is no existing saved benchmarks, but it will work going forward?

Yes exactly that is what I expect, on my test repo I did not see the comments until I had something on gh-pages.

@NathanielRN NathanielRN force-pushed the add-sdk-span-performance-tests branch 2 times, most recently from de3277a to 86b642e Compare December 10, 2020 22:36
Copy link
Contributor

@lzchen lzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiiiiiiiiiiiiiiiiiiice.

As discussed with @NathanielRN , since readthedocs builds from master, this github action will also write to master in the benchmarks folder for every build.

Copy link
Contributor

@ocelotl ocelotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question regarding specification compliance 👍

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

.github/workflows/test.yml Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
@NathanielRN NathanielRN force-pushed the add-sdk-span-performance-tests branch 4 times, most recently from 577533b to 22d48fa Compare December 11, 2020 07:31
@lzchen lzchen merged commit 496ed8b into open-telemetry:master Dec 11, 2020
@NathanielRN NathanielRN deleted the add-sdk-span-performance-tests branch December 11, 2020 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants