-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the benchmark reporting 10-times higher values? #791
Comments
+1, I think its worth going to 2.0.0 to fix this |
cc @johnmccutchan for insight |
This is legacy that we should not change. The purpose of this harness is to replicate the exact same benchmark conditions that we use internally. All historical data has this same scaling in place. If you have a benchmark that runs under this harness and you speed it up (or slow it down), you can see the relative change to your base line. tl;dr- this isn't a "code timer" but a benchmark runner that is designed to match our internal benchmarking infrastructure. |
Internally we could stay on 1.0.4 though right? It seems wrong to force this behavior on all users of this package. Or maybe we could hide it behind an option? |
@jakemac53 If the external version changes it makes it impossible for us to compare results to our internal numbers. I'll discuss what we want to do long term with this package at the next compiler team meeting. |
Ok sounds good, my main concern is that this package is currently advertising itself as the officially endorsed package for writing dart benchmarks, but really it seems like its just for internal use if we can't ever make changes which would throw off our historical measurements. Instead it seems like users should lock themselves to a particular version and choose to upgrade when the benefits of the new features outweigh the cost of having to normalize their historical data. |
Seems like a flag would work. If the examples and internal benchmarks set the same flag, the results would be comparable, without affecting other benchmarks. |
This merges two benchmark packages into one to make it easier to run benchmarks on Golem and locally. `api_benchmark` is not merged: it runs a web server and benchmarks performance in an actual browser. We should revisit those benchmarks to see if they make sense and see if it's a good idea to merge them into the new benchmark package in this PR. Changes: - Remove `protobuf_benchmarks`, `query_benchmark`. Move benchmarks in both to `benchmarks`. - Use `benchmark_harness` in all benchmarks. Previously only `protobuf_benchmarks` used `benchmark_harness`. - Because `benchmark_harness` reports 10x the actual number (https://github.com/dart-lang/benchmark_harness/issues/30), to be compatible with the old `query_benchmark` results (which are stored in Golem) and generally report accurate results, `BenchmarkBase` class overrides the relevant method(s) from `benchmark_harness.BenchmarkBase` to divide the result by 10 before reporting. Outputs before: RunTimeRaw(protobuf_decode): 973 us RunTimeRaw(protobuf_decode_json): 3914 us RunTimeRaw(protobuf_encode): 3135 us RunTimeRaw(protobuf_encode_json): 4606 us RunTimeRaw(protobuf_decode): 30 us Outputs after: protobuf_query_decode(RunTimeRaw): 973.4907766990291 us. protobuf_query_decode_json(RunTimeRaw): 3925.9745098039216 us. protobuf_query_encode(RunTimeRaw): 3115.623076923077 us. protobuf_query_encode_json(RunTimeRaw): 4568.952272727272 us. protobuf_query_set_nested_value(RunTimeRaw): 18.9181422625804 us. All benchmarks (query_benchmark and others) use this new class. - Directory structure: - `datasets/` has benchmark inputs (i.e. binary encoded protos) - `protos/` has benchmark protos - `lib/` has code common to all benchmarks (file operations, benchmark class) - `tool/compile_protos.sh` compiles all protos - `tool/run_protoc_plugin.sh` runs protoc plugin from `../protoc_plugin`. Used by `compile_protos.sh`. - `tool/compile_benchmarks.dart` compiles benchmark programs to native and JS, in parallel - `bin/` contains benchmarks - `protoc_version` is used by Golem to download the right version of protoc - Each benchmark now has its own executable. This is to allow benchmarking changes in isolation. For example, to avoid changing encoding benchmark results when decoding code was modified, because of global analyses or JIT. - `protobuf_benchmarks/` inputs no longer use [`benchmarks.proto`][1]. Instead the input files are now the `payload` fields of `BenchmarkDataset`s, and input file name specifies message type. Fixes #613 Golem cl 4855277 [1]: https://github.com/google/protobuf.dart/blob/15d9eed9f44dc3bab0738d33921d2608f3b158b5/protobuf_benchmarks/benchmarks.proto
I understand, it's beneficial (for the sake of accurateness) to run the measured function 10 times in a loop. But, why this is the value which is actually reported? Why not report value divided by 10? It is really confusing!
The text was updated successfully, but these errors were encountered: