Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor benchmark tools for statistical significance #7094

Merged
merged 9 commits into from
Jul 26, 2016
Merged

Refactor benchmark tools for statistical significance #7094

merged 9 commits into from
Jul 26, 2016

Conversation

AndreasMadsen
Copy link
Member

@AndreasMadsen AndreasMadsen commented Jun 1, 2016

Checklist
  • tests and code linting passes
  • a test and/or benchmark is included
  • documentation is changed or added
  • the commit message follows commit guidelines
Affected core subsystem(s)

benchmark

Description of change

I have been rather confused about the benchmark suite and I don't think it is as user friendly as the rest of nodecore. This PR attempt to remove most of the confusion I was facing when I started using it. Primarily it:

  • removes unused/undocumented files
  • allows partially setting the benchmarks variables using process arguments.
  • refactor compare.js such comparing node versions and getting statistical significance is easy.
  • refactor the plot.R tool (now called scatter) to show a scatter plot with confidence bars.
  • refactor cli tools such the cli API is more homogeneous.
  • documents all the tools.
  • removes the implicit process.exit(0) after bench.end().
  • uses process.send to avoid most parsing (the benchmark cli arguments haven't changed).

The specifics are documented in the commit messages. Please also see the the new README as quite a lot have changed (be sure the to check my spelling!).

Note that some benchmark takes a very long time to complete, e.g. timers/timers.js type=depth thousands=500 takes 11.25 min. Thus running it 30 times for statistical significance is unreasonable. I suspect the only reason why it is set to so many iterations is to get a small variance, but with the the new compare tool the variance can be estimated instead of being reduced. Thus we can reduce the number of iterations and still get the information we need. But I suggest we do that in another pull request, as is very different discussion.

Motivation (long story): I wanted to benchmark the effect of some async_wrap changes. I went to the benchmark/ directory and read the README. However I quickly discovered that it was primarily about running benchmarks a single time and how to write benchmarks. And most importantly it didn't explain how to compare two node versions. This is now documented in the new README.

I then had to search for the tools myself and discovered the large amount of benchmarks files which where not put into categorized directories. I assumed they where somehow extra significant, but in reality they just appear to be unused. These files are now removed.

After discovering the compare tool, which has the cli API

node benchmark/compare.js
            <node-binary1> <node-binary2> +
            [--html] [--red|-r] [--green|-g] +
            [-- <type> [testFilter]]

I was confused about what the --red, --green was and how the node-binary1 and node-binary2 compared, should I write ./node-old ./node-new or ./node-new ./node-old if I wanted a positive improvement factor to signify an improvement? The new compare API is:

usage: ./node benchmark/compare.js <type> ...
  --new    ./new-node-binary  new node binary (required)
  --old    ./old-node-binary  old node binary (required)
  --runs   30                 number of samples
  --filter pattern            string to filter benchmark scripts
  --var    variable=value     set benchmark variable (can be repeated)

After understanding common.js this it was still unclear if the performance was statistically significant different. I tried running the benchmark 5 times and got that 4/5 was an improvement, I was expecting it to have the same performance or be slower. (spoiler: it wasn't significant). The compare.js script now runs the benchmarks many times (30 by default) and there is an R script to analyse the csv results.

At this point I wanted to do a rewrite of the benchmark tools (not the benchmarks themself) and changed a few other things in the process as well. - I'm a mathematician so I care a lot about statistical significance :)

@AndreasMadsen AndreasMadsen added the benchmark Issues and PRs related to the benchmark subsystem. label Jun 1, 2016
@AndreasMadsen
Copy link
Member Author

I'm not sure who to cc for this one.
/cc @Trott as you appear to have made some resent benchmark changes.

@Trott
Copy link
Member

Trott commented Jun 5, 2016

@nodejs/benchmarking

@Trott
Copy link
Member

Trott commented Jun 5, 2016

In theory, this sounds fantastic to me! In practice, there's so much about benchmarking that I'm ignorant about, I have to defer to others.

@jasnell
Copy link
Member

jasnell commented Jun 6, 2016

Very nice. @mscdex @bnoordhuis ... any thoughts on this?

var s;
for (var i = 0; i < n; i++) {
s = '01234567890';
s[1] = 'a';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this line was added to prevent v8 from optimizing the for-loop away or something (since s wouldn't have been referenced)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. With use strict it is definitely broken. Looking at the original commit ( 12a169e - 6 years ago) it seams like it was just a misunderstanding of how strings works. The commit appears to compare strings and buffers, which is not comparable in this case as strings are immutable.

@mscdex
Copy link
Contributor

mscdex commented Jun 6, 2016

Just briefly looking over it, it mostly seems to look ok except for a few nits.

I did spot a typo in the benchmark: add script for creating scatter plot commit message body.

//
// Parse arguments
//
const cli = CLI(`usage: ./node benchmark/compare.js <type> ...
Copy link
Contributor

@mscdex mscdex Jun 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should probably be some explanation (in the help text) about what <type> should be exactly...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you referring to <type>?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, markdown cut that part out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc there was another one of these in another commit. just to watch out for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should all be fixed. Unless you are talking about the R scripts, but they only take -- arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, the first few times I tried running the new I found it very confusing that the type argument needed to appear before the arguments starting with -- (i.e. compare.js --new bla --old blah http did not work). I almost never use CLIs with that argument order, and just showing this usage text wasn’t exactly helpful, either.

You don’t need to change the behaviour, but maybe add a note here about that and for the other scripts where it applies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on that note, I think it is very specific, this is the message you get now.

usage: ./node benchmark/compare.js <type> ...
  Run each benchmark in the <type> directory many times using two diffrent
  node versions. More than one <type> directory can be specified. The output is
  formatted as csv, which can be processed using for example 'compare.R'.

  --new    ./new-node-binary  new node binary (required)
  --old    ./old-node-binary  old node binary (required)
  --runs   30                 number of samples
  --filter pattern            string to filter benchmark scripts
  --set    variable=value     set benchmark variable (can be repeated)

I choose this order, because it could be implemented using less code.

I will try and change the argument order, this appears to cause a lot of confusion for many people, but I would love to understand why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try and change the argument order, this appears to cause a lot of confusion for many people, but I would love to understand why.

If I had to guess, I’d say it’s because that’s the order usually suggested in man pages and --help texts, and maybe because the positional arguments are the ones one is most likely to spend more time editing before hitting enter… idk, maybe there’s more to it.

Copy link
Member Author

@AndreasMadsen AndreasMadsen Jul 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I understand the order is confusing (it is fixed now). But this is the third comment I got about a missing note, but unless I'm misunderstanding the comment, there is a note just one line below.

@AndreasMadsen
Copy link
Member Author

@mscdex thanks. Updated as suggested.

@AndreasMadsen
Copy link
Member Author

ping

```

## How to write a benchmark test
After generating the csv, a comparens table can be created using the `scatter.R`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/comparens/comparison ?

@mscdex
Copy link
Contributor

mscdex commented Jun 11, 2016

/cc @nodejs/collaborators

@ChALkeR
Copy link
Member

ChALkeR commented Jun 11, 2016

@mscdex What's the semver status of this? Major?

@mscdex
Copy link
Contributor

mscdex commented Jun 11, 2016

@ChALkeR I don't know how benchmarks are covered when it comes to that kind of thing. I would guess they are treated like tests or docs since they are not a part of the runtime?

@mcollina
Copy link
Member

I'll go for major, it makes things easier and less complicated.

One thing that is not clear from the document is how the statistical significance is achieved.

@AndreasMadsen
Copy link
Member Author

AndreasMadsen commented Jun 14, 2016

@mscdex Thanks for the suggestions, I will update the documentation tomorrow.

@mcollina It runs each the benchmark a given number of times (--runs) using the new and old node binary that is provided to compare.js. Using the R script it then ...

... makes an independent/unpaired 2-group t-test, with the null hypothesis that the performance is the same for both versions. The significant field will show a star if the p-value is less than 0.05.

I think the compare documentation is fairly clear on this. But do tell me how I can improve it.

@AndreasMadsen
Copy link
Member Author

Thanks for the review. Landed in ee2843b edbed3f 0f9bfaa f3463cf3061931b5c94ba9c753c1d75ee4d2b712 1f64ceba89a074f9e23196d019d56f00cdd4577a 01fbf656a3874d189cadeced08266a26ea526491 de9b44c0889d2264436277848762f1ebf868aa57 6e745d7a7586b12b894537192726bf2b999a456d 693e7be399e4c0964b5bbceaee6e8326c7c02a42

@addaleax
Copy link
Member

Uh, you might want to back these commits out of master for now, the linter complains about benchmark/_cli.js

@AndreasMadsen
Copy link
Member Author

As in force push?

@addaleax
Copy link
Member

addaleax commented Jul 26, 2016

@AndreasMadsen I’d do that for now. Could you fix that, and maybe do a CI or linter run before re-landing? ;)

@addaleax addaleax reopened this Jul 26, 2016
This removes the need for parsing stdout from the benchmarks. If the
process wasn't executed by fork, it will just print like it used to.

This also fixes the parsing of CLI arguments, by inferring the type
from the options object instead of the value content.

Only two benchmarks had to be changed:

* http/http_server_for_chunky_client.js this previously used a spawn
now it uses a fork and relays the messages using common.sendResult.

* misc/v8-bench.js this utilized that v8/benchmark/run.js called
global.print and reformatted the input. It now interfaces directly
with the benchmark runner global.BenchmarkSuite.

PR-URL: #7094
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Previously bench.end would call process.exit(0) however this is rather
confusing and indeed a few benchmarks had code that assumed otherwise.

This adds process.exit(0) to the benchmarks that needs it.

PR-URL: #7094
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
The data sampling is done in node and the data processing is done in R.
Only plyr was added as an R dependency and it is fairly standard.

PR-URL: #7094
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Previously this a tool in `plot.R`. It is now are more complete tool
which executes the benchmarks many times and creates a boxplot.

PR-URL: #7094
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
PR-URL: #7094
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Strings where never mutable, it is not clear what this benchmarks
attempts to do. This did work at some point, but only because the
benchmark wasn't using strict mode.

PR-URL: #7094
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
@AndreasMadsen
Copy link
Member Author

AndreasMadsen commented Jul 26, 2016

Thanks for the quick eye. I have force pushed and updated the PR. I wish I knew how it happened.

CI: https://ci.nodejs.org/job/node-test-pull-request/3422/

@addaleax
Copy link
Member

Well, yeah, I’ve had the, ahem, pleasure of breaking master by not having run CI again before landing myself in the recent past. :)

Anyway, CI looked good before it went all 502 (FreeBSD failure is unrelated and only the Windows tests were remaining), I’d say you can land this. Thanks!

@AndreasMadsen AndreasMadsen merged commit d525e6c into nodejs:master Jul 26, 2016
@AndreasMadsen
Copy link
Member Author

Landed in: ee2843b edbed3f 0f9bfaa f99471b 8bb59fd 855009a 0c0f34e 6edef1d d525e6c

@addaleax addaleax added the semver-major PRs that contain breaking changes and should be released in the next major version. label Jul 27, 2016
@addaleax
Copy link
Member

Labelled this semver-major because that’s what has been suggested above, and #7890 shows that people obviously were using APIs of the old benchmarking scripts.

@AndreasMadsen
Copy link
Member Author

Sounds good. This is obviously not backward compatible and it is quite easy to use the new tools on an old node version.

Also I don't really want to backport this ;)

Trott added a commit to Trott/io.js that referenced this pull request Aug 31, 2016
MylesBorins pushed a commit that referenced this pull request Sep 4, 2016
Enable `brace-style` in ESLint.

Ref: #7094 (comment)

PR-URL: #8348
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Myles Borins <[email protected]>
MylesBorins pushed a commit that referenced this pull request Sep 28, 2016
Enable `brace-style` in ESLint.

Ref: #7094 (comment)

PR-URL: #8348
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Myles Borins <[email protected]>
rvagg pushed a commit that referenced this pull request Oct 18, 2016
Enable `brace-style` in ESLint.

Ref: #7094 (comment)

PR-URL: #8348
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Myles Borins <[email protected]>
MylesBorins pushed a commit that referenced this pull request Oct 26, 2016
Enable `brace-style` in ESLint.

Ref: #7094 (comment)

PR-URL: #8348
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Myles Borins <[email protected]>
@gibfahn gibfahn mentioned this pull request Jun 15, 2017
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Issues and PRs related to the benchmark subsystem. semver-major PRs that contain breaking changes and should be released in the next major version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants