Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark: refactor buffer benchmarks #26418

Closed
wants to merge 1 commit into from

Conversation

BridgeAR
Copy link
Member

@BridgeAR BridgeAR commented Mar 4, 2019

Currently the buffer benchmarks take significantly too long to
complete. This drastically reduces the overall runtime by removing
obsolete checked variations and reducing the iteration count.

It also improves the benchmarks by removing the deprecated
new Buffer(size) usage and some other small improvements.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added benchmark Issues and PRs related to the benchmark subsystem. buffer Issues and PRs related to the buffer subsystem. labels Mar 4, 2019
mscdex
mscdex previously requested changes Mar 4, 2019
Copy link
Contributor

@mscdex mscdex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be removing some of the parameter values, especially ones that can trigger different code paths.

In general, I think instead of making these changes, it'd be better to allow for a list of benchmark file filters in CI so we can exclude specific benchmarks we know do not exercise code paths being tested.

@BridgeAR
Copy link
Member Author

BridgeAR commented Mar 4, 2019

@mscdex I went through the results on a CI machine and checked if the outcome was mainly identical or not. Most of the cases are redundant and do not provide a actual gain. If anyone wants to execute specific parameters that are not provided it's also always possible to add those locally. If the outcome is substantial, it's always possible to add them back to the code base.

The filter as you suggested them does not seem to solve the general problem that we have a lot of entries that are pretty much identical. E.g. a lot of buffer functions have optional length or encoding options but without actually changing much on any benchmark. I really tried hard to only remove ones that seemed well covered, so I kindly ask you to go through the removed entries to outline what entries you believe should still be kept and why.

The problem does not only manifest itself on the CI. More important to me, it's an issue running these locally.

I would like to switch from "benchmark most code branches of a specific function" to "cover most functions with your most common feature set".

@mscdex
Copy link
Contributor

mscdex commented Mar 4, 2019

If anyone wants to execute specific parameters that are not provided it's also always possible to add those locally.

One problem with that is not everyone making changes to runtime code will necessarily know to add/check those (missing) cases that may be affected. The people who have committed benchmarks typically are much more knowledgeable about that sort of thing because they were making performance-related changes in the same PR when the benchmark (or new parameters) was added.

@mscdex
Copy link
Contributor

mscdex commented Mar 4, 2019

In addition to being able to specify multiple filters in CI, I think being able to somehow specify groups of configurations for a particular benchmark file would be beneficial because some parameters do not need to be tried more than once or something they don't make sense when combined with other parameter values.

@BridgeAR
Copy link
Member Author

BridgeAR commented Mar 4, 2019

@mscdex I absolutely agree that mostly the person who wrote the benchmark knows best what cases are effected. That's also the crux: we have a lot of benchmarks that only check edge cases and cases that were important for specific code that changes over time and might not follow the original intent anymore. Relying on the existent ones as a generic measure is not ideal if we test each parameter. Instead, we could just ask to add another benchmark if we feel like the code change should be tested and is not yet covered by the existent benchmarks.

I believe that we gain most if the benchmarks focus on the generic feature without distinguishing each parameter. We really have a lot of redundant cases here.

@BridgeAR
Copy link
Member Author

BridgeAR commented Mar 4, 2019

@mscdex

specify groups of configurations for a particular benchmark file would be beneficial

That would indeed be good. I'll open a help wanted issue for it.

@BridgeAR
Copy link
Member Author

@mscdex would you be so kind and propose a way to reduce the overall runtime of the benchmarks? They are currently clearly very far away from "micro" benchmarks (the buffer benchmarks take more than a whole day!). We test a whole lot of cases that are super edge cases and only specific to a single code path that is rarely taken.

We could go ahead and have a "generic" folder in which we have the benchmarks that are more generic and have other benchmarks that test a whole lot of other cases separate. I personally do not believe that we really need these though as it should be pretty straight forward to add the extra cases when required (it just requires adding some parameters).

@BridgeAR
Copy link
Member Author

@nodejs/benchmarking PTAL

@mscdex
Copy link
Contributor

mscdex commented Mar 13, 2019

@BridgeAR I think having the grouped configurations will go a long way to helping reduce the time taken without having to remove useful default parameter values.

@BridgeAR
Copy link
Member Author

@mscdex I don't think that it'll have enough influence. It would be great to have the buffer benchmarks run in a reasonable time again, independent of the grouping. Especially as it will take quite some work setting up the groups.

@BridgeAR
Copy link
Member Author

BridgeAR commented May 1, 2019

@nodejs/tsc PTAL. I think this is quite an important PR: our current buffer benchmarks take significantly more than a day (I guess more than two days but I stopped the benchmarks after ~1.5 days each time). These are far from being "micro" benchmarks and IMO we test way to many edge cases.

If we only reduce the iterations for the current variations, we'd have to reduce it so significantly, that the benchmark might not have a high statistical significance anymore. Thus the combination of reducing the iteration count and removing some variations (mostly just edge cases).
We can always add a variation locally to verify some edge cases and post the result in the PR or add such variations temporarily in the PR and to remove them again before landing (to run the benchmark on our CI).

I think it would be good to have a recommendation what cases we should handle in our benchmarks (I suggest no or rarely edge cases) and what the maximum runtime of our benchmarks (I suggest an average of around 1-2 second per variation and to try to limit the variations to the most common ones).

Therefore I would like the TSC to decide what should be done to reduce the overall runtime and if this PR is OK as it is or what alternatives should be used instead.

@BridgeAR BridgeAR added tsc-agenda Issues and PRs to discuss during the meetings of the TSC. review wanted PRs that need reviews. labels May 1, 2019
Currently the buffer benchmarks take significantly too long to
complete. This drastically reduces the overall runtime by removing
obsolete checked variations and reducing the iteration count.

It also improves the benchmarks by removing the deprecated
`new Buffer(size)` usage and some other small improvements.
@Trott
Copy link
Member

Trott commented May 7, 2019

In general, I think instead of making these changes, it'd be better to allow for a list of benchmark file filters in CI so we can exclude specific benchmarks we know do not exercise code paths being tested.

FWIW, filtering already exists sort of but is completely not obvious. You sort of need to trick Jenkins into applying the necessary command-line parameters by sneaking them into seemingly unrelated boxes that happen to just be put into the command line at the right place. You can sneak in --set in one box and --filter in one of the other boxes, IIRC.

@Trott
Copy link
Member

Trott commented May 7, 2019

I definitely can see both sides of this and haven't yet formed a solid opinion. I don't think our default benchmarks should take forever to run, but I also think great care and consideration should be taken in adjusting them to run faster. I'm not saying that care and consideration hasn't been taken here. It's more like I don't know the appropriate level of care and consideration when I see it, so I don't know.

@mcollina
Copy link
Member

mcollina commented May 8, 2019

I do not think it's possible to come to a conclusion on this. In order to provide an informed opinion, we would have to backtrack why every single variation was added, as this worsened over time. This is a gargantuan task, considering that this is changing 23 files.

Microbenchmarks are useful only if they can detect a regression. However, they take so long that actually they are not worth running, essentially negating their purpose. Therefore, I'm in favor for landing this change.

@mhdawson
Copy link
Member

mhdawson commented May 8, 2019

I think we have similar problems for other areas of the benchmarks as well. Instead of removing would it make any sense to start with a "quick" option that runs a subset for each category with a target of 1 hour or less for each category? That might be something people would be more likely to run. Once that was in place we could then decide if "quick" or "full" should be the default.

@BridgeAR
Copy link
Member Author

BridgeAR commented May 8, 2019

@Trott

filtering already exists sort of but is completely not obvious

Even if the filters would be more obvious: I don't think it's a good idea to have the default run too long. What is important to be able to run arbitrary configurations without having to change any code and that is indeed still possible with our CI (and that's great!).

I don't know the appropriate level of care and consideration

Some hints: our number of iterations is almost always too high. A benchmark should indicate relevant results after significantly less iterations than we normally do. As a ball park: How many iterations would a user use as in a very tight loop? Increase that by factor ten and hopefully reach the CPU limit. For most benchmarks each of such variation should take a few hundred milliseconds but not more.

I'll go through the changes to outline the main differences as I see them.

@Trott
Copy link
Member

Trott commented May 22, 2019

6 days ago in #26418 (comment), @sam-github suggested that a vote was premature and encouraged further work towards consensus. But no one has responded since.

We can encourage conversation (as @sam-github did) but we can't force it. Our options (other than "do nothing" or "close this PR") are to continue to encourage conversation or else to have a TSC vote.

@mscdex @BridgeAR Any thoughts on Sam's comments?

@sam-github If there is no further participation in conversation, would you be inclined to resume the vote? Or is people simply opting out of a discussion insufficient grounds to move to a TSC vote in your opinion?

@sam-github
Copy link
Contributor

No strong opinion either way.

@mscdex
Copy link
Contributor

mscdex commented May 22, 2019

@Trott I've already described alternatives to this PR that I believe would be better.

As far as running benchmarks on a regular interval outside of PRs goes, that is something that could be considered outside of this current topic, as it's not a replacement for checking benchmarks before changes (which may appear to have performance implications) get merged in the first place.

Also about default iterations, the reason they were/are so high is that you need to ensure you give V8 enough time to properly optimize/inline/deoptimize/etc. functions and code because chances are that is what would be happening in the real world in a long-lived node process.

@sam-github
Copy link
Contributor

@mscdex

I don't think we should be removing some of the parameter values

I think at least going through the PR and commenting on the specific things you do and don't object to would be helpful. That would allow landing all the non-controversial changes, and it would be easier to understand what your specific objections are (for bystanders, now being asked to vote :-(). At this point, I don't know what specifically you don't want removed. Also, the suggestion of a whole new config system for benchmarks is interesting, but seems pretty tangential to this. Wouldn't it just move code to config, and if the default config had less coverage, it doesn't seem like it would change your objections to removing "some parameter values". As is, this PR is too large to understand what is or is not controversial, and since benchmarks can always be added back again in the future, it looks like if it comes to a vote it will land as-is.

@BridgeAR
Copy link
Member Author

@sam-github

I prefer to work towards consensus rather than voting to ignore concerns.

I believe we always try and do that and no one is going to argue against that! At least to me it also seemed like we did that? The PR is open for almost three month and we discussed a couple of possibilities in the meanwhile (sadly without finding an agreement).

How long the benchmarks should take depends on the purpose. If there are multiple purposes, we probably need different groups. I don't think its bad to slim down the tests now before better grouping exists. Tests can always be added again if we need them.

Absolutely.

I've worked with benchmark suites that took weeks to run, and were only run before major releases, as well as benchmarks that were ran weekly, and had to complete within the week.

There are definitely benchmark suites that test lots of cases and that require a very long run time to finish. I just do not see that being the case for any of our microbenchmarks that we currently have in our code base.
Mostly each benchmark file tests one or a few specific APIs. But the argument matrix that we use is mostly about very specific code branches that only differ in one or two lines of code setting up the arguments to reach the actual heavy API call underneath. They often just hit a few other if statements that assign a default values (I have an example below). So at least to me it seems like this case can not be compared with such an in-depth benchmark. I guess it could be better compared with running all our microbenchmarks in one go? I would expect that to take significantly more time overall.

Being able to run the individual benchmark quickly and often allows to detect regressions on PRs that touch this code base. But that is currently not possible due to the extreme run time. I have also not found any indication that the high iteration count and the big argument matrix brought as any benefit so far.

@mscdex

Also about default iterations, the reason they were/are so high is that you need to ensure you give V8 enough time to properly optimize/inline/deoptimize/etc. functions and code because chances are that is what would be happening in the real world in a long-lived node process.

The V8 team has own benchmarks and tests that verify the behavior after significantly less iterations. The code should be optimized and inlined after a few iterations since there is no other code in-between. @nodejs/v8 @psmarshall I guess you could shed some light here?


A couple cases that I removed would only hit the arguments default code path which would look like:

function foo(input, encoding, callback) {
  if (typeof encoding === 'function') {
    callback = encoding;
    encoding = 'utf8';
  }
  actualHeavyAPICall(input, encoding, callback);
}

foo('bar', function callback() {});
foo('bar', 'utf8', function callback() {});

I do not think we actually have to verify each argument variation all the time, since the main API is not changed by any of them. Instead, I would like to only test the main API with a few different input cases by default. On top of that we are always able to run more cases by explicitly providing the some arguments on the command line.

Right now we do not run our buffer benchmarks at all because they take so long. By reducing the iterations (which makes them more like in a real world application) and removing edge cases, I believe that we make them more valuable.

@BridgeAR
Copy link
Member Author

@nodejs/tsc is there anything I should / can do to move this forward?

@BridgeAR
Copy link
Member Author

@mhdawson
Copy link
Member

I'd easily +1 if we had a default and long option for the benchmarks with the default being what is proposed in this PR and the long option being what it was before. I'm not going to block but I can't convince myself to +1 either so I'll abstain.

@mhdawson
Copy link
Member

@Trott we agreed in the meeting that we should complete a vote. Do you want to manage completing the voting process?

@Trott
Copy link
Member

Trott commented Jun 20, 2019

@Trott we agreed in the meeting that we should complete a vote. Do you want to manage completing the voting process?

With 10 TSC members abstaining, 5 members voting in favor, and 3 members not (yet) voting, this passes and can land if CI is green.

@Trott Trott dismissed mscdex’s stale review June 20, 2019 15:46

Escalated to TSC, PR approved.

@nodejs-github-bot
Copy link
Collaborator

@Trott Trott removed the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Jun 20, 2019
@nodejs-github-bot
Copy link
Collaborator

Trott pushed a commit to Trott/io.js that referenced this pull request Jun 20, 2019
Currently the buffer benchmarks take significantly too long to
complete. This drastically reduces the overall runtime by removing
obsolete checked variations and reducing the iteration count.

It also improves the benchmarks by removing the deprecated
`new Buffer(size)` usage and some other small improvements.

PR-URL: nodejs#26418
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Peter Marshall <[email protected]>
Reviewed-By: Franziska Hinkelmann <[email protected]>
Reviewed-By: Anatoli Papirovski <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
@Trott
Copy link
Member

Trott commented Jun 20, 2019

Landed in 7b80268

@Trott Trott closed this Jun 20, 2019
targos pushed a commit that referenced this pull request Jul 2, 2019
Currently the buffer benchmarks take significantly too long to
complete. This drastically reduces the overall runtime by removing
obsolete checked variations and reducing the iteration count.

It also improves the benchmarks by removing the deprecated
`new Buffer(size)` usage and some other small improvements.

PR-URL: #26418
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Peter Marshall <[email protected]>
Reviewed-By: Franziska Hinkelmann <[email protected]>
Reviewed-By: Anatoli Papirovski <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
@targos targos mentioned this pull request Jul 2, 2019
@BridgeAR BridgeAR deleted the refactor-buffer-benchmarks branch January 20, 2020 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Issues and PRs related to the benchmark subsystem. buffer Issues and PRs related to the buffer subsystem. review wanted PRs that need reviews.
Projects
None yet
Development

Successfully merging this pull request may close these issues.