Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: improve performance of test-crypto-timing-safe-equal-benchmarks #26237

Closed
wants to merge 2 commits into from

Conversation

Trott
Copy link
Member

@Trott Trott commented Feb 21, 2019

Resetting require.cache() to Object.create(null) each time rather than
deleting the specific key results in a 10x improvement in running time.

Fixes: #25984
Refs: #26229

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines

@Trott Trott added the crypto Issues and PRs related to the crypto subsystem. label Feb 21, 2019
@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Feb 21, 2019
@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

@bnoordhuis @not-an-aardvark @nodejs/crypto

@Trott Trott mentioned this pull request Feb 21, 2019
@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

(If this is a reasonable approach, this test can probably be moved out of pummel and start running again more regularly in CI.)

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

Pummel CI: https://ci.nodejs.org/job/node-test-commit-custom-suites/885/

Whoops, I used the wrong repo in the Parameters. Let's try again.

Pummel CI: https://ci.nodejs.org/job/node-test-commit-custom-suites/886/

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

Before this change:

$ time ./node test/pummel/test-crypto-timing-safe-equal-benchmarks.js 

real	0m56.422s
user	0m54.419s
sys	0m8.399s
$ 

With this change:

$ time ./node test/pummel/test-crypto-timing-safe-equal-benchmarks.js 

real	0m4.885s
user	0m4.689s
sys	0m1.418s
$ 

@BridgeAR
Copy link
Member

I am somewhat surprised that this change has a positive impact.

@bmeurer @hashseed does using delete on an object cause it to become so slow as shown here? Or do you spot what is causing the performance difference in this case?

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

(The speed increase is also seen on CI. Before this change, test was timing out taking more than 2 minutes. With this change, the test takes 11 seconds. I agree with @BridgeAR that this result is surprising.)

@bmeurer
Copy link
Member

bmeurer commented Feb 21, 2019

delete puts the object into dictionary mode. But that's also what you get with Object.create(null). However Object.create(null) can be inlined into optimized code, whereas delete always goes to C++.

@BridgeAR
Copy link
Member

@bmeurer is there a possibility to improve delete? The performance cliff is extreme in this case.

Copy link
Member

@joyeecheung joyeecheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a nit

test/pummel/test-crypto-timing-safe-equal-benchmarks.js Outdated Show resolved Hide resolved
@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

Post-fixup Lite CI for linting: https://ci.nodejs.org/job/node-test-pull-request-lite-pipeline/2688/

Post-fixup Pummel CI to actually run the test: https://ci.nodejs.org/job/node-test-commit-custom-suites/887/

@Trott Trott added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 21, 2019
@joyeecheung
Copy link
Member

joyeecheung commented Feb 21, 2019

BTW, this "fixes" the test by adding an optimization that's not really related to
#26229 (comment) - maybe we should just use vm.runInThisContext to run the script instead of requiring it and then resetting the cache? We don't actually have to load it as a CJS module for what is being tested here.

@hashseed
Copy link
Member

hashseed commented Feb 21, 2019

@bmeurer @hashseed does using delete on an object cause it to become so slow as shown here? Or do you spot what is causing the performance difference in this case?

The reason for the original bad performance was actually a memory leak.

Even though we delete the module keyed by the file name from the cache, it is added to children on the cache entry for the main test file.

This is the (partial) content of require.cache at the first call to runOneBenchmark:

[Object: null prototype] {
  '/usr/local/google/home/yangguo/node/test/pummel/test-crypto-timing-safe-equal-benchmarks.js':
   Module {
     id: '.',
     exports: {},
     parent: null,
     filename:
      '/usr/local/google/home/yangguo/node/test/pummel/test-crypto-timing-safe-equal-benchmarks.js',
     loaded: false,
     children: [ [Module], [Module], [Module] ],
...
}

This is after 10:

[Object: null prototype] {
  '/usr/local/google/home/yangguo/node/test/pummel/test-crypto-timing-safe-equal-benchmarks.js':
   Module {
     id: '.',
     exports: {},
     parent: null,
     filename:
      '/usr/local/google/home/yangguo/node/test/pummel/test-crypto-timing-safe-equal-benchmarks.js',
     loaded: false,
     children:
      [ [Module],
        [Module],
        [Module],
        [Module],
        [Module],
        [Module],
        [Module],
        [Module],
        [Module],
        [Module],
        [Module],
        [Module],
        [Module] ],
...

By resetting require.cache entirely, the module actually get garbage-collected. You can visualize the difference by running with --trace-gc.

@BridgeAR
Copy link
Member

@hashseed thanks for looking into it again.

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

I'd like to fast-track this to un-break the nightly CI job that runs pummel tests. Please 👍 here if you support fast-tracking, or leave a comment if you think that's unwarranted in this case.

@Trott Trott added the fast-track PRs that do not need to wait for 48 hours to land. label Feb 21, 2019
Copy link
Contributor

@not-an-aardvark not-an-aardvark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM, although it might be worth running the test a bunch of times to make sure it's not more flaky as a result of this change.

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

The change LGTM, although it might be worth running the test a bunch of times to make sure it's not more flaky as a result of this change.

Running the test as it appears in this PR 1000 times: https://ci.nodejs.org/job/node-stress-single-test/2160/

EDIT: Canceled after 521 runs, 458 success, 63 failure. Oddly, I can't get a failure on my local machine at all. Let's see what happens with master branch on CI stress test.

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

Because we're already seeing a small-but-larger-than-expected number of failures on the stress test for this PR, here's a stress test against master but with the timeout bumped to 10 minutes. The idea is to see how often the test currently fails if it's given enough time. We know it will probably fail every time with a timeout, but if the timeout were larger, would the test be as flaky as the version in this PR? Let's find out. https://ci.nodejs.org/job/node-stress-single-test/2161/

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

@not-an-aardvark Any chance at all the test isn't flaky and it really isn't timing-safe? (But, oddly, only on the CI machine but not on my local laptop. ¯\(ツ)/¯)

@Trott Trott removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 21, 2019
@not-an-aardvark
Copy link
Contributor

Any chance at all the test isn't flaky and it really isn't timing-safe? (But, oddly, only on the CI machine but not on my local laptop. ¯\(ツ)/¯)

I suppose it's possible, but it seems unlikely that this would have resulted from a V8/module-loading change given that the comparison function doesn't do much with V8/module-loading internals. Based on looking at the test failures, it seems like errors are happening in both directions (sometimes it takes longer when the two buffers are equal, and sometimes it takes longer when the two buffers are unequal) which doesn't seem like it would typically occur if it's not timing-safe.

I remember when I was originally trying to debug this, we were also encountering flakiness on some architectures but not others.

@hashseed
Copy link
Member

Is it possible that the timing difference this test is intended to measure has been overshadowed and therefore masked by the GC time caused by the memory leak?

@not-an-aardvark
Copy link
Contributor

If not, and the bug is in the test modifications in this PR, would Joyee's suggestion about using vm.runInThisContext() be a likely fix?

I'm not very familiar with vm.runInThisContext, but it seems like could be worth trying. My best guess is that the flakiness results in the engine sometimes optimizing a particular JS code path over another (e.g. the code path which creates intentionally unequal buffers, rather than the code path that creates intentionally equal buffers), and recompiling the function every time was helping to avoid that kind of noise.

If the original code had a memory leak, another possibility could be that some calls are getting interrupted by GC once in awhile and throwing off the timing.

@hashseed
Copy link
Member

Note that previously, the code was not being recompiled on every run because it hit V8's internal compilation cache, even though it was deleted from the require cache.

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

@hashseed @not-an-aardvark On the GC hypothesis, sorry to ask a potentially-ignorant or naive question, but is this a correct understanding of what you're saying?: The memory leak means that GC happens from time to time, which will be unpredictable, thus causing timing to appear safe (because it is unpredictable) when it may not actually be? So it would cause the test to pass when perhaps it shouldn't? Or am I misunderstanding?

@hashseed
Copy link
Member

I don't know what exactly the test asserts. Is it predictability or unpredictability? The memory leak will likely contribute to the latter.

@not-an-aardvark
Copy link
Contributor

The test fails if it can find a statistically significant difference between comparison times for equal buffers and comparison times for unequal buffers, using crypto.timingSafeEqual. Unpredictability in comparison times would add statistical noise and reduce the likelihood that it can find a statistically significant difference, making the test more likely to pass. (The test tries to remove high outliers from the timing sample, which could potentially avoid noise caused by GC pauses, but it's not clear how effective this measure is.)

The test also asserts that there is a statistically significant difference between timings for equal/unequal buffers when using the timing-unsafe Buffer#equals, so if there was an extremely large amount of unpredictability then the test would start failing again. This type of failure would result in a different error message.

The failures here seem to indicate that there is a timing anomaly being introduced somewhere, although it seems likely that the anomaly is an artifact of the test setup rather than timing unsafety in crypto.timingSafeEqual.

Resetting require.cache() to `Object.create(null)` each time rather than
deleting the specific key results in a 10x improvement in running time.

Fixes: nodejs#25984
Refs: nodejs#26229
@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

BTW, this "fixes" the test by adding an optimization that's not really related to
#26229 (comment) - maybe we should just use vm.runInThisContext to run the script instead of requiring it and then resetting the cache? We don't actually have to load it as a CJS module for what is being tested here.

@joyeecheung Any reason to avoid eval() in this situation rather than vm.runInThisContext() which will introduce marginally more complication?

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

Changed it up to eval(). 🤞

Stress test: https://ci.nodejs.org/job/node-stress-single-test/2162/

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

@Trott
Copy link
Member Author

Trott commented Feb 21, 2019

The eval version appears to run as quickly as the Object.create(null) version but does not at all appear to be flaky. (48 runs in a row on stress test so far with no failures. I'll let it get to at least 500 and probably 1000.) Re-reviews would be welcome at this point. Thanks for the patience, everyone.

@joyeecheung
Copy link
Member

My original suggestion about using vm.runInThisContext() was based on that the previous implementation of the CJS loader used that method to perform the compilation. If eval() hits the compilation cache in similar ways without introducing unpredictable optimization that interferes with the timing being tested (@hashseed ?) then LGTM

@Trott
Copy link
Member Author

Trott commented Feb 22, 2019

Stress test results: 1000 out of 1000 runs passed. 🎉

@Trott
Copy link
Member Author

Trott commented Feb 22, 2019

Because this is a fair bit different from what was originally opened in this PR, a couple re-reviews or new reviews would be welcome.

@Trott Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Feb 22, 2019
@Trott Trott added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 22, 2019
@Trott
Copy link
Member Author

Trott commented Feb 22, 2019

And...now that we have green stress test and some re-reviews... would like to fast-track to unbreak CI, so I'm re-adding the fast-track label. Please 👍 here to fast-track.

@Trott Trott added the fast-track PRs that do not need to wait for 48 hours to land. label Feb 22, 2019
@Trott
Copy link
Member Author

Trott commented Feb 22, 2019

Landed in eb9e6c0

Thanks, everyone!

@Trott Trott closed this Feb 22, 2019
Trott added a commit to Trott/io.js that referenced this pull request Feb 22, 2019
Using `eval()` rather than `require()`'ing a fixture and deleting it
from the cache results in a roughtly 10x improvement in running time.

Fixes: nodejs#25984
Refs: nodejs#26229

PR-URL: nodejs#26237
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yang Guo <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Teddy Katz <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
addaleax pushed a commit that referenced this pull request Feb 25, 2019
Using `eval()` rather than `require()`'ing a fixture and deleting it
from the cache results in a roughtly 10x improvement in running time.

Fixes: #25984
Refs: #26229

PR-URL: #26237
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yang Guo <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Teddy Katz <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
@BridgeAR BridgeAR mentioned this pull request Feb 26, 2019
rvagg pushed a commit that referenced this pull request Feb 28, 2019
Using `eval()` rather than `require()`'ing a fixture and deleting it
from the cache results in a roughtly 10x improvement in running time.

Fixes: #25984
Refs: #26229

PR-URL: #26237
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yang Guo <[email protected]>
Reviewed-By: Ruben Bridgewater <[email protected]>
Reviewed-By: Teddy Katz <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
@Trott Trott deleted the fix-crypto-timing-test branch January 13, 2022 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. crypto Issues and PRs related to the crypto subsystem. fast-track PRs that do not need to wait for 48 hours to land. flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate flaky test-crypto-timing-safe-equal-benchmarks
10 participants