-
Notifications
You must be signed in to change notification settings - Fork 7.3k
http/net: _unrefActive() is extremely expensive #8160
Comments
The linear scan of the timer list was my least favorite part of the implementation. This is definitely something I want to get fixed in 0.10 and 0.12 in perhaps something like a timer wheel. |
@bnoordhuis @tjfontaine I'd like to give it a shot. I agree with your suggestions about how to approach the problem. Here's my plan:
Please let me know what you think. |
@misterdjules Sounds like a good approach to me. A quick note on the What should work, however, is making If true, it expired but if false, you reschedule it in
As a final note, I use |
I'm not entirely sure if a timer wheel is really necessary, I think for short term implementations you can actually just use a heap |
@tjfontaine Alright, then for 2), instead of implementing a timer wheel I'll use a heap. |
[Off topic] @bnoordhuis Can you tell how you came up with that list of cost centers? |
@chirag04 Certainly. That's the output of The tick processor depends on the d8 binary so you need to build V8 first: If you're using master, take note of #8088 (which was fixed only recently.) HTH. |
Awesome! Thanks @bnoordhuis |
@bnoordhuis @tjfontaine I implemented approach 1) in misterdjules@802b4e3. It would be great if you had some time to look at it and let me know if it's in line with what we discussed. I added a test in the tests suite that specifically tests Below are the results I gathered so far. These results seem to be encouraging, but I have no experience benchmarking changes in Node.js, and I know it's easy to see what you want in benchmarks. Any comment and suggestion would be very much appreciated, thank you! Node.js server program used for the benchmarks
With linear scan in
|
I just realized @bnoordhuis benchmark was based on v0.10. I will rebase my changes on it and post benchmarks based on v0.10 so that we can compare apples to apples. |
@bnoordhuis @tjfontaine It seems I was able to build and run the mac-tick-processor on v0.12. Is d8 known to build and run on the v0.10 branch? |
Also, just for your information, I backported the change from v0.12 to v0.10 here: misterdjules@7d78abd. |
Does
I've had that happen, too. Does the |
The numbers do indeed look encouraging. Is the difference stable over multiple runs? I/O-bound tests sometimes have a tendency of displaying high variance. Can you open a pull request with the changes? I would suggest targeting v0.10 first. One comment about the test: I noticed it uses a 1s timeout. That's on the long side for a test in test/simple. If it's not possible to shorten the timeout, then test/pummel is arguably a better place for it. |
@bnoordhuis The difference seems to be stable over multiple runs. I will shortly post more results illustrating that. Before that I'm trying to see if the increase in complexity in I will open a pull request with these changes targeted at v0.10. The 1s timeout in the test is arbitrary, we can make it as low as possible. I'll make that change before creating the PR. Thank you for the review! |
@bnoordhuis I still get a segfault when running the x64.debug build of d8 on Linux. I managed to work around it by invoking d8 myself, bypassing the linux-tick-processor shell script. Thank you for your support Ben! |
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. However, it is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160.
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160.
@bnoordhuis I just created the PR for v0.10. |
@bnoordhuis @tjfontaine I posted the results of some benchmarks in a gist. I used On Linux, I used the script that can be found in the Gist mentioned earlier. It uses v8's profiler and wrk to generate HTTP requests. The results also show that the change has a positive impact on performance. Also The improvements shown by the benchmark using the v8's profiler seem to be more drastic than what we see with Last but not least, I haven't found a way to highlight the potential negative performance impact that the linear search in |
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160.
I've added another Gist that compares the current implementation, the unordered list implementation and the heap implementation all based on v0.12. The full profiling output of one run of the heap implementation is also available here. I had written the heap implementation before realizing that we wanted to benchmark that on v0.10, and I don't want to spend time back porting it to v0.10 before we determine that we want to investigate it further. The results show that the heap implementation is better than the original one, but performs worse than the unordered list. I've used @tjfontaine's binary heap module to avoid reinventing the wheel right from the start and to see if the improvements in algorithmic complexity would yield interesting results. I chose to use a heap implementation that uses a binary tree and not an array to avoid having to deal with the potential negative performance impact of growing (and maybe even shrinking) the array each time the storage limit is reached when adding a timer. The code for the heap implementation is in a separate branch based on v0.12. There are a few hacks needed to work around the fact that the I'd like to gather your comments before going further. Currently, my opinion is that:
I'm looking forward to reading your thoughts! |
I actually find a way to highlight the unordered list's worst case. The benchmark is here. I wasn't able to highlight it before because I wasn't adding enough timers. 10K is not enough, but 100K shows interesting results. I will of course try to analyze where's the inflection point. The benchmark is very simple, it just adds 100K timers that each expire 1ms after the previous one. The goal is to have a very high number of timers when calling So as expected, So now that I can highlight this worst case and shows that the heap implementation performs much better in this case, I'm going to backport it to v0.10, re-run the benchmarks and post the results. I will also probably stop spamming this issue's comments, and create a Wiki page updated as I get new interesting results. |
I forgot to mention in my previous comment that the micro-benchmark that I used to highlight This is not necessarily easy, since it would require lots of timeouts to happen very frequently and at regular time intervals. Feel free to share your thoughts on how to reproduce this pattern in a macro benchmark as I've never done that in the past. |
I created a Wiki page that summarizes my investigations. To paraphrase my conclusion, I have two questions:
|
Re-assigning to myself. |
@trevnorris Please let me know if you have any question regarding this issue. The heap implementation is done, but we need to determine if we want to use @tjfontaine's heap module as our internal heap module, or if we want to use/create another one. |
@trevnorris After a discussion with @tjfontaine, we thought that we could land the unordered list implementation (constant time insert, O(n) removal when timeouts fire) in 0.10.34 and 0.12. Then we can work on a better heap internal module and land the heap implementation (O(n) insert and O(log2 n) removal when timeouts fire) for a future release. What do you think? |
@misterdjules Sounds like a reasonable investigation. When you have some code to demonstrate what you're saying ping me. |
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160.
@trevnorris The code for the unordered list implementation (based off of 0,10) is available in this branch: https://github.com/misterdjules/node/tree/fix-issue-8160-0-10. There's only one commit to look at: 2714b69. Some explanations on how I compared the unordered list implementation with the original one are available, along with the results, here: #8160 (comment). If you have trouble interpreting the results or running the benchmark script, please let me know. Someone on the Node.js google group also tested the unordered list implementation, and the results seem to confirm my investigations. More details here: https://groups.google.com/forum/#!topic/nodejs/Uc-0BOCicyU/discussion. Thank you! |
@misterdjules It seems as if it should be possible to keep a separate list just for sockets and such, and one for timers. Each could be optimized for their use case. Thoughts? |
The only usage of What we can solve for is the normal/predictable case, that is to say connections are behaving normally and data is flowing through the stream. In this case we are constantly updating the time out and then ordering the list on insert. By shifting the linear scan to when the timeout fires we're only incurring the pain when a connection has already hit the bad path. Or that is to say the unintended path, where they're no longer sending data on a regular interval. This means that instead of incurring the on cpu time for all behaving connections we are only experiencing the cpu time on connections that are probably going to be reaped. |
@tjfontaine Sounds like a great solution. Estimated difficulty level? |
@trevnorris My understanding is that what @tjfontaine described in his latest comment is what's implemented in 2714b69. |
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes #8160.
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes #8160. PR-URL: #8751 Signed-off-by: Timothy J Fontaine <[email protected]>
The current work around landed in 934bfe2 |
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160. PR-URL: nodejs#8751 Signed-off-by: Timothy J Fontaine <[email protected]>
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes: nodejs/node-v0.x-archive#8160 PR-URL: nodejs/node-v0.x-archive#8751 Signed-off-by: Timothy J Fontaine <[email protected]> Conflicts: lib/timers.js
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes: nodejs/node-v0.x-archive#8160 Signed-off-by: Timothy J Fontaine <[email protected]> Conflicts: lib/timers.js Fixes: nodejs/node-convergence-archive#23 Ref: #268 PR-URL: #2540 Reviewed-By: bnoordhuis - Ben Noordhuis <[email protected]>
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes: nodejs/node-v0.x-archive#8160 Signed-off-by: Timothy J Fontaine <[email protected]> Conflicts: lib/timers.js Fixes: nodejs/node-convergence-archive#23 Ref: #268 PR-URL: #2540 Reviewed-By: bnoordhuis - Ben Noordhuis <[email protected]>
Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes: nodejs/node-v0.x-archive#8160 Signed-off-by: Timothy J Fontaine <[email protected]> Conflicts: lib/timers.js Fixes: nodejs/node-convergence-archive#23 Ref: nodejs#268 PR-URL: nodejs#2540 Reviewed-By: bnoordhuis - Ben Noordhuis <[email protected]>
This probably affects master as well but I'm reporting this for v0.10. With any HTTP-heavy benchmark,
exports._unrefActive()
from lib/timers.js shows up high on the list of cost centers. Case in point:NB: That's from an application that puts together a multi-kilobyte SOAP response. That's an expensive operation but _unrefActive() still manages to dominate the list of most expensive functions.
There are two issues here, I think:
I'd like to start a discussion on how to best fix that.
For 1, stripping _unrefActive() from unnecessary cruft seems like a first good step, maybe followed by switching to a timer wheel.
For 2, I'm less sure; the easiest solution I can think of (that is still provably correct and performs well) is to replace the
timers._unrefActive(this)
calls with something likethis.nevents += 1
. Then, when the timer expires:this.nevents > 0
.That should greatly reduce the frequency of the (linear) scan over the timer list.
Thoughts, @indutny and @tjfontaine and anyone else who wants to chime in?
The text was updated successfully, but these errors were encountered: