http/net: _unrefActive() is extremely expensive #8160

bnoordhuis · 2014-08-13T22:17:03Z

This probably affects master as well but I'm reporting this for v0.10. With any HTTP-heavy benchmark, exports._unrefActive() from lib/timers.js shows up high on the list of cost centers. Case in point:

1496   10.4%  LazyCompile: *exports._unrefActive timers.js:431
 607   40.6%    LazyCompile: *onread net.js:496
 579   38.7%    LazyCompile: *Socket._write net.js:619
 554   95.7%      LazyCompile: *Writable.write _stream_writable.js:163
 554  100.0%        LazyCompile: *Socket.write net.js:612

NB: That's from an application that puts together a multi-kilobyte SOAP response. That's an expensive operation but _unrefActive() still manages to dominate the list of most expensive functions.

There are two issues here, I think:

The efficiency of _unrefActive()'s implementation leaves much to be desired.
lib/net.js frequently calls _unrefActive() an ungodly number of times (for just about every I/O operation, something that really hurts with busy connections.)

I'd like to start a discussion on how to best fix that.

For 1, stripping _unrefActive() from unnecessary cruft seems like a first good step, maybe followed by switching to a timer wheel.

For 2, I'm less sure; the easiest solution I can think of (that is still provably correct and performs well) is to replace the timers._unrefActive(this) calls with something like this.nevents += 1. Then, when the timer expires:

Check if this.nevents > 0.
If true, restart the timer.
If false, emit the timeout event.

That should greatly reduce the frequency of the (linear) scan over the timer list.

Thoughts, @indutny and @tjfontaine and anyone else who wants to chime in?

The text was updated successfully, but these errors were encountered:

tjfontaine · 2014-08-14T15:39:16Z

The linear scan of the timer list was my least favorite part of the implementation. This is definitely something I want to get fixed in 0.10 and 0.12 in perhaps something like a timer wheel.

misterdjules · 2014-08-19T01:21:02Z

@bnoordhuis @tjfontaine I'd like to give it a shot. I agree with your suggestions about how to approach the problem. Here's my plan:

Implement @bnoordhuis' suggested solution of not calling _unrefActive for every event, but instead increment a counter that is checked upon timeout. Profile this implementation with a simple test case and present the results.
Implement a timer wheel, profile the same test case and present the results. I've never implemented a timer wheel so it could take me some time, but at least we should have some feedback from 1) in the meantime.
Depending on the outcome of 1) and 2), refine our plan. Of course, if 1) proves to have a significant positive impact, I guess it could be merged without waiting for 2).

Please let me know what you think.
Please let me know if you're ok with that and I'll keep you posted ASAP.

bnoordhuis · 2014-08-19T05:37:59Z

@misterdjules Sounds like a good approach to me.

A quick note on the this.nevents += 1 solution that I suggested: timers._unrefActive() moves the timer Timer.now() + timer._idleTimeout milliseconds into the future (put another way, every I/O event postpones expiry by timer._idleTimeout milliseconds) so a simple event counter might not be enough to maintain the current behavior.

What should work, however, is making timers._unrefActive() store the current timestamp on the timer. Then, when the timer expires, you check if Timer.now() - timer._idleTimestamp >= timer._idleTimeout.

If true, it expired but if false, you reschedule it in timer._idleTimeout - (Timer.now() - timer._idleTimestamp) ms from now. You should probably add another field for that so you don't clobber the original timer._idleTimeout.

timer._idleTimestamp is just a suggestion for a name. timer._idleLastActive, for example, is arguably more descriptive.

As a final note, I use Timer.now() indiscriminately above but it's a fairly expensive function. Try to cache its return value as much as is feasible.

tjfontaine · 2014-08-19T14:38:00Z

I'm not entirely sure if a timer wheel is really necessary, I think for short term implementations you can actually just use a heap

misterdjules · 2014-08-19T16:44:53Z

@tjfontaine Alright, then for 2), instead of implementing a timer wheel I'll use a heap.

chirag04 · 2014-08-20T08:36:45Z

[Off topic]

@bnoordhuis Can you tell how you came up with that list of cost centers?

bnoordhuis · 2014-08-20T12:15:49Z

@chirag04 Certainly. That's the output of node --prof app.js, which writes a v8.log file, post-processed with deps/v8/tools/linux-tick-processor.

The tick processor depends on the d8 binary so you need to build V8 first: cd deps/v8 && make native i18nsupport=off. May not work with the V8 from release tarballs, clone the repo in that case.

If you're using master, take note of #8088 (which was fixed only recently.) HTH.

chirag04 · 2014-08-20T13:38:33Z

Awesome! Thanks @bnoordhuis

misterdjules · 2014-08-20T17:03:47Z

@bnoordhuis @tjfontaine I implemented approach 1) in misterdjules@802b4e3. It would be great if you had some time to look at it and let me know if it's in line with what we discussed.

I added a test in the tests suite that specifically tests _unrefActive, which I think is something currently missing. If and when this code is merged, I'll see if I can make this test more thorough. All tests run by make test pass, including the newly added one.

Below are the results I gathered so far. These results seem to be encouraging, but I have no experience benchmarking changes in Node.js, and I know it's easy to see what you want in benchmarks.

Any comment and suggestion would be very much appreciated, thank you!

Node.js server program used for the benchmarks

var http = require('http');
var server = http.createServer(function (req, res) {
    res.end();
})

server.listen(4242, function() {
    console.log('Server listening on port 4242...');
});

With linear scan in `_unrefActive` (aka "original version")

wrk results

$ wrk -t12 -c400 -d30s http://127.0.0.1:4242/
Running 30s test @ http://127.0.0.1:4242/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    20.88ms    5.31ms  45.02ms   85.47%
    Req/Sec     1.02k   631.04     2.64k    49.92%
  360904 requests in 30.00s, 38.89MB read
  Socket errors: connect 155, read 199, write 0, timeout 2170
Requests/sec:  12030.06
Transfer/sec:      1.30MB
$

v8 prof's logs

ticks parent  name
  10456   34.4%  LazyCompile: *exports._unrefActive timers.js:517:32
  10275   98.3%    LazyCompile: *onread net.js:492:16

   4283   14.1%  Stub: CompareICStub {1}
   4283  100.0%    LazyCompile: *exports._unrefActive timers.js:517:32
   4215   98.4%      LazyCompile: *onread net.js:492:16

   2020    6.7%  node::Parser::Execute(v8::FunctionCallbackInfo<v8::Value> const&)
   1326   65.6%    LazyCompile: ~socketOnData _http_server.js:339:24
   1323   99.8%      LazyCompile: *emit events.js:68:44
   1322   99.9%        LazyCompile: *readableAddChunk _stream_readable.js:134:26
   1320   99.8%          LazyCompile: *onread net.js:492:16
    688   34.1%    LazyCompile: socketOnData _http_server.js:339:24
    688  100.0%      LazyCompile: *emit events.js:68:44
    688  100.0%        LazyCompile: *readableAddChunk _stream_readable.js:134:26
    688  100.0%          LazyCompile: *onread net.js:492:16

   1563    5.1%  _getpid

Without linear scan in `_unrefActive` (aka "optimized version")

wrk results

$ wrk -t12 -c400 -d30s http://127.0.0.1:4242/
Running 30s test @ http://127.0.0.1:4242/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    18.82ms    2.21ms  44.87ms   79.47%
    Req/Sec     1.18k   650.91     2.25k    50.35%
  380139 requests in 30.01s, 40.97MB read
  Socket errors: connect 155, read 168, write 0, timeout 2203
Requests/sec:  12666.23
Transfer/sec:      1.36MB
$

v8 prof's logs

ticks parent  name
   8365   27.6%  node::Parser::Execute(v8::FunctionCallbackInfo<v8::Value> const&)
   4225   50.5%    LazyCompile: ~socketOnData _http_server.js:339:24
   4220   99.9%      LazyCompile: *emit events.js:68:44
   4220  100.0%        LazyCompile: *readableAddChunk _stream_readable.js:134:26
   4220  100.0%          LazyCompile: *onread net.js:492:16
   4095   49.0%    LazyCompile: socketOnData _http_server.js:339:24
   4095  100.0%      LazyCompile: *emit events.js:68:44
   4095  100.0%        LazyCompile: *readableAddChunk _stream_readable.js:134:26
   4095  100.0%          LazyCompile: *onread net.js:492:16

   1798    5.9%  _getpid

   1203    4.0%  LazyCompile: *emit events.js:68:44
    867   72.1%    LazyCompile: *readableAddChunk _stream_readable.js:134:26
    867  100.0%      LazyCompile: *onread net.js:492:16
    214   17.8%    LazyCompile: *resume_ _stream_readable.js:717:17
    214  100.0%      LazyCompile: ~<anonymous> _stream_readable.js:711:30
    214  100.0%        LazyCompile: _tickCallback node.js:332:27
     43    3.6%    LazyCompile: *emitReadable_ _stream_readable.js:417:23
     43  100.0%      LazyCompile: ~<anonymous> _stream_readable.js:409:32
     43  100.0%        LazyCompile: _tickCallback node.js:332:27
     30    2.5%    LazyCompile: ~finish _http_outgoing.js:504:18
     30  100.0%      LazyCompile: *afterWrite _stream_writable.js:321:20
     30  100.0%        LazyCompile: ~<anonymous> _stream_writable.js:312:32
     30  100.0%          LazyCompile: _tickCallback node.js:332:27

    801    2.6%  LazyCompile: *exports._unrefActive timers.js:555:32
    778   97.1%    LazyCompile: *onread net.js:492:16

    692    2.3%  LazyCompile: socketOnData _http_server.js:339:24
    692  100.0%    LazyCompile: *emit events.js:68:44
    692  100.0%      LazyCompile: *readableAddChunk _stream_readable.js:134:26
    692  100.0%        LazyCompile: *onread net.js:492:16

misterdjules · 2014-08-20T17:20:49Z

I just realized @bnoordhuis benchmark was based on v0.10. I will rebase my changes on it and post benchmarks based on v0.10 so that we can compare apples to apples.

misterdjules · 2014-08-20T21:02:27Z

@bnoordhuis @tjfontaine It seems mac-tick-processor and more generally d8 on branch v0.10 fails to build on MacOS X. I tried to build it on Linux and it also fails with similar build issues. I was able to fix them on Linux but when running linux-tick-processor, it segfaults and dumps core.

I was able to build and run the mac-tick-processor on v0.12. Is d8 known to build and run on the v0.10 branch?

misterdjules · 2014-08-20T21:19:02Z

Also, just for your information, I backported the change from v0.12 to v0.10 here: misterdjules@7d78abd.

bnoordhuis · 2014-08-21T20:42:52Z

It seems mac-tick-processor and more generally d8 on branch v0.10 fails to build on MacOS X.

Does make native werror=no work?

I tried to build it on Linux and it also fails with similar build issues. I was able to fix them on Linux but when running linux-tick-processor, it segfaults and dumps core.

I've had that happen, too. Does the make x64.debug build work for you? Set D8_PATH to out/x64.debug to make the tick processor use it.

bnoordhuis · 2014-08-21T20:50:29Z

The numbers do indeed look encouraging. Is the difference stable over multiple runs? I/O-bound tests sometimes have a tendency of displaying high variance.

Can you open a pull request with the changes? I would suggest targeting v0.10 first.

One comment about the test: I noticed it uses a 1s timeout. That's on the long side for a test in test/simple. If it's not possible to shorten the timeout, then test/pummel is arguably a better place for it.

misterdjules · 2014-08-21T21:14:40Z

@bnoordhuis The difference seems to be stable over multiple runs. I will shortly post more results illustrating that.

Before that I'm trying to see if the increase in complexity in unrefTimeout's algorithm (early out as soon as we find the first item that hasn't expired VS always scanning the whole list) could have a significant negative impact on performance. Intuitively, it shouldn't be the case because unrefTimeout is not called often, unless timeouts are very short. But in that case, the list wouldn't have enough time to grow to any significant size. So far this intuition has been confirmed, but I'd rather be extra-careful here.

I will open a pull request with these changes targeted at v0.10.

The 1s timeout in the test is arbitrary, we can make it as low as possible. I'll make that change before creating the PR.

Thank you for the review!

misterdjules · 2014-08-21T22:14:31Z

@bnoordhuis werror=no doesn't fix the build issue on MacOS X, I'll see if I can investigate that a bit further.

I still get a segfault when running the x64.debug build of d8 on Linux. I managed to work around it by invoking d8 myself, bypassing the linux-tick-processor shell script.

Thank you for your support Ben!

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. However, it is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160.

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160.

misterdjules · 2014-08-22T00:36:25Z

@bnoordhuis I just created the PR for v0.10.

misterdjules · 2014-08-22T07:09:24Z

@bnoordhuis @tjfontaine I posted the results of some benchmarks in a gist.

I used http-flamegraph.sh on SmartOS. I ran it with two sampling periods: 10ms and 1ms. The results of the benchmark with the 1ms sampling period are probably not representative because of the high overhead of capturing the stack so frequently. The results with a 10ms sampling period show that the change has a positive impact on performance. The flamegraphs also show that, with the change, the amount of samples with _unrefActive on CPU is cut in half.

On Linux, I used the script that can be found in the Gist mentioned earlier. It uses v8's profiler and wrk to generate HTTP requests. The results also show that the change has a positive impact on performance. Also _unrefActive is almost gone from the top contributors in the profiler's output.

The improvements shown by the benchmark using the v8's profiler seem to be more drastic than what we see with http-flamegraph.sh. My understanding is that it comes from the fact that the 10ms sampling used by http-flamegraph.sh is too coarse to really catch all the calls to a tiny function like _unrefActive. v8's profiler uses a 1ms sampling period AFAIK, which catches more calls to _unrefActive, and does not have the overhead of a kernel capturing the whole stack.

Last but not least, I haven't found a way to highlight the potential negative performance impact that the linear search in unrefTimeout introduced by my change could have.

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160.

misterdjules · 2014-08-22T19:13:07Z

I've added another Gist that compares the current implementation, the unordered list implementation and the heap implementation all based on v0.12. The full profiling output of one run of the heap implementation is also available here.

I had written the heap implementation before realizing that we wanted to benchmark that on v0.10, and I don't want to spend time back porting it to v0.10 before we determine that we want to investigate it further.

The results show that the heap implementation is better than the original one, but performs worse than the unordered list. I've used @tjfontaine's binary heap module to avoid reinventing the wheel right from the start and to see if the improvements in algorithmic complexity would yield interesting results.

I chose to use a heap implementation that uses a binary tree and not an array to avoid having to deal with the potential negative performance impact of growing (and maybe even shrinking) the array each time the storage limit is reached when adding a timer.

The code for the heap implementation is in a separate branch based on v0.12. There are a few hacks needed to work around the fact that the binaryheap module doesn't have the same API as the _linkedlist module. This could be cleaned up if we decided to investigate this solution further.

I'd like to gather your comments before going further. Currently, my opinion is that:

Having an O(n) worst case insert as it's the case in the original implementation is going to perform worse than a heap, an unordered list with O(1) insert or a timer wheel. So I think choosing any of these three implementations would improve things significantly.
The heap and unordered list implementations are simple, but the heap requires more code to be added and an additional internal module (not a big deal). The timer wheel is probably more complex, and harder to tune. See this post in the LKML for a good example of how subtle it can be.
It is not clear if optimizing the code when timers expire will yield significant results in most cases, since my intuition is that timers expire rarely (not in absolute numbers, but relatively to the number of times they are added). Plus, if timeouts start happening, then it's likely that the other end of the connection will slow down as the result of timeouts, and thus the rate of adding timers will also slow down. This is where I think I need more data and a more scientific approach to answer this question, and it could take some time to be able to present them in a meaningful way.

I'm looking forward to reading your thoughts!

misterdjules · 2014-08-22T20:16:04Z

I actually find a way to highlight the unordered list's worst case. The benchmark is here. I wasn't able to highlight it before because I wasn't adding enough timers. 10K is not enough, but 100K shows interesting results. I will of course try to analyze where's the inflection point.

The benchmark is very simple, it just adds 100K timers that each expire 1ms after the previous one. The goal is to have a very high number of timers when calling unrefTimeout.

So as expected, unrefTimeout performs very poorly in the unordered list implementation. The heap implementation performs much better. I wasn't able to benchmark v0.12 itself because linux-tick-processor says that there's no tick in the log file. However, as expected, the original v0.10 implementation shows that unrefActive itself is the bottleneck.

So now that I can highlight this worst case and shows that the heap implementation performs much better in this case, I'm going to backport it to v0.10, re-run the benchmarks and post the results.

I will also probably stop spamming this issue's comments, and create a Wiki page updated as I get new interesting results.

misterdjules · 2014-08-22T21:46:57Z

I forgot to mention in my previous comment that the micro-benchmark that I used to highlight unrefTimeout's bad performance with the unordered list implementation is not relevant by itself unless we can reproduce it in the context of an actual application. It was used only to validate the hypothesis that the unordered list implementation is slow compared to the heap implementation in that case. One of the next steps is to build an actual macro benchmark that illustrates the same pattern.

This is not necessarily easy, since it would require lots of timeouts to happen very frequently and at regular time intervals. Feel free to share your thoughts on how to reproduce this pattern in a macro benchmark as I've never done that in the past.

misterdjules · 2014-08-25T23:30:59Z

I created a Wiki page that summarizes my investigations. To paraphrase my conclusion, I have two questions:

Do we want to integrate the heap implementation first and then benchmark the timer wheel? Or do we want to benchmark the timer wheel before merging any change?
If we decide to integrate the heap implementation now: can we use @tjfontaine's binaryheap module as our lib/_heap.js implementation (as it's done currently)?

trevnorris · 2014-09-25T22:54:19Z

Re-assigning to myself.

misterdjules · 2014-09-25T23:49:40Z

@trevnorris Please let me know if you have any question regarding this issue. The heap implementation is done, but we need to determine if we want to use @tjfontaine's heap module as our internal heap module, or if we want to use/create another one.

misterdjules · 2014-10-24T18:40:00Z

@trevnorris After a discussion with @tjfontaine, we thought that we could land the unordered list implementation (constant time insert, O(n) removal when timeouts fire) in 0.10.34 and 0.12. Then we can work on a better heap internal module and land the heap implementation (O(n) insert and O(log2 n) removal when timeouts fire) for a future release. What do you think?

trevnorris · 2014-10-25T07:22:13Z

@misterdjules Sounds like a reasonable investigation. When you have some code to demonstrate what you're saying ping me.

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160.

misterdjules · 2014-10-27T02:17:14Z

@trevnorris The code for the unordered list implementation (based off of 0,10) is available in this branch: https://github.com/misterdjules/node/tree/fix-issue-8160-0-10. There's only one commit to look at: 2714b69.

Some explanations on how I compared the unordered list implementation with the original one are available, along with the results, here: #8160 (comment). If you have trouble interpreting the results or running the benchmark script, please let me know.

Someone on the Node.js google group also tested the unordered list implementation, and the results seem to confirm my investigations. More details here: https://groups.google.com/forum/#!topic/nodejs/Uc-0BOCicyU/discussion.

Thank you!

trevnorris · 2014-10-27T19:39:11Z

@misterdjules It seems as if it should be possible to keep a separate list just for sockets and such, and one for timers. Each could be optimized for their use case. Thoughts?

tjfontaine · 2014-10-28T18:33:19Z

The only usage of unrefActive is on sockets, the rest of the timer usage is in buckets and is behaving reasonably well.

What we can solve for is the normal/predictable case, that is to say connections are behaving normally and data is flowing through the stream. In this case we are constantly updating the time out and then ordering the list on insert. By shifting the linear scan to when the timeout fires we're only incurring the pain when a connection has already hit the bad path. Or that is to say the unintended path, where they're no longer sending data on a regular interval.

This means that instead of incurring the on cpu time for all behaving connections we are only experiencing the cpu time on connections that are probably going to be reaped.

trevnorris · 2014-10-30T22:52:05Z

@tjfontaine Sounds like a great solution. Estimated difficulty level?

misterdjules · 2014-10-30T23:23:44Z

@trevnorris My understanding is that what @tjfontaine described in his latest comment is what's implemented in 2714b69.

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes #8160.

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes #8160. PR-URL: #8751 Signed-off-by: Timothy J Fontaine <[email protected]>

tjfontaine · 2014-12-17T21:10:23Z

The current work around landed in 934bfe2

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes nodejs#8160. PR-URL: nodejs#8751 Signed-off-by: Timothy J Fontaine <[email protected]>

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes: nodejs/node-v0.x-archive#8160 PR-URL: nodejs/node-v0.x-archive#8751 Signed-off-by: Timothy J Fontaine <[email protected]> Conflicts: lib/timers.js

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes: nodejs/node-v0.x-archive#8160 Signed-off-by: Timothy J Fontaine <[email protected]> Conflicts: lib/timers.js Fixes: nodejs/node-convergence-archive#23 Ref: #268 PR-URL: #2540 Reviewed-By: bnoordhuis - Ben Noordhuis <[email protected]>

Before this change, _unrefActive would keep the unrefList sorted when adding a new timer. Because _unrefActive is called extremely frequently, this linear scan (O(n) at worse) would make _unrefActive show high in the list of contributors when profiling CPU usage. This commit changes _unrefActive so that it doesn't try to keep the unrefList sorted. The insertion thus happens in constant time. However, when a timer expires, unrefTimeout has to go through the whole unrefList because it's not ordered anymore. It is usually not large enough to have a significant impact on performance because: - Most of the time, the timers will be removed before unrefTimeout is called because their users (sockets mainly) cancel them when an I/O operation takes place. - If they're not, it means that some I/O took a long time to happen, and the initiator of subsequents I/O operations that would add more timers has to wait for them to complete. With this change, _unrefActive does not show as a significant contributor in CPU profiling reports anymore. Fixes: nodejs/node-v0.x-archive#8160 Signed-off-by: Timothy J Fontaine <[email protected]> Conflicts: lib/timers.js Fixes: nodejs/node-convergence-archive#23 Ref: nodejs#268 PR-URL: nodejs#2540 Reviewed-By: bnoordhuis - Ben Noordhuis <[email protected]>

tjfontaine self-assigned this Aug 14, 2014

tjfontaine added timer labels Aug 14, 2014

misterdjules mentioned this issue Aug 22, 2014

timers: Avoid linear scan in _unrefActive. #8225

Closed

trevnorris unassigned tjfontaine Sep 25, 2014

trevnorris self-assigned this Sep 25, 2014

misterdjules added this to the 0.10.34 milestone Oct 24, 2014

misterdjules mentioned this issue Nov 19, 2014

timers: Avoid linear scan in _unrefActive. #8751

Closed

tjfontaine closed this as completed Dec 17, 2014

bnoordhuis mentioned this issue May 24, 2015

[Converge] timers: Avoid linear scan in _unrefActive. nodejs/node-convergence-archive#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http/net: _unrefActive() is extremely expensive #8160

http/net: _unrefActive() is extremely expensive #8160

bnoordhuis commented Aug 13, 2014

tjfontaine commented Aug 14, 2014

misterdjules commented Aug 19, 2014

bnoordhuis commented Aug 19, 2014

tjfontaine commented Aug 19, 2014

misterdjules commented Aug 19, 2014

chirag04 commented Aug 20, 2014

bnoordhuis commented Aug 20, 2014

chirag04 commented Aug 20, 2014

misterdjules commented Aug 20, 2014

misterdjules commented Aug 20, 2014

misterdjules commented Aug 20, 2014

misterdjules commented Aug 20, 2014

bnoordhuis commented Aug 21, 2014

bnoordhuis commented Aug 21, 2014

misterdjules commented Aug 21, 2014

misterdjules commented Aug 21, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 25, 2014

trevnorris commented Sep 25, 2014

misterdjules commented Sep 25, 2014

misterdjules commented Oct 24, 2014

trevnorris commented Oct 25, 2014

misterdjules commented Oct 27, 2014

trevnorris commented Oct 27, 2014

tjfontaine commented Oct 28, 2014

trevnorris commented Oct 30, 2014

misterdjules commented Oct 30, 2014

tjfontaine commented Dec 17, 2014

http/net: _unrefActive() is extremely expensive #8160

http/net: _unrefActive() is extremely expensive #8160

Comments

bnoordhuis commented Aug 13, 2014

tjfontaine commented Aug 14, 2014

misterdjules commented Aug 19, 2014

bnoordhuis commented Aug 19, 2014

tjfontaine commented Aug 19, 2014

misterdjules commented Aug 19, 2014

chirag04 commented Aug 20, 2014

bnoordhuis commented Aug 20, 2014

chirag04 commented Aug 20, 2014

misterdjules commented Aug 20, 2014

Node.js server program used for the benchmarks

With linear scan in _unrefActive (aka "original version")

wrk results

v8 prof's logs

Without linear scan in _unrefActive (aka "optimized version")

wrk results

v8 prof's logs

misterdjules commented Aug 20, 2014

misterdjules commented Aug 20, 2014

misterdjules commented Aug 20, 2014

bnoordhuis commented Aug 21, 2014

bnoordhuis commented Aug 21, 2014

misterdjules commented Aug 21, 2014

misterdjules commented Aug 21, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 22, 2014

misterdjules commented Aug 25, 2014

trevnorris commented Sep 25, 2014

misterdjules commented Sep 25, 2014

misterdjules commented Oct 24, 2014

trevnorris commented Oct 25, 2014

misterdjules commented Oct 27, 2014

trevnorris commented Oct 27, 2014

tjfontaine commented Oct 28, 2014

trevnorris commented Oct 30, 2014

misterdjules commented Oct 30, 2014

tjfontaine commented Dec 17, 2014

With linear scan in `_unrefActive` (aka "original version")

Without linear scan in `_unrefActive` (aka "optimized version")