Long-running application memory usage testing #2423

rvagg · 2015-08-18T12:12:18Z

I don't want to be presumptuous and drop a new repo into this org so I'm starting it here: https://github.com/rvagg/node-memtest but will move it if it's considered helpful by others (perhaps there's a good critique that will demolish my approach).

I've been using it to test long-running memory usage across versions to generate graphs like this:

It can either put data into InfluxDB for plotting by Grafana or it can spit out memory usage to stdout so you can watch it stream by.

There are only two "tests" in there so far, just starters, I imagine the list of tests expanding over time as we come up with use-cases from the wild that demonstrate interesting behaviour we should monitor.

The list of Node & io.js versions is also dynamic, I've been including all of the "stable" ones and in the above screenshot I also have a nightly of master in the mix to test of the 3.0.0 leak is still there. It could be evolved to automatically pick up new nightlies and continually be testing and producing data for us to give us confidence in our releases.

I've set up anonymous access to an instance I have running here if you want to have a poke at it: http://128.199.216.28:3000/dashboard/db/memory-tests

Is anyone interested in helping out with this?

The text was updated successfully, but these errors were encountered:

kzc · 2015-08-18T18:16:00Z

Tracking process vsize (virtual size) would also be useful. rss can potentially miss some leaks if the leaked memory is not being actively referenced (swapped out and not resident in memory).

EricMCornelius · 2015-08-18T18:26:14Z

👍 much needed validation, thanks for tackling this!

ChALkeR · 2015-08-24T08:05:26Z

I did not take a look on the details, but 2.5.0 seems more stable than 3.1/3.1.1, especially in the http-spew test.

rvagg · 2015-08-24T08:34:51Z

@ChALkeR yeah, heavy Buffer use puts us at the mercy of how V8 treats ArrayBuffers unfortunately, at least it doesn't grow though. @trevnorris has talked about doing some deeper investigation into the nature of GC on this and whether it's totally on the ArrayBuffer side or if there is something we can tweak on our end.

ChALkeR · 2015-08-24T08:37:04Z

From the first look it could be related to #1671, but I'm not sure yet.

trevnorris · 2015-08-24T18:26:46Z

Here's the flame graph for creating Buffers in v2.5.0: https://cldup.com/VDEJ25y_oL.svg

Here it is for v3.1.0: https://cldup.com/90ZGBV_cfP.svg

The two things that stand out to me are std::map and the time spent in some sort of malloc. @bnoordhuis also noticed that it generates a large number of page faults. Not sure what to make of this, but will attempt to investigate. Either way this requires a V8 fix.

kzc · 2015-08-24T23:49:41Z

@trevnorris Memory and CPU are like apples and oranges. You can often get lower memory use in these synthetic IO benchmarks by using a smaller default Buffer size, but usually at the expense of CPU in increased garbage collection bookkeeping.

Which variable are you trying to minimize?

trevnorris · 2015-08-25T18:14:34Z

@kzc Choosing 64KB buffers was not random. The probable most common case of I/O in node is done over TCP. Which lends itself to the 64KB packet limit test. Which is why I chose to create the flamegraph with that value.

Though I have much less concern at this point over memory usage, than the fact that Buffer instantiation time is over 2x slower then it was in v0.12. And not only in a small range of sizes, but at all tested sizes (which stopped at 256KB).

kzc · 2015-08-25T21:35:20Z

A tight for loop around 1e7 unused new SlowBuffer()s doesn't measure Buffer instantiation time as much as it measures GC time as seen in the stats from flame graphs above:

v2.5.0:
    node::smalloc::Alloc                  - 78.32% of CPU
    v8::internal::Heap::CollectAllGarbage - 69.75% of CPU

v3.1.0:
    node::Buffer::Create                  - 58.24% of CPU
    v8::ArrayBuffer::New                  - 47.86% of CPU
    v8::internal::Heap::CollectAllGarbage - 36.00% of CPU

You have to take the results of synthetic benchmarks with a grain of salt as it doesn't approximate a real life scenario which allocates buffer memory on an as-needed basis putting much less pressure on GC.

One would also have to measure IO throughput in addition to measuring CPU to get a fair comparison of relative node version performance. Ultimately IO throughput is the most important metric, or perhaps throughput per unit of energy depending on the situation.

ChALkeR · 2015-08-25T21:45:29Z

You have to take the results of synthetic benchmarks with a grain of salt as it doesn't approximate a real life scenario which allocates buffer memory on an as-needed basis putting much less pressure on GC.

I'm not quite sure what you mean here. Buffers do not trigger scavenges, and it actually takes a mark-sweep to clean them. The problem that I know about is that Buffer usage triggers many mark-sweeps, which is slow and not memory-efficient.

One would also have to measure IO throughput in addition to measuring CPU to get a fair comparison of relative node version performance. Ultimately IO throughput is the most important metric, or perhaps throughput per unit of energy depending on the situation.

Do you have a benchmark to share?

trevnorris · 2015-08-25T21:51:26Z

Doing testing between v2.5.0 and v3.2.0 I saw a jump in minor page faults from 27,300 to 5,955,571. So at a quick glance it would seem that something bad for performance is occurring.

trevnorris · 2015-08-25T22:07:42Z

From the benchmarks/ dir. Here are a few results:

                                            v3.2.0   v0.12.7

net/net-s2c.js len=16777216 type=buf dur=5: 16.1     21.5
net/net-c2s.js len=16777216 type=buf dur=5: 16.7     19.6

These two show the issue at its worst because throughput is most dependent on being able to allocate new buffers, write them and clean them up once the write is complete. We have companies like F5 attempting to do millions of transactions, and when a single buffer allocation takes 0.5us more (whether from instantiation or necessary GC) it gravely affects their application.

kzc · 2015-08-25T22:08:21Z

I'm not quite sure what you mean here. Buffers do not trigger scavenges, and it actually takes a mark-sweep to clean them.

v8::internal::Heap::CollectAllGarbage is indirectly called by node::Buffer::Create.

The act of creating a SlowBuffer invokes GC.

Do you have a benchmark to share?

The ones provided in this ticket could be used. They just have to close the server and exit after a fixed number of transactions.

ChALkeR · 2015-08-25T22:12:25Z

@kzc, I still fail to see why does it need a v8::internal::Heap::CollectAllGarbage call and how is that good for performance and/or memory.

kzc · 2015-08-25T22:15:11Z

I still fail to see why does it need a v8::internal::Heap::CollectAllGarbage call and how is that good for performance and/or memory.

Because 1e7 new SlowBuffer()s are created in a tight loop in JS, it has no choice but to reclaim memory from a GC within the v8 allocator. It's not a matter of good nor bad - just necessary.

ChALkeR · 2015-08-25T22:16:23Z

@kzc Scavenges are enough in all the supplied testcases, I believe. My question is why is it running full mark-sweeps.

trevnorris · 2015-08-25T22:16:52Z

They just have to close the server and exit after a fixed number of transactions.

Wait, what? I've got to be misunderstanding what you're saying. node servers are doing high throughput for days/weeks/months at a time. So, what did you mean?

Also, to make sure it's clear, Buffers are created w/ the kInternalized flag. So V8 has full control of the memory.

trevnorris · 2015-08-25T22:17:38Z

The act of creating a SlowBuffer invokes GC.

Why?

trevnorris · 2015-08-25T22:19:33Z

reclaim memory from a GC within the v8 allocator

We're not using ArrayBuffer::Allocator. If that's what you're referring to. We malloc() manually then hand it off using ArrayBuffer::New() using kInternalized. So that V8 can take full control of the memory's lifetime.

kzc · 2015-08-25T22:20:20Z

@trevnorris - I was referring to constructing a benchmark that exits after a certain number of HTTP requests. If you time the thing you get an idea of how long node takes to process N http requests/responses as fast as a client can handle. Nothing fancy.

trevnorris · 2015-08-25T22:21:25Z

Ah okay. Thanks for clarifying. :)

kzc · 2015-08-25T22:22:43Z

Why does creating a SlowBuffer invoke GC? It's clearly visible in both your flame graphs. It has to free up memory somehow in the tight loop. There's no pauses in the event loop to collect garbage otherwise.

ChALkeR · 2015-08-25T22:24:30Z

@kzc Why does it invoke a mark-sweep?

kzc · 2015-08-25T22:30:41Z

We're not using ArrayBuffer::Allocator. If that's what you're referring to. We malloc() manually then hand it off using ArrayBuffer::New() using kInternalized.

malloc, free, v8 memory, GCing - it's all quite recursively intertwined. It's hard measure one in isolation because each affect all of them.

Might it be possible to reduce GC by recycling the IO SlowBuffers with their already malloc()ed memory?

kzc · 2015-08-25T22:32:05Z

Why does it invoke a mark-sweep?

Not exactly sure why, but would v8 have any other choice when allocating 64K * 1e7 in a tight JS loop?

trevnorris · 2015-08-25T22:32:10Z

Here's the revised test script:

var SlowBuffer = require('buffer').SlowBuffer;
var ITER = 1e5, SIZE = 65536, cntr = 0;
var t = process.hrtime();

(function runner() {
  if (++cntr > ITER) return print();
  new SlowBuffer(SIZE);
  setImmediate(runner);
}());

function print() {
  t = process.hrtime(t);
  console.log(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
}

This allows a full turn of the event loop between each Buffer creation. Even so, allocating 64KB buffers runs in 3060 ns/op and 5300 ns/op respectively.

ChALkeR · 2015-08-25T22:50:29Z

@trevnorris What I'm talking about when I mention scavenges vs mark-sweeps is that if you insert an if (cntr % 177 === 0) gc(true); to your testcase, it starts running 15-30% faster.

If it doesn't, you might have to decrease 177 to something lower, the point is to run a scavenge before v8 decides to run a mark-sweep.

ChALkeR · 2015-08-25T22:51:06Z

@kzc Try inserting gc(true) as I mentioned above.

ChALkeR · 2015-08-25T23:00:04Z

20x «concurrency» and a warmup:

'use strict';

var limit = 1000000;
var concurrent = 20;
var scavenges = true;

var warmup = limit / 10;
var start, count = 0;
limit += warmup;

function call() {
    var buffer = new Buffer(64 * 1024);
    var bufferx;
    if (scavenges && (count % 170 === 0)) {
        // gc(true) is a scavenge. Not sure if this is documented.
        gc(true);
    }
    count++;
    if (count > limit) {
        console.log(process.hrtime(start));
        process.exit(0);
    } else if (count === warmup) {
        start = process.hrtime();
    }
    setImmediate(call);
}

for (var i = 0; i < concurrent; i++) {
    call();
}

This has up to 2x speedup with gc(true) enabled (scavenges set to true).

@trevnorris Yes, with SlowBuffer too. In fact, with SlowBuffer the difference is even more visible (2.5x).

trevnorris · 2015-08-25T23:07:40Z

@ChALkeR The reason SlowBuffer is the more important metric is because that's how data packets are handled coming in from I/O. Those two use the exact same code path, except SlowBuffer must make a jump from JS to C++. Which only adds 30-50ns max.

kzc · 2015-08-25T23:12:25Z

@trevnorris - Even so for allocating 64KB buffers run in 3060 ns/op and 5300 ns/op respectively.

I see. Fair enough. Might it be possible to reduce GC by recycling the old IO SlowBuffers in their entirety with their already malloc()ed memory?

@ChALkeR - 2x speedup with gc(true) - very cool.

ChALkeR · 2015-08-25T23:15:29Z

Btw, the memory usage also drops about 2.5x times with gc(true) (using SlowBuffers):

[chalker@pc arraybuffer]$ node --expose-gc test-quick-0.js 
[ 5, 825429614 ]
{ rss: 64040960, heapTotal: 10307328, heapUsed: 4564176 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-0.js 
[ 5, 724898065 ]
{ rss: 60608512, heapTotal: 9275392, heapUsed: 4564176 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-0.js 
[ 5, 711345311 ]
{ rss: 65585152, heapTotal: 9275392, heapUsed: 4564176 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-1.js 
[ 2, 547167883 ]
{ rss: 24649728, heapTotal: 21658624, heapUsed: 3784272 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-1.js 
[ 2, 562118909 ]
{ rss: 24330240, heapTotal: 21658624, heapUsed: 3784128 }

[chalker@pc arraybuffer]$ node --expose-gc test-quick-1.js 
[ 2, 649892998 ]
{ rss: 24715264, heapTotal: 21658624, heapUsed: 3775640 }

Measured at the end, but it should be fine, as RSS doesn't generally go down.

trevnorris · 2015-08-25T23:17:25Z

@kzc

Might it be possible to reduce GC by recycling the old IO SlowBuffers in their entirety with their already malloc()ed memory?

No. The Object's are not persistent weak handles, so we are not notified when they are cleaned up. Instead we hand off the memory to the ArrayBuffer using kInternalized so V8 can clean up the object at its leisure.

ChALkeR · 2015-08-25T23:18:25Z

My point is that the default GC strategy is highly inefficient at least in these testcases, and it is taking at least half of the total testcase time to do things that aren't really needed here. And it can't even keep the memory down as low as it could if it was using scavenges.

I am not yet sure how that affects real-world apps, it needs more testing.

The main issue for that is #1671.

kzc · 2015-08-26T00:46:55Z

@ChALkeR - your modification of @trevnorris' benchmark is really fascinating. You're really on to something here.

1,100,001 new Buffer(64 * 1024) calls in all cases, timings with/without gc(true), and concurrent set to different values:

concurrent = 20, gc(true) called:
  2.25user 0.01system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 29,892maxresident)k
  0inputs+0outputs (0major+231,176minor)pagefaults 0swaps

concurrent = 20, gc(true) NOT called:
  4.29user 0.29system 0:04.46elapsed 102%CPU (0avgtext+0avgdata 51,508maxresident)k
  0inputs+0outputs (0major+579,680minor)pagefaults 0swaps

concurrent = 1, gc(true) called:
  8.87user 0.02system 0:08.91elapsed 99%CPU (0avgtext+0avgdata 19,608maxresident)k
  0inputs+0outputs (0major+1,090,812minor)pagefaults 0swaps

concurrent = 1, gc(true) NOT called:
  9.83user 0.20system 0:09.91elapsed 101%CPU (0avgtext+0avgdata 83,232maxresident)k
  0inputs+0outputs (0major+646,562minor)pagefaults 0swaps

The gc(true) effect on lowering elapsed time, memory and CPU I can understand, but the concurrent effect I never expected - there is no concurrency in this single thread node process after all. But even so, it's dramatically faster with 20 async callbacks on the go at once. Are the new Buffer(64 * 1024)s going into different v8 memory zones or something? The concurrent = 20 setting better resembles a real world scenario with 20 socket connections.

kzc · 2015-08-26T01:11:58Z

The concurrent = 20 setting better resembles a real world scenario with 20 socket connections.

Let me take that statement back - new Buffer(65336) is not used, after all. An HTTP benchmark with 20 concurrent connections would better resemble a real world scenario.

cjihrig · 2015-11-10T23:10:24Z

@rvagg what is the status here? Do you still actively use this? Still potentially interested in bringing it into the org?

rvagg · 2015-11-14T03:35:51Z

@cjihrig probably not, I don't want to lump the org with activity that there is little interest in maintaining. I'll keep this as a private project for now unless others want to join the fun.

cjihrig · 2015-11-14T03:39:32Z

OK, sounds good. For the record, if you change your mind, I'd be +1 to having it in the org. Seems like useful info to have.

silverwind mentioned this issue Aug 18, 2015

io.js 3.0.0 memory leak #2308

Closed

ChALkeR added the memory Issues and PRs related to the memory management or memory footprint. label Aug 18, 2015

rvagg closed this as completed Nov 14, 2015

Long-running application memory usage testing #2423

Long-running application memory usage testing #2423

Comments

rvagg commented Aug 18, 2015

kzc commented Aug 18, 2015

EricMCornelius commented Aug 18, 2015

ChALkeR commented Aug 24, 2015

rvagg commented Aug 24, 2015

ChALkeR commented Aug 24, 2015

trevnorris commented Aug 24, 2015

kzc commented Aug 24, 2015

trevnorris commented Aug 25, 2015

kzc commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

trevnorris commented Aug 25, 2015

trevnorris commented Aug 25, 2015

kzc commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

kzc commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

trevnorris commented Aug 25, 2015

trevnorris commented Aug 25, 2015

trevnorris commented Aug 25, 2015

kzc commented Aug 25, 2015

trevnorris commented Aug 25, 2015

kzc commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

kzc commented Aug 25, 2015

kzc commented Aug 25, 2015

trevnorris commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

trevnorris commented Aug 25, 2015

kzc commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

trevnorris commented Aug 25, 2015

ChALkeR commented Aug 25, 2015

kzc commented Aug 26, 2015

kzc commented Aug 26, 2015

cjihrig commented Nov 10, 2015

rvagg commented Nov 14, 2015

cjihrig commented Nov 14, 2015