RFC: seeking feedback on async_hooks performance in production #14794

refack · 2017-08-12T19:54:00Z

Version: v8.x - master
Platform: all
Subsystem: async_hooks

[email protected] saw the introduction of async_hooks the latest and greatest incarnation of core's tracing API (#11883 - based on https://github.com/nodejs/node-eps/blob/master/006-asynchooks-api.md)
The @nodejs/async_hooks teams would love to hear feedback, especially regarding performance in production apps (CPU/memory usage), but also synthetics benchmarking results.
Any other feedback will also be welcome, as well as issues and PRs.

Thanks.

The text was updated successfully, but these errors were encountered:

AndreasMadsen · 2017-08-12T21:08:38Z

@nodejs/diagnostics

jkrems · 2017-08-13T01:59:09Z

I'm guessing the instructions for a non-synthetic benchmark would be:

Run an actual app on node 8 in prod.
Turn on async_hooks (require('async_hooks').enable() should be enough?).
Compare memory/CPU stats.

If so, I might get some data in the coming weeks (wanted to give 8.3 a spin anyhow in one of our services).

AndreasMadsen · 2017-08-13T11:23:35Z

@jkrems You actually need to add empty hooks functions, otherwise nothing changes.

refack · 2017-08-13T12:32:35Z

@jkrems I that will be great!

@jkrems You actually need to add empty hooks functions, otherwise nothing changes.

It will be a fair exercise to suss out memory leaks.

refack · 2017-08-13T12:43:11Z

There is the example trace snippet from nodejs.org/api/async_hooks.html (slightly simplified):

let indent = '';
const traceFile = 1; // for stdout
async_hooks.createHook({
  init(asyncId, type, triggerAsyncId) {
    const eid = async_hooks.executionAsyncId();
    fs.writeSync(
      traceFile,
      `${indent}${type}(${asyncId}): trigger: ${triggerAsyncId} execution: ${eid}\n`);
  },
  before(asyncId) {
    fs.writeSync(traceFile , `${indent}before:  ${asyncId}\n`);
    indent += '  ';
  },
  after(asyncId) {
    indent = indent.slice(2);
    fs.writeSync(traceFile , `${indent}after:   ${asyncId}\n`);
  },
  destroy(asyncId) { fs.writeSync(traceFile , `${indent}destroy: ${asyncId}\n`); },
}).enable();

tutman96 · 2017-08-15T00:47:44Z

Been running it in production on several high traffic components since 4.2.1 and have had no major memory leaks. No performance degradation compared to the old async-listeners implementation.

One thing we did notice was the asynchronous hooks generated by promises didn't get "destroy"ed until the promise got garbage collection. This caused a little bit of memory run away under very high traffic (synthetic traffic) but haven't noticed any problems with real world traffic.

elierotenberg · 2017-08-27T15:12:26Z

Hello,

I believe https://github.com/angular/zone.js/issues/889 might interest you!

vdeturckheim · 2017-09-26T09:24:04Z

I'm under the impression Google started using is in production in their APM. googleapis/cloud-trace-nodejs@0e15b6c

matthewloring · 2017-09-26T12:23:34Z

We are currently experimenting with async hooks but it is still behind a flag while we evaluate it. We found it to be slightly slower than the continuation-local-storage based approach but we still think it will be the right way forward especially once more modules start using the embedder API to record their own async events removing the need for some of our monkeypatching. We can update this thread once we gather more info on how async hooks is working for us.

vdeturckheim · 2017-09-26T12:39:49Z

awesome ! Thanks @matthewloring . When running with the CLS, is the monkeypatching on all async methods still applied ? Might have a performance boost there.

matthewloring · 2017-09-26T12:56:32Z

The monkeypatching done by our agent is the same under both approaches (since other modules have not yet started using the async hooks embedder api). Async hooks tracks more lifecycle events per asynchronous task than CLS does, especially for promises, many of which are redundant in the case of the context tracking done by our agent. I suspect this to be part of the slow down but we need to do more performance analysis to verify.

elierotenberg · 2017-09-27T09:25:41Z

Is it even possible to implement a robust cls/zone (or a lower-level abstraction that would allow zone/cls to be built upon) as a native Node module ?

I'm under the impression that some features are very hard / impossible to implement in a robust way using only JS (event using monkey-patching), most related to hacks in third-party libraries that also attemps to do low-level async stuff, and implement some sort of userland scheduling, eg. bluebird.

What are your use cases for this family of features?
I'm most interested in writing async servers in multi-process Node. I want to build a common generator-based userland coroutines/procs libraries that abstracts away the transport layer between coroutines and allow dynamic scheduling/monitoring/hot update.
In this model a proc in a pausable/resumable state machine capable of sending/receiving messages to/from other procs with an inbox queue. (heavily inspired by Erlang)

Example[1] :

function* counterServer({ counter }) {
  yield effect("saveCounter", counter);
  try {
    const message = yield receive(10000); // pause until a message is received or a timeout is elapsed
    if(message.type === "add") {
      // "loop" by recursively yielding itself (or another generator function)
       return yield* counterServer({ counter: counter + 1 });
    }
    if(message.type === "query") {
      // send a message back to the sender
       yield send(message.source, { type: counter, counter });
       return yield* counterServer({ counter });
    }
  }
  catch(error) {
    if(error instanceof ReceiveTimeout) {
      // its okay to timeout, we can just keep receiving
      return yield* counterServer({ counter });
    }
    // other errors crash the process so that its monitor can restart it in a safe state
    console.error("Error thrown in counterServer");
    throw error;
  }
}

const loop = async (initialCounter) => {
  let savedCounter = initialCounter;
  try {
    await spawn(counterServer, { counter }, {
      // algebraic effect handler : produce a "global" side effect (bubbled and resumable)
      saveCounter: (currentCounter) => savedCounter = currentCounter,
    });
  }
  catch(error) {
    console.error(error);
    // restart in safe state
    return await loop(savedCounter);
  }
}

loop(0);

For this feature to be exploitable, I need to monitor asynchronous side-effects, so that calculations can't escape their context (unless they really really want to, in which case the invariants would not be supported). If zone or cls were reliable at scale, I could build this upon them.

For this I think I only need to ensure that I can keep a readable context, and that I can run a function in an extended context.

Example[1] (could be expressed in terms of zones/cls) :

Context.current.fork((context) => {
    // extend context passed by parent
    return { ... context, foo: "bar" };
  }), (context) => {
  // do stuff in new "hardly" escapable context (not strictly impossible but hard to do by mistake)
  ...
}).catch((error) => ...).then((result) => ... );

I think allowing "just" that would enable many useful paradigms, even with a more restricted feature set than Zones/cls.

Do you think this can be implemented using async_hooks only ? I have a prototype but I'm sure you guys spent a lot of time thinking about this and I'd be glad to read your thoughts.

[1]: Syntax for illustrative purpose only, not really important at this stage

vdeturckheim · 2017-10-10T13:19:09Z

It seems that NR is having a beta for a new version of their agent based on async_hooks. See:

https://discuss.newrelic.com/t/feature-idea-native-async-await-support/49141/7

https://github.com/newrelic/node-newrelic/blob/ddab8b5f09bd8a65e3a9749121142898c120063a/NEWS.md#v220-2017-08-22

https://github.com/newrelic/node-newrelic/blob/46737cc6df2be640dfab76b72e650240f6a6e825/test/integration/core/async_hooks.js#L4

@martinkuba can you share some results regarding the usage of the API here?

ref: #14717

vdeturckheim · 2017-10-10T13:20:05Z

@elierotenberg there is a plan to re-implement domains over async_hooks, you might want to keep an eye on this.

elierotenberg · 2017-10-10T13:29:15Z

I will. Thanks.

NatalieWolfe · 2017-10-11T00:23:27Z

@vdeturckheim I work at NR (Martin actually just moved teams). We've had mostly positive results thus far. No real-world issues have come up and it seems the performance is generally better than our existing monkey-patch based instrumentation. Our customers who are testing it have responded positively without any complaints of performance degradation after turning it on.

In our worst-case scenario benchmark (a 300-link no-op promise chain), our async-hook instrumentation is about 5x faster than our older monkey-patch based instrumentation. A no-op async hook was about 1.5x faster than our async-hook instrumentation. No instrumentation at all was about 8x faster than the async-hook instrumentation. This test is not reflective of real-world performance, but does show it is better than what we had and there is room for improvement.

For memory-leak situations, we've run some servers under extremely heavy load (~6 million promises per minute, 6000 requests per minute) for several days and noticed no leaking issues. We have, however, encountered a strange behavior around the 6-hour mark where GC scavenge events suddenly jump and CPU jumps with it. This is a one-time situation and the memory usage doesn't change after that. We're currently working on narrowing down the cause of that.

elierotenberg · 2017-10-11T05:16:01Z

Glad to learn that NR is working on a visulization/debug tool based on async_hooks. Async work monitoring/debugging seems very painful and hard to me, and I'm convinced we need better tools with the proper abstractions.

As a side note, I've continued my little research and am now working on implementing a (minimal) actor system for easier work distribution with a single Node process or spanning multiple Node process (it should even be able to support having actors actually run in the browser and communication with the cluster using http/websockets). Think Erlang or Akka (but with much less features initially of course).
I will probably use async_hook in each Node process to monitor local async behavior, and enforce/monitor the invariants underlying the actor model implementation.

vdeturckheim · 2017-10-11T06:27:11Z

@NatalieWolfe thanks a lot! looks pretty promising!

Out of curiosity, what applications are you testing the agent with? Did you build an internal app or do you rely on open source projects?

benjamn · 2017-12-22T16:45:25Z

Realized my comment might not be germane for this thread, so I moved it here: #14717 (comment)

This commit fixes a bug where a request can end without finishing its timings. See: MiniProfiler#4 Bug cause: The structure that is used to register Miniprofiler timing providers, like (Postgres, HTTP, Redis), because it overrides the original method (globally) and uses the `request` object to access the Miniprofiler extensions to build the timings, and this doesn't work in a scenario with simultaneous requests using an async provider. Here is an example using [`pg`](https://github.com/goenning/miniprofiler-pg/blob/master/index.js) to try illustrating the failing scenario (check out the `tests/concurrent-async-test.js` test to see it running). request A start: * `pg.Client.prototype.query` holds a `req` object of requestA. * It calls `.query` in a pg instance * A miniprofiler timing starts with the call to `req.miniprofiler.timeQuery(...)` * The original method is called (async). request B start: * `pg.Client.prototype.query` holds a `req` object of request B. * It calls `.query` in a pg instance * Start timing with `req.miniprofiler.timeQuery(...)` * The original method is called (async). request A resume: * The result of `.query` is returned * A new call to `.query` is made * This time the `req` points to request B, this means that `req.miniprofiler.timeQuery(...)` will start a timing on request B. * The original method is called (async) request B resume: * The result of `.query` is returned. * All data was fetched, the request is ready to finish, so internally Miniprofile calls [`stopProfilling`](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L80). * This fails because there is a timing started (by request A) but not finished, so calculating the [diffs](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L409) will fails because `stop` is undefined. Solution -------- Using NodeJS "async_hooks" we can track the reference to the correct extension for each request, so calls to `req.miniprofiler.timeQuery()` will point to the correct miniprofiler extension. To check some performance analisys see: nodejs/node#14794 (comment) The goal of the current commit isn't introduce breaking changes, so the miniprofiler reference is injected into the request using JS getters. Another solution is changing the API for providers, where instead of receiving a `req` reference, they can receive a function that gets the reference to the correct miniprofiler instance. But this will break API with all existing providers. References ---------- - https://medium.com/the-node-js-collection/async-hooks-in-node-js-illustrated-b7ce1344111f - https://medium.com/@guysegev/async-hooks-a-whole-new-world-of-opportunities-a1a6daf1990a - nodejs/node#14794 (comment)

Trott · 2018-10-26T05:01:38Z

Closing as discussion seems to have run its course. Feel free to re-open if this is still something that needs more discussion.

This commit fixes a bug where a request can end without finishing its timings. See: MiniProfiler#4 Bug cause: The structure that is used to register Miniprofiler timing providers, like (Postgres, HTTP, Redis), because it overrides the original method (globally) and uses the `request` object to access the Miniprofiler extensions to build the timings, and this doesn't work in a scenario with simultaneous requests using an async provider. Here is an example using [`pg`](https://github.com/goenning/miniprofiler-pg/blob/master/index.js) to try illustrating the failing scenario (check out the `tests/concurrent-async-test.js` test to see it running). request A start: * `pg.Client.prototype.query` holds a `req` object of requestA. * It calls `.query` in a pg instance * A miniprofiler timing starts with the call to `req.miniprofiler.timeQuery(...)` * The original method is called (async). request B start: * `pg.Client.prototype.query` holds a `req` object of request B. * It calls `.query` in a pg instance * Start timing with `req.miniprofiler.timeQuery(...)` * The original method is called (async). request A resume: * The result of `.query` is returned * A new call to `.query` is made * This time the `req` points to request B, this means that `req.miniprofiler.timeQuery(...)` will start a timing on request B. * The original method is called (async) request B resume: * The result of `.query` is returned. * All data was fetched, the request is ready to finish, so internally Miniprofile calls [`stopProfilling`](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L80). * This fails because there is a timing started (by request A) but not finished, so calculating the [diffs](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L409) will fails because `stop` is undefined. Solution -------- Using NodeJS "async_hooks" we can track the reference to the correct extension for each request, so calls to `req.miniprofiler.timeQuery()` will point to the correct miniprofiler extension. To check some performance analisys see: nodejs/node#14794 (comment) The goal of the current commit isn't introduce breaking changes, so the miniprofiler reference is injected into the request using JS getters. Another solution is changing the API for providers, where instead of receiving a `req` reference, they can receive a function that gets the reference to the correct miniprofiler instance. But this will break API with all existing providers. References ---------- - https://medium.com/the-node-js-collection/async-hooks-in-node-js-illustrated-b7ce1344111f - https://medium.com/@guysegev/async-hooks-a-whole-new-world-of-opportunities-a1a6daf1990a - nodejs/node#14794 (comment)

refack added async_hooks Issues and PRs related to the async hooks subsystem. meta Issues and PRs related to the general management of the project. labels Aug 12, 2017

addaleax removed the meta Issues and PRs related to the general management of the project. label Aug 12, 2017

refack mentioned this issue Aug 12, 2017

async_hooks experimental status #14717

Closed

refack mentioned this issue Sep 21, 2017

async_hooks: skip sanity checks when disabled #15454

Closed

3 tasks

daniloisr mentioned this issue Oct 22, 2018

Fix context between two async requests MiniProfiler/node#7

Merged

Trott closed this as completed Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: seeking feedback on async_hooks performance in production #14794

RFC: seeking feedback on async_hooks performance in production #14794

refack commented Aug 12, 2017

AndreasMadsen commented Aug 12, 2017

jkrems commented Aug 13, 2017 •

edited

Loading

AndreasMadsen commented Aug 13, 2017

refack commented Aug 13, 2017 •

edited

Loading

refack commented Aug 13, 2017

tutman96 commented Aug 15, 2017

elierotenberg commented Aug 27, 2017

vdeturckheim commented Sep 26, 2017

matthewloring commented Sep 26, 2017

vdeturckheim commented Sep 26, 2017

matthewloring commented Sep 26, 2017

elierotenberg commented Sep 27, 2017 •

edited

Loading

vdeturckheim commented Oct 10, 2017

vdeturckheim commented Oct 10, 2017

elierotenberg commented Oct 10, 2017

NatalieWolfe commented Oct 11, 2017

elierotenberg commented Oct 11, 2017

vdeturckheim commented Oct 11, 2017

benjamn commented Dec 22, 2017 •

edited

Loading

Trott commented Oct 26, 2018

RFC: seeking feedback on async_hooks performance in production #14794

RFC: seeking feedback on async_hooks performance in production #14794

Comments

refack commented Aug 12, 2017

AndreasMadsen commented Aug 12, 2017

jkrems commented Aug 13, 2017 • edited Loading

AndreasMadsen commented Aug 13, 2017

refack commented Aug 13, 2017 • edited Loading

refack commented Aug 13, 2017

tutman96 commented Aug 15, 2017

elierotenberg commented Aug 27, 2017

vdeturckheim commented Sep 26, 2017

matthewloring commented Sep 26, 2017

vdeturckheim commented Sep 26, 2017

matthewloring commented Sep 26, 2017

elierotenberg commented Sep 27, 2017 • edited Loading

vdeturckheim commented Oct 10, 2017

vdeturckheim commented Oct 10, 2017

elierotenberg commented Oct 10, 2017

NatalieWolfe commented Oct 11, 2017

elierotenberg commented Oct 11, 2017

vdeturckheim commented Oct 11, 2017

benjamn commented Dec 22, 2017 • edited Loading

Trott commented Oct 26, 2018

jkrems commented Aug 13, 2017 •

edited

Loading

refack commented Aug 13, 2017 •

edited

Loading

elierotenberg commented Sep 27, 2017 •

edited

Loading

benjamn commented Dec 22, 2017 •

edited

Loading