Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: seeking feedback on async_hooks performance in production #14794

Closed
refack opened this issue Aug 12, 2017 · 20 comments
Closed

RFC: seeking feedback on async_hooks performance in production #14794

refack opened this issue Aug 12, 2017 · 20 comments
Labels
async_hooks Issues and PRs related to the async hooks subsystem.

Comments

@refack
Copy link
Contributor

refack commented Aug 12, 2017

  • Version: v8.x - master
  • Platform: all
  • Subsystem: async_hooks

[email protected] saw the introduction of async_hooks the latest and greatest incarnation of core's tracing API (#11883 - based on https://github.com/nodejs/node-eps/blob/master/006-asynchooks-api.md)
The @nodejs/async_hooks teams would love to hear feedback, especially regarding performance in production apps (CPU/memory usage), but also synthetics benchmarking results.
Any other feedback will also be welcome, as well as issues and PRs.

Thanks.

@refack refack added async_hooks Issues and PRs related to the async hooks subsystem. meta Issues and PRs related to the general management of the project. labels Aug 12, 2017
@addaleax addaleax removed the meta Issues and PRs related to the general management of the project. label Aug 12, 2017
@AndreasMadsen
Copy link
Member

@nodejs/diagnostics

@jkrems
Copy link
Contributor

jkrems commented Aug 13, 2017

I'm guessing the instructions for a non-synthetic benchmark would be:

  • Run an actual app on node 8 in prod.
  • Turn on async_hooks (require('async_hooks').enable() should be enough?).
  • Compare memory/CPU stats.

If so, I might get some data in the coming weeks (wanted to give 8.3 a spin anyhow in one of our services).

@AndreasMadsen
Copy link
Member

@jkrems You actually need to add empty hooks functions, otherwise nothing changes.

@refack
Copy link
Contributor Author

refack commented Aug 13, 2017

@jkrems I that will be great!

@jkrems You actually need to add empty hooks functions, otherwise nothing changes.

It will be a fair exercise to suss out memory leaks.

@refack
Copy link
Contributor Author

refack commented Aug 13, 2017

There is the example trace snippet from nodejs.org/api/async_hooks.html (slightly simplified):

let indent = '';
const traceFile = 1; // for stdout
async_hooks.createHook({
  init(asyncId, type, triggerAsyncId) {
    const eid = async_hooks.executionAsyncId();
    fs.writeSync(
      traceFile,
      `${indent}${type}(${asyncId}): trigger: ${triggerAsyncId} execution: ${eid}\n`);
  },
  before(asyncId) {
    fs.writeSync(traceFile , `${indent}before:  ${asyncId}\n`);
    indent += '  ';
  },
  after(asyncId) {
    indent = indent.slice(2);
    fs.writeSync(traceFile , `${indent}after:   ${asyncId}\n`);
  },
  destroy(asyncId) { fs.writeSync(traceFile , `${indent}destroy: ${asyncId}\n`); },
}).enable();

@tutman96
Copy link

Been running it in production on several high traffic components since 4.2.1 and have had no major memory leaks. No performance degradation compared to the old async-listeners implementation.

One thing we did notice was the asynchronous hooks generated by promises didn't get "destroy"ed until the promise got garbage collection. This caused a little bit of memory run away under very high traffic (synthetic traffic) but haven't noticed any problems with real world traffic.

@elierotenberg
Copy link

Hello,

I believe https://github.com/angular/zone.js/issues/889 might interest you!

@vdeturckheim
Copy link
Member

I'm under the impression Google started using is in production in their APM. googleapis/cloud-trace-nodejs@0e15b6c

@matthewloring
Copy link

We are currently experimenting with async hooks but it is still behind a flag while we evaluate it. We found it to be slightly slower than the continuation-local-storage based approach but we still think it will be the right way forward especially once more modules start using the embedder API to record their own async events removing the need for some of our monkeypatching. We can update this thread once we gather more info on how async hooks is working for us.

@vdeturckheim
Copy link
Member

awesome ! Thanks @matthewloring . When running with the CLS, is the monkeypatching on all async methods still applied ? Might have a performance boost there.

@matthewloring
Copy link

The monkeypatching done by our agent is the same under both approaches (since other modules have not yet started using the async hooks embedder api). Async hooks tracks more lifecycle events per asynchronous task than CLS does, especially for promises, many of which are redundant in the case of the context tracking done by our agent. I suspect this to be part of the slow down but we need to do more performance analysis to verify.

@elierotenberg
Copy link

elierotenberg commented Sep 27, 2017

Is it even possible to implement a robust cls/zone (or a lower-level abstraction that would allow zone/cls to be built upon) as a native Node module ?

I'm under the impression that some features are very hard / impossible to implement in a robust way using only JS (event using monkey-patching), most related to hacks in third-party libraries that also attemps to do low-level async stuff, and implement some sort of userland scheduling, eg. bluebird.

What are your use cases for this family of features?
I'm most interested in writing async servers in multi-process Node. I want to build a common generator-based userland coroutines/procs libraries that abstracts away the transport layer between coroutines and allow dynamic scheduling/monitoring/hot update.
In this model a proc in a pausable/resumable state machine capable of sending/receiving messages to/from other procs with an inbox queue. (heavily inspired by Erlang)

Example[1] :

function* counterServer({ counter }) {
  yield effect("saveCounter", counter);
  try {
    const message = yield receive(10000); // pause until a message is received or a timeout is elapsed
    if(message.type === "add") {
      // "loop" by recursively yielding itself (or another generator function)
       return yield* counterServer({ counter: counter + 1 });
    }
    if(message.type === "query") {
      // send a message back to the sender
       yield send(message.source, { type: counter, counter });
       return yield* counterServer({ counter });
    }
  }
  catch(error) {
    if(error instanceof ReceiveTimeout) {
      // its okay to timeout, we can just keep receiving
      return yield* counterServer({ counter });
    }
    // other errors crash the process so that its monitor can restart it in a safe state
    console.error("Error thrown in counterServer");
    throw error;
  }
}

const loop = async (initialCounter) => {
  let savedCounter = initialCounter;
  try {
    await spawn(counterServer, { counter }, {
      // algebraic effect handler : produce a "global" side effect (bubbled and resumable)
      saveCounter: (currentCounter) => savedCounter = currentCounter,
    });
  }
  catch(error) {
    console.error(error);
    // restart in safe state
    return await loop(savedCounter);
  }
}

loop(0);

For this feature to be exploitable, I need to monitor asynchronous side-effects, so that calculations can't escape their context (unless they really really want to, in which case the invariants would not be supported). If zone or cls were reliable at scale, I could build this upon them.

For this I think I only need to ensure that I can keep a readable context, and that I can run a function in an extended context.

Example[1] (could be expressed in terms of zones/cls) :

Context.current.fork((context) => {
    // extend context passed by parent
    return { ... context, foo: "bar" };
  }), (context) => {
  // do stuff in new "hardly" escapable context (not strictly impossible but hard to do by mistake)
  ...
}).catch((error) => ...).then((result) => ... );

I think allowing "just" that would enable many useful paradigms, even with a more restricted feature set than Zones/cls.

Do you think this can be implemented using async_hooks only ? I have a prototype but I'm sure you guys spent a lot of time thinking about this and I'd be glad to read your thoughts.

[1]: Syntax for illustrative purpose only, not really important at this stage

@vdeturckheim
Copy link
Member

@elierotenberg there is a plan to re-implement domains over async_hooks, you might want to keep an eye on this.

@elierotenberg
Copy link

I will. Thanks.

@NatalieWolfe
Copy link

@vdeturckheim I work at NR (Martin actually just moved teams). We've had mostly positive results thus far. No real-world issues have come up and it seems the performance is generally better than our existing monkey-patch based instrumentation. Our customers who are testing it have responded positively without any complaints of performance degradation after turning it on.

In our worst-case scenario benchmark (a 300-link no-op promise chain), our async-hook instrumentation is about 5x faster than our older monkey-patch based instrumentation. A no-op async hook was about 1.5x faster than our async-hook instrumentation. No instrumentation at all was about 8x faster than the async-hook instrumentation. This test is not reflective of real-world performance, but does show it is better than what we had and there is room for improvement.

For memory-leak situations, we've run some servers under extremely heavy load (~6 million promises per minute, 6000 requests per minute) for several days and noticed no leaking issues. We have, however, encountered a strange behavior around the 6-hour mark where GC scavenge events suddenly jump and CPU jumps with it. This is a one-time situation and the memory usage doesn't change after that. We're currently working on narrowing down the cause of that.

@elierotenberg
Copy link

Glad to learn that NR is working on a visulization/debug tool based on async_hooks. Async work monitoring/debugging seems very painful and hard to me, and I'm convinced we need better tools with the proper abstractions.

As a side note, I've continued my little research and am now working on implementing a (minimal) actor system for easier work distribution with a single Node process or spanning multiple Node process (it should even be able to support having actors actually run in the browser and communication with the cluster using http/websockets). Think Erlang or Akka (but with much less features initially of course).
I will probably use async_hook in each Node process to monitor local async behavior, and enforce/monitor the invariants underlying the actor model implementation.

@vdeturckheim
Copy link
Member

@NatalieWolfe thanks a lot! looks pretty promising!

Out of curiosity, what applications are you testing the agent with? Did you build an internal app or do you rely on open source projects?

@benjamn
Copy link
Contributor

benjamn commented Dec 22, 2017

Realized my comment might not be germane for this thread, so I moved it here: #14717 (comment)

daniloisr pushed a commit to daniloisr/miniprofiler-node that referenced this issue Oct 22, 2018
This commit fixes a bug where a request can end without finishing its
timings. See: MiniProfiler#4

Bug cause: The structure that is used to register Miniprofiler timing
providers, like (Postgres, HTTP, Redis), because it overrides the
original method (globally) and uses the `request` object to access the
Miniprofiler extensions to build the timings, and this doesn't work in a
scenario with simultaneous requests using an async provider.

Here is an example using
[`pg`](https://github.com/goenning/miniprofiler-pg/blob/master/index.js)
to try illustrating the failing scenario (check out the
`tests/concurrent-async-test.js` test to see it running).

request A start:
* `pg.Client.prototype.query` holds a `req` object of requestA.
* It calls `.query` in a pg instance
  * A miniprofiler timing starts with the call to
    `req.miniprofiler.timeQuery(...)`
  * The original method is called (async).

request B start:
* `pg.Client.prototype.query` holds a `req` object of request B.
* It calls `.query` in a pg instance
  * Start timing with `req.miniprofiler.timeQuery(...)`
  * The original method is called (async).

request A resume:
* The result of `.query` is returned
* A new call to  `.query` is made
  * This time the `req` points to request B, this means that
    `req.miniprofiler.timeQuery(...)` will start a timing on request B.
  * The original method is called (async)

request B resume:
* The result of `.query` is returned.
* All data was fetched, the request is ready to finish, so internally
  Miniprofile calls
  [`stopProfilling`](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L80).
  * This fails because there is a timing started (by request A) but not
    finished, so calculating the
    [diffs](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L409)
    will fails because `stop` is undefined.

Solution
--------

Using NodeJS "async_hooks" we can track the reference to the correct
extension for each request, so calls to `req.miniprofiler.timeQuery()`
will point to the correct miniprofiler extension.

To check some performance analisys see:
nodejs/node#14794 (comment)

The goal of the current commit isn't introduce breaking changes, so the
miniprofiler reference is injected into the request using JS getters.

Another solution is changing the API for providers, where instead of
receiving a `req` reference, they can receive a function that gets the
reference to the correct miniprofiler instance. But this will break API
with all existing providers.

References
----------

- https://medium.com/the-node-js-collection/async-hooks-in-node-js-illustrated-b7ce1344111f
- https://medium.com/@guysegev/async-hooks-a-whole-new-world-of-opportunities-a1a6daf1990a
- nodejs/node#14794 (comment)
@Trott
Copy link
Member

Trott commented Oct 26, 2018

Closing as discussion seems to have run its course. Feel free to re-open if this is still something that needs more discussion.

@Trott Trott closed this as completed Oct 26, 2018
daniloisr pushed a commit to daniloisr/miniprofiler-node that referenced this issue Nov 20, 2018
This commit fixes a bug where a request can end without finishing its
timings. See: MiniProfiler#4

Bug cause: The structure that is used to register Miniprofiler timing
providers, like (Postgres, HTTP, Redis), because it overrides the
original method (globally) and uses the `request` object to access the
Miniprofiler extensions to build the timings, and this doesn't work in a
scenario with simultaneous requests using an async provider.

Here is an example using
[`pg`](https://github.com/goenning/miniprofiler-pg/blob/master/index.js)
to try illustrating the failing scenario (check out the
`tests/concurrent-async-test.js` test to see it running).

request A start:
* `pg.Client.prototype.query` holds a `req` object of requestA.
* It calls `.query` in a pg instance
  * A miniprofiler timing starts with the call to
    `req.miniprofiler.timeQuery(...)`
  * The original method is called (async).

request B start:
* `pg.Client.prototype.query` holds a `req` object of request B.
* It calls `.query` in a pg instance
  * Start timing with `req.miniprofiler.timeQuery(...)`
  * The original method is called (async).

request A resume:
* The result of `.query` is returned
* A new call to  `.query` is made
  * This time the `req` points to request B, this means that
    `req.miniprofiler.timeQuery(...)` will start a timing on request B.
  * The original method is called (async)

request B resume:
* The result of `.query` is returned.
* All data was fetched, the request is ready to finish, so internally
  Miniprofile calls
  [`stopProfilling`](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L80).
  * This fails because there is a timing started (by request A) but not
    finished, so calculating the
    [diffs](https://github.com/MiniProfiler/node/blob/1a98e40698b1126ac8de728a33406656361f8870/lib/miniprofiler.js#L409)
    will fails because `stop` is undefined.

Solution
--------

Using NodeJS "async_hooks" we can track the reference to the correct
extension for each request, so calls to `req.miniprofiler.timeQuery()`
will point to the correct miniprofiler extension.

To check some performance analisys see:
nodejs/node#14794 (comment)

The goal of the current commit isn't introduce breaking changes, so the
miniprofiler reference is injected into the request using JS getters.

Another solution is changing the API for providers, where instead of
receiving a `req` reference, they can receive a function that gets the
reference to the correct miniprofiler instance. But this will break API
with all existing providers.

References
----------

- https://medium.com/the-node-js-collection/async-hooks-in-node-js-illustrated-b7ce1344111f
- https://medium.com/@guysegev/async-hooks-a-whole-new-world-of-opportunities-a1a6daf1990a
- nodejs/node#14794 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
async_hooks Issues and PRs related to the async hooks subsystem.
Projects
None yet
Development

No branches or pull requests