Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics make request 10x times slower #313

Closed
roytan883 opened this issue Jul 3, 2018 · 39 comments
Closed

metrics make request 10x times slower #313

roytan883 opened this issue Jul 3, 2018 · 39 comments
Labels
Module: Core Status: Need reproduce Issue waits for reproducing

Comments

@roytan883
Copy link
Contributor

The issue scenario is :
apiGateway.rest -> auth.authorize->demo.welcome->demo.hello
every step is very simple, should only consider network cost.

If i enable metrics: true in each services process, use about 120ms total. Buf if i disable metrics, it only use about 10ms.

metrics internal use event broadcast ? Then is should not impact normal performance so badly.

@roytan883
Copy link
Contributor Author

Each step duration looks good, only a few ms. But why the final apiGateway.rest count duration a long time:

INFO -> metrics traceStart:  { id: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 1,
  startTime: 1530602761235,
  remoteCall: false,
  action: { name: 'apiGateway.rest' },
  nodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '6ba3c7bc-a194-43c2-8f34-34380de6d60c',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761271,
  remoteCall: true,
  meta: {},
  action: { name: 'auth.authorize' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-auth-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceFinish:  { id: '6ba3c7bc-a194-43c2-8f34-34380de6d60c',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761271,
  endTime: 1530602761273.025,
  duration: 2.024912,
  remoteCall: true,
  fromCache: false,
  meta: {},
  action: { name: 'auth.authorize' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-auth-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761314,
  remoteCall: true,
  meta: {},
  action: { name: 'demo.welcome' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-demo-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '47682a4f-6c45-4c98-9130-9372dfecb748',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 3,
  startTime: 1530602761314,
  remoteCall: false,
  meta: {},
  action: { name: 'demo.hello' },
  parent: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  nodeID: 'rts-demo-dev-88-0' }
INFO -> metrics traceFinish:  { id: '47682a4f-6c45-4c98-9130-9372dfecb748',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 3,
  startTime: 1530602761314,
  endTime: 1530602761315.0862,
  duration: 1.086225,
  remoteCall: false,
  fromCache: false,
  meta: {},
  action: { name: 'demo.hello' },
  parent: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  nodeID: 'rts-demo-dev-88-0' }
INFO -> metrics traceFinish:  { id: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761314,
  endTime: 1530602761315.9263,
  duration: 1.92624,
  remoteCall: true,
  fromCache: false,
  meta: {},
  action: { name: 'demo.welcome' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-demo-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceFinish:  { id: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 1,
  startTime: 1530602761235,
  endTime: 1530602761354.5188,
  duration: 119.518744,
  remoteCall: false,
  fromCache: false,
  action: { name: 'apiGateway.rest' },
  nodeID: 'rts-apiGateway-dev-88-0' }

@icebob
Copy link
Member

icebob commented Jul 3, 2018

Which versions?

Btw, please use issue templates when you open a new issue.

@icebob
Copy link
Member

icebob commented Jul 3, 2018

I've just measured this demo with jMeter:
With metrics: false: ~10k req/sec
With metrics: true: ~8k req/sec

@roytan883
Copy link
Contributor Author

I think i found the problem. if i start the process which listen those "metrics" event(metrics.trace.span.start/finish), then whole system slow to "120ms" each request(the event handler is very simple, i even try empty function). if i stop the listen process, the whole system back to normal "10ms" each request.

@roytan883
Copy link
Contributor Author

version:0.12.8

@icebob
Copy link
Member

icebob commented Jul 3, 2018

And the listener is on the same node, or in a remote node?

@roytan883
Copy link
Contributor Author

a remote node, every service in this issue are running on a separate node process

@icebob
Copy link
Member

icebob commented Jul 3, 2018

Ok, I will investigate it.

@icebob icebob added Module: Core Status: Need reproduce Issue waits for reproducing labels Jul 3, 2018
@roytan883
Copy link
Contributor Author

by the way, transporter is "nats"

@icebob
Copy link
Member

icebob commented Jul 3, 2018

What I measured:
API gateway with metrics without listener: 8k req/sec
API gateway with metrics with a remote listener: 3k req/sec

Btw, it was just a simple "test.hello" action.

@roytan883
Copy link
Contributor Author

it was just a simple "test.hello" action. a remote listener: 3k req/sec

yes, that's the problem. my scenario has apiGateway.rest -> auth.authorize->demo.welcome->demo.hello, more remote actions, much slower.

Maybe you should test its latency?

@icebob
Copy link
Member

icebob commented Jul 3, 2018

If you can make a repro repo.

@roytan883
Copy link
Contributor Author

roytan883 commented Jul 3, 2018

you just make request like this:
client -> apiGateway(test.hello) -> test.hello -> ctx.call test1.hello1 -> ctx.call test2.hello2
place test/test1/test2 in separate node, and another node listen metrics

yes, I can create a repo. But do you need that ? you already measured that remote node listen metrics lead performance badly slower.

@icebob
Copy link
Member

icebob commented Jul 3, 2018

It's reasonable that metric function needs extra performance which increases the response time. Unfortunately, in your case, it's a huge difference because every action call generates 2 extra events what need to be transferred to another node which causes overhead. The bad news that I can't fix it currently because it is not a bug it's a design issue.

In the version 0.14, I will rewrite the whole metrics & tracing logic to be more efficient and pluggable.
Meanwhile, as a workaround, you can disable this built-in metric function and write a middleware which tracing action calls and send them to a central server (bypassing the moleculer message bus)

@roytan883
Copy link
Contributor Author

Maybe its not about the metric, i dig the source code, the metrics just call broker.emit. Then i make a test for event emit. Not listen (metrics.trace.span.start/finish).

In listener node:

      events: {
        // "metrics.trace.span.start": this.traceStart,
        // "metrics.trace.span.finish": this.traceFinish,
        "event.test.t1": this.t1,
        "event.test.t2": this.t2,
      },

async t1(info) {
    this.logger.info(`${this.name} t1: `, util.inspect(info));
  }
  async t2(info) {
    this.logger.info(`${this.name} t2: `, util.inspect(info));
  }

in auth:

async authorize(ctx) {
    this.broker.emit("event.test.t1", {
      a: "abc",
      b: 123,
      c: true
    })
    console.log("call authorize ctx.params = ", ctx.params)
    if (ctx.params.action.indexOf('$node') >= 0) { //disable call $node.*
      return false
    }
    return true
  }

in demo:

  async welcome(ctx) {
    this.broker.emit("event.test.t2", {
      a: "abc",
      b: 123,
      c: true
    })
    // await ctx.call('demo.hello')
    return this._sayWelcome(ctx.params.name);
  }

the result is:
1, if not send event.test.t1 and event.test.t2, use 8.396ms
2, if send event.test.t1, use 51.806ms
2, if send event.test.t1 and event.test.t2, use 81.931ms

@roytan883
Copy link
Contributor Author

simple event take 30~40ms, doesn't make sence, event should not much slower than action

@roytan883
Copy link
Contributor Author

I guess the problem is not about how to generate metrics.
It is about how to send event without impact action performance.

@icebob
Copy link
Member

icebob commented Jul 3, 2018

Hmm, it's weird. Please create a dev example file with 3 brokers & services & NATS like here and I will try to investigate it.

@roytan883
Copy link
Contributor Author

roytan883 commented Jul 3, 2018

OK, i will create a dev example file

@roytan883
Copy link
Contributor Author

roytan883 commented Jul 3, 2018

OK, finished
the result is:

call demo1.hello1 at:  1530627615760
call demo2.hello2 at:  1530627615761
call demo3.hello3 at:  1530627615762
call demo1.hello1 use time =  6
call demo1.hello1WithEvent at:  1530627615765
listen t1 at:  1530627615765
call demo2.hello2WithEvent at:  1530627615804
listen t2 at:  1530627615805
call demo3.hello3WithEvent at:  1530627615844
listen t3 at:  1530627615844
call demo1.hello1WithEvent use time =  120

test code is:

const { ServiceBroker } = require("moleculer");

let natsTransporter = {
  type: "NATS",
  options: {
    url: "nats://127.0.0.1:4222",
  }
}

// Create broker Client
const brokerClient = new ServiceBroker({
  namespace: "streaming",
  nodeID: "client",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #1
const broker1 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo1",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #2
const broker2 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo2",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #3
const broker3 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo3",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create brokerListen
const brokerListen = new ServiceBroker({
  namespace: "streaming",
  nodeID: "listen",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

broker1.createService({
  name: "demo1",
  actions: {
    hello1(ctx) {
      console.log("call demo1.hello1 at: ", Date.now())
      return ctx.call('demo2.hello2')
    },
    hello1WithEvent(ctx) {
      broker1.emit('t1', { a: 'abc', b: 123, c: true })
      console.log("call demo1.hello1WithEvent at: ", Date.now())
      return ctx.call('demo2.hello2WithEvent')
    }
  }
});

broker2.createService({
  name: "demo2",
  actions: {
    hello2(ctx) {
      console.log("call demo2.hello2 at: ", Date.now())
      return ctx.call('demo3.hello3')
    },
    hello2WithEvent(ctx) {
      broker2.emit('t2', { a: 'abc', b: 123, c: true })
      console.log("call demo2.hello2WithEvent at: ", Date.now())
      return ctx.call('demo3.hello3WithEvent')
    }
  }
});

broker3.createService({
  name: "demo3",
  actions: {
    hello3(ctx) {
      console.log("call demo3.hello3 at: ", Date.now())
      return "hello from demo3"
    },
    hello3WithEvent(ctx) {
      broker3.emit('t3', { a: 'abc', b: 123, c: true })
      console.log("call demo3.hello3WithEvent at: ", Date.now())
      return "hello from hello3WithEvent"
    }
  }
});

brokerListen.createService({
  name: "listen",
  events: {
    t1: (ctx) => {
      console.log("listen t1 at: ", Date.now())
    },
    t2: (ctx) => {
      console.log("listen t2 at: ", Date.now())
    },
    t3: (ctx) => {
      console.log("listen t3 at: ", Date.now())
    },
  }
});

brokerClient.Promise.all([brokerClient.start(), broker1.start(), broker2.start(), broker3.start(), brokerListen.start()])
  .delay(2000)
  .then(async () => {

    let startTime = Date.now()
    await brokerClient.call('demo1.hello1')
    let endTime = Date.now()
    console.log("call demo1.hello1 use time = ", endTime - startTime)

    startTime = Date.now()
    await brokerClient.call('demo1.hello1WithEvent')
    endTime = Date.now()
    console.log("call demo1.hello1WithEvent use time = ", endTime - startTime)

    // startTime = Date.now()
    // await brokerClient.call('demo1.hello1WithEvent')
    // endTime = Date.now()
    // console.log("call demo1.hello1WithEvent again use time = ", endTime - startTime)

  });

@icebob
Copy link
Member

icebob commented Jul 3, 2018

Thanks, I will investigate it tonight.

@icebob
Copy link
Member

icebob commented Jul 3, 2018

It's weirder. It's the result on my PC:

call demo1.hello1 at:  1530633313582
call demo2.hello2 at:  1530633313585
call demo3.hello3 at:  1530633313587
call demo1.hello1 use time =  17
call demo1.hello1WithEvent at:  1530633313596
listen t1 at:  1530633313599
call demo2.hello2WithEvent at:  1530633313601
listen t2 at:  1530633313603
call demo3.hello3WithEvent at:  1530633313604
listen t3 at:  1530633313606
call demo1.hello1WithEvent use time =  15

@icebob
Copy link
Member

icebob commented Jul 3, 2018

Node version, OS, architecture...etc?

@icebob
Copy link
Member

icebob commented Jul 3, 2018

I could reproduce it on Ubuntu

call demo1.hello1 at:  1530644013498
call demo2.hello2 at:  1530644013501
call demo3.hello3 at:  1530644013503
call demo1.hello1 use time =  17
call demo1.hello1WithEvent at:  1530644013513
listen t1 at:  1530644013514
call demo2.hello2WithEvent at:  1530644013560
listen t2 at:  1530644013561
call demo3.hello3WithEvent at:  1530644013604
listen t3 at:  1530644013604
call demo1.hello1WithEvent use time =  139

@icebob
Copy link
Member

icebob commented Jul 3, 2018

It's very mystic!
With NATS: ~8 rps (~125ms)
With Redis: ~500 rps (~2ms)
With TCP: ~1700 rps (~0.5ms)
linux-transporter-emits

And it's only on Linux. On Windows it's ~1500-2000 rps with all transporters.
Do you have any idea? Maybe NATS client problem because it can send messages parallel?

Dev test file: https://github.com/moleculerjs/moleculer/blob/master/dev/emit-latency.js

@roytan883
Copy link
Contributor Author

have you try different nats client library version?

@icebob
Copy link
Member

icebob commented Jul 4, 2018

I will, but currently I can't.

@icebob
Copy link
Member

icebob commented Jul 4, 2018

I tried with 0.8, 0.7...same result.

@AndreMaz
Copy link
Member

AndreMaz commented Jul 4, 2018

Same thing here.

  • On Windows the perf is Ok.
  • On Linux tried with different versions of nats (1, 0.8, 0.7) and moleculer (0.12.8, 0.12.6, 0.12.1) and the result is the similar to this
call demo1.hello1 at:  1530693198252
call demo2.hello2 at:  1530693198252
call demo3.hello3 at:  1530693198253
call demo1.hello1 use time =  8
call demo1.hello1WithEvent at:  1530693198255
listen t1 at:  1530693198255
call demo2.hello2WithEvent at:  1530693198303
listen t2 at:  1530693198303
call demo3.hello3WithEvent at:  1530693198348
listen t3 at:  1530693198349
call demo1.hello1WithEvent use time =  138

Info about my system:

Kubuntu 18.04 64bit
Kernel 4.15.0-24
CPU: Intel i7 6700
RAM 16GB

NodeJS: 8.11.3
Nats Server: 1.1.0

@icebob
Copy link
Member

icebob commented Jul 4, 2018

Thanks @AndreMaz

@roytan883
Copy link
Contributor Author

roytan883 commented Jul 6, 2018

After long time digging and testing, i found the reason. I don't know its NATS or Nodejs bug. The Nodejs said noDelay defaults to true, but its not.

socket.setNoDelay([noDelay])
Disables the Nagle algorithm. By default TCP connections use the Nagle algorithm, they buffer data before sending it off. Setting true for noDelay will immediately fire off data each time socket.write() is called. noDelay defaults to true.

Just modify this file at: node_modules/nats/lib/nats.js

// Create the stream
    this.stream = net.createConnection(this.url.port, this.url.hostname);
    this.stream.setNoDelay(true)  // need call it, somehow default 'true' is not work

After manually add setNoDelay(true), it works fine on ubuntu:

call demo1.hello1 at:  1530870151475
call demo2.hello2 at:  1530870151477
call demo3.hello3 at:  1530870151478
call demo1.hello1 use time =  8
call demo1.hello1WithEvent startTime =  1530870151480
call demo1.hello1WithEvent at:  1530870151481
listen e1 at:  1530870151482
call demo2.hello2WithEvent at:  1530870151483
listen e2 at:  1530870151484
call demo3.hello3WithEvent at:  1530870151484
listen e3 at:  1530870151485
call demo1.hello1WithEvent endTime =  1530870151486
call demo1.hello1WithEvent use time =  6

test environment:

moleculer: 0.12.8
nodejs: 8.11.1
node-nats: 1.0.0
gnatsd: 1.2.0
system: ubuntu 16.04 server

@icebob
Copy link
Member

icebob commented Jul 6, 2018

Wow, nice catch! It can explain why has the difference between OSs. So on Windows, the noDelay can be true, and on Linux it's not.

@icebob
Copy link
Member

icebob commented Jul 6, 2018

And it's the reason why was not problem with Redis transporter:
https://github.com/luin/ioredis/blob/622975d9d454eca6a9c495d35f5638d68e2662f2/lib/redis.js#L311

@icebob
Copy link
Member

icebob commented Jul 6, 2018

Btw, I can add a workaround, that I set the setNoDelay always in the transporter code when the client connected.

icebob added a commit that referenced this issue Jul 6, 2018
@roytan883
Copy link
Contributor Author

@icebob Maybe you should also listen connect and reconnect events from NATS. Because if reconnect, the down layer stream(socket) object will be a new one.

    client.on("connect", () => {
        this.client = client;
        if (client.stream)
            client.stream.setNoDelay(true);

    client.on("reconnect", () => {
        this.client = client;
        if (client.stream)
            client.stream.setNoDelay(true);

@icebob
Copy link
Member

icebob commented Jul 6, 2018

Yes, good idea!

@icebob icebob closed this as completed Jul 8, 2018
@aricart
Copy link

aricart commented Jul 17, 2018

npm already has a fix for this, you may want to just update and remove the patching.

@roytan883
Copy link
Contributor Author

@aricart Thank you for fixing this

@aricart
Copy link

aricart commented Jul 17, 2018

@roytan883 yw!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Core Status: Need reproduce Issue waits for reproducing
Projects
None yet
Development

No branches or pull requests

4 participants