Add core metrics service #58623

pgayvallet · 2020-02-26T17:25:00Z

Summary

Fix #46563

Add a new metrics core service, and expose the associated API in core's public setup API.
the core OpsMetrics reproduces the structure generated in src/legacy/server/status/lib/metrics.js
core collectors implementation no longer relies on oppsy (but is based on oppsy implem). Once all usages has been adapted to use this new core API, we should be able to remove oppsy from our dependencies.
the PR only expose this new API and does not adapt any plugin/legacy code to use it. Using the new service/api is outside of the scope of this PR (in Migrate status service and status page to New Platform #41983 for the status page/plugin for ex.)

Checklist

Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios

For maintainers

This was checked for breaking API changes and was labeled appropriately

Dev Docs

A new metrics API is available from core, and allow retrieving various metrics regarding the http server, process and os load/usages

core.metrics.getOpsMetrics$().subscribe(metrics => {
  // do something with the metrics
})

elasticmachine · 2020-02-26T17:36:11Z

Pinging @elastic/kibana-platform (Team:Platform)

pgayvallet · 2020-02-27T09:29:01Z

src/core/server/metrics/collectors/process.test.ts

+  it('collects event loop delay', async () => {
+    const metrics = await collector.collect();
+
+    expect(metrics.event_loop_delay).toBeGreaterThan(0);
+  });


Mocking getEventLoopDelay is a pain, so I only tested that we collect an actual > 0 value here.

pgayvallet · 2020-02-27T09:36:36Z

src/core/server/metrics/integration_tests/server_collector.test.ts

+describe('ServerMetricsCollector', () => {
+  let server: HttpService;
+  let collector: ServerMetricsCollector;
+  let hapiServer: HapiServer;
+  let router: IRouter;


The ServerMetricsCollector is collecting data from the HAPI server. Unit testing that was a pain, as a lot of the HAPI server methods and behavior would need to be properly mocked, so I did integration test instead (also I think this is more appropriate than UT for this specific collector)

pgayvallet · 2020-02-27T09:36:42Z

src/core/server/metrics/integration_tests/server_collector.test.ts

+  it.skip('collect connection count', async () => {
+    const waitSubject = new Subject();
+
+    router.get({ path: '/', validate: false }, async (ctx, req, res) => {
+      await waitSubject.pipe(take(1)).toPromise();
+      return res.ok({ body: '' });
+    });
+    await server.start();
+
+    let metrics = await collector.collect();
+    expect(metrics.concurrent_connections).toEqual(0);
+
+    sendGet('/');
+    await delay(20);
+    metrics = await collector.collect();
+    expect(metrics.concurrent_connections).toEqual(1);


This is trying to test this part of the code:

const connections = await new Promise<number>(resolve => { this.server.listener.getConnections((_, count) => { resolve(count); }); });

I though this.server.listener.getConnections was returning the number of opened/pending connections, so I tried to test it by keeping waiting handlers, but the test fails ( metrics.concurrent_connections is always equals to 0).

The snippet is directly copied from

kibana/src/legacy/server/status/index.js

Lines 41 to 49 in 8e9a8a8

server.listener.getConnections((_, count) => {

event.concurrent_connections = count;

// captures (performs transforms on) the latest event data and stashes

// the metrics for status/stats API payload

metrics.capture(event).then(data => {

kbnServer.metrics = data;

});

});

So it's very unlikely is doesn't work (or at least doesn't do what it's supposed to do), however I don't know how to to properly integration test it ( I could unit test it by mocking server.listener.getConnections, but mocking the whole hapi server just for that felt like overkill)

If someone got an idea...

sendGet doesn't send a request, you need to call end with a callback. https://visionmedia.github.io/superagent/

supertest(hapiServer.listener) .get('/') .end(() => null);

It does when you await for it. using both end and awaiting it after throws an error stating that.

Oh but I'm not awaiting for it here.... Thanks.

pgayvallet · 2020-02-27T09:39:14Z

src/core/server/metrics/metrics_service.ts

+    return {
+      getOpsMetrics$: () => metricsObservable,
+    };


I exposed this in the setup API, even if the observable doesn't emit until core's start phase. Tell me if we prefer moving it to the start API instead.

I think it makes sense if we implement lazy logic

Makes sense to me to maintain the pattern of "register things in setup"

pgayvallet · 2020-02-27T09:40:29Z

src/core/server/metrics/metrics_service.ts

+export class MetricsService
+  implements CoreService<InternalMetricsServiceSetup, InternalMetricsServiceStart> {
+  private readonly logger: Logger;
+  private metricsCollector?: OpsMetricsCollector;
+  private collectInterval?: NodeJS.Timeout;
+  private metrics$ = new ReplaySubject<OpsMetrics>(1);


I created a metrics module / service instead of naming that ops. I though that if at some point we want to expose other metrics, it would make more sense to have a global service for that.

pgayvallet · 2020-02-27T09:42:30Z

src/core/server/metrics/types.ts

+export interface OpsMetrics {
+  /** Process related metrics */
+  process: OpsProcessMetrics;
+  /** OS related metrics */
+  os: OpsOsMetrics;
+  /** server response time stats */
+  response_times: OpsServerMetrics['response_times'];
+  /** server requests stats */
+  requests: OpsServerMetrics['requests'];
+  /** number of current concurrent connections to the server */
+  concurrent_connections: OpsServerMetrics['concurrent_connections'];
+}


waiting for @chrisronline to reply to #46563 (comment) to know if we can regroup the server metrics in a server property instead of exposing them all at the root level as it was done in legacy.

Are these in snake_case format just to maintain compatibility with legacy? Seems like we should rename to camelCase

It is, see #46563. If we think this is not acceptable and that consumers should adapt, I could both rename everything to camel and create the server property I spoke about. Not sure how exactly the existing structure is allowed to move, maybe you can answer?

mshustov · 2020-02-27T12:48:54Z

src/core/server/metrics/ops_config.ts

+export const opsConfig = {
+  path: 'ops',
+  schema: schema.object({
+    interval: schema.number({ defaultValue: 5000 }),


nit: what if we use more descriptive duration type? schema.duration({ defaultValue: '5s' }),

mshustov · 2020-02-27T12:55:38Z

src/core/server/metrics/collectors/os.ts

+
+    if (platform === 'linux') {
+      try {
+        const distro = (await promisify(getos)()) as LinuxOs;


nit: could be created once instead of recreating on every call

mshustov · 2020-02-27T12:56:55Z

src/core/server/metrics/collectors/types.ts

+ */
+export interface OpsOsMetrics {
+  /** The os platform */
+  platform: string;


nit: platform: NodeJS.Platform

mshustov · 2020-02-27T13:01:37Z

src/core/server/metrics/collectors/server.ts

+    max: 0,
+  };
+
+  constructor(private readonly server: HapiServer) {


What if we prevent hapi server leakage and invert dependencies? The HTTP server could implement the collect interface instead.

We are inside core and no hapi reference is leaking outside of it, so I would say it's alright. But I can move the server collector inside http and expose it from the internal http contract is you think this is better / more futur proof.

We are inside core and no hapi reference is leaking outside of it, so I would say it's alright. But I can move the server collector inside http and expose it from the internal http contract is you think this is better / more futur proof.

It's up to you, but I'd prefer to have the one place to update if we decide to get rid of hapi one day.

src/core/server/metrics/metrics_service.ts

mshustov · 2020-02-27T13:49:43Z

src/core/server/metrics/integration_tests/server_collector.test.ts

+  it.skip('collect connection count', async () => {
+    const waitSubject = new Subject();
+
+    router.get({ path: '/', validate: false }, async (ctx, req, res) => {
+      await waitSubject.pipe(take(1)).toPromise();
+      return res.ok({ body: '' });
+    });
+    await server.start();
+
+    let metrics = await collector.collect();
+    expect(metrics.concurrent_connections).toEqual(0);
+
+    sendGet('/');
+    await delay(20);
+    metrics = await collector.collect();
+    expect(metrics.concurrent_connections).toEqual(1);


sendGet doesn't send a request, you need to call end with a callback. https://visionmedia.github.io/superagent/

supertest(hapiServer.listener) .get('/') .end(() => null);

mshustov · 2020-02-27T13:51:37Z

src/core/server/metrics/metrics_service.ts

+    return {
+      getOpsMetrics$: () => metricsObservable,
+    };


I think it makes sense if we implement lazy logic

pgayvallet · 2020-02-27T20:50:12Z

retest

joshdover · 2020-02-27T21:19:06Z

src/core/server/metrics/types.ts

+export interface OpsMetrics {
+  /** Process related metrics */
+  process: OpsProcessMetrics;
+  /** OS related metrics */
+  os: OpsOsMetrics;
+  /** server response time stats */
+  response_times: OpsServerMetrics['response_times'];
+  /** server requests stats */
+  requests: OpsServerMetrics['requests'];
+  /** number of current concurrent connections to the server */
+  concurrent_connections: OpsServerMetrics['concurrent_connections'];
+}


Are these in snake_case format just to maintain compatibility with legacy? Seems like we should rename to camelCase

joshdover · 2020-02-27T21:22:30Z

src/core/server/metrics/metrics_service.ts

+    return {
+      getOpsMetrics$: () => metricsObservable,
+    };


Makes sense to me to maintain the pattern of "register things in setup"

joshdover · 2020-02-27T21:24:56Z

src/core/server/metrics/collectors/os.ts

+        free_in_bytes: os.freemem(),
+        used_in_bytes: os.totalmem() - os.freemem(),
+      },
+      uptime_in_millis: os.uptime() * 1000,


Same as other comment, can we rename to camelCase?

src/core/server/metrics/integration_tests/server_collector.test.ts

…rics

pgayvallet · 2020-02-28T07:36:26Z

Remaining points to decide on:

Should we use interval observable instead of setInterval (that would force to move the getOptsMetrics$ API to the start contract) - Add core metrics service #58623 (comment)
Should we move the server collector to the http module to avoid leakage of hapi API outside of http module - Add core metrics service #58623 (comment)
Can we take this NP migration opportunity to change the ops metrics structure (at least in core) - Add core metrics service #58623 (comment)

Don't have a strong opinion on any of them, but we need a consensus.

mshustov

Should we use interval observable instead of setInterval (that would force to move the getOptsMetrics$ API to the start contract) -
Add core metrics service #58623 (comment)

not necessary

Should we move the server collector to the http module to avoid leakage of hapi API outside of http module - Add core metrics service #58623 (comment)

optional. #58623 (comment)

Can we take this NP migration opportunity to change the ops metrics structure (at least in core) - Add core metrics service #58623 (comment)

can be done as follow-up

…rics

pgayvallet · 2020-03-03T07:55:59Z

Created #59113 to track the follow-up improvements

kibanamachine · 2020-03-03T09:49:08Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: b6bdaae

History

💚 Build #29967 succeeded db94469
💚 Build #29645 succeeded 54f9084
💚 Build #29536 succeeded a3f385c
💚 Build #29387 succeeded 739a30c
💔 Build #29209 failed cf62a4d

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* create base service and collectors * wire the service into server, add mock * add collector tests * add main collector test * export metric types from server * add service and server tests * updates generated doc * improve doc * nits and comments * add disconnected requests test

chrisronline · 2020-03-03T19:25:28Z

How do I access this in NP? This doesn't have a metrics key

pgayvallet · 2020-03-04T10:41:08Z

@chrisronline Wow. I guess I just forgot to add it to the public contract. Will address that asap.

pgayvallet · 2020-03-04T11:09:26Z

Created #59294

chrisronline · 2020-03-04T21:30:22Z

I'm comparing the data returned from this new API to what currently exists in the monitoring code base and I'm seeing an issue (there might be more, but this is the first I dug into)

In the existing system, we request data which then "flushes" the event rolling system which "resets" the existing state. So essentially, all max and average stats are limited to the period of time between when the first event is collected to when the entire buffer is flushed - which is ideal as each .monitoring-kibana-* document is collected every 10s (by default) and we want to ensure each document contains data representative of that time period.

However, this API seems to buffer the data for the entire duration of the process - resulting in different reported values from the existing system.

Perhaps we should add a way to flush the system, or potentially introduce instances of the metrics collector that can support flushing some intermediate state (so to preserve the idea that the main metrics collector never resets its local state)? I don't know if there are other plugins that depend on these data or not.

pgayvallet · 2020-03-05T07:21:24Z

@chrisronline

However, this API seems to buffer the data for the entire duration of the process - resulting in different reported values from the existing system.

Yea, it seems I missed the fact that the HAPI network monitor actually resets it's state / requests after every collection

I don't know if there are other plugins that depend on these data or not.

the monitoring plugin and the oss server monitoring (src/legacy/server/status/collectors/get_ops_stats_collector.js) seems to be the only consumers of this API.

Perhaps we should add a way to flush the system, or potentially introduce instances of the metrics collector that can support flushing some intermediate state

exposing an API to flush the global collector/metrics doesn't look acceptable in current implementation, as a single consumer should not impact the other consumers
introducing an additional layer in-between doesn't seem doable, as I don't really see how it would be capable to compute the average request value for a given time interval with the data the current collector provides.

Leading to:

I think falling back to oppsy behavior, by reseting the network collector after every collection (every getOpsMetrics observable emission) is the most reasonable thing to do, as this is what both consumers of the API are expecting anyway. Also, this seems reasonable to let the consumers handles potential data aggregation (and none does atm)

WDYT?

chrisronline · 2020-03-05T12:29:21Z

Sure, that works for me! Thanks!

pgayvallet · 2020-03-06T14:15:14Z

@chrisronline created #59551

pgayvallet added 5 commits February 26, 2020 12:50

create base service and collectors

0d9b91b

wire the service into server, add mock

03caa6a

add collector tests

208a291

add main collector test

f7a9b47

export metric types from server

cf62a4d

pgayvallet added Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.7.0 v8.0.0 labels Feb 26, 2020

pgayvallet added the release_note:plugin_api_changes Contains a Plugin API changes section for the breaking plugin API changes section. label Feb 26, 2020

pgayvallet added 3 commits February 27, 2020 09:57

add service and server tests

59b97e0

updates generated doc

515a40d

improve doc

739a30c

pgayvallet commented Feb 27, 2020

View reviewed changes

pgayvallet marked this pull request as ready for review February 27, 2020 09:43

pgayvallet requested a review from a team as a code owner February 27, 2020 09:43

mshustov reviewed Feb 27, 2020

View reviewed changes

nits and comments

a3f385c

pgayvallet mentioned this pull request Feb 27, 2020

Migrate ops metrics to New Platform #46563

Closed

joshdover reviewed Feb 27, 2020

View reviewed changes

pgayvallet added 2 commits February 28, 2020 08:30

add disconnected requests test

6afcca4

Merge remote-tracking branch 'upstream/master' into kbn-46563-ops-met…

54f9084

…rics

mshustov approved these changes Feb 28, 2020

View reviewed changes

pgayvallet added 2 commits March 2, 2020 07:45

Merge remote-tracking branch 'upstream/master' into kbn-46563-ops-met…

db94469

…rics

Merge remote-tracking branch 'upstream/master' into kbn-46563-ops-met…

b6bdaae

…rics

pgayvallet mentioned this pull request Mar 3, 2020

Change core metrics structure to follow naming convention #59113

Closed

pgayvallet merged commit 64ffae3 into elastic:master Mar 3, 2020

pgayvallet mentioned this pull request Mar 3, 2020

[7.x] Add core metrics service (#58623) #59122

Merged

pgayvallet mentioned this pull request Mar 4, 2020

Expose metrics service to public API #59294

Merged

chrisronline mentioned this pull request Mar 5, 2020

[Monitoring] Migrate server to NP #56675

Merged

1 task

pgayvallet mentioned this pull request Mar 6, 2020

Reset the metrics after each emission #59551

Merged

1 task

mshustov mentioned this pull request Jun 25, 2020

move Metrics API to start #69787

Merged

2 tasks

TinaHeiligers mentioned this pull request Dec 21, 2021

Fixes flaky server metrics collector integration tests #121469

Merged

1 task

	server.listener.getConnections((_, count) => {
	event.concurrent_connections = count;

	// captures (performs transforms on) the latest event data and stashes
	// the metrics for status/stats API payload
	metrics.capture(event).then(data => {
	kbnServer.metrics = data;
	});
	});

Add core metrics service #58623

Add core metrics service #58623

Conversation

pgayvallet commented Feb 26, 2020 • edited Loading

Summary

Checklist

For maintainers

Dev Docs

elasticmachine commented Feb 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgayvallet commented Feb 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgayvallet commented Feb 28, 2020

mshustov left a comment

Choose a reason for hiding this comment

pgayvallet commented Mar 3, 2020

kibanamachine commented Mar 3, 2020

💚 Build Succeeded

History

chrisronline commented Mar 3, 2020

pgayvallet commented Mar 4, 2020

pgayvallet commented Mar 4, 2020

chrisronline commented Mar 4, 2020

pgayvallet commented Mar 5, 2020

chrisronline commented Mar 5, 2020

pgayvallet commented Mar 6, 2020

pgayvallet commented Feb 26, 2020 •

edited

Loading