Add Perf Counter #3664

micafan · 2018-12-29T10:11:00Z

What do these changes do?

With performance counter, we can easy to view service running status and abnormal conditions. This PR is to support perf counter.

I will submit code in several stages. And now is the first stage, submit the base interface class of performance counter.

Introduction:
MetricsRegistryInterface is the base registry class. Registry is used to register and update metrics.
MetricsReporterInterface is the base reporter class. Reporter is used to reporter metrics to any monitor service.

Related issue number

ericl · 2018-12-29T10:15:36Z

What's the intended backend for this interface?

Why not use https://github.com/census-instrumentation/opencensus-cpp which will enable many backends to be targeted?

AmplabJenkins · 2018-12-29T13:16:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10498/
Test FAILed.

micafan · 2018-12-29T13:59:41Z

What's the intended backend for this interface?

These interface classes are more general and abstract encapsulation.
First, it has good scalability and can support other components to meet the needs of personalization.
Secondly, the interface is simple and easy to use. Less intrusive to code.

Why not use https://github.com/census-instrumentation/opencensus-cpp which will enable many backends to be targeted?

We can implement OpenCensusRegistry based on OpenCensus to make it a more concise interface. Direct use of OpenCensus stats interface is too cumbersome.
And also OpenCensus's interface is not stable. The PerfCounter interface we provide is widely used by other ray-related projects within the company, and changes in the interface can affect many projects.

OpenCensus use cases(a bit complicated): https://github.com/census-instrumentation/opencensus-cpp/blob/master/opencensus/stats/examples/view_and_record_example.cc

MetricsRegistryInterface use cases:
Regisry - > Registry Counter (metric_a);
Regisry - > Update (metric_a, value, tags);

And also we can implement PrometheusRegistry based on Prometheus. Prometheus is a widely adopted solution. That's why OpenCensus also supports Prometheus protocols.

ericl · 2018-12-29T14:18:53Z

I'm not sure I buy the simplicity argument. As I understand it a perf counter would look like this:
opencensus::stats::Record({{my_perf_ctr, 1}})
with explicit tags:
opencensus::stats::Record({{my_perf_ctr, 1}}, {{my_tag_key, "value"}})

I see the pros of using a stats library as follows:

we don't need to maintain a separate stats API inside Ray.
we get backend compatibility for free (no need to implement registries or exporters)
fully featured implementation

The cons seem to be:

need to add opencensus to the build system (similar to Arrow?)
census stats API is still unstable (though, presumably the export APIs are stabler)
you may need to implement a census exporter depending on your backend

Re: compatibility, I understand there are applications already using this API -- but shouldn't this being independent of Ray? Suppose we have Prometheus as a backend, then, Ray can export to prometheus through opencensus, and the app can export directly if it wants, and there are no compatibility concerns.

My sense here is that long term it is better for Ray to "support stats export via opencensus" rather than "export via this custom C++ API". The latter may be easier to put into Ray, but since it is basically a public API it will need to be maintained long term.

AmplabJenkins · 2018-12-29T17:15:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10501/
Test FAILed.

AmplabJenkins · 2019-01-07T12:25:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10653/
Test PASSed.

micafan · 2019-01-07T12:37:09Z

I'm not sure I buy the simplicity argument. As I understand it a perf counter would look like this:
opencensus::stats::Record({{my_perf_ctr, 1}})
with explicit tags:
opencensus::stats::Record({{my_perf_ctr, 1}}, {{my_tag_key, "value"}})
@ericl

Actually, you needs to register "my_perf_ctr" and register "my_tag_key" before do record(call opencensus::stats::Record)!
And when do record, you must get the above registered objects(use MeasureRegistry::GetMeasureByName if the measure is alreay registered).

What if we have more than one tag key?
my_perf_ctr = opencensus::stats::MeasureDouble::Register(...);
tag_key_1 = opencensus::tags::TagKey::Register();
tag_key_2 = opencensus::tags::TagKey::Register();
tag_key_3 = opencensus::tags::TagKey::Register();
....
opencensus::stats::Record({{my_perf_ctr, 1}}, {{tag_key_1, "value"}, {tag_key_2, "value"}, {tag_key_3, "value"} ... })

What if the tag key is not static but dynamic? Then you has to check if the tag key already register or not.

The usage of PerfCounter is much simpler. No matter tag key is static or dynamic. Even without register. Like blow:
// If there is no tag
metrics::PerfCounter::GetInstance()->UpdateCounter("my_perf_ctr", 2);
// If there are tags
metrics::PerfCounter::GetInstance()->UpdateCounter("my_perf_ctr", 2, {{"ip", "10.11.12.13"}, {"jobname", "on-line-learning"}, {"taskname", "update-model"}});

ericl · 2019-01-08T01:18:05Z

Right, I agree for dynamic tags the proposed interface can be used more simply without extra registration code. This is imo a niche case though.

The main question here is whether it's easier to integrate census, vs writing our own stats library (which is already several hundred lines of code in this pr, and likely to grow with more features). Do you have a sense of the tradeoffs here?

I do want to see perf counters in Ray, but I am concerned about adding a bunch of custom code that implements the same thing but in a less general way and with limited backend support.

jovany-wang · 2019-01-08T03:36:32Z

@ericl
If I understood what you said correctly, you'd like to implement this with a simple encapsulation for census like glog?

Maybe we can simplify some interfaces to make an balance, like use map instead of our custom tags, use census client directly instead of registry and reporter.

ericl · 2019-01-08T03:51:31Z

@jovany-wang so there are two main interfaces of concern here right?
"registry": this is the interface between Ray <-> stats library
"reporter": this is the interface between the stats library <-> backends (e.g., Prometheus)

The proposal raised in this PR is to implement both interfaces in Ray. The alternative I'm suggesting is to use census lib instead, which I believe already implements a reporter/exporter. For the registry side, we can ideally use the census client API directly from Ray, or with a thin wrapper as a compromise.

Btw, I think any stats API should be private to Ray -- backend integrations should be against census exporter (https://opencensus.io/exporters/#language-vs-available-exporters-matrix), not against Ray directly. I think this is different from glog because glog isn't designed to integrate with downstream systems. One possible issue is whether C++ metrics exporter integration is possible (I only see docs for how to implement a C++ trace exporter: https://opencensus.io/exporters/custom-exporter/cpp/tracing/)

Update: it looks like there are a couple C++ stats exporter examples, one of which is here: https://github.com/census-instrumentation/opencensus-cpp/blob/master/opencensus/exporters/stats/prometheus/internal/prometheus_exporter.cc

micafan · 2019-01-08T10:51:30Z

The main question here is whether it's easier to integrate census, vs writing our own stats library (which is already several hundred lines of code in this pr, and likely to grow with more features). Do you have a sense of the tradeoffs here?

Of course, this proposal isn't intent to writing our own stats library. The goal is to keep the external interface simple and versatile; the internal implementation relies on third-party statistical libraries such as Prometheus-Cpp or OpenCensus-Cpp.

@jovany-wang so there are two main interfaces of concern here right?
"registry": this is the interface between Ray <-> stats library
"reporter": this is the interface between the stats library <-> backends (e.g., Prometheus)
The proposal raised in this PR is to implement both interfaces in Ray. The alternative I'm suggesting is to use census lib instead, which I believe already implements a reporter/exporter. For the registry side, we can ideally use the census client API directly from Ray, or with a thin wrapper as a compromise.

Another reason we can't use census directly is that the backend is a ray-independent service (rather than an arrow-like, ray-initiated service). You can't assume that the backend supports census or prometheus protocols.

For example, Open-falcon (http://open-falcon.org), as an open source monitoring system (github 4000+ star), is used by many companies and does not support the above protocols. When these companies use ray, they certainly don't want to change the monitoring system they use to fit ray.

That's why flink supports a lot of monitoring systems (https://github.com/apache/flink/tree/c3b013b9d0c2e5f4941ddeff18084d090c066440/flink-metrics), such as datadog, ganglia, prometheus, statsd, and so on. So does spark.

Therefore, census can be used as a plugin for the perf counter.

We also realize that the plugin interface layer(regsitry and reporter) is flexible but too complex and is being considered for simplification：Combine the report to the registry, leaving only the reporter function？

ericl · 2019-01-08T13:23:03Z

The point is, census already *does* support a large number of backends, i.e., datadog, Prometheus, zipkin, x-ray, stackdriver: https://opencensus.io/exporters/supported-exporters/ Rather than implement all of this individually on Ray, why not use a dedicated stats library that will provide these integrations for us? Suppose you need to add openfalcon support: I don't see why adding an openfalcon<>opencensus integration is harder than openfalcon<>ray. That way, we keep the Ray metrics code lightweight with a single *narrow waist* integration. No need to have to individually support a bunch of libraries like Spark, etc are forced too. That said,

We also realize that the plugin interface layer(regsitry and reporter) is

flexible but too complex and is being considered for simplification：Combine the report to the registry, leaving only the reporter function？ Having a lightweight reporter interface as an additional integration point is reasonable, though we shouldn't merge an interface only (I.e., initial support should have a concrete implemention such as opencensus, and we can have compile time flag for other backends for advanced developers, similar to glog support). It would of course be preferable to only support the census client interface though.

…

On Tue, Jan 8, 2019, 2:51 AM micafan ***@***.***> wrote: The main question here is whether it's easier to integrate census, vs writing our own stats library (which is already several hundred lines of code in this pr, and likely to grow with more features). Do you have a sense of the tradeoffs here? Of course, this proposal isn't intent to writing our own stats library. The goal is to keep the external interface simple and versatile; the internal implementation relies on third-party statistical libraries such as Prometheus-Cpp or OpenCensus-Cpp. @jovany-wang <https://github.com/jovany-wang> so there are two main interfaces of concern here right? "registry": this is the interface between Ray <-> stats library "reporter": this is the interface between the stats library <-> backends (e.g., Prometheus) The proposal raised in this PR is to implement both interfaces in Ray. The alternative I'm suggesting is to use census lib instead, which I believe already implements a reporter/exporter. For the registry side, we can ideally use the census client API directly from Ray, or with a thin wrapper as a compromise. Another reason we can't use census directly is that the backend is a ray-independent service (rather than an arrow-like, ray-initiated service). You can't assume that the backend supports census or prometheus protocols. For example, Open-falcon (http://open-falcon.org), as an open source monitoring system (github 4000+ star), is used by many companies and does not support the above protocols. When these companies use ray, they certainly don't want to change the monitoring system they use to fit ray. That's why flink supports a lot of monitoring systems ( https://github.com/apache/flink/tree/c3b013b9d0c2e5f4941ddeff18084d090c066440/flink-metrics), such as datadog, ganglia, prometheus, statsd, and so on. So does spark. Therefore, census can be used as a plugin for the perf counter. We also realize that the plugin interface layer(regsitry and reporter) is flexible but too complex and is being considered for simplification：Combine the report to the registry, leaving only the reporter function？ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3664 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SgPMaKOE6o59nJ50mt8uCRAoONIoks5vBHg7gaJpZM4Zkgro> .

ericl · 2019-01-08T13:29:59Z

Another note is that metrics can be reported not only from c++ code but eventually java and python. For example, the python code already supports a form of tracing for the timeline visualization. So it's preferable to rely on library integrations, otherwise backend support will eventually need to be added in each language (or with expensive IPC calls).

…

On Tue, Jan 8, 2019, 5:22 AM Eric Liang ***@***.***> wrote: The point is, census already *does* support a large number of backends, i.e., datadog, Prometheus, zipkin, x-ray, stackdriver: https://opencensus.io/exporters/supported-exporters/ Rather than implement all of this individually on Ray, why not use a dedicated stats library that will provide these integrations for us? Suppose you need to add openfalcon support: I don't see why adding an openfalcon<>opencensus integration is harder than openfalcon<>ray. That way, we keep the Ray metrics code lightweight with a single *narrow waist* integration. No need to have to individually support a bunch of libraries like Spark, etc are forced too. That said, > We also realize that the plugin interface layer(regsitry and reporter) is flexible but too complex and is being considered for simplification：Combine the report to the registry, leaving only the reporter function？ Having a lightweight reporter interface as an additional integration point is reasonable, though we shouldn't merge an interface only (I.e., initial support should have a concrete implemention such as opencensus, and we can have compile time flag for other backends for advanced developers, similar to glog support). It would of course be preferable to only support the census client interface though. On Tue, Jan 8, 2019, 2:51 AM micafan ***@***.***> wrote: > The main question here is whether it's easier to integrate census, vs > writing our own stats library (which is already several hundred lines of > code in this pr, and likely to grow with more features). Do you have a > sense of the tradeoffs here? > > Of course, this proposal isn't intent to writing our own stats library. > The goal is to keep the external interface simple and versatile; the > internal implementation relies on third-party statistical libraries such as > Prometheus-Cpp or OpenCensus-Cpp. > > @jovany-wang <https://github.com/jovany-wang> so there are two main > interfaces of concern here right? > "registry": this is the interface between Ray <-> stats library > "reporter": this is the interface between the stats library <-> backends > (e.g., Prometheus) > The proposal raised in this PR is to implement both interfaces in Ray. > The alternative I'm suggesting is to use census lib instead, which I > believe already implements a reporter/exporter. For the registry side, we > can ideally use the census client API directly from Ray, or with a thin > wrapper as a compromise. > > Another reason we can't use census directly is that the backend is a > ray-independent service (rather than an arrow-like, ray-initiated service). > You can't assume that the backend supports census or prometheus protocols. > > For example, Open-falcon (http://open-falcon.org), as an open source > monitoring system (github 4000+ star), is used by many companies and does > not support the above protocols. When these companies use ray, they > certainly don't want to change the monitoring system they use to fit ray. > > That's why flink supports a lot of monitoring systems ( > https://github.com/apache/flink/tree/c3b013b9d0c2e5f4941ddeff18084d090c066440/flink-metrics), > such as datadog, ganglia, prometheus, statsd, and so on. So does spark. > > Therefore, census can be used as a plugin for the perf counter. > > We also realize that the plugin interface layer(regsitry and reporter) is > flexible but too complex and is being considered for simplification：Combine > the report to the registry, leaving only the reporter function？ > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#3664 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAA6SgPMaKOE6o59nJ50mt8uCRAoONIoks5vBHg7gaJpZM4Zkgro> > . >

micafan · 2019-01-09T08:13:44Z

The point is, census already does support a large number of backends, i.e., datadog, Prometheus, zipkin, x-ray, stackdriver: https://opencensus.io/exporters/supported-exporters/ Rather than implement all of this individually on Ray, why not use a dedicated stats library that will provide these integrations for us? Suppose you need to add openfalcon support: I don't see why adding an openfalcon<>opencensus integration is harder than openfalcon<>ray.

My concern is compatibility issues. This often depends on the design of the backend.
For instance, i am not sure if the OpenCensus protocol can be converted to a protocol supported by open falcon backend.

Having a lightweight reporter interface as an additional integration point is reasonable, though we shouldn't merge an interface only (I.e., initial support should have a concrete implemention such as opencensus, and we can have compile time flag for other backends for advanced developers, similar to glog support). It would of course be preferable to only support the census client interface though.

Initial support does have a concrete implemention, implemented by Prometheus. The backend related settings are specified by the startup configuration item(instead of compile time flag). The startup configuration item also includes the gateway address and so on.

I will send out the initial implementation and discuss it further.

AmplabJenkins · 2019-01-09T11:52:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10705/
Test PASSed.

ericl · 2019-01-09T12:25:36Z

For instance, i am not sure if the OpenCensus protocol can be converted

to a protocol supported by open falcon backend. Can you investigate this? The reporter/exporter example I linked was not that many lines of code to integrate with Prometheus. If it turns out it does convert easily it would save quite a bunch of efforts.

…

On Wed, Jan 9, 2019, 3:52 AM UCB AMPLab ***@***.***> wrote: Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10705/ Test PASSed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3664 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SuoOiqc-zXjwKvzngS60lMW_pm5yks5vBdgDgaJpZM4Zkgro> .

ericl · 2019-01-09T19:23:45Z

Looks like the main API is this: https://github.com/census-instrumentation/opencensus-cpp/blob/e9a943b244f419eaa122495b96e5c95aa7299cbe/opencensus/stats/stats_exporter.h

  // Registers a new handler. Every few seconds, each registered handler will be
  // called with the present data for each registered view. This should only be
  // called by push exporters' Register() methods.
  static void RegisterPushHandler(std::unique_ptr<Handler> handler);

  // Retrieves current data for all registered views, for implementing pull
  // exporters.
  static std::vector<std::pair<ViewDescriptor, ViewData>> GetViewData();

micafan · 2019-01-10T03:58:50Z

For instance, i am not sure if the OpenCensus protocol can be converted
to a protocol supported by open falcon backend. Can you investigate this?

Ok.

The reporter/exporter example I linked was not that many lines of code to integrate with Prometheus. If it turns out it does convert easily it would save quite a bunch of efforts.

First, by then prometheus reporter is generic, not customizable for census. And:

The census exporter cannot set the reporting period according to the requirements. This is actually a personalized need.
Export interval was defined here as a constant:
https://github.com/census-instrumentation/opencensus-cpp/blob/e9a943b244f419eaa122495b96e5c95aa7299cbe/opencensus/stats/internal/stats_exporter_impl.h

PerfCounter:
https://github.com/ray-project/ray/pull/3664/files#diff-fb977c41c9e61e84ba09103e2c4fd1fc

And there still works to do as opencensus comments here(URL to export to, user name and password for login, and job name):
// StatsExporter::Handler is the interface for push exporters that export
// recorded data for registered views. The exporter should provide a static
// Register() method that takes any arguments needed by the exporter (e.g. a
// URL to export to) and calls StatsExporter::RegisterHandler itself.
class Handler {
public:
virtual ~Handler() = default;
virtual void ExportViewData(
const std::vector<std::pair<ViewDescriptor, ViewData>>& data) = 0;
};

PrometheusPushReporter is mainly doing the above two work.

…
On Wed, Jan 9, 2019, 3:52 AM UCB AMPLab @.***> wrote: Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10705/ Test PASSed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3664 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SuoOiqc-zXjwKvzngS60lMW_pm5yks5vBdgDgaJpZM4Zkgro .

micafan · 2019-01-24T03:37:45Z

@ericl
OpenCensus can support open-falcon protocols in a flexible way.
In fact, Open-Falcon requires that seven fields must be specified（check here:http://book.open-falcon.org/zh/usage/data-push.html）:`metric, endpoint, timestamp, value, step, counterType, tags`. Endpoint is not defined in OpenCensus. When backend is open-falcon, you need to use endpoint as a required tag field.

OpenCensus can be used as the only implementation of the Registry, not one of the extended implementations. At the same time, the Reporter is retained as an extensible interface, and provides a default implementation, while allowing users to extend support for their own backend. After all, some backends are not necessarily accepted by OpenCensus, and users may not need this part to be open source. Can we agree?

Retaining the Registry interface can be much simpler and easier to understand than using OpenCensus directly (the OpenCensus is much more complex than the Prometheus interface and implementation). Also avoid developers spending too much time understanding the details of OpenCensus.

ericl · 2019-01-24T08:35:25Z

@micafan it looks like you are suggesting the following, if I understand correctly:

Ray -> Reporter (new) -> OpenCensusRegistry (new) -> OpenCensus -> Backends  (OSS)
               \
                 > CustomRegistry -> CustomBackend  (proprietary)

This seems ok as a compromise to me, if you're also intending to implement OpenCensusRegistry ;)
However, doing that won't be very useful to you if you're using the custom backend. Basically, what I'm trying to push for here is greater unification between the OSS and non-OSS deployments of Ray.

What I was actually suggesting earlier was:

Ray -> OpenCensus -> Backends   (OSS)
                 \
                   > CustomExporter -> CustomBackend   (proprietary)

As you can see, the amount of components for integrating custom backends is about the same, but the additions to Ray are lighter weight. Not to mention, any custom exporter code written will reusable for other projects and not just for Ray.

The concerns raised above for this design were:

The census exporter cannot set the reporting period according to the requirements

You can read metrics data by calling opencensus::stats::GetViewData(). The interface is thread-safe and can be called at any time to query view data: https://github.com/census-instrumentation/opencensus-cpp/blob/e9a943b244f419eaa122495b96e5c95aa7299cbe/opencensus/stats/stats_exporter.h#L56
So, in your application/Ray main, you can create a timer that exports at any period you want by querying stats::GetViewData(). Does that make sense?

When backend is open-falcon, you need to use endpoint as a required tag field.

This is an issue in any case right? You can also add the endpoint in the custom exporter code.

Retaining the Registry interface can be much simpler and easier to understand than using OpenCensus directly

I don't think this is necessarily the case, OpenCensus is going to be better documented and tested than anything we put into Ray as a one-off. I would much rather merge a PR adding OC integration instead. Happy to help with a proof of concept here, it sounds like you need an example of converting a std::vector<std::pair<ViewDescriptor, ViewData>> into a OpenFalcon push request?

micafan · 2019-01-24T12:33:04Z

@micafan it looks like you are suggesting the following, if I understand correctly:
Ray -> Reporter (new) -> OpenCensusRegistry (new) -> OpenCensus -> Backends  (OSS)
               \
                 > CustomRegistry -> CustomBackend  (proprietary)

No, what i suggest is:

Ray -> PerfCounter -> OpenCensusRegistry -> OpenCensus 
                \ 
                 ReporterInterface  -> PrometheusReporter -> OpenCensus -> Backend
                                                  \   or         
                                                   -> CustomReporter() -> CustomBackend

This seems ok as a compromise to me, if you're also intending to implement OpenCensusRegistry ;)

In fact, I have implemented a version of OpenCensusRegistry three month ago, and I plan to continue（This time i want make the overall structure of perf counter more simplified). Check it here:
https://github.com/ray-project/ray/tree/e1817fe1fb99e09b6335dcd393e4977ef883fdc5/src/ray/metrics/registry

However, doing that won't be very useful to you if you're using the custom backend. Basically, what I'm trying to push for here is greater unification between the OSS and non-OSS deployments of Ray.

What I was actually suggesting earlier was:
Ray -> OpenCensus -> Backends   (OSS)
                 \
                   > CustomExporter -> CustomBackend   (proprietary)

I prefer this way if you insist:

Ray -> PerfCounter -> OpenCensus -> Backends   (OSS)
                        \
                          > CustomExporter -> CustomBackend   (proprietary)

As you can see, the amount of components for integrating custom backends is about the same, but the additions to Ray are lighter weight. Not to mention, any custom exporter code written will reusable for other projects and not just for Ray.

The concerns raised above for this design were:

The census exporter cannot set the reporting period according to the requirements

You can read metrics data by calling opencensus::stats::GetViewData(). The interface is thread-safe and can be called at any time to query view data: https://github.com/census-instrumentation/opencensus-cpp/blob/e9a943b244f419eaa122495b96e5c95aa7299cbe/opencensus/stats/stats_exporter.h#L56
So, in your application/Ray main, you can create a timer that exports at any period you want by querying stats::GetViewData(). Does that make sense?

Just explain why didn't use opencensus::stats::StatsExporter directly. In fact, I already used it this way（line 104）：https://github.com/ray-project/ray/blob/e1817fe1fb99e09b6335dcd393e4977ef883fdc5/src/ray/metrics/registry/open_census_metrics_registry.cc

When backend is open-falcon, you need to use endpoint as a required tag field.

This is an issue in any case right? You can also add the endpoint in the custom exporter code.

I agree. As I said before, OpenCensus' protocal can be converted to OpenFalcon's protocal. Implementing OpenFalconExporter can solve this problem.

Retaining the Registry interface can be much simpler and easier to understand than using OpenCensus directly

What i mean is OpenCensue's api is a little complicated. I highly recommend using it after packaging instead of directly.

ericl · 2019-01-25T01:05:30Z

Thanks for clarifying.

So this plan seems OK with me. I like it since it minimizes the number of new interfaces we are adding:

Ray -> PerfCounter -> OpenCensus -> Backends   (OSS)
                        \
                          > CustomExporter -> CustomBackend   (proprietary)

The only new interface is PerfCounter. To make sure, PerfCounter here will be a simple class with perhaps inc(amount, tags) right? So something like:

class PerfCounter(name) { def inc(amount, tags) }
PerfCounter metric1 = new PerfCounter("foo");
PerfCounter metric2 = new PerfCounter("foo2");

metric1.inc(1, {});

I think for more complex metrics (such as histograms, etc.) Ray should call OpenCensus directly to avoid the growth of a bunch of wrapper code:

     ------------ (more complex metrics)
    /                \
Ray -> PerfCounter -> OpenCensus -> Backends   (OSS)
                        \
                          > CustomExporter -> CustomBackend   (proprietary)

What i mean is OpenCensue's api is a little complicated. I highly recommend using it after packaging instead of directly.

I think we can agree to disagree here. For me it is seems fairly clear.

Define your metrics and views.
Call recordUsage.

I would avoid wrappers whenever possible, since it is adding more (unnecessary) layers of indirection.

micafan · 2019-01-25T03:48:42Z

Thanks for clarifying.

So this plan seems OK with me. I like it since it minimizes the number of new interfaces we are adding:
Ray -> PerfCounter -> OpenCensus -> Backends   (OSS)
                        \
                          > CustomExporter -> CustomBackend   (proprietary)
The only new interface is PerfCounter. To make sure, PerfCounter here will be a simple class with perhaps inc(amount, tags) right? So something like:
class PerfCounter(name) { def inc(amount, tags) }
PerfCounter metric1 = new PerfCounter("foo");
PerfCounter metric2 = new PerfCounter("foo2");

metric1.inc(1, {});

PerfCounter's interface(Defines here):

class PerfCounter final {
 public:
  static PerfCounter *GetInstance();

  /// Update counter metric.
  ///
  /// \param metrics_name The name of the metric that we want to update.
  /// \param value The value that we want to update to.
  /// \param tags The tags that we want to attach to.
  void UpdateCounter(const std::string &metrics_name, double value,
                     const Tags &tags = Tags{});

  /// Update gauge metric.
  ///
  /// \param metrics_name The name of the metric that we want to update.
  /// \param value The value that we want to update to.
  /// \param tags The tags that we want to attach to.
  void UpdateGauge(const std::string &metrics_name, double value,
                   const Tags &tags = Tags{});

  /// Update histogram metric.
  /// The reasonable range of fluctuation is[min_value, max_value].
  /// Exceeding the range can still be counted, but the accuracy is lower.
  ///
  /// \param metrics_name The name of the metric that we want to update.
  /// \param value The value that we want to update to.
  /// \param min_value The minimum value that we can specified.
  /// \param max_value The maximum value that we can specified.
  /// \param tags The tags that we want to attach to.
  void UpdateHistogram(const std::string &metrics_name, double value, double min_value,
                       double max_value, const Tags &tags = Tags{});
  ...

I think for more complex metrics (such as histograms, etc.) Ray should call OpenCensus directly to avoid the growth of a bunch of wrapper code:
     ------------ (more complex metrics)
    /                \
Ray -> PerfCounter -> OpenCensus -> Backends   (OSS)
                        \
                          > CustomExporter -> CustomBackend   (proprietary)

Use PerfCounter::UpdateHistogram(...) seems ok. Can you take the time to review PerfCounter.h and PerfCounter.cc?

Another issue with OpenCensus is：

 ViewDescriptor
  // Sets the name of the ViewDescriptor. Names must be unique within the
  // library; it is recommended that it be in the format "<domain>/<path>",
  // where "<path>" uniquely specifies the measure, aggregation, and columns
  // (e.g. "example.com/Foo/FooUsage-sum-key1-key2").
  ViewDescriptor& set_name(absl::string_view name);

If the dimension is added, the View name will change accordingly. The front-end monitoring view configured according to the name is invalid and needs to be reconfigured.

Prometheus only uses dimension information as an attribute field, not as part of the name, which is more friendly.

ericl · 2019-01-25T04:20:32Z

If I understand you just want to intercept the stats at the point of recording right? Why not just:

class PerfCounter final {
 public:
  PerfCounter(opencensus::stats::Measure &measure);
  void Record(double value, opencensus::tags::TagMap tags = NULL);

  // and then some static hooks to intercept Record() calls for custom exporter
  static void RegisterCallback(std::function<void(Measure, double, TagMap)> callback);
}

You can define histo aggregation etc in your custom exporter, which does not need to be open sourced. For OSS, we will use census histogram aggregation defined in the measure.

If the dimension is added, the View name will change accordingly. The front-end monitoring view configured according to the name is invalid and needs to be reconfigured.

Can't you call set_name() with a custom name? The comment is just a suggestion to include the dimension info, but not required.

micafan · 2019-01-25T04:40:50Z

If I understand you just want to intercept the stats at the point of recording right? Why not just:
class PerfCounter final {
 public:
  PerfCounter(opencensus::stats::Measure &measure);
  void Record(double value, opencensus::tags::TagMap tags = NULL);
}

For the user, providing the metric name information is sufficient. Is it better to let users focus on the information they need to count, rather than how to count them?
Why use OpenCensus::Measure? It is also responsible for creating Measure for this purpose. Similarly, OpenCensus::tags::TagMap requires registration and maintenance. It doesn't make sense to repeat these steps each time you create a metric.

You can define histo aggregation etc in your custom exporter, which does not need to be open sourced. For OSS, we will use census histogram aggregation defined in the measure.

If the dimension is added, the View name will change accordingly. The front-end monitoring view configured according to the name is invalid and needs to be reconfigured.

Can't you call set_name() with a custom name? The comment is just a suggestion to include the dimension info, but not required.

How to ensure the uniqueness of the view name, if there is a need to dynamically add tags during the running process? Such as:
View 1 of a.count has 3 tagkeys, while view 2 has 4 tagkeys.

ericl · 2019-01-25T04:47:45Z

For the user, providing the metric name information is sufficient. Is it better to let users focus on the information they need to count, rather than how to count them?

You only need to pass the full info to declare a metric. The call to Record() only takes the value and tags, not the "how" info.

// Declare a metric:
PerfCounter num_redis_lookups = PerfCounter(...);

// use it
num_redis_lookups.Record(1);

// register custom exporter
PerfCounter.registerCallback(void (const measure &Measure, double value, TagMap& tags) {
   // send wherever you want
});

Why use OpenCensus::Measure? It is also responsible for creating Measure for this purpose. Similarly, OpenCensus::tags::TagMap requires registration and maintenance. It doesn't make sense to repeat these steps each time you create a metric.

I would like to avoid the open source metrics from diverging from custom implementations. Hence, the requirement to declare it in OpenCensus language. By the way, you don't need to declare a tag map. Just your metrics and views.

Also, one thing to note that a big portion of maintaining metricking systems is keeping them up to date. It's a good thing to force a metric to be documented by declaring it up-front. That way, we can have a file with all the available metrics. It will be more work to add a metric, but it will be easier for other developers to maintain.

How to ensure the uniqueness of the view name, if there is a need to dynamically add tags during the running process? Such as: View 1 of a.count has 3 tagkeys, while view 2 has 4 tagkeys.

Why would you need to do this? Just declare a view1 and view2 up front with all the tags you want. These could be parsed from a config file as well.

micafan · 2019-01-25T07:50:51Z

For the user, providing the metric name information is sufficient. Is it better to let users focus on the information they need to count, rather than how to count them?

You only need to pass the full info to declare a metric. The call to Record() only takes the value and tags, not the "how" info.

But it's more convenient and simple to use. Only one line of code, and always the similar line：

UpdateCounter("actor_task_count", 10, {{"host", "11.12.10.10"}...});

This is very useful:

When adding metric, the workload is much smaller
When updating a metric across classes, it's much simpler

// Declare a metric:
PerfCounter num_redis_lookups = PerfCounter(...);
// use it
num_redis_lookups.Record(1);
// register custom exporter
PerfCounter.registerCallback(void (const measure &Measure, double value, TagMap& tags) {
// send wherever you want
});

In this way, you need to "Declare a metric" first, and hold "a PerfCounter object", so you can call "Record" where ever you want, right?

Why use OpenCensus::Measure? It is also responsible for creating Measure for this purpose. Similarly, OpenCensus::tags::TagMap requires registration and maintenance. It doesn't make sense to repeat these steps each time you create a metric.

I would like to avoid the open source metrics from diverging from custom implementations. Hence, the requirement to declare it in OpenCensus language. By the way, you don't need to declare a tag map. Just your metrics and views.

Also, one thing to note that a big portion of maintaining metricking systems is keeping them up to date. It's a good thing to force a metric to be documented by declaring it up-front. That way, we can have a file with all the available metrics. It will be more work to add a metric, but it will be easier for other developers to maintain.

How to ensure the uniqueness of the view name, if there is a need to dynamically add tags during the running process? Such as: View 1 of a.count has 3 tagkeys, while view 2 has 4 tagkeys.

Why would you need to do this? Just declare a view1 and view2 up front with all the tags you want. These could be parsed from a config file as well.

In the long run, this may happen, which is why Prometheus supports this feature.

ericl · 2019-01-25T07:58:02Z

That's right, you need to declare a metric up front.

micafan · 2019-01-25T07:58:55Z

// register custom exporter
PerfCounter.registerCallback(void (const measure &Measure, double value, TagMap& tags) {
   // send wherever you want
});

Too complicated.
I think the metrics used by users should be as simple as using logs.

micafan · 2019-01-25T08:00:07Z

That's right, you need to declare a metric up front.

Too complicated.
I think the metrics used by users should be as simple as using logs.

ericl · 2019-01-25T08:30:20Z

I'm sorry, but I don't think we should be merging this much code just to simplify adding a counter. It's just not that important that "adding a metric should be like logging".

I've seen a lot of projects go with the "dynamic metrics" option, it is unquestionably hard to maintain. Spark is one example, you can hardly rely on SQL metrics to work properly, if ever! On the other hand projects that require explicit declaration (i.e., documentation) of metrics tend to do better in having new developers understand what is going on.

I think the proposed alternative meets all your requirements and adds much less code to Ray.

micafan · 2019-01-25T08:52:01Z

I'm sorry, but I don't think we should be merging this much code just to simplify adding a counter. It's just not that important that "adding a metric should be like logging".

I've seen a lot of projects go with the "dynamic metrics" option, it is unquestionably hard to maintain. Spark is one example, you can hardly rely on SQL metrics to work properly, if ever! On the other hand projects that require explicit declaration (i.e., documentation) of metrics tend to do better in having new developers understand what is going on.

I think the proposed alternative meets all your requirements and adds much less code to Ray.

The solution I advocate does not add a lot of code, about a few hundred lines, but it is very practical and cost-effective.

Both options are feasible. It's just the solution you provide, the number of metrics increases, and the code multiplies.

micafan · 2019-01-25T09:36:42Z

I'm sorry, but I don't think we should be merging this much code just to simplify adding a counter. It's just not that important that "adding a metric should be like logging".

I've seen a lot of projects go with the "dynamic metrics" option, it is unquestionably hard to maintain. Spark is one example, you can hardly rely on SQL metrics to work properly, if ever! On the other hand projects that require explicit declaration (i.e., documentation) of metrics tend to do better in having new developers understand what is going on.

I think the proposed alternative meets all your requirements and adds much less code to Ray.

PerfCounter can also include the Register interface and the Update interface. Support for registering all metrics in the same file.
But I don't think this is really necessary. We can easily view all metrics through commands like grep, uniq, and pipes. And even if you know the metric name and declaration, sometimes you still have to look at the code confirmation.

AmplabJenkins · 2019-03-11T07:10:46Z

Can one of the admins verify this patch?

AmplabJenkins · 2019-03-11T21:49:45Z

Can one of the admins verify this patch?

raulchen · 2019-03-21T03:38:06Z

replaced by #4246

minmin.fmm added 3 commits January 7, 2019 17:18

Add Perf Counter Classes

f6f8ee9

add User interface PerfCounter

b133560

fix lint error

f9de58a

micafan force-pushed the add_perf_counter_classes branch from 13823df to f9de58a Compare January 7, 2019 10:16

jovany-wang requested a review from raulchen January 8, 2019 11:22

add Prometheus implementation

a0950e3

micafan changed the title ~~Add Perf Counter Classes~~ Add Perf Counter Jan 10, 2019

ericl mentioned this pull request Jan 25, 2019

Integrate metrics and tracing library #3858

Closed

5 tasks

guoyuhong mentioned this pull request Jan 28, 2019

Separate different libs for ray backend code. #3879

Closed

7 tasks

jovany-wang mentioned this pull request Mar 21, 2019

Integrate metrics #4246

Merged

raulchen closed this Mar 21, 2019

Add Perf Counter #3664

Add Perf Counter #3664

Conversation

micafan commented Dec 29, 2018

What do these changes do?

Related issue number

ericl commented Dec 29, 2018

AmplabJenkins commented Dec 29, 2018

micafan commented Dec 29, 2018

ericl commented Dec 29, 2018 • edited Loading

AmplabJenkins commented Dec 29, 2018

AmplabJenkins commented Jan 7, 2019

micafan commented Jan 7, 2019

ericl commented Jan 8, 2019

jovany-wang commented Jan 8, 2019 • edited Loading

ericl commented Jan 8, 2019 • edited Loading

micafan commented Jan 8, 2019

ericl commented Jan 8, 2019 via email

ericl commented Jan 8, 2019 via email

micafan commented Jan 9, 2019

AmplabJenkins commented Jan 9, 2019

ericl commented Jan 9, 2019 via email

ericl commented Jan 9, 2019

micafan commented Jan 10, 2019

micafan commented Jan 24, 2019

ericl commented Jan 24, 2019

micafan commented Jan 24, 2019 • edited Loading

ericl commented Jan 25, 2019

micafan commented Jan 25, 2019 • edited Loading

ericl commented Jan 25, 2019 • edited Loading

micafan commented Jan 25, 2019

ericl commented Jan 25, 2019 • edited Loading

micafan commented Jan 25, 2019

ericl commented Jan 25, 2019

micafan commented Jan 25, 2019

micafan commented Jan 25, 2019

ericl commented Jan 25, 2019

micafan commented Jan 25, 2019

micafan commented Jan 25, 2019 • edited Loading

AmplabJenkins commented Mar 11, 2019

AmplabJenkins commented Mar 11, 2019

raulchen commented Mar 21, 2019

ericl commented Dec 29, 2018 •

edited

Loading

jovany-wang commented Jan 8, 2019 •

edited

Loading

ericl commented Jan 8, 2019 •

edited

Loading

micafan commented Jan 24, 2019 •

edited

Loading

micafan commented Jan 25, 2019 •

edited

Loading

ericl commented Jan 25, 2019 •

edited

Loading

ericl commented Jan 25, 2019 •

edited

Loading

micafan commented Jan 25, 2019 •

edited

Loading