-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Perf Counter #3664
Add Perf Counter #3664
Conversation
What's the intended backend for this interface? Why not use https://github.com/census-instrumentation/opencensus-cpp which will enable many backends to be targeted? |
Test FAILed. |
These interface classes are more general and abstract encapsulation.
We can implement OpenCensusRegistry based on OpenCensus to make it a more concise interface. Direct use of OpenCensus stats interface is too cumbersome. OpenCensus use cases(a bit complicated): https://github.com/census-instrumentation/opencensus-cpp/blob/master/opencensus/stats/examples/view_and_record_example.cc MetricsRegistryInterface use cases: And also we can implement PrometheusRegistry based on Prometheus. Prometheus is a widely adopted solution. That's why OpenCensus also supports Prometheus protocols. |
I'm not sure I buy the simplicity argument. As I understand it a perf counter would look like this: I see the pros of using a stats library as follows:
The cons seem to be:
Re: compatibility, I understand there are applications already using this API -- but shouldn't this being independent of Ray? Suppose we have Prometheus as a backend, then, Ray can export to prometheus through opencensus, and the app can export directly if it wants, and there are no compatibility concerns. My sense here is that long term it is better for Ray to "support stats export via opencensus" rather than "export via this custom C++ API". The latter may be easier to put into Ray, but since it is basically a public API it will need to be maintained long term. |
Test FAILed. |
13823df
to
f9de58a
Compare
Test PASSed. |
Actually, you needs to register "my_perf_ctr" and register "my_tag_key" before do record(call opencensus::stats::Record)! What if we have more than one tag key? What if the tag key is not static but dynamic? Then you has to check if the tag key already register or not. The usage of PerfCounter is much simpler. No matter tag key is static or dynamic. Even without register. Like blow: |
Right, I agree for dynamic tags the proposed interface can be used more simply without extra registration code. This is imo a niche case though. The main question here is whether it's easier to integrate census, vs writing our own stats library (which is already several hundred lines of code in this pr, and likely to grow with more features). Do you have a sense of the tradeoffs here? I do want to see perf counters in Ray, but I am concerned about adding a bunch of custom code that implements the same thing but in a less general way and with limited backend support. |
@ericl Maybe we can simplify some interfaces to make an balance, like use map instead of our custom tags, use census client directly instead of registry and reporter. |
@jovany-wang so there are two main interfaces of concern here right? The proposal raised in this PR is to implement both interfaces in Ray. The alternative I'm suggesting is to use census lib instead, which I believe already implements a reporter/exporter. For the registry side, we can ideally use the census client API directly from Ray, or with a thin wrapper as a compromise. Btw, I think any stats API should be private to Ray -- backend integrations should be against census exporter (https://opencensus.io/exporters/#language-vs-available-exporters-matrix), not against Ray directly. I think this is different from glog because glog isn't designed to integrate with downstream systems. One possible issue is whether C++ metrics exporter integration is possible (I only see docs for how to implement a C++ trace exporter: https://opencensus.io/exporters/custom-exporter/cpp/tracing/) Update: it looks like there are a couple C++ stats exporter examples, one of which is here: https://github.com/census-instrumentation/opencensus-cpp/blob/master/opencensus/exporters/stats/prometheus/internal/prometheus_exporter.cc |
Of course, this proposal isn't intent to writing our own stats library. The goal is to keep the external interface simple and versatile; the internal implementation relies on third-party statistical libraries such as Prometheus-Cpp or OpenCensus-Cpp.
Another reason we can't use census directly is that the backend is a ray-independent service (rather than an arrow-like, ray-initiated service). You can't assume that the backend supports census or prometheus protocols. For example, Open-falcon (http://open-falcon.org), as an open source monitoring system (github 4000+ star), is used by many companies and does not support the above protocols. When these companies use ray, they certainly don't want to change the monitoring system they use to fit ray. That's why flink supports a lot of monitoring systems (https://github.com/apache/flink/tree/c3b013b9d0c2e5f4941ddeff18084d090c066440/flink-metrics), such as datadog, ganglia, prometheus, statsd, and so on. So does spark. Therefore, census can be used as a plugin for the perf counter. We also realize that the plugin interface layer(regsitry and reporter) is flexible but too complex and is being considered for simplification:Combine the report to the registry, leaving only the reporter function? |
The point is, census already *does* support a large number of backends,
i.e., datadog, Prometheus, zipkin, x-ray, stackdriver:
https://opencensus.io/exporters/supported-exporters/
Rather than implement all of this individually on Ray, why not use a
dedicated stats library that will provide these integrations for us?
Suppose you need to add openfalcon support: I don't see why adding an
openfalcon<>opencensus integration is harder than openfalcon<>ray.
That way, we keep the Ray metrics code lightweight with a single *narrow
waist* integration. No need to have to individually support a bunch of
libraries like Spark, etc are forced too. That said,
We also realize that the plugin interface layer(regsitry and reporter) is
flexible but too complex and is being considered for simplification:Combine
the report to the registry, leaving only the reporter function?
Having a lightweight reporter interface as an additional integration point
is reasonable, though we shouldn't merge an interface only (I.e., initial
support should have a concrete implemention such as opencensus, and we can
have compile time flag for other backends for advanced developers, similar
to glog support). It would of course be preferable to only support the
census client interface though.
…On Tue, Jan 8, 2019, 2:51 AM micafan ***@***.***> wrote:
The main question here is whether it's easier to integrate census, vs
writing our own stats library (which is already several hundred lines of
code in this pr, and likely to grow with more features). Do you have a
sense of the tradeoffs here?
Of course, this proposal isn't intent to writing our own stats library.
The goal is to keep the external interface simple and versatile; the
internal implementation relies on third-party statistical libraries such as
Prometheus-Cpp or OpenCensus-Cpp.
@jovany-wang <https://github.com/jovany-wang> so there are two main
interfaces of concern here right?
"registry": this is the interface between Ray <-> stats library
"reporter": this is the interface between the stats library <-> backends
(e.g., Prometheus)
The proposal raised in this PR is to implement both interfaces in Ray. The
alternative I'm suggesting is to use census lib instead, which I believe
already implements a reporter/exporter. For the registry side, we can
ideally use the census client API directly from Ray, or with a thin wrapper
as a compromise.
Another reason we can't use census directly is that the backend is a
ray-independent service (rather than an arrow-like, ray-initiated service).
You can't assume that the backend supports census or prometheus protocols.
For example, Open-falcon (http://open-falcon.org), as an open source
monitoring system (github 4000+ star), is used by many companies and does
not support the above protocols. When these companies use ray, they
certainly don't want to change the monitoring system they use to fit ray.
That's why flink supports a lot of monitoring systems (
https://github.com/apache/flink/tree/c3b013b9d0c2e5f4941ddeff18084d090c066440/flink-metrics),
such as datadog, ganglia, prometheus, statsd, and so on. So does spark.
Therefore, census can be used as a plugin for the perf counter.
We also realize that the plugin interface layer(regsitry and reporter) is
flexible but too complex and is being considered for simplification:Combine
the report to the registry, leaving only the reporter function?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3664 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAA6SgPMaKOE6o59nJ50mt8uCRAoONIoks5vBHg7gaJpZM4Zkgro>
.
|
Another note is that metrics can be reported not only from c++ code but
eventually java and python. For example, the python code already supports a
form of tracing for the timeline visualization. So it's preferable to rely
on library integrations, otherwise backend support will eventually need to
be added in each language (or with expensive IPC calls).
…On Tue, Jan 8, 2019, 5:22 AM Eric Liang ***@***.***> wrote:
The point is, census already *does* support a large number of backends,
i.e., datadog, Prometheus, zipkin, x-ray, stackdriver:
https://opencensus.io/exporters/supported-exporters/
Rather than implement all of this individually on Ray, why not use a
dedicated stats library that will provide these integrations for us?
Suppose you need to add openfalcon support: I don't see why adding an
openfalcon<>opencensus integration is harder than openfalcon<>ray.
That way, we keep the Ray metrics code lightweight with a single *narrow
waist* integration. No need to have to individually support a bunch of
libraries like Spark, etc are forced too. That said,
> We also realize that the plugin interface layer(regsitry and reporter)
is flexible but too complex and is being considered for
simplification:Combine the report to the registry, leaving only the
reporter function?
Having a lightweight reporter interface as an additional integration point
is reasonable, though we shouldn't merge an interface only (I.e., initial
support should have a concrete implemention such as opencensus, and we can
have compile time flag for other backends for advanced developers, similar
to glog support). It would of course be preferable to only support the
census client interface though.
On Tue, Jan 8, 2019, 2:51 AM micafan ***@***.***> wrote:
> The main question here is whether it's easier to integrate census, vs
> writing our own stats library (which is already several hundred lines of
> code in this pr, and likely to grow with more features). Do you have a
> sense of the tradeoffs here?
>
> Of course, this proposal isn't intent to writing our own stats library.
> The goal is to keep the external interface simple and versatile; the
> internal implementation relies on third-party statistical libraries such as
> Prometheus-Cpp or OpenCensus-Cpp.
>
> @jovany-wang <https://github.com/jovany-wang> so there are two main
> interfaces of concern here right?
> "registry": this is the interface between Ray <-> stats library
> "reporter": this is the interface between the stats library <-> backends
> (e.g., Prometheus)
> The proposal raised in this PR is to implement both interfaces in Ray.
> The alternative I'm suggesting is to use census lib instead, which I
> believe already implements a reporter/exporter. For the registry side, we
> can ideally use the census client API directly from Ray, or with a thin
> wrapper as a compromise.
>
> Another reason we can't use census directly is that the backend is a
> ray-independent service (rather than an arrow-like, ray-initiated service).
> You can't assume that the backend supports census or prometheus protocols.
>
> For example, Open-falcon (http://open-falcon.org), as an open source
> monitoring system (github 4000+ star), is used by many companies and does
> not support the above protocols. When these companies use ray, they
> certainly don't want to change the monitoring system they use to fit ray.
>
> That's why flink supports a lot of monitoring systems (
> https://github.com/apache/flink/tree/c3b013b9d0c2e5f4941ddeff18084d090c066440/flink-metrics),
> such as datadog, ganglia, prometheus, statsd, and so on. So does spark.
>
> Therefore, census can be used as a plugin for the perf counter.
>
> We also realize that the plugin interface layer(regsitry and reporter) is
> flexible but too complex and is being considered for simplification:Combine
> the report to the registry, leaving only the reporter function?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#3664 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAA6SgPMaKOE6o59nJ50mt8uCRAoONIoks5vBHg7gaJpZM4Zkgro>
> .
>
|
My concern is compatibility issues. This often depends on the design of the backend.
Initial support does have a concrete implemention, implemented by Prometheus. The backend related settings are specified by the startup configuration item(instead of compile time flag). The startup configuration item also includes the gateway address and so on. I will send out the initial implementation and discuss it further. |
Test PASSed. |
For instance, i am not sure if the OpenCensus protocol can be converted
to a protocol supported by open falcon backend.
Can you investigate this? The reporter/exporter example I linked was not
that many lines of code to integrate with Prometheus. If it turns out it
does convert easily it would save quite a bunch of efforts.
…On Wed, Jan 9, 2019, 3:52 AM UCB AMPLab ***@***.***> wrote:
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10705/
Test PASSed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3664 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAA6SuoOiqc-zXjwKvzngS60lMW_pm5yks5vBdgDgaJpZM4Zkgro>
.
|
Looks like the main API is this: https://github.com/census-instrumentation/opencensus-cpp/blob/e9a943b244f419eaa122495b96e5c95aa7299cbe/opencensus/stats/stats_exporter.h
|
Ok.
First, by then prometheus reporter is generic, not customizable for census. And:
PerfCounter:
PrometheusPushReporter is mainly doing the above two work.
|
@ericl OpenCensus can be used as the only implementation of the Registry, not one of the extended implementations. At the same time, the Reporter is retained as an extensible interface, and provides a default implementation, while allowing users to extend support for their own backend. After all, some backends are not necessarily accepted by OpenCensus, and users may not need this part to be open source. Can we agree? Retaining the Registry interface can be much simpler and easier to understand than using OpenCensus directly (the OpenCensus is much more complex than the Prometheus interface and implementation). Also avoid developers spending too much time understanding the details of OpenCensus. |
@micafan it looks like you are suggesting the following, if I understand correctly:
This seems ok as a compromise to me, if you're also intending to implement OpenCensusRegistry ;) What I was actually suggesting earlier was:
As you can see, the amount of components for integrating custom backends is about the same, but the additions to Ray are lighter weight. Not to mention, any custom exporter code written will reusable for other projects and not just for Ray. The concerns raised above for this design were:
You can read metrics data by calling
This is an issue in any case right? You can also add the endpoint in the custom exporter code.
I don't think this is necessarily the case, OpenCensus is going to be better documented and tested than anything we put into Ray as a one-off. I would much rather merge a PR adding OC integration instead. Happy to help with a proof of concept here, it sounds like you need an example of converting a |
No, what i suggest is:
In fact, I have implemented a version of OpenCensusRegistry three month ago, and I plan to continue(This time i want make the overall structure of perf counter more simplified). Check it here:
I prefer this way if you insist:
Just explain why didn't use opencensus::stats::StatsExporter directly. In fact, I already used it this way(line 104):https://github.com/ray-project/ray/blob/e1817fe1fb99e09b6335dcd393e4977ef883fdc5/src/ray/metrics/registry/open_census_metrics_registry.cc
I agree. As I said before, OpenCensus' protocal can be converted to OpenFalcon's protocal. Implementing OpenFalconExporter can solve this problem.
What i mean is OpenCensue's api is a little complicated. I highly recommend using it after packaging instead of directly. |
Thanks for clarifying. So this plan seems OK with me. I like it since it minimizes the number of new interfaces we are adding:
The only new interface is
I think for more complex metrics (such as histograms, etc.) Ray should call OpenCensus directly to avoid the growth of a bunch of wrapper code:
I think we can agree to disagree here. For me it is seems fairly clear.
I would avoid wrappers whenever possible, since it is adding more (unnecessary) layers of indirection. |
PerfCounter's interface(Defines here):
Use PerfCounter::UpdateHistogram(...) seems ok. Can you take the time to review PerfCounter.h and PerfCounter.cc? Another issue with OpenCensus is:
If the dimension is added, the View name will change accordingly. The front-end monitoring view configured according to the name is invalid and needs to be reconfigured. Prometheus only uses dimension information as an attribute field, not as part of the name, which is more friendly. |
If I understand you just want to intercept the stats at the point of recording right? Why not just:
You can define histo aggregation etc in your custom exporter, which does not need to be open sourced. For OSS, we will use census histogram aggregation defined in the measure.
Can't you call set_name() with a custom name? The comment is just a suggestion to include the dimension info, but not required. |
For the user, providing the metric name information is sufficient. Is it better to let users focus on the information they need to count, rather than how to count them?
How to ensure the uniqueness of the view name, if there is a need to dynamically add tags during the running process? Such as: |
You only need to pass the full info to declare a metric. The call to Record() only takes the value and tags, not the "how" info.
I would like to avoid the open source metrics from diverging from custom implementations. Hence, the requirement to declare it in OpenCensus language. By the way, you don't need to declare a tag map. Just your metrics and views. Also, one thing to note that a big portion of maintaining metricking systems is keeping them up to date. It's a good thing to force a metric to be documented by declaring it up-front. That way, we can have a file with all the available metrics. It will be more work to add a metric, but it will be easier for other developers to maintain.
Why would you need to do this? Just declare a view1 and view2 up front with all the tags you want. These could be parsed from a config file as well. |
But it's more convenient and simple to use. Only one line of code, and always the similar line:
This is very useful:
In this way, you need to "Declare a metric" first, and hold "a PerfCounter object", so you can call "Record" where ever you want, right?
In the long run, this may happen, which is why Prometheus supports this feature. |
That's right, you need to declare a metric up front. |
Too complicated. |
Too complicated. |
I'm sorry, but I don't think we should be merging this much code just to simplify adding a counter. It's just not that important that "adding a metric should be like logging". I've seen a lot of projects go with the "dynamic metrics" option, it is unquestionably hard to maintain. Spark is one example, you can hardly rely on SQL metrics to work properly, if ever! On the other hand projects that require explicit declaration (i.e., documentation) of metrics tend to do better in having new developers understand what is going on. I think the proposed alternative meets all your requirements and adds much less code to Ray. |
The solution I advocate does not add a lot of code, about a few hundred lines, but it is very practical and cost-effective. Both options are feasible. It's just the solution you provide, the number of metrics increases, and the code multiplies. |
PerfCounter can also include the Register interface and the Update interface. Support for registering all metrics in the same file. |
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
replaced by #4246 |
What do these changes do?
With performance counter, we can easy to view service running status and abnormal conditions. This PR is to support perf counter.
I will submit code in several stages. And now is the first stage, submit the base interface class of performance counter.
Introduction:
MetricsRegistryInterface is the base registry class. Registry is used to register and update metrics.
MetricsReporterInterface is the base reporter class. Reporter is used to reporter metrics to any monitor service.
Related issue number