Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no data on Functions explorer dashboard with Rust app #66

Open
lcmgh opened this issue Jun 28, 2023 · 23 comments
Open

no data on Functions explorer dashboard with Rust app #66

lcmgh opened this issue Jun 28, 2023 · 23 comments

Comments

@lcmgh
Copy link

lcmgh commented Jun 28, 2023

First of all, thanks for making life easier :) I just gave it a first try and annotated a fn in my Rust app with autometrics. But in the dashboard no data is shown. Also all quantile metrics are set to 0 even the fn got some calls.

Version: autometrics = { version = "0.5.0", features = ["metrics"] }

  1. Data from my apps Prometheus endpoint
# TYPE function_calls_duration summary
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.5"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.9"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.95"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.99"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.999"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="1"} 0
function_calls_duration_sum{function="batch_insert",module="myapp::mymod::dao"} 0.01188134
function_calls_duration_count{function="batch_insert",module="myapp::mymod::dao"} 3
  1. My data is scraped
  2. Dashboard shows no data

image

@gagbo
Copy link
Member

gagbo commented Jun 28, 2023

Hello,

Thanks for the detailed bug report. I don't think I can take a deeper look at it before the end of the week, but a preliminary check I can do is:

  • It's weird that the function_calls_duration metric is a summary instead of a histogram (not your fault, and since we set the exporter using buckets here it actually might be a bug/issue in the metrics-prometheus-exporter lib we use for you here) – and to be clear, that's why the dashboard is failing; the queries are made for a histogram metric, not a summary one

Is there any specific reason you chose the metrics feature?

Can you try to use the prometheus feature instead and tell us if you have the same issue? If you don't have the issue anymore, I'll create an issue from here to autometrics-rs as it's a bug with the metrics feature. If you still have the issue, can you show the data sent by the Prometheus endpoint again?

@lcmgh
Copy link
Author

lcmgh commented Jun 28, 2023

Hi @gagbo! Thanks for checking back. I have used the metrics feature because my app has already a Prometheus exporter endpoint.

/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
    .install()
    .expect("failed to install recorder/exporter");

I was not sure if that feature exposes the Prometheus endpoint at a port and wanted not to have any interference here.

So switching to prometheus makes the autometrics metrics not appear anymore on my existing exporter endpoint.

After switching back to metrics it is similar to before

function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.5"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.9"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.95"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.99"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.999"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="1"} 0
function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.012257653
function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 3

@gagbo
Copy link
Member

gagbo commented Jun 28, 2023

That's weird too (that using the prometheus feature only doesn't work with the existing metrics setup). Are you using the default registry of Prometheus client library, or a separate custom library?

For the record, the prometheus feautures doesn't try to expose a scrapping endpoint, the prometheus-exporter feature does. It's currently documented under " For projects already using custom Prometheus metrics " in the README, not sure what we can do yet to make that clearer.

To go back on topic, can you instead try to change your prometheus exporter initialization:

/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
+    .set_buckets(&[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0])
+    .expect("Failed to set histogram buckets")
    .install()
    .expect("failed to install recorder/exporter");

and see if it solves at least1 the "function_calls_durations as summary" issue?

Footnotes

  1. I think this might be incompatible with your existing metrics (because this apparently forces all histograms to be exported as histogram instead of summary), but if that's the case I'd rather handle that in a separate issue

@lcmgh
Copy link
Author

lcmgh commented Jun 28, 2023

I have adapted your changes and the metric endpoint now gives

function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.005"} 2
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.01"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.025"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.05"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.075"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.1"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.25"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.75"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="1"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="2.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="7.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="10"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="+Inf"} 3
function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.013500508000000001
function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 3

@gagbo
Copy link
Member

gagbo commented Jun 28, 2023

Yeah, that's better, at least the duration related metrics should work. I still have 2 questions:

  • don't you see also a function_calls_count metric that is exposed?
  • is the function-specific dashboard showing data at least in the latency graph?

@lcmgh
Copy link
Author

lcmgh commented Jun 29, 2023

  • There is indeed a function_calls_count metric (forgot to include it in my previous posts)
# HELP function_calls_count Autometrics counter for tracking function calls
# TYPE function_calls_count counter
function_calls_count{function="batch_insert",module="myapp::mod::dao",caller="",result="ok"} 1

# HELP function_calls_duration Autometrics histogram for tracking function call duration
# TYPE function_calls_duration histogram
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.005"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.01"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.025"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.05"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.075"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.1"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.25"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.5"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.75"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="1"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="2.5"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="5"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="7.5"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="10"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="+Inf"} 1
function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.002298264
function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 1
  • Dashboard still shows no data in all three panels

With applied code from #66 (comment).

@gagbo
Copy link
Member

gagbo commented Jun 29, 2023

I see that the output of the scrapping endpoint looks more like what I'm expecting, so this part is solved.

I'm trying to understand what happened to the dashboard, but I think I'll have to test this and reproduce to understand the issue.

  • The fact that the dashboard proposes the batch_insert function name as input shows that at least the part extracting valid function labels is working fine (so the dashboard is well connected to the Prometheus data source, and can see the autometrics metrics)
  • The fact that none of the graphs show data make me think that maybe the issue is with our build_info metric and how we use it in the query to generate the graph.

Also, the sampling interval used to show the graph is dynamic, and changes with the timespan shown. Do you still have the issue if you change the "Last 6 hours" in the top-right corner to something significantly shorter like "Last 5 minutes"? I hope that it will give enough points (depending on your scrape_interval) to show a curve

@lcmgh
Copy link
Author

lcmgh commented Jun 30, 2023

Do you still have the issue if you change the "Last 6 hours" in the top-right corner to something significantly shorter like "Last 5 minutes"?

Oh, actually I now have found some lines in the Latency (95th and 99th Percentile) panel but still no data in the other ones. The other ones even show an error but only when going back to 7 days.

execution: found duplicate series for the match group {instance="xxxxx:10250", job="kubelet"} on the right hand-side of the operation: [{__name__="up", endpoint="https-metrics", instance="xxxx:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="kube-system", node="xxxxxxxxxxx", service="prometheus-operator-kubelet"}, {__name__="up", endpoint="https-metrics", instance="xxxx", job="kubelet", metrics_path="/metrics/probes", namespace="kube-system", node="xxxxx", service="prometheus-operator-kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side

The fact that none of the graphs show data make me think that maybe the issue is with our build_info metric and how we use it in the query to generate the graph.

build_info{commit="",version="1.1.7",branch=""} 1

@gagbo
Copy link
Member

gagbo commented Jul 4, 2023

The issue might be specifically triggered by your setup then, it will be hard for me to reproduce it.

Can you try to:

  • open the function-level dashboard,
  • edit the query for the success rate (or call rate) graph
  • edit all the on (instance, job) clauses, and replace them with on (instance, job, metrics_path)

and see if that works? I don't understand how you can have (instance, job) pairs that are not-unique, but it looks like that's what the error message is saying.

@lcmgh
Copy link
Author

lcmgh commented Jul 4, 2023

and see if that works? I

Yes this made it work! The legend says {function="batch_insert", module="myapp::mod::dao", version="1.1.7"}

@gagbo
Copy link
Member

gagbo commented Jul 4, 2023

Thanks for bearing with me! We always assumed that instance, job would be enough to uniquely identify scrape jobs, but it looks like that assumption was wrong :) Now we'll just need to make sure that the metrics_path label is a built-in default label and present in all versions of Prometheus that matter to our use cases. Thanks again :)

@lcmgh
Copy link
Author

lcmgh commented Jul 4, 2023

It's me who need to be thankful! Then I just adapt my dashboard and my team mates are ready to enjoy the new dashboard :) awesome!

@lcmgh
Copy link
Author

lcmgh commented Jul 4, 2023

Huh, I noticed another thing, the duration of my fn is always shown as 0. I'd expect it to be > 0 when there are data points (hovering over the yellow line also shows 0):

temp

@gagbo
Copy link
Member

gagbo commented Jul 4, 2023

Oh right, because your function calls probably always last less than the smallest bucket which is 0.005s (on average you have 0.0045 according to this comment).

So if you want to have more granularity there, you should change the buckets in your exporter config:

/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
-    .set_buckets(&[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0])
+    .set_buckets(&[0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5])
    .expect("Failed to set histogram buckets")
    .install()
    .expect("failed to install recorder/exporter");

There's no hard requirement on the values in the buckets, just:

  • adding buckets adds data in your prometheus instance (that's why I removed the top values here, but you can put them back if you want)
  • if you ever want to add latency SLOs later, the latency threshold must be one of the bucket value (so if you want to create a SLO that states "99% of queries in this group must last less than 0.02s", then 0.02 must appear in the buckets)

@lcmgh
Copy link
Author

lcmgh commented Jul 4, 2023

Thanks for making this clear to me. In production I will have longer runtimes so this should be fine :).

@lcmgh
Copy link
Author

lcmgh commented Jul 18, 2023

I stumbled over another issue. Metrics data is only shown when selecting "Last 7 days" in Grafana even then there a data points for the last days/hours. For another fn call I could not working it at all.

@gagbo
Copy link
Member

gagbo commented Jul 24, 2023

I stumbled over another issue. Metrics data is only shown when selecting "Last 7 days" in Grafana even then there a data points for the last days/hours. For another fn call I could not working it at all.

What is the scrape_interval of your project? Can you try to change the query in the dashboard to replace the last_over_time(build_info[$__rate_interval] bits with last_over_time(build_info[1h] (1h or anything longer than your scrape_interval really) and tell me if that works?

I'm running into a similar issue just now (haven't had time to look into that before sorry), and it might be the solution. It also looks like a commit tried to solve that kind of issue recently, but I don't know whether you tried running this version of the dashboard or if it was indeed the problem

@lcmgh
Copy link
Author

lcmgh commented Jul 24, 2023

My scrape interval is 1m :).

Straight using https://github.com/autometrics-dev/autometrics-shared/blob/62f4dfd0a52e23874b3925039c17c879ca7aea49/dashboards/Autometrics%20Function%20Explorer.json makes my dashboard show no data at all.

@gagbo
Copy link
Member

gagbo commented Jul 25, 2023

That's too bad about this, especially if you get no data at all, there's something we didn't think of when making the queries I'd guess.

For the record, I just learnt that using $__rate_interval as we do in the dashboard is perfectly fine, if you configured the scrape interval of your Prometheus instance to match the 1m you have:

Settings screen for a Prometheus data source in Grafana, showing the scrape interval field

Source of the screenshot with the explanation

Other than that, I wasn't able to reproduce the issue locally, so I'll go watch on our staging if the issue lives so I can debug that hands on

EDIT: the only way I was able to reproduce it, was to have a too short timespan for the ranges in queries. So I think that configuring the scrape timeout in your grafana "Prometheus data source" is going to fix most of the issues.

Gory details of my current root cause hypothesis below:

$__rate_interval is the maximum of 4 * scrape_interval and interval (the automatic interval given the time range of the graph). When you use "Last 7 days" as the range of the graph, interval became 5 minutes in my Grafana, which is typically more than the recommended interval for 1 minute scrape interval. When you go lower than this, interval decreases and 4 * scrape_interval is supposed to be the anchor value. With the default of 15s, the default "anchor" is 1 minute, which means all the rate computation the dashboard does only has 1 point in time to guess a rate; therefore, the dashboard can't compute rates and shows "no data". (And we internally never hit that issue because we're setting the interval programmatically).

All this to say "if your Prometheus data source already has 1 minute as a scrape timeout in Grafana settings, I'm stumped again"

@lcmgh
Copy link
Author

lcmgh commented Jul 28, 2023

I stand corrected: The issue with no data being shown seems to be fixed with the updated query mentioned earlier this thread.

Still for one function I have a datapoint with count zero. Not sure if this is because I have forgotten to set the Prometheus buckets. The actual function call duration is around 12sec so I thought default settings are fine.

image

metrics endpoint output

# HELP function_calls_count Autometrics counter for tracking function calls
# TYPE function_calls_count counter
function_calls_count{function="receive_and_send_rows",module="myservice_sl_server::server",caller=""} 1

# HELP build_info Autometrics info metric for tracking software version and build details
# TYPE build_info gauge
build_info{commit="",version="0.1.0",branch=""} 1

# TYPE health gauge
health{target="Total"} 1

# HELP function_calls_duration Autometrics histogram for tracking function call duration
# TYPE function_calls_duration summary
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.5"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.9"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.95"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.99"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.999"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="1"} 0
function_calls_duration_sum{function="receive_and_send_rows",module="myservice_sl_server::server"} 13.730152187
function_calls_duration_count{function="receive_and_send_rows",module="myservice_sl_server::server"} 1

Prometheus exporter setup

PrometheusBuilder::new()
    .install()
    .expect("failed to install recorder/exporter");

@lcmgh
Copy link
Author

lcmgh commented Jul 28, 2023

When going with the following, no data is shown but the function appears in my dropdown.

PrometheusBuilder::new()
    .set_buckets(&[
        0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0,
    ])
    .expect("Failed to set histogram buckets")
    .install()
    .expect("failed to install recorder/exporter");
# HELP function_calls_count Autometrics counter for tracking function calls
# TYPE function_calls_count counter
function_calls_count{function="read_and_stream_rows",module="myservice_sl_server::db",caller=""} 1

# TYPE health gauge
health{target="Total"} 1

# HELP build_info Autometrics info metric for tracking software version and build details
# TYPE build_info gauge
build_info{commit="",version="0.1.0",branch=""} 1

# HELP function_calls_duration Autometrics histogram for tracking function call duration
# TYPE function_calls_duration histogram
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.005"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.01"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.025"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.05"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.075"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.1"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.25"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.75"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="1"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="2.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="7.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="10"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="+Inf"} 1
function_calls_duration_sum{function="read_and_stream_rows",module="myservice_sl_server::db"} 15.281207034
function_calls_duration_count{function="read_and_stream_rows",module="myservice_sl_server::db"} 1

@gagbo
Copy link
Member

gagbo commented Jul 28, 2023

Still for one function I have a datapoint with count zero. Not sure if this is because I have forgotten to set the Prometheus buckets.

No, don't worry about this. To help with metrics discovery, the library initializes the counters of autometricized functions at start in Debug builds. That allows the dashboards to show data even if nothing happened.

When going with the following, no data is shown but the function appears in my dropdown.

Did you check the settings for the Prometheus source in Grafana? I still think that changing the scrape interval to 1m (or 60s if it only takes seconds) like in the screenshot above would help with the "No data" issue

@lcmgh
Copy link
Author

lcmgh commented Oct 26, 2023

Our Prometheus uses a scrape interval of 60s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants