Istio's Prow should use the new upstream monitoring and alerting stack. #1774

cjwagner · 2019-09-06T19:00:58Z

https://github.com/kubernetes/test-infra/tree/master/prow/cluster/monitoring#monitoring
We'll get lots of nice preconfigured graphs and alerts for free. e.g. https://monitoring.prow.k8s.io
I think we can pretty easily move over anything from https://velodrome.istio.io that is still used too so that we can turn down velodrome and just have a single monitoring dashboard.

cc @clarketm @fejta @geeknoid @howardjohn

scottilee · 2020-03-21T23:27:33Z

@cjwagner I'd like to help with this, and according to Travis this is mostly a copy and paste to see if it works more or less out of the box, but had a few questions:

It seems like the first step would be to create a "monitor" directory in https://github.com/istio/test-infra/tree/master/prow/cluster and copy everything from https://github.com/kubernetes/test-infra/tree/master/prow/cluster/monitoring to there, let me know if this is wrong?
Per the README, update grafana_secret.yaml -- what should I set the password to and how does Istio handle secrets?
Per the README, update `` -- probably set this to the Istio #test-failures channel? and who can I ask about the api_url?
Per the README I should update the annotations in grafana_expose.yaml since it doesn't look like Istio uses certmanager?
Update the OWNERS file to howardjohn, clarketm, cjwagner? Or should this be renamed to CODEOWNERS?
I skimmed through the YAML and there didn't seem to be anything else Kubernetes test-infra specific, let me know if I missed anything though?
I could also add more jsonnet for dashboards in regard to Lacking Prow cluster metrics #1638 but not sure if that's better to do after everything is setup
Lastly, how would I go about applying this? I think I would be setting up an Istio Prow job to deploy this but let me know if that's wrong and I need some kind of additional access.

I can also move over the Istio-specific stuff from https://github.com/kubernetes/test-infra/blob/master/velodrome/config.yaml into a new config.yaml in https://github.com/istio/test-infra/tree/master/prow/cluster/velodrome. But not sure what the next steps would be after that?

cjwagner · 2020-03-23T18:17:14Z

@cjwagner I'd like to help with this, and according to Travis this is mostly a copy and paste to see if it works more or less out of the box, but had a few questions:

It seems like the first step would be to create a "monitor" directory in https://github.com/istio/test-infra/tree/master/prow/cluster and copy everything from https://github.com/kubernetes/test-infra/tree/master/prow/cluster/monitoring to there, let me know if this is wrong?

Sure, that would be appropriate.

Per the README, update grafana_secret.yaml -- what should I set the password to and how does Istio handle secrets?

You will need to generate a new password. I'm not certain how the Istio handles secrets, but if it is the same as the k8s project then you'll need to share the credentials with oncall (or perhaps have them generate them in this case) and they can load them into the cluster and back them up in a secret store.

Per the README, update `` -- probably set this to the Istio #test-failures channel? and who can I ask about the api_url?

(Assuming you are talking about {{ api_url }}) I don't know much about that slack channel, but it sounds appropriate. You could alternatively make a separate channel for these alerts. You'll need to coordinate with a slack admin to get the api_url (which is a secret value). I don't know who that would be.

Per the README I should update the annotations in grafana_expose.yaml since it doesn't look like Istio uses certmanager?

Yes, you should update the ingress to work with your own domain instead of monitoring.prow.k8s.io and configure TLS termination with whatever tool you prefer. (In this case you should probably use GKE managed certs like Travis has set up for the Prow and Velodrome instances.

Update the OWNERS file to howardjohn, clarketm, cjwagner? Or should this be renamed to CODEOWNERS?

Istio uses CODEOWNERS instead of OWNERS so switch to CODEOWNERS. Please don't include me as an owner of this, I don't intend to maintain this instance.

I skimmed through the YAML and there didn't seem to be anything else Kubernetes test-infra specific, let me know if I missed anything though?

There is some k/t-i specific configuration. In particular, some of the alerts are only appropriate or are customized for k/t-i. e.g. We explicitly ignore Tide errors from kubeflow since they have been misusing Tide for a long time and the errors are accepted. https://github.com/kubernetes/test-infra/blob/af68b02cf15c086db2b33997784b93a096a7c208/prow/cluster/monitoring/mixins/prometheus/tide_alerts.libsonnet#L36
Since the alerts aren't tuned for your instance, I'd recommend starting without configuring the alerts and checking if they would have fired on the last week of data before enabling them. That way you can tweak the alert thresholds to customize for your instance and avoid alerting spam that could make people start ignoring the new alerts before they become reliable.

I could also add more jsonnet for dashboards in regard to Lacking Prow cluster metrics #1638 but not sure if that's better to do after everything is setup

Yes, if the metrics and dashboards currently included in the monitoring stack are insufficient for your needs it would probably be best to work on adding new dashboards as a follow up. Please consider if any metrics/dashboards would be generally useful to Prow (#1638 sounds like a yes) and contribute the changes upstream to k/t-i if so to allow all Prow instances to benefit.

Lastly, how would I go about applying this? I think I would be setting up an Istio Prow job to deploy this but let me know if that's wrong and I need some kind of additional access.

This should be deployed as part of the deploy-prow job via a new make target:

test-infra/prow/config/jobs/test-infra.yaml

Lines 15 to 31 in 2dfc3ef

    
           - name: deploy-prow 
        
             type: postsubmit 
        
             regex: '^prow/cluster/(?:gcsweb/|build/|private/|[^/]+\.yaml$)' 
        
             cluster: test-infra-trusted 
        
             max_concurrency: 1 
        
             command: 
        
             - make 
        
             - -C 
        
             - prow 
        
             - deploy 
        
             - deploy-gcsweb 
        
             - deploy-velodrome 
        
             - deploy-build 
        
             - deploy-private 
        
             requirements: [deploy] 
        
             node_selector: 
        
               prod: prow

I can also move over the Istio-specific stuff from https://github.com/kubernetes/test-infra/blob/master/velodrome/config.yaml into a new config.yaml in https://github.com/istio/test-infra/tree/master/prow/cluster/velodrome. But not sure what the next steps would be after that?

I would recommend avoiding touching velodrome at all unless you are migrating things off velodrome and onto the new monitoring stack.

Hope this helps!

clarketm · 2020-03-26T08:10:10Z

@scottilee thanks for taking this on!

istio-policy-bot added the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Oct 30, 2019

clarketm added the help wanted Indicates a PR/Issue that needs community help label Nov 26, 2019

clarketm added the prow label Dec 7, 2019

istio-policy-bot removed the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Mar 23, 2020

clarketm assigned scottilee Mar 26, 2020

clarketm removed the help wanted Indicates a PR/Issue that needs community help label Mar 26, 2020

scottilee mentioned this issue Apr 2, 2020

Metrics #2563

Merged

istio-testing closed this as completed in #2563 Apr 4, 2020

clarketm mentioned this issue Apr 7, 2020

Prow monitoring cleanup, codegen, deps, and deployment #2570

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Istio's Prow should use the new upstream monitoring and alerting stack. #1774

Istio's Prow should use the new upstream monitoring and alerting stack. #1774

cjwagner commented Sep 6, 2019

scottilee commented Mar 21, 2020 •

edited

Loading

cjwagner commented Mar 23, 2020

clarketm commented Mar 26, 2020 •

edited

Loading

Istio's Prow should use the new upstream monitoring and alerting stack. #1774

Istio's Prow should use the new upstream monitoring and alerting stack. #1774

Comments

cjwagner commented Sep 6, 2019

scottilee commented Mar 21, 2020 • edited Loading

cjwagner commented Mar 23, 2020

clarketm commented Mar 26, 2020 • edited Loading

scottilee commented Mar 21, 2020 •

edited

Loading

clarketm commented Mar 26, 2020 •

edited

Loading