Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio's Prow should use the new upstream monitoring and alerting stack. #1774

Closed
cjwagner opened this issue Sep 6, 2019 · 3 comments · Fixed by #2563
Closed

Istio's Prow should use the new upstream monitoring and alerting stack. #1774

cjwagner opened this issue Sep 6, 2019 · 3 comments · Fixed by #2563
Assignees
Labels

Comments

@cjwagner
Copy link
Member

cjwagner commented Sep 6, 2019

https://github.com/kubernetes/test-infra/tree/master/prow/cluster/monitoring#monitoring
We'll get lots of nice preconfigured graphs and alerts for free. e.g. https://monitoring.prow.k8s.io
I think we can pretty easily move over anything from https://velodrome.istio.io that is still used too so that we can turn down velodrome and just have a single monitoring dashboard.

cc @clarketm @fejta @geeknoid @howardjohn

@istio-policy-bot istio-policy-bot added the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Oct 30, 2019
@clarketm clarketm added the help wanted Indicates a PR/Issue that needs community help label Nov 26, 2019
@clarketm clarketm added the prow label Dec 7, 2019
@scottilee
Copy link
Contributor

scottilee commented Mar 21, 2020

@cjwagner I'd like to help with this, and according to Travis this is mostly a copy and paste to see if it works more or less out of the box, but had a few questions:

  • It seems like the first step would be to create a "monitor" directory in https://github.com/istio/test-infra/tree/master/prow/cluster and copy everything from https://github.com/kubernetes/test-infra/tree/master/prow/cluster/monitoring to there, let me know if this is wrong?
  • Per the README, update grafana_secret.yaml -- what should I set the password to and how does Istio handle secrets?
  • Per the README, update `` -- probably set this to the Istio #test-failures channel? and who can I ask about the api_url?
  • Per the README I should update the annotations in grafana_expose.yaml since it doesn't look like Istio uses certmanager?
  • Update the OWNERS file to howardjohn, clarketm, cjwagner? Or should this be renamed to CODEOWNERS?
  • I skimmed through the YAML and there didn't seem to be anything else Kubernetes test-infra specific, let me know if I missed anything though?
  • I could also add more jsonnet for dashboards in regard to Lacking Prow cluster metrics #1638 but not sure if that's better to do after everything is setup
  • Lastly, how would I go about applying this? I think I would be setting up an Istio Prow job to deploy this but let me know if that's wrong and I need some kind of additional access.

I can also move over the Istio-specific stuff from https://github.com/kubernetes/test-infra/blob/master/velodrome/config.yaml into a new config.yaml in https://github.com/istio/test-infra/tree/master/prow/cluster/velodrome. But not sure what the next steps would be after that?

@cjwagner
Copy link
Member Author

@cjwagner I'd like to help with this, and according to Travis this is mostly a copy and paste to see if it works more or less out of the box, but had a few questions:

Sure, that would be appropriate.

  • Per the README, update grafana_secret.yaml -- what should I set the password to and how does Istio handle secrets?

You will need to generate a new password. I'm not certain how the Istio handles secrets, but if it is the same as the k8s project then you'll need to share the credentials with oncall (or perhaps have them generate them in this case) and they can load them into the cluster and back them up in a secret store.

  • Per the README, update `` -- probably set this to the Istio #test-failures channel? and who can I ask about the api_url?

(Assuming you are talking about {{ api_url }}) I don't know much about that slack channel, but it sounds appropriate. You could alternatively make a separate channel for these alerts. You'll need to coordinate with a slack admin to get the api_url (which is a secret value). I don't know who that would be.

  • Per the README I should update the annotations in grafana_expose.yaml since it doesn't look like Istio uses certmanager?

Yes, you should update the ingress to work with your own domain instead of monitoring.prow.k8s.io and configure TLS termination with whatever tool you prefer. (In this case you should probably use GKE managed certs like Travis has set up for the Prow and Velodrome instances.

  • Update the OWNERS file to howardjohn, clarketm, cjwagner? Or should this be renamed to CODEOWNERS?

Istio uses CODEOWNERS instead of OWNERS so switch to CODEOWNERS. Please don't include me as an owner of this, I don't intend to maintain this instance.

  • I skimmed through the YAML and there didn't seem to be anything else Kubernetes test-infra specific, let me know if I missed anything though?

There is some k/t-i specific configuration. In particular, some of the alerts are only appropriate or are customized for k/t-i. e.g. We explicitly ignore Tide errors from kubeflow since they have been misusing Tide for a long time and the errors are accepted. https://github.com/kubernetes/test-infra/blob/af68b02cf15c086db2b33997784b93a096a7c208/prow/cluster/monitoring/mixins/prometheus/tide_alerts.libsonnet#L36
Since the alerts aren't tuned for your instance, I'd recommend starting without configuring the alerts and checking if they would have fired on the last week of data before enabling them. That way you can tweak the alert thresholds to customize for your instance and avoid alerting spam that could make people start ignoring the new alerts before they become reliable.

Yes, if the metrics and dashboards currently included in the monitoring stack are insufficient for your needs it would probably be best to work on adding new dashboards as a follow up. Please consider if any metrics/dashboards would be generally useful to Prow (#1638 sounds like a yes) and contribute the changes upstream to k/t-i if so to allow all Prow instances to benefit.

  • Lastly, how would I go about applying this? I think I would be setting up an Istio Prow job to deploy this but let me know if that's wrong and I need some kind of additional access.

This should be deployed as part of the deploy-prow job via a new make target:

- name: deploy-prow
type: postsubmit
regex: '^prow/cluster/(?:gcsweb/|build/|private/|[^/]+\.yaml$)'
cluster: test-infra-trusted
max_concurrency: 1
command:
- make
- -C
- prow
- deploy
- deploy-gcsweb
- deploy-velodrome
- deploy-build
- deploy-private
requirements: [deploy]
node_selector:
prod: prow

I can also move over the Istio-specific stuff from https://github.com/kubernetes/test-infra/blob/master/velodrome/config.yaml into a new config.yaml in https://github.com/istio/test-infra/tree/master/prow/cluster/velodrome. But not sure what the next steps would be after that?

I would recommend avoiding touching velodrome at all unless you are migrating things off velodrome and onto the new monitoring stack.

Hope this helps!

@istio-policy-bot istio-policy-bot removed the lifecycle/needs-triage Indicates a new PR or issue needs to be triaged label Mar 23, 2020
@clarketm
Copy link
Member

clarketm commented Mar 26, 2020

@scottilee thanks for taking this on!

@clarketm clarketm removed the help wanted Indicates a PR/Issue that needs community help label Mar 26, 2020
@scottilee scottilee mentioned this issue Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants