Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cip-auditor: alerts are noisy #2364

Open
spiffxp opened this issue Jul 19, 2021 · 13 comments
Open

cip-auditor: alerts are noisy #2364

spiffxp opened this issue Jul 19, 2021 · 13 comments
Assignees
Labels
area/artifacts Issues or PRs related to the hosting of release artifacts for subprojects area/release-eng Issues or PRs related to the Release Engineering subproject lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@spiffxp
Copy link
Member

spiffxp commented Jul 19, 2021

Tracking issue for the fact that CIP auditor alerts are noisy.

The alerting was manually created by thockin a while ago (could not find an issue to link at a glance, maybe someone else knows). Then it was manually disabled because the alerts were perceived as noisy.

The alerting is managed via click-ops, ideally it could be done with gcloud: #1624

IAM policies to allow viewing of incidents and alerts without granting admin access to k8s-artifacts prod would be helpful. What's the appropriate group to grant this access to?

/cc @listx
FYI @tylerferrara
Making this tracking issue since I'm about to make a manual IAM change and I want to document it somewhere

@spiffxp
Copy link
Member Author

spiffxp commented Jul 19, 2021

/priority important-longterm
/wg k8s-infra
/sig release
/area artifacts
/area release-eng
/area infra/monitoring
/milestone v1.22

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. wg/k8s-infra sig/release Categorizes an issue or PR as relevant to SIG Release. labels Jul 19, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Jul 19, 2021
@k8s-ci-robot k8s-ci-robot added area/artifacts Issues or PRs related to the hosting of release artifacts for subprojects area/release-eng Issues or PRs related to the Release Engineering subproject area/infra/monitoring labels Jul 19, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Jul 19, 2021

FWIW here is the policy generating the alert in question

# $ gcloud alpha monitoring policies list --project=kubernetes-public
---
combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_MAX
      perSeriesAligner: ALIGN_RATE
    comparison: COMPARISON_GT
    duration: 0s
    filter: metric.type="logging.googleapis.com/user/cip-auditor-alert"
    trigger:
      count: 1
  displayName: logging/user/cip-auditor-alert
  name: projects/kubernetes-public/alertPolicies/12089518252001301876/conditions/12089518252001300531
creationRecord:
  mutateTime: '2020-03-31T00:27:12.391234443Z'
  mutatedBy: thockin@REDACTED
displayName: Image promoter alert
documentation:
  content: The image promoter has logged something that we consider an alert.
  mimeType: text/markdown
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T01:45:19.410085124Z'
  mutatedBy: thockin@REDACTED
name: projects/kubernetes-public/alertPolicies/12089518252001301876
notificationChannels:
- projects/kubernetes-public/notificationChannels/7630148271419930225
- projects/kubernetes-public/notificationChannels/17367851054639804370
- projects/kubernetes-public/notificationChannels/2533614711603005061
- projects/kubernetes-public/notificationChannels/7846745591716920888

@tylerferrara
Copy link
Contributor

/assign

@spiffxp
Copy link
Member Author

spiffxp commented Jul 19, 2021

members of k8s-infra-gcp-auditors should be able to do what I just did above:

- monitoring.alertPolicies.get
- monitoring.alertPolicies.list

@spiffxp
Copy link
Member Author

spiffxp commented Jul 19, 2021

This is what IAM is for the project in question right now: https://github.com/kubernetes/k8s.io/blob/main/audit/projects/kubernetes-public/iam.json

I could be convinced that k8s-infra-cluster-admins should get something like roles/monitoring.editor or higher on this project, as the nearest conveniently existing group that has editor/admin type roles on this project

I could also be convinced there should be a group dedicated to CIP related infrastructure that gets more granular IAM permissions

@spiffxp
Copy link
Member Author

spiffxp commented Jul 19, 2021

@tylerferrara is unable to view incidents because of lack of roles/monitoring.viewer, ref: https://cloud.google.com/monitoring/alerts/troubleshooting-alerts#no-permission

The only existing group that has this is gke-security-groups:

"members": [
"group:[email protected]",
"serviceAccount:[email protected]",
"serviceAccount:k8s-infra-monitoring-viewer@kubernetes-public.iam.gserviceaccount.com"
],
"role": "roles/monitoring.viewer"

Which is populated entirely by app-specific rbac groups:

k8s.io/groups/groups.yaml

Lines 172 to 194 in 15f7d7c

# Every RBAC group should be added here.
- email-id: [email protected]
name: gke-security-groups
description: |-
Security Groups for GKE clusters
settings:
ReconcileMembers: "true"
WhoCanViewMembership: "ALL_MEMBERS_CAN_VIEW" # needed for RBAC
members:
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]
- [email protected]

Rather than untangle the right place to put them, since I'm about to go AFK, I'm going to add @tylerferrara manually

$ gcloud projects add-iam-policy-binding kubernetes-public --role=roles/monitoring.viewer --member=user:[email protected]
Updated IAM policy for project [kubernetes-public].

@tylerferrara
Copy link
Contributor

Incidents are now viewable!

With respect to the auditor logs, tracking down these incident triggers requires looking at stack traces. I'm still unable to view anything from the "Traces" service in GCP for the k8s-artifacts-prod project.

@spiffxp
Copy link
Member Author

spiffxp commented Jul 19, 2021

#2365 added the necessary cloudtrace read-only permissions to audit.viewer which is granted org-wide to k8s-infra-gcp-auditors@, which @tylerferrara is already a member of.

@tylerferrara
Copy link
Contributor

The PR: #2366 has stopped the audit crashes, which was causing the alerts. However, this issue should remain open until the following investigation has been resolved (CIP issue: #353).

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2021

/milestone v1.23

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.22, v1.23 Aug 4, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Sep 2, 2021

/remove-priority important-longterm
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Sep 2, 2021
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Oct 13, 2021

@kubernetes/release-engineering. I am not sure where we stand on this anymore. Has the fix been:

  • identified?
  • implemented?
  • deployed?

@ameukam
Copy link
Member

ameukam commented Dec 6, 2021

/lifecycle frozen
/milestone clear

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 6, 2021
@k8s-ci-robot k8s-ci-robot removed this from the v1.23 milestone Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Issues or PRs related to the hosting of release artifacts for subprojects area/release-eng Issues or PRs related to the Release Engineering subproject lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
Status: Blocked
Development

No branches or pull requests

5 participants