Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor boskos cleanup timing #13

Open
ixdy opened this issue May 29, 2020 · 18 comments
Open

monitor boskos cleanup timing #13

ixdy opened this issue May 29, 2020 · 18 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@ixdy
Copy link
Contributor

ixdy commented May 29, 2020

Originally filed as kubernetes/test-infra#14715 by @BenTheElder

What would you like to be added: export and graph metrics for boskos cleanup timing

Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697

Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber

/area boskos
/assign @krzyzacy
cc @fejta @mm4tt
/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2020
@k8s-ci-robot
Copy link
Contributor

@ixdy: The label(s) area/boskos cannot be applied, because the repository doesn't have them

In response to this:

Originally filed as kubernetes/test-infra#14715 by @BenTheElder

What would you like to be added: export and graph metrics for boskos cleanup timing

Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697

Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber

/area boskos
/assign @krzyzacy
cc @fejta @mm4tt
/kind feature

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krzyzacy
Copy link
Contributor

/unassign
/help-wanted

@detiber
Copy link
Member

detiber commented Jun 19, 2020

/help

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 19, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 17, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ixdy
Copy link
Contributor Author

ixdy commented Nov 16, 2020

/reopen
/remove-lifecycle stale

@k8s-ci-robot
Copy link
Contributor

@ixdy: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 16, 2020
@detiber
Copy link
Member

detiber commented Nov 17, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Nov 17, 2020
@cpanato
Copy link
Member

cpanato commented Jan 19, 2021

this is something to add a prometheus metric for this operation? @detiber

@detiber
Copy link
Member

detiber commented Jan 19, 2021

@cpanato I believe that to be the case, yes. That said, I haven't dug into how the existing metrics are exposed for boskos. The dashboards sit at monitoring.prow.k8s.io, though.

@cpanato
Copy link
Member

cpanato commented Jan 20, 2021

hello @ixdy the metric in question should be added in this part https://github.com/kubernetes-sigs/boskos/tree/master/cmd/cleaner ? or it is for another part of the code?
maybe the first question is, this is still needed?

@ixdy
Copy link
Contributor Author

ixdy commented Feb 16, 2021

Sorry for the delay in response. To clarify, this would be metrics added to the janitor(s), not the (unfortunately named) cleaner component.

The basic gist is just adding some Prometheus metrics to the janitors, yes, but the primary challenge is that in some deployments (such as k8s.io prow) Boskos + the janitors run in a completely separate build cluster from the prow monitoring stack, which makes collecting these metrics more challenging, since they aren't directly accessible.

In the case of k8s.io prow, to collect metrics from the core boskos service, we expose the boskos metrics port on an external IP and then explicitly collect from that address. Since the janitors run as a separate container, we'd need to either expose additional IPs for each janitor (non-ideal) or set up some sort of collector for all of the boskos metrics (core and janitor) and then expose that to the prow monitoring stack. Alternately, we could collect/push these metrics to the monitoring stack. [Note: I'm probably using the wrong Prometheus terminology here.]

Figuring all of this out is the harder aspect of this issue. If this sounds interesting to you, please take it on!

@cpanato
Copy link
Member

cpanato commented Feb 23, 2021

@ixdy thanks and my turn to say sorry for the delay 😄

There are two different things we need to do, the first one is to add the metric in the janitor and the second the infrastructure part.

For the second I have a couple of questions:

  • the janitor is a cron process or it is always up and running?
    if is a cron we will need to use Prometheus pushgateway to send the metrics there and then the monitor cluster can scrape from there.

  • the cluster that runs the boskos and the janitor is the same?
    then to avoid having multiple LB to expose we can deploy Prometheus in this cluster to collect the metrics and expose this to be scraped by the main monitoring system, so we just have one LB entry point (Prometheus federated)


I will work on the first part to add the metrics while we discuss the second If that sounds good to you

thanks!

@cpanato
Copy link
Member

cpanato commented Feb 23, 2021

/assign
/remove-help

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 23, 2021
@ixdy
Copy link
Contributor Author

ixdy commented Feb 24, 2021

  1. Is the janitor a cron process or is it always up and running?

It depends. There are 3 (or 4) different janitor endpoints right now:

  • a. cmd/aws-janitor: one-shot command that cleans up an AWS account, optionally specifying a region.
  • b. cmd/aws-janitor-boskos: long-lived process which queries Boskos (using its API) for AWS regions that are in dirty state, cleaning up the relevant region using the same library as (a) and then returning the region in Boskos to the free state.
  • c. cmd/janitor/gcp_janitor.py: one-shot python script which cleans up the provided GCP project(s). Eventually should be rewritten in Go, probably.
  • d. cmd/janitor: resource-agnostic janitor that queries Boskos (using its API) for resources of a specified type that are in a dirty state, passing them to a specified janitor command to clean up, returning them to Boskos in the free state (assuming the janitor command exited successfully). Defaults to calling the gcp_janitor.py script, but can potentially call any other one-shot janitor (e.g. the AWS janitor from (a)).

The one-shot janitors could be run as CronJobs, with or without Boskos (e.g. to manage AWS environments, GCP projects, etc that are not managed by Boskos). The Boskos-specific janitors tend to run as long-running pods.

(So one follow-up question you might have: which janitor? The ones most relevant to this issue are probably cmd/aws-janitor-boskos and cmd/janitor, though hopefully you can generalize things enough to reduce the amount of duplicated code.)

  1. The cluster that runs the boskos and the janitor is the same?

In general, yes, the janitors run in the same cluster as Boskos. This is because the necessary credentials/service accounts needed to interact with AWS accounts/GCP projects likely already exist in those clusters, as they are used by the test jobs.

@cpanato
Copy link
Member

cpanato commented Feb 24, 2021

thanks for the clarification @ixdy

k8s-ci-robot added a commit that referenced this issue Mar 3, 2021
aws-janitor-boskos: add clean time and process time metrics
k8s-ci-robot added a commit that referenced this issue Mar 12, 2021
aws-janitor: add job duration metric
@spiffxp spiffxp added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

7 participants