-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions on migrating kubespray CI to test-infra #31351
Comments
/sig testing |
@BenTheElder @ameukam @upodroid for resource usage |
/sig k8s-infra |
So far really large testing is basically only done for scale testing the core Kubernetes project. These are all relative terms though. SIG K8s Infra owns the actual resource policy, which is not well defined yet, but I can speak to it a little as a lead in both SIGs, can you be more specific about what you're intending to run? We just went through measures this year to reduce spend, and we're resuming the process of finishing moving lingering CI/resources out of google.com projects into kubernetes.io on GCP in particular.
No, this is not supported, please don't put lots of expensive testing in presubmit.
Not supported on prow.k8s.io, sorry.
No, please do not try to use kubevirt on our clusters (this is why we created KIND), you'll need to spin up remote machines. |
kubespray has been a bit of a black sheep, where the project over the years have drifted away from the commons - it has its own CI and Zoom account, not using the community ways...that's not necessarily bad, but rather peculiar.
could you explain what might be the reasons for such a migration? |
Here is a typical PR runs: https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/pipelines/1091836466
Postsubmits and periodics do not guard merging in the main branch, is that correct ? Understood though.
ACK. Just one question, does that depend on the service cluster (where prow itself runs) or the build cluster ?
Ok. But is bringing our own cluster as a Prow "build cluster" a possibility, or not at all ?
Sorry for that, I listed them in the issue on kubespray side. Basically:
|
Right, if you find an issue you can revert. We have testgrid.k8s.io to aid in that. But we consider for example 5,000 node scale tests as an extreme example. We find bugs surfaced in those tests and yet they do not gate all PR merges because this is unreasonably expensive.
The project doesn't have anyone maintaining support for this. We support prow decorated jobs.
K8S Infra is only using community managed resources going forward because we've been bitten repeatedly with issues depending on third party controlled accounts etc. We do not support "bring your own", if anyone wants to help fund the project with assets they can talk to the CNCF about setting up something like https://www.cncf.io/google-cloud-recommits-3m-to-kubernetes/ which SIG K8s Infra administers and SIG Testing uses to run CI.
This looks like an expansive test matrix as large as we'd typically do in periodic testing only, not on every PR. It's difficult to understand what sort of expense we're talking about here though, just seeing the gitlab pipeline names. Generally when sig subprojects have started using our CI in the past they've had relatively minimal needs, some cheap unit tests and so on. We have not had a new distro / deployment tool onboard in a long time since maybe cluster API so there's not a lot of precedent here. |
@VannTen Can you please share a bit of the background on the gitlab infra itself? Who paid for it? who set it up? For some of us this is fresh unforeseen news sadly! |
Yeah, I think it is. Ok. So in our cases, for example, that would translate to the test matrix moving to periodics, and keeping one/some default configuration tests in presubmits ?
The jobs in deploy-part2 typically runs for around 40 minutes, and use 1 to 3 VM (using kubevirt) (see here + the job itself. Given there is around 20-25 fives config configuration, that adds up.
I'll share what I can, I don't have all the information or history. The integration github <-> gitlab-ci was done by @ant31 if I'm correct, and use https://github.com/failfast-ci/failfast-api The infrastucture was provided by Packet, (which is now Equinix Metal) and I think it's on CNCF cloud credits (If some of the people mentioned have more info, feel free to correct or precise 👍 ) |
Hi all, The background is that kubespray started with Kubernetes 1.0, so there was little around to help the community. CNCF allocated us a few bare-metal nodes (and still does) to run our pipelines. We are deploying and maintaining those nodes ourselves. Why Gitlab-ci? 2016-2017 The gitlab-ci was a good alternative to combine low maintenance (only need to deploy the runner), and it checkboxes most of the requirements (complex pipeline, with manual jobs and stages), we filled in the missing github integration and features with https://github.com/failfast-ci/failfast-api We create empty VMs to mimic end-user environments:
Moving to prow would remove the need to maintain bare-metal nodes, failfast-ci project, and a few other benefits, but we must be able to configure an equivalent pipeline. |
same for kOps?
the test matrix with CNI, distro is redundantly complex (see the kOps case above). i wouldn't want us to say -1 on kubespray if they want to move to prow, but if i could, i'd happily take 50% of the test bandwidth of kOps and give it to kubespray. |
The bulk of the CI that we run that requires testing on a real virtual machine involves creating VMs on AWS/GCP. We have tooling that handles that for us and you would need to adopt it. kops is a good example of what you'll need to do to adopt the Kubernetes CI. Here are a couple of examples: |
On Mon, Dec 04, 2023 at 02:13:02PM -0800, Benjamin Elder wrote:
> Postsubmits and periodics do not guard merging in the main branch, is that correct ? Understood though.
Right, if you find an issue you can revert. We have testgrid.k8s.io to aid in that.
If you're _frequently_ reverting because of changes not caused in presubmit, consider presubmit.
Another question about that: what's the typical frequency for periodics ? Daily, weekly ?
Do other projects have some strategy in place to avoid breakage in their main branch ? Having a separate "dev" branch for instance, only
merged in the main branch at the same frequency than periodics run ?
|
daily for the latest release, weekly for older supported releases or rare scenarios. For kubespray in particular, I would test 2 or 3 scenarios for a proper e2e test in presubmits(runs on every push to a PR) and then run the e2e test matrix once a day or twice at most. |
This is overlooking the "not on presubmit" aspect of my comment. I'm well aware of the kops test matrix, that's exactly what I was thinking of. That matrix is actually designed to minimally identify which aspect is broken and the tooling for it is in this repo.
I don't think that's a reasonable dichotomy. kops has been using these resources in good faith as a long time participant in upstream test tooling, infra, etc. Also, we're (SIG Testing + SIG K8s Infra) planning to use kops to replace kube-up because we desperately need to eliminate kube-up.sh and we need to be flexible in AWS+GCP spend, so we certainly don't want to reduce test coverage. (There is a KEP in flight) |
Reasonably frequent on the main branch (multiple times per day), much less frequent on stable release branches with frequency decreasing for older releases (and none for out of support releases). |
i agree with the comments from earlier that presubmit should be minimal and fail-fast.
it's not, but it's anecdotally hinting of fairness and non-bias. kubespray should not be denied bandwidth just because they are late for the party.
jobs such as https://testgrid.k8s.io/sig-cluster-lifecycle-kops#kops-grid-cilium-deb10-k27 would not be contributing much to the kube-up replacement picture. such jobs are effectively testing a user deployment scenario. they just guarantees to maintainers and users that a certain deployment scenario works, not that kOps itself works. i don't want to speak behind the intent of these jobs, though. |
i don't think we have a way to measure how much $$ is generated per SIG, but i wild guess that SIG CL is a major contributor to our budget reduction due to how much subprojects and e2e test jobs we have... i would not be surprised if at some point we have to do some sort of evaluation and ask maintainers to limit how much they test. |
The problem is moreso that we need to determine if we have bandwidth to spin things up (we probably don't at the moment -- AWS Spend is hitting the budget cap, but we're going to optimize costs) and we've already had to cut down on spend like scale testing this year unfortunately due to lack of options. We shouldn't do more cutting of existing usage until we have a policy in place. (Though we can run equivilantly with less cost e.g. committed use discounts). We need to have a framework in place before we start kicking things off, we haven't done that yet (because we've been too busy reacting to the ongoing issues). As-is kubespray has running CI without us cutting any other CI off, so we don't have to choose between projects yet.
This is a tricky topic, we have a lot of jobs that aren't really "benefitting" a single SIG.
To that point, the cloud provider testing is specific to a particular vendor ... it's not going to be that simple to dismiss categories of testing. We have similar compat testing with cri-o and containerd. Ideally the project should select testing that benefits broadly but we do have to run with actual implementations evetually. |
So, I think we can run kubespray CI on prow, but it remains an open question how best to enable the test environments you need and how much we can afford. I don't think that's kubevirt, we use managed k8s clusters because we have limited bandwidth to maintain these things and nested virt isn't enabled. We can start with something small like unit tests so they can get familiar with prow and we don't need to worry too much about the resources needed for that. For e2e testing: When other projects spin up external assets they do so by renting resources through https://github.com/kubernetes-sigs/boskos typically through integration in https://github.com/kubernetes-sigs/kubetest2 to ensure that they will be automatically cleaned up if the CI job is abruptly terminated or otherwise fails to clean-up after itself. This aspect is pretty important, I'd ask that we make sure boskos is used if / when e2e tests are setup. CAPI, kops, Kubernetes etc use this. |
aside re: freeing up resources for CI etc ... we dug into our expenditures in the bi-weekly k8s infra call yesterday and the main outcome is going on here kubernetes/k8s.io#6165 I think we can easily run things like build/unit test/lint on prow already but it will take more work to setup a suitable envionment for the e2e tests. We haven't used packet/equinix from prow before but that might be an option for running essentially the same e2e environment. What if we ran a build cluster on equinix w/ kubevirt? Would the kubespray team be up for maintaining this? We probably need to discuss options more between k8s infra and sig testing calls. |
Yes, it would work I think. I don't know prow enough know what would need to change if any.
if there's an equivalent of gitlab-runner for prow(prow-runner?) deployed on that cluster, then it would use the resources that kubespray has already without adding loads/expenses on k8s-infra As nice to have, maybe step 1,2 and 5 could be handled by prow so it's easily reproduced by all projects (to create kubevirt VM). In all cases it's not a blocker. |
What if we ran a build cluster on equinix w/ kubevirt? Would the kubespray team be up for maintaining this?
I don't know about the others kubespray contributors, but I could participate. I have some dedicated time for upstream work + the down-time,
and my main occupation is maintaining clusters anyway.
|
In that case, would the same constraints (mainly, moving stuff to periodics) apply ? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale
|
I don't think we have transparent budget info or credentials for equinix in SIG K8s infra currently so it's hard to say, AFAIK that's similarly ~CNCF, like your current gitlab instance, rather than Kubernetes owned/managed. cc @dims who has the only Kubernetes related equinix infra I've previously seen (cs.k8s.io, a single machine AFAIK). |
(We are still planning the migration of prow control plane to k8s infra this year, amongst other things, I'm personally a bit over-extended WRT k8s infra but I'm not the only lead, I know Arnaud is out for a while currently) |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale We've ~all but completed the CI migration, we need to rotate the log bucket and we have the same issue for release binaries behind fastly. Both are in progress, the preliminary work is in place but there's a lot of lingering references. We should be starting to get a stable idea where our usage is at. We now have spend reporting for GCP, AWS, Fastly, Digital Ocean and Azure, with budgets known except DO. I think we still have a very small presence on equinix currently, just cs.k8s.io, which is one VM in @dims's hands, so it's not really actively tracked yet. Things have settled quite a bit and we should really revisit this. We create and dispose a LOT of VMs on GCP and AWS every day using projects/accounts rented from https://github.com/kubernetes-sigs/boskos I would still recommend exploring a gradual transition with lighter and simpler workloads first and make sure the merge robot and so on are working and we can continue to explore e2e testing in parallel. For the most part we're looking at projects to file in https://github.com/kubernetes/k8s.io for bespoke infra needs, for basic CI jobs #sig-testing can help, slack is a good bit more active than this issue tracker at the moment. |
Thanks for the update and info 👍
To gave one from kubespray side:
There has been a lot of work on our CI, and other improvements planned, but it currently looks like we will use **more** features of our current setup (gitlab-ci features + the fact we're running the test and the provisioned Kubevirt VMs in the same cluster), which I think will make a migration to prow less desirable.
For now, I think we can froze this, if that's ok with you, and see where we are once our CI reworkings have settled a bit.
/lifecycle frozen
|
Can you tell us more about this? We would like to know more about this and see how prow fits in this. We recently helped etcd project adopt prow and we enabled the use cases they needed for a successful migration. |
Sure.
Compared to last time, the following has changed:
- We now have working /retest /retest-failed and /ok-to-test (thanks to @ant31 work on failfast). There are some hiccups thought 🤔
- the pipelines have been redesigned to use gitlab-ci `needs` rather than stage (aka, it's a DAG of jobs)
- we have three "level" of testing using labels ('ci-short|extended|full') -> still needs some stuff to configure, notably the label plugin.
- some cleanups of obsolete stuff
(The last three have made the pipelines much faster)
In the works:
- reworking the CI resources cleanup. It's currently racey which makes PR pipelines flakey. -> kubernetes-sigs/kubespray#11530 (TL;DR : use ownerRefs on kubevirt VMs so they're deleted when the pod running the job is)
- distributed cache. We have a lot of flake recently because we use vagrant for some jobs and we're rate-limited vagrant box hosting, probably because we're downloading the same boxes again and again. This should solve that and accelerate some jobs as well.
Regarding switching to prow, if we don't consider the work of switching itself, they are pros and cons:
Pros:
- conditional running (==if some files changed). Currently we lose the information in the GitHub-> gitlab integration
- merge pool
Cons:
- no DAG support (AFAIK)
- more generally, .gitlab-ci.ym has more features than prowjobs, I believe. We don't use all of them, but I do think we would need to re-think some stuff to fit in prow.
I'm probably forgetting some stuff, but that is what comes to mind at the moment.
|
Hi,
We're currently evaluating migrating the CI of the kubespray project to test-infra from gitlab-ci, and I have some questions and what we can and cannot do with prow and test-infra, so we can decide whether it can work:
Currently, we're handling jobs in gitlab ci stages, because some takes a lot of time and we try to fail early.
I understand prow does not have a job dependency concepts, so I have two possible strategies in mind:
-> any strategy on how to avoid costly jobs we know won't matter because one failed ?
Regarding tekton pipelines:
/test <job-name>
stuff. I suppose it's not possible to restart individual parts of the pipeline as if they were Prowjobs ?Some of our jobs currently provision kubevirt VM (https://github.com/kubernetes-sigs/kubespray/blob/master/tests/cloud_playbooks/roles/packet-ci/templates/vm.yml.j2) to test kubespray runs on them. Is there something in prow/test-infra which can do that for us ? (Didn't find anything but well it does not hurt to ask).
Regarding compute resources:
That's a lot of different questions in different directions, but I'm trying to figure things out, so sorry if this is a bit unclear.
Related issue on kubespray : kubernetes-sigs/kubespray#10682
Cc @floryut @ant31 from kubespray
Thanks
The text was updated successfully, but these errors were encountered: