-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs/garbage-collection: add containers #1029
base: main
Are you sure you want to change the base?
Conversation
d0a47ba
to
94a7d95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the next step after this is actually defining GC policies for containers.
Just ran this as a temporary test job in the pipeline :
|
4cae8b1
to
ff5e453
Compare
ff5e453
to
a72aea8
Compare
Add containers tags in the garbage collection job. These can run in parrallel without issues.
Pruning dev streams after 2 weeks and stable streams after 2 months as decided in coreos/fedora-coreos-tracker#1367 (comment)
a72aea8
to
270fcc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should incorporate the container-prune
and cloud-prune
code together and also update the builds.json
when we've pruned container images. This is more heavyweight, but brings into alignment the GC (i.e. operating on each build). The OSTree pruning that we do today that was set up by me a while back I guess will go away eventually so we don't need to worry about making that process align I don't think.
lock(resource: "builds-json-${params.STREAM}") { | ||
stage('Upload Builds JSON') { | ||
// containers tags and cloud artifacts can be GCed in parallel | ||
def parallelruns = [:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @jlebon here. I think we should run things serially since the job is 100% not time critical. If things fail in either stage of this pipeline I'm hoping it is very clear what went wrong when and what did and did not happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably match the policy done for our OSTree repo for now.
For our production streams we don't prune at all (maybe we should, but this would affect our extended upgrade testing):
For our non production streams we prune at 90 days:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our production streams we don't prune at all (maybe we should, but this would affect our extended upgrade testing):
ahh I see you have special handling in the code for barrier releases.
Some backstory on this in coreos/coreos-assembler#3798 (comment). I think for merging to make sense, we would want to flip things around so that the source is also the build metadata for pruning container images, rather than container repos. Otherwise, they're quite disparate currently. Though actually, at a higher level what we probably want is both. E.g. taking AMIs as an example:
This could be applied to any remote resource, including container registries and other clouds. It ensures that we also eventually GC things we've pushed but because of pipeline failures it never was registered in any build metadata. This is the difference between what |
Yes. I agree. We'd basically consult the meta and build.json and prune based on that rather than using skopeo to gather all tags in the repo.
To do it that way we'd have to build a list of AMIs from all streams each time we ran, which would be pretty heavyweight. And we'd also have to consider that we aren't the only ones that create and want to retain AMIs in the account.
Yeah. It's good to try to clean up things that fell through the cracks periodically, but I do think it would make things more complicated
|
Quoting @jlebon message from the original PR to add some context :
I don't really understand why I didn't question this, as
So @dustymabe recommends that we follow the same policies that are used for all the other images, but it was decided in a community meeting (albeit a long time ago) different rules : coreos/fedora-coreos-tracker#1367 (comment)
I ended up implementing it anyway so this point is no longer relevant I think |
Allright, I will go and look at cosa and see what I can come up with. |
:) - maybe hold off on making any changes for now. I think we are still discussing what the ideal outcome is. And of course there is "ideal" and there is "already works". Sometimes "already works" wins too. |
I think we already do that in the cloud-prune code : https://github.com/coreos/coreos-assembler/blob/main/src/cmd-cloud-prune#L123C31-L123C37 |
No. If the build doesn't need to be pruned we break. We don't need to calculate every AMI that exists before we can prune any. I think there are two cases we are talking about here:
Right now |
For GC, I don't think we should be worried too much about how much time it takes given that it's not time critical; we could also easily lower the cadence if it's a problem?
Yes, definitely. This is relevant for Fedora where we do this. I think it's to everyone's benefit to have resources properly tagged anyway per team/group (which I think we've started doing for FCOS at least recently-ish?). And that's helpful too for querying since most APIs have ways to limit/filter to make it less expensive. Anyway, to be clear I'm not suggesting we pivot to this right away. I think both approaches are "kinda right" and that the ideal IMO is "both" (which I guess means... yeah I think eventually we should probably merge because we'd do this for other resources too). But it's not worth blocking getting GC set up for it. |
I also think any restructuring should wait until we look at RHCOS GC requirements too, which are going to be very different from FCOS. (E.g. any AMI that was once listed as a bootimage in the installer cannot be pruned; this is similar in a way with how we cross-check with barrier releases. Also, listing tags in the prod OCP repo is expensive because we use a monorepo approach so starting from build metadata might be better. But actually... it's very possible that GC in that repo is being taken care of at a higher OCP level anyway and we shouldn't touch it at all.) |
It's not necessarily time for me, but waste. If we change it to the "calculate the universe" method then we literally have to grab every meta.json, for every architecture, for every build in the build history before we can do any pruning. That's pretty wasteful IMO. Now there are some "cloud resources" that make this much easier. For example in the Either way I think what I'm saying is I like the more methodical approach where we don't have to calculate the universe first.
Yes we do now tag, but we don't have that for RHCOS so we'd have to support one strategy for one and another for the other.
ehh. "both" is just more work IMO. The only problem with not blocking IMO is that we do do accounting to keep track of what has been pruned so it would be nice to have that be correct, but I suspect we could just make the code do the right thing if the remote resource doesn't exist and it wouldn't be that big of a deal.
yeah that does make me think.. for FCOS we should probably consider barrier releases.. currently we don't prune any production streams but if we had the ability to skip barrier releases in the pruning then I think my stance on that would change quickly. For RHCOS and FCOS we could have an exception mechanism built into the GC to ignore specific versions (barriers for FCOS, bootimages in the installer for RHCOS). |
True, though we can easily cache this.
Right, I think how we approached this is the right call for a first pass (both for the cloud case and the container case). I'm saying that ideally we extend this in the future.
We could/should add it. But anyway, for RHCOS AMIs specifically I'm pretty sure we already do some ad-hoc pruning currently and AFAIK that operation doesn't go through build metadata, but OCP releases. I think there it does basically do "(the universe) - (AMIs in OCP releases)" (@mike-nguyen would know more).
Right yeah, all the pruning code on the cloud-prune side should already handle the ENOENT case. So if we ever pivot container-prune, we'd want to retain that too.
Hmm, in my mind the goal is to eventually roll out the cloud pruning to production streams but indeed looking at coreos/fedora-coreos-tracker#99, we never agreed on any time window for those. Are you saying even then, we'd want to retain the bootimages from barrier releases? At the risk of stating the obvious, it's only really needed for OSTree commits (hard/awkward) and container images (easy). |
Add containers tags in the garbage collection job. These can run in parrallel without issues.