-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS: Create disaster recovery script #334
Conversation
We create an independent copying program that will copy binary artifacts without overwriting (so that we won't overwrite our backups).
/assign @thockin |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README ?
How is this expected to be run, and with what permissions?
dest_blobs[blob.name] = blob | ||
|
||
# Copy blobs from src to dest, if the blobs don't exist in dest | ||
# If the blob does exist in dest, but the md5 does not match, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the receiving bucket have sufficient retention that even a bug in this script can't overwrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need both retention and no-replace. Retention prevents us replacing the file, comparison stops us trying to replace the file (which I believe fails even if the contents are the same), but we also need the script to report the mismatch nicely (at the very least, non-zero exit code).
But ... perhaps what you're saying is that if we have retention enabled, we can do a gsutil rsync
and that will be good enough? That is plausible....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I set a retention policy of 5 years, I can not remove or overwrite the file until that expires (I just tried it manually with GCS).
My question then is: can we / should we rely on this in the backup-bucket to ensure that backups are additive?
Further: we don't seem to set retention on the *prod buckets. Perhaps we should do that, also? Can we think of any reason we wouldn't want to guarantee that community-owned artifacts are available for an extended period of time? We can't do this with GCR tags, but we can set it on the GCS buckets for non-GCR artifacts.
MAYBE we can do it on the bucket for GCR (retaining the image by SHA, if not the tags). ISTR there was a problem (we used to have it and removed it, forget why). Will test.
FURTHER YET: should we lock these retentions? This is risky in case we ever actually legit need to remove an artifact (e.g. a cred leak or something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, at that point, a simple additive rsync may be sufficient, with a different cred controlling the 2 buckets, it sounds pretty reasonable yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, obviously slightly disappointed in myself for not seeing this before I wrote the script, but I believe you're right. We should set large retention, and just use a gsutil rsync -c
. (The -c
is optional when they are both gs:// urls)
We do need to set -C
so that one changed file doesn't stop the others. But then I think that's it (as you pointed out). We do need to watch for errors in the output and make sure that we don't just turn off the job instead of fixing it.
As for the other topics, yes I think we should set retention on the prod buckets, and we should set retention on the backup buckets. I actually think we should lock the retention on the backup buckets (so we know they are always there).
As for setting retention on the prod buckets, I'm not sure. Presumably we expect these to be mirrored over time, so it's not clear we actually could delete an artifact. Someone changing an artifact could be pretty damaging until we corrected it (which we would do automatically, but not immediately). We should say that you should check the hashes of all artifacts (this lets us use mirrors), but likely this bucket will be the canonical source of those hashes for the immediate future.
Because of that, I think we should lock it. But we should try swapping out the backing bucket to make sure we can do it without "too much" disruption if we have to.
With a locked retention policy, disaster in this case means somebody uploading an additional file. Either the next version of k8s, maybe some credential-stealing JS (but we shouldn't authn to artifacts.k8s.io
), or some content designed to cause embarrassment . We would want to remove that, but we wouldn't be compromising the integrity of the artifacts themselves. I'd imagine the biggest risk would be uploading a new binary, then sending out a CVE notification with a link to it to various mailing lists. I think some form of offline signing is the only realistic way of protecting against that, but plenty of existing solutions there in e.g. OS packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern would be if an artifact was uploaded that had such a security flaw that it should never be used. I don't care if we maintain a backup of that artifact, but deleting it from the prod bucket seems like something we should maintain an option for (especially if we have backups)
/cc |
/assign @listx |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We create an independent copying program that will copy binary
artifacts without overwriting (so that we won't overwrite our
backups).