GCS: Create disaster recovery script #334

justinsb · 2019-08-14T04:35:56Z

We create an independent copying program that will copy binary
artifacts without overwriting (so that we won't overwrite our
backups).

We create an independent copying program that will copy binary artifacts without overwriting (so that we won't overwrite our backups).

justinsb · 2019-08-14T04:36:06Z

/assign @thockin

k8s-ci-robot · 2019-08-14T04:36:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign spiffxp
You can assign the PR to them by writing /assign @spiffxp in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thockin

README ?

How is this expected to be run, and with what permissions?

thockin · 2019-08-14T23:21:11Z

artifacts/dev/disaster-recovery/copy-buckets.py

+  dest_blobs[blob.name] = blob
+
+# Copy blobs from src to dest, if the blobs don't exist in dest
+# If the blob does exist in dest, but the md5 does not match,


shouldn't the receiving bucket have sufficient retention that even a bug in this script can't overwrite?

I think we need both retention and no-replace. Retention prevents us replacing the file, comparison stops us trying to replace the file (which I believe fails even if the contents are the same), but we also need the script to report the mismatch nicely (at the very least, non-zero exit code).

But ... perhaps what you're saying is that if we have retention enabled, we can do a gsutil rsync and that will be good enough? That is plausible....

If I set a retention policy of 5 years, I can not remove or overwrite the file until that expires (I just tried it manually with GCS).

My question then is: can we / should we rely on this in the backup-bucket to ensure that backups are additive?

Further: we don't seem to set retention on the *prod buckets. Perhaps we should do that, also? Can we think of any reason we wouldn't want to guarantee that community-owned artifacts are available for an extended period of time? We can't do this with GCR tags, but we can set it on the GCS buckets for non-GCR artifacts.

MAYBE we can do it on the bucket for GCR (retaining the image by SHA, if not the tags). ISTR there was a problem (we used to have it and removed it, forget why). Will test.

FURTHER YET: should we lock these retentions? This is risky in case we ever actually legit need to remove an artifact (e.g. a cred leak or something).

And yes, at that point, a simple additive rsync may be sufficient, with a different cred controlling the 2 buckets, it sounds pretty reasonable yeah?

Yes, obviously slightly disappointed in myself for not seeing this before I wrote the script, but I believe you're right. We should set large retention, and just use a gsutil rsync -c. (The -c is optional when they are both gs:// urls)

We do need to set -C so that one changed file doesn't stop the others. But then I think that's it (as you pointed out). We do need to watch for errors in the output and make sure that we don't just turn off the job instead of fixing it.

As for the other topics, yes I think we should set retention on the prod buckets, and we should set retention on the backup buckets. I actually think we should lock the retention on the backup buckets (so we know they are always there).

As for setting retention on the prod buckets, I'm not sure. Presumably we expect these to be mirrored over time, so it's not clear we actually could delete an artifact. Someone changing an artifact could be pretty damaging until we corrected it (which we would do automatically, but not immediately). We should say that you should check the hashes of all artifacts (this lets us use mirrors), but likely this bucket will be the canonical source of those hashes for the immediate future.

Because of that, I think we should lock it. But we should try swapping out the backing bucket to make sure we can do it without "too much" disruption if we have to.

With a locked retention policy, disaster in this case means somebody uploading an additional file. Either the next version of k8s, maybe some credential-stealing JS (but we shouldn't authn to artifacts.k8s.io), or some content designed to cause embarrassment . We would want to remove that, but we wouldn't be compromising the integrity of the artifacts themselves. I'd imagine the biggest risk would be uploading a new binary, then sending out a CVE notification with a link to it to various mailing lists. I think some form of offline signing is the only realistic way of protecting against that, but plenty of existing solutions there in e.g. OS packages.

The only concern would be if an artifact was uploaded that had such a security flaw that it should never be used. I don't care if we maintain a backup of that artifact, but deleting it from the prod bucket seems like something we should maintain an option for (especially if we have backups)

listx · 2019-08-16T00:47:47Z

/cc

dims · 2019-08-20T22:18:21Z

/assign @listx
/uncc

fejta-bot · 2019-11-19T04:30:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-12-19T05:13:47Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-01-18T05:59:33Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-01-18T05:59:42Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

GCS: Create disaster recovery script

9b97220

We create an independent copying program that will copy binary artifacts without overwriting (so that we won't overwrite our backups).

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 14, 2019

k8s-ci-robot added the wg/k8s-infra label Aug 14, 2019

k8s-ci-robot assigned thockin Aug 14, 2019

k8s-ci-robot requested review from cblecker and dims August 14, 2019 04:36

thockin reviewed Aug 14, 2019

View reviewed changes

k8s-ci-robot requested a review from listx August 16, 2019 00:47

listx mentioned this pull request Aug 16, 2019

k8s.io disaster recovery plan kubernetes-sigs/promo-tools#70

Open

k8s-ci-robot assigned listx Aug 20, 2019

k8s-ci-robot removed the request for review from dims August 20, 2019 22:18

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 19, 2019

k8s-ci-robot closed this Jan 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCS: Create disaster recovery script #334

GCS: Create disaster recovery script #334

justinsb commented Aug 14, 2019

justinsb commented Aug 14, 2019

k8s-ci-robot commented Aug 14, 2019

thockin left a comment

thockin Aug 14, 2019

justinsb Aug 15, 2019 •

edited

Loading

thockin Aug 16, 2019

thockin Aug 16, 2019

justinsb Aug 21, 2019 •

edited

Loading

cblecker Aug 21, 2019

listx commented Aug 16, 2019

dims commented Aug 20, 2019

fejta-bot commented Nov 19, 2019

fejta-bot commented Dec 19, 2019

fejta-bot commented Jan 18, 2020

k8s-ci-robot commented Jan 18, 2020

GCS: Create disaster recovery script #334

GCS: Create disaster recovery script #334

Conversation

justinsb commented Aug 14, 2019

justinsb commented Aug 14, 2019

k8s-ci-robot commented Aug 14, 2019

thockin left a comment

Choose a reason for hiding this comment

thockin Aug 14, 2019

Choose a reason for hiding this comment

justinsb Aug 15, 2019 • edited Loading

Choose a reason for hiding this comment

thockin Aug 16, 2019

Choose a reason for hiding this comment

thockin Aug 16, 2019

Choose a reason for hiding this comment

justinsb Aug 21, 2019 • edited Loading

Choose a reason for hiding this comment

cblecker Aug 21, 2019

Choose a reason for hiding this comment

listx commented Aug 16, 2019

dims commented Aug 20, 2019

fejta-bot commented Nov 19, 2019

fejta-bot commented Dec 19, 2019

fejta-bot commented Jan 18, 2020

k8s-ci-robot commented Jan 18, 2020

justinsb Aug 15, 2019 •

edited

Loading

justinsb Aug 21, 2019 •

edited

Loading