Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS: Create disaster recovery script #334

Closed
wants to merge 1 commit into from

Conversation

justinsb
Copy link
Member

We create an independent copying program that will copy binary
artifacts without overwriting (so that we won't overwrite our
backups).

We create an independent copying program that will copy binary
artifacts without overwriting (so that we won't overwrite our
backups).
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 14, 2019
@justinsb
Copy link
Member Author

/assign @thockin

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign spiffxp
You can assign the PR to them by writing /assign @spiffxp in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README ?

How is this expected to be run, and with what permissions?

dest_blobs[blob.name] = blob

# Copy blobs from src to dest, if the blobs don't exist in dest
# If the blob does exist in dest, but the md5 does not match,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the receiving bucket have sufficient retention that even a bug in this script can't overwrite?

Copy link
Member Author

@justinsb justinsb Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need both retention and no-replace. Retention prevents us replacing the file, comparison stops us trying to replace the file (which I believe fails even if the contents are the same), but we also need the script to report the mismatch nicely (at the very least, non-zero exit code).

But ... perhaps what you're saying is that if we have retention enabled, we can do a gsutil rsync and that will be good enough? That is plausible....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I set a retention policy of 5 years, I can not remove or overwrite the file until that expires (I just tried it manually with GCS).

My question then is: can we / should we rely on this in the backup-bucket to ensure that backups are additive?

Further: we don't seem to set retention on the *prod buckets. Perhaps we should do that, also? Can we think of any reason we wouldn't want to guarantee that community-owned artifacts are available for an extended period of time? We can't do this with GCR tags, but we can set it on the GCS buckets for non-GCR artifacts.

MAYBE we can do it on the bucket for GCR (retaining the image by SHA, if not the tags). ISTR there was a problem (we used to have it and removed it, forget why). Will test.

FURTHER YET: should we lock these retentions? This is risky in case we ever actually legit need to remove an artifact (e.g. a cred leak or something).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, at that point, a simple additive rsync may be sufficient, with a different cred controlling the 2 buckets, it sounds pretty reasonable yeah?

Copy link
Member Author

@justinsb justinsb Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, obviously slightly disappointed in myself for not seeing this before I wrote the script, but I believe you're right. We should set large retention, and just use a gsutil rsync -c. (The -c is optional when they are both gs:// urls)

We do need to set -C so that one changed file doesn't stop the others. But then I think that's it (as you pointed out). We do need to watch for errors in the output and make sure that we don't just turn off the job instead of fixing it.

As for the other topics, yes I think we should set retention on the prod buckets, and we should set retention on the backup buckets. I actually think we should lock the retention on the backup buckets (so we know they are always there).

As for setting retention on the prod buckets, I'm not sure. Presumably we expect these to be mirrored over time, so it's not clear we actually could delete an artifact. Someone changing an artifact could be pretty damaging until we corrected it (which we would do automatically, but not immediately). We should say that you should check the hashes of all artifacts (this lets us use mirrors), but likely this bucket will be the canonical source of those hashes for the immediate future.

Because of that, I think we should lock it. But we should try swapping out the backing bucket to make sure we can do it without "too much" disruption if we have to.

With a locked retention policy, disaster in this case means somebody uploading an additional file. Either the next version of k8s, maybe some credential-stealing JS (but we shouldn't authn to artifacts.k8s.io), or some content designed to cause embarrassment . We would want to remove that, but we wouldn't be compromising the integrity of the artifacts themselves. I'd imagine the biggest risk would be uploading a new binary, then sending out a CVE notification with a link to it to various mailing lists. I think some form of offline signing is the only realistic way of protecting against that, but plenty of existing solutions there in e.g. OS packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern would be if an artifact was uploaded that had such a security flaw that it should never be used. I don't care if we maintain a backup of that artifact, but deleting it from the prod bucket seems like something we should maintain an option for (especially if we have backups)

@listx
Copy link
Contributor

listx commented Aug 16, 2019

/cc

@dims
Copy link
Member

dims commented Aug 20, 2019

/assign @listx
/uncc

@k8s-ci-robot k8s-ci-robot removed the request for review from dims August 20, 2019 22:18
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 19, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants