Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restic podvolumerestore failure not causing restore to fail #2121

Closed
dymurray opened this issue Dec 11, 2019 · 6 comments · Fixed by #2201
Closed

Restic podvolumerestore failure not causing restore to fail #2121

dymurray opened this issue Dec 11, 2019 · 6 comments · Fixed by #2201
Assignees
Labels
Bug Needs investigation Restic Relates to the restic integration

Comments

@dymurray
Copy link
Contributor

What steps did you take and what happened:
[A clear and concise description of what the bug is, and what commands you ran.)
A restore was created which attempted to write 1 GB of data to a 1MB volume from restic. Restic errors out since it runs out of space to write data to, and the podvolumerestore fails. However, the associated Velero restore is in a Completed state even though the restic restore failed.

[dymurray@pups ~]$ velero describe restore 050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb -n openshift-migration --details
Name:         050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
Namespace:    openshift-migration
Labels:       app.kubernetes.io/part-of=openshift-migration
              migmigration=0525ece0-1c18-11ea-8753-02bddcc5316e
              migration-stage-restore=0525ece0-1c18-11ea-8753-02bddcc5316e
Annotations:  openshift.io/migrate-copy-phase=final
              openshift.io/migration-registry=172.30.63.96:5000
              openshift.io/migration-registry-dir=/bz-reproduce-registry-e3380ad4-1c17-11ea-8753-02bddcc5316e

Backup:  050d6c20-1c18-11ea-9119-372282f2cae3-jxvcq

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  true

Phase:  Completed

Validation errors:  <none>

Warnings:  <none>
Errors:    <none>

Restic Restores:
  Failed:
    ocp-24659-mysql/mysql-1-8msj8-stage: mysql-data

What did you expect to happen:
I expected the Velero restore to fail.

Output of podvolumerestore

Restic Restores:
  Failed:
    ocp-24659-mysql/mysql-1-8msj8-stage: mysql-data
(failed reverse-i-search)`oc hg': ^C status
[dymurray@pups ~]$ oc get pods^C
(failed reverse-i-search)`podvolum': oc get ^Cds
[dymurray@pups ~]$ oc get podvolumerestores -o yaml 050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb-v8qld 
apiVersion: velero.io/v1
kind: PodVolumeRestore
metadata:
  creationTimestamp: 2019-12-11T13:14:21Z
  generateName: 050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb-
  generation: 3
  labels:
    velero.io/pod-uid: 241be6be-1c18-11ea-9b29-06135bb47b3a
    velero.io/restore-name: 050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
    velero.io/restore-uid: 239e691b-1c18-11ea-8bf6-0aae074dc220
  name: 050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb-v8qld
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: velero.io/v1
    controller: true
    kind: Restore
    name: 050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
    uid: 239e691b-1c18-11ea-8bf6-0aae074dc220
  resourceVersion: "663274"
  selfLink: /apis/velero.io/v1/namespaces/openshift-migration/podvolumerestores/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb-v8qld
  uid: 2549751e-1c18-11ea-9b29-06135bb47b3a
spec:
  backupStorageLocation: migstorage-default-b4g7n
  pod:
    kind: Pod
    name: mysql-1-8msj8-stage
    namespace: ocp-24659-mysql
    uid: 241be6be-1c18-11ea-9b29-06135bb47b3a
  repoIdentifier: s3:http://ae7b5152f1b6511ea97df06135bb47b3-1758046785.us-east-2.elb.amazonaws.com/mig-storage/velero/restic/ocp-24659-mysql
  snapshotID: 571a36a8
  volume: mysql-data
status:
  completionTimestamp: 2019-12-11T13:14:45Z
  message: 'error restoring volume: error creating .velero directory for done file:
    mkdir /host_pods/241be6be-1c18-11ea-9b29-06135bb47b3a/volumes/kubernetes.io~csi/pvc-2401a0d1-1c18-11ea-9b29-06135bb47b3a/mount/.velero:
    no space left on device'
  phase: Failed
  startTimestamp: 2019-12-11T13:14:42Z
@skriss
Copy link
Contributor

skriss commented Dec 11, 2019

@dymurray do you have the log for the restore as well?

@skriss skriss added Bug Restic Relates to the restic integration labels Dec 11, 2019
@dymurray
Copy link
Contributor Author

@skriss Yup, here is the relevant portion of the restore logs:

time="2019-12-11T13:14:19Z" level=info msg="Adding PVC ocp-24659-mysql/mysql as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pvc_from_pod_action.go:58" pluginName=velero restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=info msg="Skipping persistentvolumeclaims/ocp-24659-mysql/mysql because it's already been restored." logSource="pkg/restore/restore.go:859" restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:933" restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                         
time="2019-12-11T13:14:19Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:933" restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                         
time="2019-12-11T13:14:19Z" level=info msg="Executing ResticRestoreAction" cmd=/velero logSource="pkg/restore/restic_restore_action.go:68" pluginName=velero restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=info msg="Restic backups for pod found" cmd=/velero logSource="pkg/restore/restic_restore_action.go:94" pluginName=velero pod=ocp-24659-mysql/mysql-1-8msj8-stage restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=warning msg="Plugin config contains unparseable image name" cmd=/velero image="image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-velero-restic-restore-helper-rhel8@sha256:35390b9210d520c8a3944662f06d7e711911ac96d6937d103826629cca37bef0" logSource="pkg/restore/restic_restore_action.go:171" pluginName=velero pod=ocp-24659-mysql/mysql-1-8msj8-stage restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=info msg="Using image \"gcr.io/heptio-images/velero-restic-restore-helper:latest\"" cmd=/velero logSource="pkg/restore/restic_restore_action.go:105" pluginName=velero pod=ocp-24659-mysql/mysql-1-8msj8-stage restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=error msg="Using default resource values, couldn't parse resource requirements: couldn't parse CPU request \"\": quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'." cmd=/velero logSource="pkg/restore/restic_restore_action.go:112" pluginName=velero pod=ocp-24659-mysql/mysql-1-8msj8-stage restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=info msg="Done executing ResticRestoreAction" cmd=/velero logSource="pkg/restore/restic_restore_action.go:142" pluginName=velero restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:933" restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                         
time="2019-12-11T13:14:19Z" level=info msg="[common-restore] Entering common restore plugin" cmd=/plugins/velero-plugins logSource="/go/src/github.com/fusor/openshift-migration-plugin/vendor/github.com/fusor/openshift-velero-plugin/velero-plugins/common/restore.go:22" pluginName=velero-plugins restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                                                        
time="2019-12-11T13:14:19Z" level=info msg="[common-restore] common restore plugin for mysql-1-8msj8-stage" cmd=/plugins/velero-plugins logSource="/go/src/github.com/fusor/openshift-migration-plugin/vendor/github.com/fusor/openshift-velero-plugin/velero-plugins/common/restore.go:29" pluginName=velero-plugins restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                                         
time="2019-12-11T13:14:19Z" level=info msg="[GetRegistryInfo] value from imagestream" cmd=/plugins/velero-plugins logSource="/go/src/github.com/fusor/openshift-migration-plugin/vendor/github.com/fusor/openshift-velero-plugin/velero-plugins/common/shared.go:29" pluginName=velero-plugins restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                                                                
time="2019-12-11T13:14:19Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:933" restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                         
time="2019-12-11T13:14:19Z" level=info msg="[pod-restore] Entering Pod restore plugin" cmd=/plugins/velero-plugins logSource="/go/src/github.com/fusor/openshift-migration-plugin/velero-plugins/migpod/restore.go:29" pluginName=velero-plugins restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=info msg="[pod-restore] pod: mysql-1-8msj8-stage" cmd=/plugins/velero-plugins logSource="/go/src/github.com/fusor/openshift-migration-plugin/velero-plugins/migpod/restore.go:34" pluginName=velero-plugins restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:19Z" level=info msg="[util] container image ref registry.access.redhat.com/rhel7" cmd=/plugins/velero-plugins logSource="/go/src/github.com/fusor/openshift-migration-plugin/vendor/github.com/fusor/openshift-velero-plugin/velero-plugins/common/util.go:92" pluginName=velero-plugins restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                                               
time="2019-12-11T13:14:19Z" level=info msg="[util] container image ref gcr.io/heptio-images/velero-restic-restore-helper:latest" cmd=/plugins/velero-plugins logSource="/go/src/github.com/fusor/openshift-migration-plugin/vendor/github.com/fusor/openshift-velero-plugin/velero-plugins/common/util.go:92" pluginName=velero-plugins restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb                       
time="2019-12-11T13:14:19Z" level=info msg="Attempting to restore Pod: mysql-1-8msj8-stage" logSource="pkg/restore/restore.go:1031" restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb        
time="2019-12-11T13:14:45Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="pod volume restore failed: error restoring volume: error creating .velero directory for done
file: mkdir /host_pods/241be6be-1c18-11ea-9b29-06135bb47b3a/volumes/kubernetes.io~csi/pvc-2401a0d1-1c18-11ea-9b29-06135bb47b3a/mount/.velero: no space left on device" logSource="pkg/restore/restore.go:1126" restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb
time="2019-12-11T13:14:45Z" level=info msg="restore completed" logSource="pkg/controller/restore_controller.go:465" restore=openshift-migration/050d6c20-1c18-11ea-9119-372282f2cae3-9c4pb 

@skriss
Copy link
Contributor

skriss commented Jan 15, 2020

@dymurray just coming back to this, I think there's potentially a race condition in some of the code that's collecting errors from the restic restore operations and propagating them back to the main velero restore. I'll look at fixing that.

@dymurray
Copy link
Contributor Author

@skriss ah interesting thank you very much. I had been going down the path of trying to reproduce by running out of space on the restic volume but I think you may be on to something. Will help review your PR.

@dymurray
Copy link
Contributor Author

Just want to note I cannot confirm this PR solved my issue above. I will do some testing and reopen if I still find this issue is valid.

@skriss
Copy link
Contributor

skriss commented Jan 22, 2020

bummer -- will keep an eye out for more info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs investigation Restic Relates to the restic integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants