-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pangeo-hubs: recover prometheus-metrics #2688
Comments
I think maybe we got the same disk provisioned back to us, because prometheus-server fails to startup once again. Either we got the previous disk coupled to the new PV, or we got the new disk so populated with things over a short period of time that it fails to start again. |
Hopefully this is the old disk, and we can recover this data? I think prometheus data, especially for in use cluster, should be considered user data and preserved. We should be extremely careful about deleting it and not do that to the extent possible. @consideRatio it's definitely not the same disk. I don't know if the startup problems are related to the disk, but my intuiton is that they are not. |
I mounted this disk onto a pod with the following yaml: apiVersion: v1
kind: Pod
metadata:
name: test
spec:
volumes:
- name: volume
gcePersistentDisk:
pdName: gke-pangeo-hubs-cluste-pvc-55aae527-9546-4cbc-8efc-df47046272ab
fsType: ext4
containers:
- name: shell
image: ubuntu/nginx:latest
volumeMounts:
- mountPath: /data
name: volume
Unfortunately, this is a leftover PVC from a long time ago (Aug 2021), and I think we have lost all metrics :(
|
Unfortunately this is not quite right, it only determines if the PV is deleted when the PVC is deleted. If the PV itself is deleted, then the underlying disk is still deleted. https://kubernetes.io/docs/concepts/storage/persistent-volumes/#retain |
@consideRatio are you sure the old PV has Retain set in that case? The only unattached PV that was created for prometheus (as seen from description |
@yuvipanda hmmm hmmm I'm almost 100% that the PV i deleted had "retain" on it, which I believe implies that the storage asset - a GCP Persistent Disk in this case, should still be around for a new PV resource to be associated with. My understanding is that:
So I figure we should look for the GCP PDs via console.cloud.google.com, and if we find one, create a k8s PV to associate with the GCP PD manually by mimicing how things look for other PDs, and then finally let a chart create a PVC which then may get bound to the PV we manually created. I've done something like this before, but I'm not confident on the details. Do you see a GCP PD that could be the one previously associated with the k8s PV i deleted? |
Also, the current PV for prometheus has reclaimpolicy set to delete, not retain, across all our clusters. |
Nope, I do not. I only see #2688 (comment), explored in #2688 (comment) |
Opened #2717 to make sure that all prometheus PVs are marked as 'retain' |
#2718 addresses making all current PVs 'retain'. |
So I presume we have lost the data, meaning it is unrecoverable by now and we should close this one without investing more resources into it. Further thoughts? |
This is unfortunately now a lost cause :( |
As part of trying to get prometheus-server to start successfully with OOMKilled issues and inability to request even more memory without changing the k8s node type, I removed a
PV
andPVC
resource in the pangeo-hubs k8s cluster. Doing so, we got provisioned a new persistent disk that became associated with the new PV.The PV deleted was configured with
Retain
as a policy, which means that deleting the PV doesn't delete the actual disk associated with the PV.As suggested by @yuvipanda in #2366 (comment), we want to remedy this deletion. We want to have a PVC and PV, where the PV references the previous disk, not the new. To do this though, we would need to know the ID of the previous disk etc, but I lack access to pangeo-hubs and can't do it.
The text was updated successfully, but these errors were encountered: