pangeo-hubs: recover prometheus-metrics #2688

consideRatio · 2023-06-21T17:29:27Z

As part of trying to get prometheus-server to start successfully with OOMKilled issues and inability to request even more memory without changing the k8s node type, I removed a PV and PVC resource in the pangeo-hubs k8s cluster. Doing so, we got provisioned a new persistent disk that became associated with the new PV.

The PV deleted was configured with Retain as a policy, which means that deleting the PV doesn't delete the actual disk associated with the PV.

As suggested by @yuvipanda in #2366 (comment), we want to remedy this deletion. We want to have a PVC and PV, where the PV references the previous disk, not the new. To do this though, we would need to know the ID of the previous disk etc, but I lack access to pangeo-hubs and can't do it.

The text was updated successfully, but these errors were encountered:

consideRatio · 2023-06-21T18:39:00Z

I think maybe we got the same disk provisioned back to us, because prometheus-server fails to startup once again. Either we got the previous disk coupled to the new PV, or we got the new disk so populated with things over a short period of time that it fails to start again.

yuvipanda · 2023-06-23T17:16:49Z

Hopefully this is the old disk, and we can recover this data? I think prometheus data, especially for in use cluster, should be considered user data and preserved. We should be extremely careful about deleting it and not do that to the extent possible.

@consideRatio it's definitely not the same disk. I don't know if the startup problems are related to the disk, but my intuiton is that they are not.

yuvipanda · 2023-06-24T00:52:41Z

I mounted this disk onto a pod with the following yaml:

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  volumes:
  - name: volume
    gcePersistentDisk:
      pdName:  gke-pangeo-hubs-cluste-pvc-55aae527-9546-4cbc-8efc-df47046272ab 
      fsType: ext4
  containers:
    - name: shell
      image: ubuntu/nginx:latest
      volumeMounts:
        - mountPath: /data
          name: volume

Unfortunately, this is a leftover PVC from a long time ago (Aug 2021), and I think we have lost all metrics :(

root@test:/data# ls -la
total 36
drwxrwsr-x 5 root   nogroup  4096 Aug 23  2021 .
drwxr-xr-x 1 root   root     4096 Jun 24 00:51 ..
drwxr-sr-x 2 nobody nogroup  4096 Aug 23  2021 chunks_head
-rw-r--r-- 1 nobody nogroup     0 Aug 23  2021 lock
drwxrws--- 2 root   nogroup 16384 Aug 23  2021 lost+found
-rw-r--r-- 1 nobody nogroup 20001 Aug 23  2021 queries.active
drwxr-sr-x 2 nobody nogroup  4096 Aug 23  2021 wal

yuvipanda · 2023-06-24T00:53:34Z

The PV deleted was configured with Retain as a policy, which means that deleting the PV doesn't delete the actual disk associated with the PV.

Unfortunately this is not quite right, it only determines if the PV is deleted when the PVC is deleted. If the PV itself is deleted, then the underlying disk is still deleted. https://kubernetes.io/docs/concepts/storage/persistent-volumes/#retain

consideRatio · 2023-06-24T08:01:44Z

Are you sure? Looking at retain docs linked, my understanding is that the GCP persistent disk for the created PV is retained when a PV is deleted.

I've manually reused a GCP PD previously used by another PV, and i figure retain was the trick to this

yuvipanda · 2023-06-26T17:30:44Z

@consideRatio are you sure the old PV has Retain set in that case? The only unattached PV that was created for prometheus (as seen from description {"kubernetes.io/created-for/pv/name":"pvc-648a9d92-a57b-487f-becb-353b4971106e","kubernetes.io/created-for/pvc/name":"data-support-nfs-server-provisioner-0","kubernetes.io/created-for/pvc/namespace":"support"}) in the project is the one I looked at, and the mtime there is all too old.

consideRatio · 2023-06-26T17:47:38Z

@yuvipanda hmmm hmmm I'm almost 100% that the PV i deleted had "retain" on it, which I believe implies that the storage asset - a GCP Persistent Disk in this case, should still be around for a new PV resource to be associated with.

My understanding is that:

(long time ago) Our support chart, via dependency on the proemtheus-chart, created a PVC resource for prometheus-server, where the PVC had some policy about "retain"
(long time ago) When k8s observed the PVC, some k8s controller associated with handling PVC resources and StorageClass resources perhaps created a PV resource - getting the "retain" policy from the PVC
(long time ago) Some GCP associated controlled in k8s saw a PV using a GCP associated StorageClass and created a GCP Persistent Disk to associated with the PV.
I deleted the PVC and PV, and some GCP aware controller opted to not delete an associated GCP PD because of "retain"

So I figure we should look for the GCP PDs via console.cloud.google.com, and if we find one, create a k8s PV to associate with the GCP PD manually by mimicing how things look for other PDs, and then finally let a chart create a PVC which then may get bound to the PV we manually created.

I've done something like this before, but I'm not confident on the details.

Do you see a GCP PD that could be the one previously associated with the k8s PV i deleted?

yuvipanda · 2023-06-26T17:56:10Z

This is all the disks on the project.

yuvipanda · 2023-06-26T17:59:01Z

Also, the current PV for prometheus has reclaimpolicy set to delete, not retain, across all our clusters.

yuvipanda · 2023-06-26T18:01:55Z

Do you see a GCP PD that could be the one previously associated with the k8s PV i deleted?

Nope, I do not. I only see #2688 (comment), explored in #2688 (comment)

yuvipanda · 2023-06-26T21:32:45Z

Opened #2717 to make sure that all prometheus PVs are marked as 'retain'

yuvipanda · 2023-06-26T21:48:05Z

#2718 addresses making all current PVs 'retain'.

damianavila · 2023-07-04T14:31:01Z

Nope, I do not.

So I presume we have lost the data, meaning it is unrecoverable by now and we should close this one without investing more resources into it. Further thoughts?

yuvipanda · 2023-08-23T18:15:40Z

This is unfortunately now a lost cause :(

consideRatio added the Task Actions that don't involve changing our code or docs. label Jun 21, 2023

consideRatio mentioned this issue Jun 21, 2023

daskhub: prometheus scraping of worker pods? #2366

Closed

consideRatio changed the title ~~Recover prometheus-metrics in pangeo-hubs~~ pangeo-hubs: recover prometheus-metrics Jun 22, 2023

GeorgianaElena mentioned this issue Jun 22, 2023

pangeo-hubs: help prometheus-server startup again (node pool changes, I lack permissions) #2695

Closed

damianavila assigned yuvipanda Jun 23, 2023

yuvipanda removed their assignment Jun 23, 2023

yuvipanda mentioned this issue Jun 26, 2023

Make sure prometheus PVs of all our clusters have deletionPolicy set to retain #2717

Open

damianavila assigned yuvipanda and consideRatio Jul 4, 2023

yuvipanda removed their assignment Jul 18, 2023

yuvipanda closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pangeo-hubs: recover prometheus-metrics #2688

pangeo-hubs: recover prometheus-metrics #2688

consideRatio commented Jun 21, 2023 •

edited

Loading

consideRatio commented Jun 21, 2023

yuvipanda commented Jun 23, 2023

yuvipanda commented Jun 24, 2023

yuvipanda commented Jun 24, 2023

consideRatio commented Jun 24, 2023 •

edited

Loading

yuvipanda commented Jun 26, 2023 •

edited

Loading

consideRatio commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

damianavila commented Jul 4, 2023

yuvipanda commented Aug 23, 2023

pangeo-hubs: recover prometheus-metrics #2688

pangeo-hubs: recover prometheus-metrics #2688

Comments

consideRatio commented Jun 21, 2023 • edited Loading

consideRatio commented Jun 21, 2023

yuvipanda commented Jun 23, 2023

yuvipanda commented Jun 24, 2023

yuvipanda commented Jun 24, 2023

consideRatio commented Jun 24, 2023 • edited Loading

yuvipanda commented Jun 26, 2023 • edited Loading

consideRatio commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

yuvipanda commented Jun 26, 2023

damianavila commented Jul 4, 2023

yuvipanda commented Aug 23, 2023

consideRatio commented Jun 21, 2023 •

edited

Loading

consideRatio commented Jun 24, 2023 •

edited

Loading

yuvipanda commented Jun 26, 2023 •

edited

Loading