Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pangeo-hubs: recover prometheus-metrics #2688

Closed
consideRatio opened this issue Jun 21, 2023 · 14 comments
Closed

pangeo-hubs: recover prometheus-metrics #2688

consideRatio opened this issue Jun 21, 2023 · 14 comments
Assignees
Labels
Task Actions that don't involve changing our code or docs.

Comments

@consideRatio
Copy link
Member

consideRatio commented Jun 21, 2023

As part of trying to get prometheus-server to start successfully with OOMKilled issues and inability to request even more memory without changing the k8s node type, I removed a PV and PVC resource in the pangeo-hubs k8s cluster. Doing so, we got provisioned a new persistent disk that became associated with the new PV.

The PV deleted was configured with Retain as a policy, which means that deleting the PV doesn't delete the actual disk associated with the PV.

As suggested by @yuvipanda in #2366 (comment), we want to remedy this deletion. We want to have a PVC and PV, where the PV references the previous disk, not the new. To do this though, we would need to know the ID of the previous disk etc, but I lack access to pangeo-hubs and can't do it.

@consideRatio consideRatio added the Task Actions that don't involve changing our code or docs. label Jun 21, 2023
@consideRatio
Copy link
Member Author

I think maybe we got the same disk provisioned back to us, because prometheus-server fails to startup once again. Either we got the previous disk coupled to the new PV, or we got the new disk so populated with things over a short period of time that it fails to start again.

@consideRatio consideRatio changed the title Recover prometheus-metrics in pangeo-hubs pangeo-hubs: recover prometheus-metrics Jun 22, 2023
@yuvipanda yuvipanda removed their assignment Jun 23, 2023
@yuvipanda
Copy link
Member

Hopefully this is the old disk, and we can recover this data? I think prometheus data, especially for in use cluster, should be considered user data and preserved. We should be extremely careful about deleting it and not do that to the extent possible.

image

@consideRatio it's definitely not the same disk. I don't know if the startup problems are related to the disk, but my intuiton is that they are not.

@yuvipanda
Copy link
Member

I mounted this disk onto a pod with the following yaml:

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  volumes:
  - name: volume
    gcePersistentDisk:
      pdName:  gke-pangeo-hubs-cluste-pvc-55aae527-9546-4cbc-8efc-df47046272ab 
      fsType: ext4
  containers:
    - name: shell
      image: ubuntu/nginx:latest
      volumeMounts:
        - mountPath: /data
          name: volume

Unfortunately, this is a leftover PVC from a long time ago (Aug 2021), and I think we have lost all metrics :(

root@test:/data# ls -la
total 36
drwxrwsr-x 5 root   nogroup  4096 Aug 23  2021 .
drwxr-xr-x 1 root   root     4096 Jun 24 00:51 ..
drwxr-sr-x 2 nobody nogroup  4096 Aug 23  2021 chunks_head
-rw-r--r-- 1 nobody nogroup     0 Aug 23  2021 lock
drwxrws--- 2 root   nogroup 16384 Aug 23  2021 lost+found
-rw-r--r-- 1 nobody nogroup 20001 Aug 23  2021 queries.active
drwxr-sr-x 2 nobody nogroup  4096 Aug 23  2021 wal

@yuvipanda
Copy link
Member

The PV deleted was configured with Retain as a policy, which means that deleting the PV doesn't delete the actual disk associated with the PV.

Unfortunately this is not quite right, it only determines if the PV is deleted when the PVC is deleted. If the PV itself is deleted, then the underlying disk is still deleted. https://kubernetes.io/docs/concepts/storage/persistent-volumes/#retain

@consideRatio
Copy link
Member Author

consideRatio commented Jun 24, 2023

Are you sure? Looking at retain docs linked, my understanding is that the GCP persistent disk for the created PV is retained when a PV is deleted.

I've manually reused a GCP PD previously used by another PV, and i figure retain was the trick to this

Screenshot_20230624-100003

@yuvipanda
Copy link
Member

yuvipanda commented Jun 26, 2023

@consideRatio are you sure the old PV has Retain set in that case? The only unattached PV that was created for prometheus (as seen from description {"kubernetes.io/created-for/pv/name":"pvc-648a9d92-a57b-487f-becb-353b4971106e","kubernetes.io/created-for/pvc/name":"data-support-nfs-server-provisioner-0","kubernetes.io/created-for/pvc/namespace":"support"}) in the project is the one I looked at, and the mtime there is all too old.

@consideRatio
Copy link
Member Author

@yuvipanda hmmm hmmm I'm almost 100% that the PV i deleted had "retain" on it, which I believe implies that the storage asset - a GCP Persistent Disk in this case, should still be around for a new PV resource to be associated with.

My understanding is that:

  1. (long time ago) Our support chart, via dependency on the proemtheus-chart, created a PVC resource for prometheus-server, where the PVC had some policy about "retain"
  2. (long time ago) When k8s observed the PVC, some k8s controller associated with handling PVC resources and StorageClass resources perhaps created a PV resource - getting the "retain" policy from the PVC
  3. (long time ago) Some GCP associated controlled in k8s saw a PV using a GCP associated StorageClass and created a GCP Persistent Disk to associated with the PV.
  4. I deleted the PVC and PV, and some GCP aware controller opted to not delete an associated GCP PD because of "retain"

So I figure we should look for the GCP PDs via console.cloud.google.com, and if we find one, create a k8s PV to associate with the GCP PD manually by mimicing how things look for other PDs, and then finally let a chart create a PVC which then may get bound to the PV we manually created.

I've done something like this before, but I'm not confident on the details.

Do you see a GCP PD that could be the one previously associated with the k8s PV i deleted?

@yuvipanda
Copy link
Member

image

This is all the disks on the project.

@yuvipanda
Copy link
Member

Also, the current PV for prometheus has reclaimpolicy set to delete, not retain, across all our clusters.

@yuvipanda
Copy link
Member

Do you see a GCP PD that could be the one previously associated with the k8s PV i deleted?

Nope, I do not. I only see #2688 (comment), explored in #2688 (comment)

@yuvipanda
Copy link
Member

Opened #2717 to make sure that all prometheus PVs are marked as 'retain'

@yuvipanda
Copy link
Member

#2718 addresses making all current PVs 'retain'.

@damianavila
Copy link
Contributor

Nope, I do not.

So I presume we have lost the data, meaning it is unrecoverable by now and we should close this one without investing more resources into it. Further thoughts?

@yuvipanda yuvipanda removed their assignment Jul 18, 2023
@yuvipanda
Copy link
Member

This is unfortunately now a lost cause :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Task Actions that don't involve changing our code or docs.
Projects
No open projects
Development

No branches or pull requests

3 participants