Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using secrets in env results in pod pending forever #263

Closed
matbun opened this issue Jul 18, 2024 · 3 comments
Closed

Using secrets in env results in pod pending forever #263

matbun opened this issue Jul 18, 2024 · 3 comments
Assignees
Labels
bug Something isn't working v0.3.x
Milestone

Comments

@matbun
Copy link

matbun commented Jul 18, 2024

Short Description of the issue

When using secrets, the pod stays in PENDING state forever. Removing them, the pods is executed correctly.

Environment

InterTwin env on Vega.

Steps to reproduce

Create secrets:

kubectl create secret generic mlflow-server --from-literal=username=XXX --from-literal=password=XXX

Pod I am using:

apiVersion: v1
kind: Pod
metadata:
  name: 3dgan-train
  annotations:
    slurm-job.vk.io/flags: "-p gpu --gres=gpu:1 --ntasks-per-node=1 --nodes=1 --time=00:55:00"
    slurm-job.vk.io/singularity-mounts: "--bind /ceph/hpc/data/st2301-itwin-users/egarciagarcia:/exp_data"
    # slurm-job.vk.io/pre-exec: "singularity pull /ceph/hpc/data/st2301-itwin-users/itwinai_v9.5.sif docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.4"
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - " cd /usr/src/app && itwinai exec-pipeline --print-config \
          --config $CERN_CODE_ROOT/config.yaml \ 
          --pipe-key training_pipeline \
          -o dataset_location=$CERN_DATA_ROOT \ 
          -o pipeline.init_args.steps.training_step.init_args.exp_root=$TMP_DATA_ROOT \ 
          -o logs_dir=$TMP_DATA_ROOT/ml_logs \ 
          -o distributed_strategy=$STRATEGY \ 
          -o devices=$DEVICES \ 
          -o hw_accelerators=$ACCELERATOR \ 
          -o checkpoints_path=$TMP_DATA_ROOT/checkpoints \
          -o max_samples=$MAX_DATA_SAMPLES \ 
          -o batch_size=$BATCH_SIZE \ 
          -o max_dataset_size=$NUM_WORKERS_DL "
    command:
    - /bin/sh
    - -c
    env:
    - name: CERN_DATA_ROOT
      value: "/exp_data"
    - name: CERN_CODE_ROOT
      value: "/usr/src/app"
    - name: TMP_DATA_ROOT
      value: "/exp_data"
    - name: MAX_DATA_SAMPLES
      value: "1000"
    - name: BATCH_SIZE
      value: "512"
    - name: NUM_WORKERS_DL
      value: "4"
    - name: ACCELERATOR
      value: "gpu"
    - name: STRATEGY
      value: "auto"
    - name: DEVICES
      value: "auto"

    - name: MLFLOW_TRACKING_USERNAME
      valueFrom:
        secretKeyRef:
          name: mlflow-server
          key: username
    - name: MLFLOW_TRACKING_PASSWORD
      valueFrom:
        secretKeyRef:
          name: mlflow-server
          key: password

    image: /ceph/hpc/data/st2301-itwin-users/itwinai_v9.5.sif
    imagePullPolicy: Always
    name: 3dgan-container
    resources:
      limits:
        cpu: "48"
        memory: 150Gi
      requests:
        cpu: "4"
        memory: 20Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  nodeSelector:
    kubernetes.io/hostname: vega-new-vk
  tolerations:
  - key: virtual-node.interlink/no-schedule
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Logs, stacktrace, or other symptoms

NAME          READY   STATUS    RESTARTS   AGE
3dgan-train   0/1     Pending   0          12m
@dciangot dciangot changed the title Using secrets results in pod pending forever Using secrets in env results in pod pending forever Jul 18, 2024
@dciangot dciangot added bug Something isn't working v0.3.x labels Jul 18, 2024
@dciangot
Copy link
Collaborator

@Surax98 any idea on how to tackle this?

@dciangot dciangot added this to the 0.3.0 milestone Jul 18, 2024
@dciangot dciangot added v0.4.x and removed bug Something isn't working v0.3.x labels Jul 18, 2024
@dciangot dciangot modified the milestones: 0.3.x, 0.4.x Jul 18, 2024
@dciangot dciangot added v0.3.x bug Something isn't working and removed v0.4.x labels Jul 18, 2024
@dciangot dciangot modified the milestones: 0.4.x, 0.3.x Jul 18, 2024
@Surax98
Copy link
Collaborator

Surax98 commented Aug 1, 2024

@matbun the investigation took a bit since I had to dig into the Virtual Kubelet repository, since it seems related to their code, more than to the interLink provider itself. Let me explain a bit what's going on.
Let's take as example this snippet from your pod:

   - name: MLFLOW_TRACKING_USERNAME
      valueFrom:
        secretKeyRef:
          name: mlflow-server
          key: username
    - name: MLFLOW_TRACKING_PASSWORD
      valueFrom:
        secretKeyRef:
          name: mlflow-server
          key: password

When using Secrets or ConfigMaps to set ENVs, a specific code on the Virtual Kubelet package is executed, leading to the retrieve of the specified resource. During this phase, the resource gathering, for a still under investigation reason, fails, probably due to a permission issue (bad cluster role?). Using ConfigMaps and Secrets as Volumes, works as expected, so you can workaround the issue by using Volumes, for the moment. By using a ClientSet within the interLink provider, I can easily retrieve everything, so it's seems a bit odd for the moment, until I don't get a much clearer understanding of the issue.
For reference, you can see the executed Virtual Kubelet code at this link; this is the function used to populate ENVs for the container, both for ConfigMaps and Secrets, even if the link points to the Secret case only. Digging in the code takes time, so feel free to help if you want!

dciangot added a commit that referenced this issue Aug 6, 2024
Starting vk secret informers at bootstrap #263
@dciangot
Copy link
Collaborator

dciangot commented Aug 6, 2024

Merged and fixed

@dciangot dciangot closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v0.3.x
Projects
None yet
Development

No branches or pull requests

3 participants