Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci(aks): Katib UAT fail on AKS #893

Closed
orfeas-k opened this issue May 9, 2024 · 6 comments · Fixed by canonical/katib-operators#185
Closed

ci(aks): Katib UAT fail on AKS #893

orfeas-k opened this issue May 9, 2024 · 6 comments · Fixed by canonical/katib-operators#185
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

orfeas-k commented May 9, 2024

Bug Description

This UAT fails for bundle latest/edge in our CI with AssertionError: Katib Experiment was not successful.. Note that this is successful for bundle 1.8/stable. Unfortunately, we cannot have detailed logs since that's a known limitation of how our UATs run.

Example runs: first-k8s-1.28, second-k8s-1.28, k8s-1.26

To Reproduce

Rerun the CI from #892 or just from main for "latest/edge"

Environment

AKS 1.26 and 1.28
Juju 3.1
CKF latest/edge

Model     Controller      Cloud/Region    Version  SLA          Timestamp
kubeflow  aks-controller  aks/westeurope  3.1.8    unsupported  09:26:07Z

App                        Version                  Status  Scale  Charm                    Channel       Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        latest/edge   308  10.0.16.94    no       
argo-controller                                     active      1  argo-controller          latest/edge   468  10.0.100.236  no       
dex-auth                                            active      1  dex-auth                 latest/edge   458  10.0.254.87   no       
envoy                                               active      1  envoy                    latest/edge   183  10.0.245.125  no       
istio-ingressgateway                                active      1  istio-gateway            latest/edge   900  10.0.44.117   no       
istio-pilot                                         active      1  istio-pilot              latest/edge   872  10.0.21.240   no       
jupyter-controller                                  active      1  jupyter-controller       latest/edge   936  10.0.131.139  no       
jupyter-ui                                          active      1  jupyter-ui               latest/edge   856  10.0.90.58    no       
katib-controller                                    active      1  katib-controller         latest/edge   526  10.0.152.253  no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/edge      138  10.0.152.2    no       
katib-db-manager                                    active      1  katib-db-manager         latest/edge   490  10.0.236.4    no       
katib-ui                                            active      1  katib-ui                 latest/edge   501  10.0.92.22    no       
kfp-api                                             active      1  kfp-api                  latest/edge  1244  10.0.176.102  no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/edge      138  10.0.3.211    no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      latest/edge   298  10.0.201.207  no       
kfp-persistence                                     active      1  kfp-persistence          latest/edge  1251  10.0.6.212    no       
kfp-profile-controller                              active      1  kfp-profile-controller   latest/edge  1209  10.0.253.135  no       
kfp-schedwf                                         active      1  kfp-schedwf              latest/edge  1263  10.0.6.119    no       
kfp-ui                                              active      1  kfp-ui                   latest/edge  1246  10.0.221.196  no       
kfp-viewer                                          active      1  kfp-viewer               latest/edge  1276  10.0.137.58   no       
kfp-viz                                             active      1  kfp-viz                  latest/edge  1197  10.0.127.237  no       
knative-eventing                                    active      1  knative-eventing         latest/edge   393  10.0.110.159  no       
knative-operator                                    active      1  knative-operator         latest/edge   368  10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:15)5.205  no       
knative-serving                                     active      1  knative-serving          latest/edge   394  10.0.147.68   no       
kserve-controller                                   active      1  kserve-controller        latest/edge   538  10.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:16)4.87   no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       latest/edge   517  10.0.52.98    no       
kubeflow-profiles                                   active      1  kubeflow-profiles        latest/edge   379  10.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:17)4.223  no       
kubeflow-roles                                      active      1  kubeflow-roles           latest/edge   207  10.0.205.101  no       
kubeflow-volumes                                    active      1  kubeflow-volumes         latest/edge   279  10.0.83.113   no       
metacontroller-operator                             active      1  metacontroller-operator  latest/edge   280  10.0.153.8    no       
minio                      res:oci-image@[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:18)55999    active      1  minio                    latest/edge   306  10.0.52.197   no       
mlmd                                                active      1  mlmd                     latest/edge   174  10.0.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:19)8.218  no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          latest/edge   371  10.0.125.250  no       
pvcviewer-operator                                  active      1  pvcviewer-operator       latest/edge    74  10.0.97.108   no       
seldon-controller-manager                           active      1  seldon-core              latest/edge   691  10.0.87.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:20)5   no       
tensorboard-controller                              active      1  tensorboard-controller   latest/edge   281  10.0.30.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:21)1   no       
tensorboards-web-app                                active      1  tensorboards-web-app     latest/edge   269  10.0.24.183   no       
training-operator                                   active      1  training-operator        latest/edge   378  10.0.16.237   no       

Unit                          Workload  Agent  Address      Ports          Message
admission-webhook/0*          active    idle   10.244.0.7                  
argo-controller/0*            active    idle   10.244.1.9                  
dex-auth/0*                   active    idle   10.244.0.8                  
envoy/0*                      active    idle   10.244.1.11                 
istio-ingressgateway/0*       active    idle   10.244.1.10                 
istio-pilot/0*                active    idle   10.244.0.9                  
jupyter-controller/0*         active    idle   10.244.1.12                 
jupyter-ui/0*                 active    idle   10.244.1.14                 
katib-controller/0*           active    idle   10.244.1.15                 
katib-db-manager/0*           active    idle   10.244.1.16                 
katib-db/0*                   active    idle   10.244.0.12                 Primary
katib-ui/0*                   active    idle   10.244.1.17                 
kfp-api/0*                    active    idle   10.244.1.18                 
kfp-db/0*                     active    idle   10.244.1.19                 Primary
kfp-metadata-writer/0*        active    idle   10.244.0.13                 
kfp-persistence/0*            active    idle   10.244.0.15                 
kfp-profile-controller/0*     active    idle   10.244.0.16                 
kfp-schedwf/0*                active    idle   10.244.0.18                 
kfp-ui/0*                     active    idle   10.244.1.20                 
kfp-viewer/0*                 active    idle   10.244.0.19                 
kfp-viz/0*                    active    idle   10.244.0.20                 
knative-eventing/0*           active    idle   10.244.0.14                 
knative-operator/0*           active    idle   10.244.0.22                 
knative-serving/0*            active    idle   10.244.0.17                 
kserve-controller/0*          active    idle   10.244.1.25                 
kubeflow-dashboard/0*         active    idle   10.244.1.23                 
kubeflow-profiles/0*          active    idle   10.244.0.24                 
kubeflow-roles/0*             active    idle   10.244.1.[21](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:22)                 
kubeflow-volumes/0*           active    idle   10.244.0.21                 
metacontroller-operator/0*    active    idle   10.244.1.[22](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:23)                 
minio/0*                      active    idle   10.244.1.24  9000-9001/TCP  
mlmd/0*                       active    idle   10.244.1.28                 
oidc-gatekeeper/0*            active    idle   10.244.0.[23](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:24)                 
pvcviewer-operator/0*         active    idle   10.244.0.26                 
seldon-controller-manager/0*  active    idle   10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:25)4.1.26                 
tensorboard-controller/0*     active    idle   10.244.1.27                 
tensorboards-web-app/0*       active    idle   10.244.0.[25](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:26)                 
training-operator/0*          active    idle   10.244.1.29

Relevant Log Output

------------------------------ Captured log call -------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running katib-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[8], line 8, in assert_experiment_succeeded(client, experiment)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=10),
      3     stop=stop_after_attempt(30),
      4     reraise=True,
      5 )
      6 def assert_experiment_succeeded(client, experiment):
      7     """Wait for the Katib Experiment to complete successfully."""
----> 8     assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."
AssertionError: Katib Experiment was not successful.
_____________________ test_notebook[training-integration] ______________________

test_notebook = '/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/notebooks/training/training-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))
    
        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)
    
        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"
    
        try:
            log.info(f"Running ***os.path.basename(test_notebook)***...")
            output_notebook, _ = ep.preprocess(notebook, ***"metadata": ***"path": "./"***)
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with ***e.ename***: ***e.evalue***")
    
        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Job pytorch-dist-mnist-gloo was not successful.

/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/test_notebooks.py:59: Failed

Additional Context

No response

@orfeas-k orfeas-k added the bug Something isn't working label May 9, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5649.

This message was autogenerated

@orfeas-k
Copy link
Contributor Author

orfeas-k commented May 30, 2024

Debugging

Deployed a 1.28 AKS cluster and Juju 3.5 with the same configuration with our CI and tried ran the Katib UAT. I could not find really useful logs that indicate the issue:

Experiment logs
╰─$ kl -n test-kubeflow cmaes-example-cmaes-589674bc6f-phff2 --all-containers -f
I0529 12:05:25.927601       1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0529 12:05:56.169166       1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654"  name:"momentum" value:"0.7043612817216396" ]
I0529 12:05:56.169233       1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099"  name:"momentum" value:"0.6980954001565728" ]
Trial logs
	╰─$ kl -n test-kubeflow cmaes-example-hdnn4bl5-jmfrx --all-containers -f
I0529 12:05:57.870325      14 main.go:396] Trial Name: cmaes-example-hdnn4bl5
100.0%12:06:00.184060      14 main.go:139] 
100.0%12:06:00.689822      14 main.go:139] 
100.0%12:06:00.885008      14 main.go:139] 
100.0%12:06:01.057909      14 main.go:139] 
I0529 12:06:01.185685      14 main.go:139] 2024-05-29T12:06:01Z INFO     Train Epoch: 1 [0/60000 (0%)]	loss=2.2980
I0529 12:06:01.606906      14 main.go:139] 2024-05-29T12:06:01Z INFO     Train Epoch: 1 [640/60000 (1%)]	loss=2.2849
I0529 12:06:01.932049      14 main.go:139] 2024-05-29T12:06:01Z INFO     Train Epoch: 1 [1280/60000 (2%)]	loss=2.0193
... # more training
I0529 12:06:30.235188      14 main.go:139] 2024-05-29T12:06:30Z INFO     Train Epoch: 1 [59520/60000 (99%)]	loss=0.3112
I0529 12:06:32.830804      14 main.go:139] 2024-05-29T12:06:32Z INFO     {metricName: accuracy, metricValue: 0.8480};{metricName: loss, metricValue: 0.4272}
I0529 12:06:32.830912      14 main.go:139] 
I0529 12:06:32.830931      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
I0529 12:06:32.830936      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
I0529 12:06:32.830974      14 main.go:139] Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.830990      14 main.go:139] 
I0529 12:06:32.831016      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
I0529 12:06:32.831025      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
I0529 12:06:32.831032      14 main.go:139] Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831036      14 main.go:139] 
I0529 12:06:32.831044      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
I0529 12:06:32.831047      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
I0529 12:06:32.831054      14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831058      14 main.go:139] 
I0529 12:06:32.831065      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
I0529 12:06:32.831069      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
I0529 12:06:32.831077      14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831081      14 main.go:139] 
F0529 12:06:54.922852      14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"

The only log that points somewhere is the last line where it cannot report logs and fails to dial the katib-db-manager (10.0.35.173 is the IP of its k8s service).

For reference, `kubectl describe pod` for trial
╰─$ kdp -n admin cmaes-example-wc2gxp47-jhmlf 
Name:             cmaes-example-wc2gxp47-jhmlf
Namespace:        admin
Priority:         0
Service Account:  default
Node:             aks-nodepool1-16255669-vmss000001/10.224.0.5
Start Time:       Wed, 29 May 2024 15:25:24 +0300
Labels:           batch.kubernetes.io/controller-uid=5e9504f4-1f09-4648-a8ec-eca78386719b
                  batch.kubernetes.io/job-name=cmaes-example-wc2gxp47
                  controller-uid=5e9504f4-1f09-4648-a8ec-eca78386719b
                  job-name=cmaes-example-wc2gxp47
                  katib.kubeflow.org/experiment=cmaes-example
                  katib.kubeflow.org/trial=cmaes-example-wc2gxp47
Annotations:      sidecar.istio.io/inject: false
Status:           Running
IP:               10.244.1.76
IPs:
  IP:           10.244.1.76
Controlled By:  Job/cmaes-example-wc2gxp47
Containers:
  training-container:
    Container ID:  containerd://be81b0cc38fe14d3be4278caa65cc6764829a20214f5729ce43c665fa40e3eaa
    Image:         docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0
    Image ID:      docker.io/kubeflowkatib/pytorch-mnist-cpu@sha256:b95678ce7c02cc8ece15fd2e8ed57241333ce8d79342ee68789f650b3671463d
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      python3 /opt/pytorch-mnist/mnist.py --epochs=1 --batch-size=64 --lr=0.04511033252270099 --momentum=0.6980954001565728 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid
    State:          Running
      Started:      Wed, 29 May 2024 15:25:25 +0300
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/log/katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-brzwq (ro)
  metrics-logger-and-collector:
    Container ID:  containerd://d77bbbc1853b584fb0a751990538044bdb90155b8cde73a592d0b6b39013648b
    Image:         docker.io/kubeflowkatib/file-metrics-collector:v0.17.0-rc.0
    Image ID:      docker.io/kubeflowkatib/file-metrics-collector@sha256:2582758cc1cefd55d71c737b2c276a44a544b3747949bc15084731118e750dd7
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      cmaes-example-wc2gxp47
      -m
      loss;Train-accuracy
      -o-type
      minimize
      -s-db
      katib-db-manager.kubeflow:65535
      -path
      /var/log/katib/metrics.log
      -format
      TEXT
    State:          Running
      Started:      Wed, 29 May 2024 15:25:25 +0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /var/log/katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-brzwq (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-brzwq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  metrics-volume:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  8s    default-scheduler  Successfully assigned admin/cmaes-example-wc2gxp47-jhmlf to aks-nodepool1-16255669-vmss000001
  Normal  Pulled     8s    kubelet            Container image "docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0" already present on machine
  Normal  Created    8s    kubelet            Created container training-container
  Normal  Started    7s    kubelet            Started container training-container
  Normal  Pulled     7s    kubelet            Container image "docker.io/kubeflowkatib/file-metrics-collector:v0.17.0-rc.0" already present on machine
  Normal  Created    7s    kubelet            Created container metrics-logger-and-collector
  Normal  Started    7s    kubelet            Started container metrics-logger-and-collector
Logs when running katib uat (succesfully) on Microk8s 1.26, juju 3.5

Trial logs

ubuntu@ip-172-31-17-107:~$ kubectl logs -n admin cmaes-example-jt5xqvdw-d7htv   --all-containers -f
I0529 13:42:57.209827      14 main.go:396] Trial Name: cmaes-example-jt5xqvdw
100.0%13:43:00.815799      14 main.go:139] 
100.6%13:43:01.555076      14 main.go:139] 
100.0%13:43:01.815681      14 main.go:139] 
119.3%13:43:02.041549      14 main.go:139] 
I0529 13:43:02.618775      14 main.go:139] 2024-05-29T13:43:02Z INFO     Train Epoch: 1 [0/60000 (0%)]	loss=2.2980
I0529 13:43:03.835410      14 main.go:139] 2024-05-29T13:43:03Z INFO     Train Epoch: 1 [640/60000 (1%)]	loss=2.2872
I0529 13:43:04.438907      14 main.go:139] 2024-05-29T13:43:04Z INFO     Train Epoch: 1 [1280/60000 (2%)]	loss=2.0531
... # training
[58880/60000 (98%)]	loss=0.4249
I0529 13:43:45.536539      14 main.go:139] 2024-05-29T13:43:45Z INFO     Train Epoch: 1 [59520/60000 (99%)]	loss=0.3053
I0529 13:43:48.549894      14 main.go:139] 2024-05-29T13:43:48Z INFO     {metricName: accuracy, metricValue: 0.8436};{metricName: loss, metricValue: 0.4374}
I0529 13:43:48.549924      14 main.go:139] 
I0529 13:43:48.553166      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
I0529 13:43:48.553188      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
I0529 13:43:48.553204      14 main.go:139] Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 13:43:48.553214      14 main.go:139] 
I0529 13:43:48.553232      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
I0529 13:43:48.553244      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
I0529 13:43:48.553300      14 main.go:139] Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 13:43:48.553319      14 main.go:139] 
I0529 13:43:48.553354      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
I0529 13:43:48.553366      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
I0529 13:43:48.553375      14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 13:43:48.553380      14 main.go:139] 
I0529 13:43:48.553389      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
I0529 13:43:48.553408      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
I0529 13:43:48.553443      14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 13:43:48.553464      14 main.go:139] 
I0529 13:43:50.329199      14 main.go:455] Metrics reported. :
metric_logs:<time_stamp:"2024-05-29T13:43:02Z" metric:<name:"loss" value:"2.2980" > > metric_logs:<time_stamp:"2024-05-29T13:43:03Z" metric:<name:"loss" value:"2.2872" > > metric_logs:<time_stamp:"2024-05-29T13:43:04Z" metric:<name:"loss" value:"2.0531" > > metric_logs:<time_stamp:"2024-05-29T13:43:05Z" metric:<name:"loss" value:"1.4292" > > metric_logs:<time_stamp:"2024-05-29T13:43:06Z" metric:<name:"loss" value:"1.4897" > > metric_logs:<time_stamp:"2024-05-29T13:43:06Z" metric:<name:"loss" value:"1.3420" > > metric_logs:<time_stamp:"2024-05-29T13:43:06Z" metric:<name:"loss" value:"1.0955" > > metric_logs:<time_stamp:"2024-05-29T13:43:07Z" metric:<name:"loss" value:"0.8254" > > metric_logs:<time_stamp:"2024-05-29T13:43:07Z" metric:<name:"loss" value:"0.7940" > > metric_logs:<time_stamp:"2024-05-29T13:43:07Z" metric:<name:"loss" value:"0.8615" > > metric_logs:<time_stamp:"2024-05-29T13:43:08Z" metric:<name:"loss" value:"0.8247" > > metric_logs:<time_stamp:"2024-05-29T13:43:08Z" metric:<name:"loss" value:"0.6430" > > metric_logs:<time_stamp:"2024-05-29T13:43:08Z" metric:<name:"loss" value:"0.8674" > > metric_logs:<time_stamp:"2024-05-29T13:43:09Z" metric:<name:"loss" value:"0.5489" > > metric_logs:<time_stamp:"2024-05-29T13:43:09Z" metric:<name:"loss" value:"0.5338" > > metric_logs:<time_stamp:"2024-05-29T13:43:10Z" metric:<name:"loss" value:"0.8515" > > metric_logs:<time_stamp:"2024-05-29T13:43:10Z" metric:<name:"loss" value:"0.7667" > > metric_logs:<time_stamp:"2024-05-29T13:43:11Z" metric:<name:"loss" value:"0.4940" > > metric_logs:<time_stamp:"2024-05-29T13:43:11Z" metric:<name:"loss" value:"0.6502" > > metric_logs:<time_stamp:"2024-05-29T13:43:12Z" metric:<name:"loss" value:"0.7553" > > metric_logs:<time_stamp:"2024-05-29T13:43:12Z" metric:<name:"loss" value:"0.4051" > > metric_logs:<time_stamp:"2024-05-29T13:43:12Z" metric:<name:"loss" value:"0.7946" > > metric_logs:<time_stamp:"2024-05-29T13:43:13Z" metric:<name:"loss" value:"0.5976" > > metric_logs:<time_stamp:"2024-05-29T13:43:13Z" metric:<name:"loss" value:"0.4301" > > metric_logs:<time_stamp:"2024-05-29T13:43:14Z" metric:<name:"loss" value:"0.5241" > > metric_logs:<time_stamp:"2024-05-29T13:43:14Z" metric:<name:"loss" value:"0.6351" > > metric_logs:<time_stamp:"2024-05-29T13:43:14Z" metric:<name:"loss" value:"0.6309" > > metric_logs:<time_stamp:"2024-05-29T13:43:15Z" metric:<name:"loss" value:"0.4322" > > metric_logs:<time_stamp:"2024-05-29T13:43:16Z" metric:<name:"loss" value:"0.6990" > > metric_logs:<time_stamp:"2024-05-29T13:43:17Z" metric:<name:"loss" value:"0.7487" > > metric_logs:<time_stamp:"2024-05-29T13:43:17Z" metric:<name:"loss" value:"0.5877" > > metric_logs:<time_stamp:"2024-05-29T13:43:17Z" metric:<name:"loss" value:"0.4116" > > metric_logs:<time_stamp:"2024-05-29T13:43:18Z" metric:<name:"loss" value:"0.7075" > > metric_logs:<time_stamp:"2024-05-29T13:43:18Z" metric:<name:"loss" value:"0.5755" > > metric_logs:<time_stamp:"2024-05-29T13:43:19Z" metric:<name:"loss" value:"0.5320" > > metric_logs:<time_stamp:"2024-05-29T13:43:20Z" metric:<name:"loss" value:"0.5022" > > metric_logs:<time_stamp:"2024-05-29T13:43:21Z" metric:<name:"loss" value:"0.5585" > > metric_logs:<time_stamp:"2024-05-29T13:43:21Z" metric:<name:"loss" value:"0.6701" > > metric_logs:<time_stamp:"2024-05-29T13:43:22Z" metric:<name:"loss" value:"0.5671" > > metric_logs:<time_stamp:"2024-05-29T13:43:22Z" metric:<name:"loss" value:"0.5596" > > metric_logs:<time_stamp:"2024-05-29T13:43:23Z" metric:<name:"loss" value:"0.4663" > > metric_logs:<time_stamp:"2024-05-29T13:43:24Z" metric:<name:"loss" value:"0.4616" > > metric_logs:<time_stamp:"2024-05-29T13:43:24Z" metric:<name:"loss" value:"0.5287" > > metric_logs:<time_stamp:"2024-05-29T13:43:25Z" metric:<name:"loss" value:"0.5045" > > metric_logs:<time_stamp:"2024-05-29T13:43:25Z" metric:<name:"loss" value:"0.5485" > > metric_logs:<time_stamp:"2024-05-29T13:43:26Z" metric:<name:"loss" value:"0.5563" > > metric_logs:<time_stamp:"2024-05-29T13:43:26Z" metric:<name:"loss" value:"0.5256" > > metric_logs:<time_stamp:"2024-05-29T13:43:28Z" metric:<name:"loss" value:"0.5970" > > metric_logs:<time_stamp:"2024-05-29T13:43:28Z" metric:<name:"loss" value:"0.3728" > > metric_logs:<time_stamp:"2024-05-29T13:43:28Z" metric:<name:"loss" value:"0.5403" > > metric_logs:<time_stamp:"2024-05-29T13:43:29Z" metric:<name:"loss" value:"0.6719" > > metric_logs:<time_stamp:"2024-05-29T13:43:29Z" metric:<name:"loss" value:"0.5648" > > metric_logs:<time_stamp:"2024-05-29T13:43:30Z" metric:<name:"loss" value:"0.5025" > > metric_logs:<time_stamp:"2024-05-29T13:43:30Z" metric:<name:"loss" value:"0.5215" > > metric_logs:<time_stamp:"2024-05-29T13:43:31Z" metric:<name:"loss" value:"0.5761" > > metric_logs:<time_stamp:"2024-05-29T13:43:31Z" metric:<name:"loss" value:"0.2862" > > metric_logs:<time_stamp:"2024-05-29T13:43:31Z" metric:<name:"loss" value:"0.2975" > > metric_logs:<time_stamp:"2024-05-29T13:43:31Z" metric:<name:"loss" value:"0.3770" > > metric_logs:<time_stamp:"2024-05-29T13:43:32Z" metric:<name:"loss" value:"0.4379" > > metric_logs:<time_stamp:"2024-05-29T13:43:32Z" metric:<name:"loss" value:"0.2941" > > metric_logs:<time_stamp:"2024-05-29T13:43:33Z" metric:<name:"loss" value:"0.4630" > > metric_logs:<time_stamp:"2024-05-29T13:43:33Z" metric:<name:"loss" value:"0.2732" > > metric_logs:<time_stamp:"2024-05-29T13:43:33Z" metric:<name:"loss" value:"0.3157" > > metric_logs:<time_stamp:"2024-05-29T13:43:34Z" metric:<name:"loss" value:"0.5066" > > metric_logs:<time_stamp:"2024-05-29T13:43:34Z" metric:<name:"loss" value:"0.6121" > > metric_logs:<time_stamp:"2024-05-29T13:43:34Z" metric:<name:"loss" value:"0.5001" > > metric_logs:<time_stamp:"2024-05-29T13:43:35Z" metric:<name:"loss" value:"0.4862" > > metric_logs:<time_stamp:"2024-05-29T13:43:35Z" metric:<name:"loss" value:"0.4491" > > metric_logs:<time_stamp:"2024-05-29T13:43:35Z" metric:<name:"loss" value:"0.5721" > > metric_logs:<time_stamp:"2024-05-29T13:43:36Z" metric:<name:"loss" value:"0.4700" > > metric_logs:<time_stamp:"2024-05-29T13:43:36Z" metric:<name:"loss" value:"0.5482" > > metric_logs:<time_stamp:"2024-05-29T13:43:36Z" metric:<name:"loss" value:"0.3291" > > metric_logs:<time_stamp:"2024-05-29T13:43:37Z" metric:<name:"loss" value:"0.4413" > > metric_logs:<time_stamp:"2024-05-29T13:43:37Z" metric:<name:"loss" value:"0.4190" > > metric_logs:<time_stamp:"2024-05-29T13:43:38Z" metric:<name:"loss" value:"0.2713" > > metric_logs:<time_stamp:"2024-05-29T13:43:38Z" metric:<name:"loss" value:"0.4338" > > metric_logs:<time_stamp:"2024-05-29T13:43:39Z" metric:<name:"loss" value:"0.4222" > > metric_logs:<time_stamp:"2024-05-29T13:43:39Z" metric:<name:"loss" value:"0.4818" > > metric_logs:<time_stamp:"2024-05-29T13:43:39Z" metric:<name:"loss" value:"0.5069" > > metric_logs:<time_stamp:"2024-05-29T13:43:40Z" metric:<name:"loss" value:"0.3661" > > metric_logs:<time_stamp:"2024-05-29T13:43:40Z" metric:<name:"loss" value:"0.4026" > > metric_logs:<time_stamp:"2024-05-29T13:43:41Z" metric:<name:"loss" value:"0.3992" > > metric_logs:<time_stamp:"2024-05-29T13:43:41Z" metric:<name:"loss" value:"0.3087" > > metric_logs:<time_stamp:"2024-05-29T13:43:41Z" metric:<name:"loss" value:"0.4326" > > metric_logs:<time_stamp:"2024-05-29T13:43:42Z" metric:<name:"loss" value:"0.4379" > > metric_logs:<time_stamp:"2024-05-29T13:43:42Z" metric:<name:"loss" value:"0.3818" > > metric_logs:<time_stamp:"2024-05-29T13:43:42Z" metric:<name:"loss" value:"0.5389" > > metric_logs:<time_stamp:"2024-05-29T13:43:43Z" metric:<name:"loss" value:"0.3744" > > metric_logs:<time_stamp:"2024-05-29T13:43:43Z" metric:<name:"loss" value:"0.4355" > > metric_logs:<time_stamp:"2024-05-29T13:43:44Z" metric:<name:"loss" value:"0.4508" > > metric_logs:<time_stamp:"2024-05-29T13:43:44Z" metric:<name:"loss" value:"0.4387" > > metric_logs:<time_stamp:"2024-05-29T13:43:44Z" metric:<name:"loss" value:"0.6276" > > metric_logs:<time_stamp:"2024-05-29T13:43:45Z" metric:<name:"loss" value:"0.4249" > > metric_logs:<time_stamp:"2024-05-29T13:43:45Z" metric:<name:"loss" value:"0.3053" > > 

experiment logs

ubuntu@ip-172-31-17-107:~$ kubectl logs -n admin cmaes-example-cmaes-79dd5db648-jk5bc --all-containers -f
I0529 13:42:35.285566       1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0529 13:42:55.787753       1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654"  name:"momentum" value:"0.7043612817216396" ]
I0529 13:42:55.787845       1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099"  name:"momentum" value:"0.6980954001565728" ]
I0529 13:43:51.732787       1 service.go:117] Update trial mapping : trialName=cmaes-example-jt5xqvdw -> trialID=0
I0529 13:43:51.732806       1 service.go:117] Update trial mapping : trialName=cmaes-example-kndkpvzh -> trialID=1
I0529 13:43:51.732811       1 service.go:147] Detect changes of Trial (trialName=cmaes-example-kndkpvzh, trialID=1) : State Complete, Evaluation 0.269100
I0529 13:43:51.732859       1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr" value:"0.02556132716757138"  name:"momentum" value:"0.701003503816815" ]
mysql related logs
╰─$ juju run katib-db/0 get-cluster-status                                                      
Running operation 1 with 1 task
  - task 2 on unit-katib-db-0

Waiting for task 2...
status:
  clustername: cluster-7fea0e9f40002bd889df6711a8a00d04
  clusterrole: primary
  defaultreplicaset:
    name: default
    primary: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306
    ssl: required
    status: ok_no_tolerance
    statustext: cluster is not tolerant to any failures.
    topology:
      katib-db-0:
        address: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306
        memberrole: primary
        mode: r/w
        role: ha
        status: online
        version: 8.0.36
    topologymode: single-primary
  domainname: cluster-set-7fea0e9f40002bd889df6711a8a00d04
  groupinformationsourcemember: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306
success: "True"

@orfeas-k
Copy link
Contributor Author

orfeas-k commented May 30, 2024

Update

Debugging

We ran the uat on Microk8s 1.28 (juju 3.5.0) and the UAT failed with the same error message.

  • When editing its k8s svc to have the port 65535 (without juju config), the trials could report metrics.
  • However, after just killing the katib-controller pod to clear it logs and without doing any configuration, the UAT started succeeding from that point on. Not sure if killing the pod had something to do, but this could show that there is a race happening, resulting in the UAT failing

Misconfigured port (Conclusion?)

It turns out that ineed the training container succeeds while the metrics-collector copmonent fails to report metrics and thus the trial fails. The reason this is failing is indeed the last line of the logs

F0529 12:06:54.922852      14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"

In this line, we see that it is trying to hit 65535 port instead of the default 6789. This was confirmed also by comparing the trial pod yaml and more specifically the args passed to the metrics container between cases when the trial was successful and when it wasn't:

- args:
  - -t
  - cmaes-example-fqgkd7hd
  - -m
  - loss;Train-accuracy
  - -o-type
  - minimize
  - -s-db
-    - katib-db-manager.kubeflow:6789
+    - katib-db-manager.kubeflow:65535
  - -path
  - /var/log/katib/metrics.log
  - -format
  - TEXT

Upstream code and port setting

From the above, it looks like when the trial is created, those args are applied to the metrics collector container of the pod.

Who is applying those args

Looking at the upstream code:

  1. those args are applied by the mutatingWebhook.
  2. The katib db manager address is then set by the GetDBManagerAddr function
  3. Which is set by the ENV variable KATIB_DB_MANAGER_SERVICE_PORT or defaults to 6789 when there is no variable.

Now what does not make sense is that nowhere in upstream Katib manifests or in katib-operator repository is a KATIB_DB_MANAGER_SERVICE_PORT env variable set.

@orfeas-k
Copy link
Contributor Author

Port configuration ENV variable

AKS

SSHing into the charms, it turns out that there are multiple katib-related ENV variables set

╰─$ juju ssh katib-db-manager/0
# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=65535
KATIB_DB_MANAGER_SERVICE_PORT_PLACEHOLDER=65535
KATIB_DB_MANAGER_SERVICE_HOST=10.0.214.142
# exit
╰─$ juju ssh katib-controller/0
#  printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=65535
KATIB_DB_MANAGER_SERVICE_PORT_PLACEHOLDER=65535
KATIB_DB_MANAGER_SERVICE_HOST=10.0.214.142

which explains why the trials' metric-collector container is misconfigured.

Regarding the previous comment, I killed the katib-manager-0 and when it was spun up, those variables were different, which explains why the UAT succeeded yesterday after killing the pod

# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=6789
KATIB_DB_MANAGER_SERVICE_PORT_API=6789
KATIB_DB_MANAGER_SERVICE_HOST=10.152.183.120

@orfeas-k
Copy link
Contributor Author

orfeas-k commented May 31, 2024

Summary - recap

Problem

Juju and envvars

Copying from Kubernetes_service_patch (KSP) library:

When sidecar charms are deployed, Juju creates a service named after the application in the namespace. This service by default contains a "placeholder" port, which is 65535/TCP.

This ties with the kubernetes native behavior of having container environment variables for every service present on the cluster:

A list of all services that were running when a Container was created is available to that Container as environment variables. This list is limited to services within the same namespace as the new Container's Pod and Kubernetes control plane services.

Thus, when a charm with a pebble service called test-charm is deployed, the following ENV variables are expected to be found in other charm's containers that are created after test-charm:

TEST_CHARM_SERVICE_PORT=<port>
TEST_CHARM_SERVICE_HOST=<test-charm-service-ip>

Note that those ENV variables are not updated after the container has been created.

Katib uses by default KATIB_DB_MANAGER_SERVICE_PORT

Copying and rephrasing from two comments above:

  1. When the trial is created, mutatingWebhook applies some args to the metrics collector container of the pod.
  2. Part of those args is also the katib db manager address which is set by the GetDBManagerAddr function
  3. Which uses the the ENV variable KATIB_DB_MANAGER_SERVICE_PORT in order to set the address' port. If not present, it defaults to 6789 when there is no variable.

Result

We can see now how the above two work in combination with each other. The result is a race where:

  1. katib-db-manager service is created but not patched yet (thus its service has a port 65535)
  2. katib-controller container is created with ENV var KATIB_DB_MANAGER_SERVICE_PORT=65535.
  3. katib-db-manager service is patched but this information is not propagated to containers already created (thus to katib-controller containers as well)
  4. The trial ends up with an endpoint that points to 65535. This results in
F0529 12:06:54.922852      14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"

The weird thing is that this is intermittent and sometimes this ENV variable has the proper value and sometimes not.

Proposed solution

Until we figure out more on this behaviour and maybe tackle it in a better way, we 'll explicitly set KATIB_DB_MANAGER_SERVICE_PORT. This information will be broadcasted to katib-controller through a new relation between the charm and katib-db-manager. For this, k8s_service_info library will be used.

@misohu
Copy link
Member

misohu commented Jul 16, 2024

Is this still failing on 1.8 ? We recently pushed this PR to main branch of UATs canonical/charmed-kubeflow-uats@c894a90 which we successfully run on aks and eks for 1.8 couple of times (e.g. here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants