-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci(aks): Katib UAT fail on AKS #893
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5649.
|
DebuggingDeployed a 1.28 AKS cluster and Juju 3.5 with the same configuration with our CI and tried ran the Katib UAT. I could not find really useful logs that indicate the issue: Experiment logs╰─$ kl -n test-kubeflow cmaes-example-cmaes-589674bc6f-phff2 --all-containers -f
I0529 12:05:25.927601 1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0529 12:05:56.169166 1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654" name:"momentum" value:"0.7043612817216396" ]
I0529 12:05:56.169233 1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099" name:"momentum" value:"0.6980954001565728" ] Trial logs ╰─$ kl -n test-kubeflow cmaes-example-hdnn4bl5-jmfrx --all-containers -f
I0529 12:05:57.870325 14 main.go:396] Trial Name: cmaes-example-hdnn4bl5
100.0%12:06:00.184060 14 main.go:139]
100.0%12:06:00.689822 14 main.go:139]
100.0%12:06:00.885008 14 main.go:139]
100.0%12:06:01.057909 14 main.go:139]
I0529 12:06:01.185685 14 main.go:139] 2024-05-29T12:06:01Z INFO Train Epoch: 1 [0/60000 (0%)] loss=2.2980
I0529 12:06:01.606906 14 main.go:139] 2024-05-29T12:06:01Z INFO Train Epoch: 1 [640/60000 (1%)] loss=2.2849
I0529 12:06:01.932049 14 main.go:139] 2024-05-29T12:06:01Z INFO Train Epoch: 1 [1280/60000 (2%)] loss=2.0193
... # more training
I0529 12:06:30.235188 14 main.go:139] 2024-05-29T12:06:30Z INFO Train Epoch: 1 [59520/60000 (99%)] loss=0.3112
I0529 12:06:32.830804 14 main.go:139] 2024-05-29T12:06:32Z INFO {metricName: accuracy, metricValue: 0.8480};{metricName: loss, metricValue: 0.4272}
I0529 12:06:32.830912 14 main.go:139]
I0529 12:06:32.830931 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
I0529 12:06:32.830936 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
I0529 12:06:32.830974 14 main.go:139] Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.830990 14 main.go:139]
I0529 12:06:32.831016 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
I0529 12:06:32.831025 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
I0529 12:06:32.831032 14 main.go:139] Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831036 14 main.go:139]
I0529 12:06:32.831044 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
I0529 12:06:32.831047 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
I0529 12:06:32.831054 14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831058 14 main.go:139]
I0529 12:06:32.831065 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
I0529 12:06:32.831069 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
I0529 12:06:32.831077 14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831081 14 main.go:139]
F0529 12:06:54.922852 14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout" The only log that points somewhere is the last line where it cannot report logs and fails to dial the katib-db-manager (10.0.35.173 is the IP of its k8s service).
For reference, `kubectl describe pod` for trial╰─$ kdp -n admin cmaes-example-wc2gxp47-jhmlf
Name: cmaes-example-wc2gxp47-jhmlf
Namespace: admin
Priority: 0
Service Account: default
Node: aks-nodepool1-16255669-vmss000001/10.224.0.5
Start Time: Wed, 29 May 2024 15:25:24 +0300
Labels: batch.kubernetes.io/controller-uid=5e9504f4-1f09-4648-a8ec-eca78386719b
batch.kubernetes.io/job-name=cmaes-example-wc2gxp47
controller-uid=5e9504f4-1f09-4648-a8ec-eca78386719b
job-name=cmaes-example-wc2gxp47
katib.kubeflow.org/experiment=cmaes-example
katib.kubeflow.org/trial=cmaes-example-wc2gxp47
Annotations: sidecar.istio.io/inject: false
Status: Running
IP: 10.244.1.76
IPs:
IP: 10.244.1.76
Controlled By: Job/cmaes-example-wc2gxp47
Containers:
training-container:
Container ID: containerd://be81b0cc38fe14d3be4278caa65cc6764829a20214f5729ce43c665fa40e3eaa
Image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0
Image ID: docker.io/kubeflowkatib/pytorch-mnist-cpu@sha256:b95678ce7c02cc8ece15fd2e8ed57241333ce8d79342ee68789f650b3671463d
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
python3 /opt/pytorch-mnist/mnist.py --epochs=1 --batch-size=64 --lr=0.04511033252270099 --momentum=0.6980954001565728 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid
State: Running
Started: Wed, 29 May 2024 15:25:25 +0300
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-brzwq (ro)
metrics-logger-and-collector:
Container ID: containerd://d77bbbc1853b584fb0a751990538044bdb90155b8cde73a592d0b6b39013648b
Image: docker.io/kubeflowkatib/file-metrics-collector:v0.17.0-rc.0
Image ID: docker.io/kubeflowkatib/file-metrics-collector@sha256:2582758cc1cefd55d71c737b2c276a44a544b3747949bc15084731118e750dd7
Port: <none>
Host Port: <none>
Args:
-t
cmaes-example-wc2gxp47
-m
loss;Train-accuracy
-o-type
minimize
-s-db
katib-db-manager.kubeflow:65535
-path
/var/log/katib/metrics.log
-format
TEXT
State: Running
Started: Wed, 29 May 2024 15:25:25 +0300
Ready: True
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
Requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
Environment: <none>
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-brzwq (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-brzwq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
metrics-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8s default-scheduler Successfully assigned admin/cmaes-example-wc2gxp47-jhmlf to aks-nodepool1-16255669-vmss000001
Normal Pulled 8s kubelet Container image "docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0" already present on machine
Normal Created 8s kubelet Created container training-container
Normal Started 7s kubelet Started container training-container
Normal Pulled 7s kubelet Container image "docker.io/kubeflowkatib/file-metrics-collector:v0.17.0-rc.0" already present on machine
Normal Created 7s kubelet Created container metrics-logger-and-collector
Normal Started 7s kubelet Started container metrics-logger-and-collector Logs when running katib uat (succesfully) on Microk8s 1.26, juju 3.5Trial logs ubuntu@ip-172-31-17-107:~$ kubectl logs -n admin cmaes-example-jt5xqvdw-d7htv --all-containers -f
I0529 13:42:57.209827 14 main.go:396] Trial Name: cmaes-example-jt5xqvdw
100.0%13:43:00.815799 14 main.go:139]
100.6%13:43:01.555076 14 main.go:139]
100.0%13:43:01.815681 14 main.go:139]
119.3%13:43:02.041549 14 main.go:139]
I0529 13:43:02.618775 14 main.go:139] 2024-05-29T13:43:02Z INFO Train Epoch: 1 [0/60000 (0%)] loss=2.2980
I0529 13:43:03.835410 14 main.go:139] 2024-05-29T13:43:03Z INFO Train Epoch: 1 [640/60000 (1%)] loss=2.2872
I0529 13:43:04.438907 14 main.go:139] 2024-05-29T13:43:04Z INFO Train Epoch: 1 [1280/60000 (2%)] loss=2.0531
... # training
[58880/60000 (98%)] loss=0.4249
I0529 13:43:45.536539 14 main.go:139] 2024-05-29T13:43:45Z INFO Train Epoch: 1 [59520/60000 (99%)] loss=0.3053
I0529 13:43:48.549894 14 main.go:139] 2024-05-29T13:43:48Z INFO {metricName: accuracy, metricValue: 0.8436};{metricName: loss, metricValue: 0.4374}
I0529 13:43:48.549924 14 main.go:139]
I0529 13:43:48.553166 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
I0529 13:43:48.553188 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
I0529 13:43:48.553204 14 main.go:139] Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 13:43:48.553214 14 main.go:139]
I0529 13:43:48.553232 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
I0529 13:43:48.553244 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
I0529 13:43:48.553300 14 main.go:139] Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 13:43:48.553319 14 main.go:139]
I0529 13:43:48.553354 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
I0529 13:43:48.553366 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
I0529 13:43:48.553375 14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 13:43:48.553380 14 main.go:139]
I0529 13:43:48.553389 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
I0529 13:43:48.553408 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
I0529 13:43:48.553443 14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 13:43:48.553464 14 main.go:139]
I0529 13:43:50.329199 14 main.go:455] Metrics reported. :
metric_logs:<time_stamp:"2024-05-29T13:43:02Z" metric:<name:"loss" value:"2.2980" > > metric_logs:<time_stamp:"2024-05-29T13:43:03Z" metric:<name:"loss" value:"2.2872" > > metric_logs:<time_stamp:"2024-05-29T13:43:04Z" metric:<name:"loss" value:"2.0531" > > metric_logs:<time_stamp:"2024-05-29T13:43:05Z" metric:<name:"loss" value:"1.4292" > > metric_logs:<time_stamp:"2024-05-29T13:43:06Z" metric:<name:"loss" value:"1.4897" > > metric_logs:<time_stamp:"2024-05-29T13:43:06Z" metric:<name:"loss" value:"1.3420" > > metric_logs:<time_stamp:"2024-05-29T13:43:06Z" metric:<name:"loss" value:"1.0955" > > metric_logs:<time_stamp:"2024-05-29T13:43:07Z" metric:<name:"loss" value:"0.8254" > > metric_logs:<time_stamp:"2024-05-29T13:43:07Z" metric:<name:"loss" value:"0.7940" > > metric_logs:<time_stamp:"2024-05-29T13:43:07Z" metric:<name:"loss" value:"0.8615" > > metric_logs:<time_stamp:"2024-05-29T13:43:08Z" metric:<name:"loss" value:"0.8247" > > metric_logs:<time_stamp:"2024-05-29T13:43:08Z" metric:<name:"loss" value:"0.6430" > > metric_logs:<time_stamp:"2024-05-29T13:43:08Z" metric:<name:"loss" value:"0.8674" > > metric_logs:<time_stamp:"2024-05-29T13:43:09Z" metric:<name:"loss" value:"0.5489" > > metric_logs:<time_stamp:"2024-05-29T13:43:09Z" metric:<name:"loss" value:"0.5338" > > metric_logs:<time_stamp:"2024-05-29T13:43:10Z" metric:<name:"loss" value:"0.8515" > > metric_logs:<time_stamp:"2024-05-29T13:43:10Z" metric:<name:"loss" value:"0.7667" > > metric_logs:<time_stamp:"2024-05-29T13:43:11Z" metric:<name:"loss" value:"0.4940" > > metric_logs:<time_stamp:"2024-05-29T13:43:11Z" metric:<name:"loss" value:"0.6502" > > metric_logs:<time_stamp:"2024-05-29T13:43:12Z" metric:<name:"loss" value:"0.7553" > > metric_logs:<time_stamp:"2024-05-29T13:43:12Z" metric:<name:"loss" value:"0.4051" > > metric_logs:<time_stamp:"2024-05-29T13:43:12Z" metric:<name:"loss" value:"0.7946" > > metric_logs:<time_stamp:"2024-05-29T13:43:13Z" metric:<name:"loss" value:"0.5976" > > metric_logs:<time_stamp:"2024-05-29T13:43:13Z" metric:<name:"loss" value:"0.4301" > > metric_logs:<time_stamp:"2024-05-29T13:43:14Z" metric:<name:"loss" value:"0.5241" > > metric_logs:<time_stamp:"2024-05-29T13:43:14Z" metric:<name:"loss" value:"0.6351" > > metric_logs:<time_stamp:"2024-05-29T13:43:14Z" metric:<name:"loss" value:"0.6309" > > metric_logs:<time_stamp:"2024-05-29T13:43:15Z" metric:<name:"loss" value:"0.4322" > > metric_logs:<time_stamp:"2024-05-29T13:43:16Z" metric:<name:"loss" value:"0.6990" > > metric_logs:<time_stamp:"2024-05-29T13:43:17Z" metric:<name:"loss" value:"0.7487" > > metric_logs:<time_stamp:"2024-05-29T13:43:17Z" metric:<name:"loss" value:"0.5877" > > metric_logs:<time_stamp:"2024-05-29T13:43:17Z" metric:<name:"loss" value:"0.4116" > > metric_logs:<time_stamp:"2024-05-29T13:43:18Z" metric:<name:"loss" value:"0.7075" > > metric_logs:<time_stamp:"2024-05-29T13:43:18Z" metric:<name:"loss" value:"0.5755" > > metric_logs:<time_stamp:"2024-05-29T13:43:19Z" metric:<name:"loss" value:"0.5320" > > metric_logs:<time_stamp:"2024-05-29T13:43:20Z" metric:<name:"loss" value:"0.5022" > > metric_logs:<time_stamp:"2024-05-29T13:43:21Z" metric:<name:"loss" value:"0.5585" > > metric_logs:<time_stamp:"2024-05-29T13:43:21Z" metric:<name:"loss" value:"0.6701" > > metric_logs:<time_stamp:"2024-05-29T13:43:22Z" metric:<name:"loss" value:"0.5671" > > metric_logs:<time_stamp:"2024-05-29T13:43:22Z" metric:<name:"loss" value:"0.5596" > > metric_logs:<time_stamp:"2024-05-29T13:43:23Z" metric:<name:"loss" value:"0.4663" > > metric_logs:<time_stamp:"2024-05-29T13:43:24Z" metric:<name:"loss" value:"0.4616" > > metric_logs:<time_stamp:"2024-05-29T13:43:24Z" metric:<name:"loss" value:"0.5287" > > metric_logs:<time_stamp:"2024-05-29T13:43:25Z" metric:<name:"loss" value:"0.5045" > > metric_logs:<time_stamp:"2024-05-29T13:43:25Z" metric:<name:"loss" value:"0.5485" > > metric_logs:<time_stamp:"2024-05-29T13:43:26Z" metric:<name:"loss" value:"0.5563" > > metric_logs:<time_stamp:"2024-05-29T13:43:26Z" metric:<name:"loss" value:"0.5256" > > metric_logs:<time_stamp:"2024-05-29T13:43:28Z" metric:<name:"loss" value:"0.5970" > > metric_logs:<time_stamp:"2024-05-29T13:43:28Z" metric:<name:"loss" value:"0.3728" > > metric_logs:<time_stamp:"2024-05-29T13:43:28Z" metric:<name:"loss" value:"0.5403" > > metric_logs:<time_stamp:"2024-05-29T13:43:29Z" metric:<name:"loss" value:"0.6719" > > metric_logs:<time_stamp:"2024-05-29T13:43:29Z" metric:<name:"loss" value:"0.5648" > > metric_logs:<time_stamp:"2024-05-29T13:43:30Z" metric:<name:"loss" value:"0.5025" > > metric_logs:<time_stamp:"2024-05-29T13:43:30Z" metric:<name:"loss" value:"0.5215" > > metric_logs:<time_stamp:"2024-05-29T13:43:31Z" metric:<name:"loss" value:"0.5761" > > metric_logs:<time_stamp:"2024-05-29T13:43:31Z" metric:<name:"loss" value:"0.2862" > > metric_logs:<time_stamp:"2024-05-29T13:43:31Z" metric:<name:"loss" value:"0.2975" > > metric_logs:<time_stamp:"2024-05-29T13:43:31Z" metric:<name:"loss" value:"0.3770" > > metric_logs:<time_stamp:"2024-05-29T13:43:32Z" metric:<name:"loss" value:"0.4379" > > metric_logs:<time_stamp:"2024-05-29T13:43:32Z" metric:<name:"loss" value:"0.2941" > > metric_logs:<time_stamp:"2024-05-29T13:43:33Z" metric:<name:"loss" value:"0.4630" > > metric_logs:<time_stamp:"2024-05-29T13:43:33Z" metric:<name:"loss" value:"0.2732" > > metric_logs:<time_stamp:"2024-05-29T13:43:33Z" metric:<name:"loss" value:"0.3157" > > metric_logs:<time_stamp:"2024-05-29T13:43:34Z" metric:<name:"loss" value:"0.5066" > > metric_logs:<time_stamp:"2024-05-29T13:43:34Z" metric:<name:"loss" value:"0.6121" > > metric_logs:<time_stamp:"2024-05-29T13:43:34Z" metric:<name:"loss" value:"0.5001" > > metric_logs:<time_stamp:"2024-05-29T13:43:35Z" metric:<name:"loss" value:"0.4862" > > metric_logs:<time_stamp:"2024-05-29T13:43:35Z" metric:<name:"loss" value:"0.4491" > > metric_logs:<time_stamp:"2024-05-29T13:43:35Z" metric:<name:"loss" value:"0.5721" > > metric_logs:<time_stamp:"2024-05-29T13:43:36Z" metric:<name:"loss" value:"0.4700" > > metric_logs:<time_stamp:"2024-05-29T13:43:36Z" metric:<name:"loss" value:"0.5482" > > metric_logs:<time_stamp:"2024-05-29T13:43:36Z" metric:<name:"loss" value:"0.3291" > > metric_logs:<time_stamp:"2024-05-29T13:43:37Z" metric:<name:"loss" value:"0.4413" > > metric_logs:<time_stamp:"2024-05-29T13:43:37Z" metric:<name:"loss" value:"0.4190" > > metric_logs:<time_stamp:"2024-05-29T13:43:38Z" metric:<name:"loss" value:"0.2713" > > metric_logs:<time_stamp:"2024-05-29T13:43:38Z" metric:<name:"loss" value:"0.4338" > > metric_logs:<time_stamp:"2024-05-29T13:43:39Z" metric:<name:"loss" value:"0.4222" > > metric_logs:<time_stamp:"2024-05-29T13:43:39Z" metric:<name:"loss" value:"0.4818" > > metric_logs:<time_stamp:"2024-05-29T13:43:39Z" metric:<name:"loss" value:"0.5069" > > metric_logs:<time_stamp:"2024-05-29T13:43:40Z" metric:<name:"loss" value:"0.3661" > > metric_logs:<time_stamp:"2024-05-29T13:43:40Z" metric:<name:"loss" value:"0.4026" > > metric_logs:<time_stamp:"2024-05-29T13:43:41Z" metric:<name:"loss" value:"0.3992" > > metric_logs:<time_stamp:"2024-05-29T13:43:41Z" metric:<name:"loss" value:"0.3087" > > metric_logs:<time_stamp:"2024-05-29T13:43:41Z" metric:<name:"loss" value:"0.4326" > > metric_logs:<time_stamp:"2024-05-29T13:43:42Z" metric:<name:"loss" value:"0.4379" > > metric_logs:<time_stamp:"2024-05-29T13:43:42Z" metric:<name:"loss" value:"0.3818" > > metric_logs:<time_stamp:"2024-05-29T13:43:42Z" metric:<name:"loss" value:"0.5389" > > metric_logs:<time_stamp:"2024-05-29T13:43:43Z" metric:<name:"loss" value:"0.3744" > > metric_logs:<time_stamp:"2024-05-29T13:43:43Z" metric:<name:"loss" value:"0.4355" > > metric_logs:<time_stamp:"2024-05-29T13:43:44Z" metric:<name:"loss" value:"0.4508" > > metric_logs:<time_stamp:"2024-05-29T13:43:44Z" metric:<name:"loss" value:"0.4387" > > metric_logs:<time_stamp:"2024-05-29T13:43:44Z" metric:<name:"loss" value:"0.6276" > > metric_logs:<time_stamp:"2024-05-29T13:43:45Z" metric:<name:"loss" value:"0.4249" > > metric_logs:<time_stamp:"2024-05-29T13:43:45Z" metric:<name:"loss" value:"0.3053" > > experiment logs ubuntu@ip-172-31-17-107:~$ kubectl logs -n admin cmaes-example-cmaes-79dd5db648-jk5bc --all-containers -f
I0529 13:42:35.285566 1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0529 13:42:55.787753 1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654" name:"momentum" value:"0.7043612817216396" ]
I0529 13:42:55.787845 1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099" name:"momentum" value:"0.6980954001565728" ]
I0529 13:43:51.732787 1 service.go:117] Update trial mapping : trialName=cmaes-example-jt5xqvdw -> trialID=0
I0529 13:43:51.732806 1 service.go:117] Update trial mapping : trialName=cmaes-example-kndkpvzh -> trialID=1
I0529 13:43:51.732811 1 service.go:147] Detect changes of Trial (trialName=cmaes-example-kndkpvzh, trialID=1) : State Complete, Evaluation 0.269100
I0529 13:43:51.732859 1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr" value:"0.02556132716757138" name:"momentum" value:"0.701003503816815" ]
mysql related logs╰─$ juju run katib-db/0 get-cluster-status
Running operation 1 with 1 task
- task 2 on unit-katib-db-0
Waiting for task 2...
status:
clustername: cluster-7fea0e9f40002bd889df6711a8a00d04
clusterrole: primary
defaultreplicaset:
name: default
primary: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306
ssl: required
status: ok_no_tolerance
statustext: cluster is not tolerant to any failures.
topology:
katib-db-0:
address: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306
memberrole: primary
mode: r/w
role: ha
status: online
version: 8.0.36
topologymode: single-primary
domainname: cluster-set-7fea0e9f40002bd889df6711a8a00d04
groupinformationsourcemember: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306
success: "True" |
UpdateDebuggingWe ran the uat on Microk8s 1.28 (juju 3.5.0) and the UAT failed with the same error message.
Misconfigured port (Conclusion?)It turns out that ineed the training container succeeds while the metrics-collector copmonent fails to report metrics and thus the trial fails. The reason this is failing is indeed the last line of the logs
In this line, we see that it is trying to hit - args:
- -t
- cmaes-example-fqgkd7hd
- -m
- loss;Train-accuracy
- -o-type
- minimize
- -s-db
- - katib-db-manager.kubeflow:6789
+ - katib-db-manager.kubeflow:65535
- -path
- /var/log/katib/metrics.log
- -format
- TEXT Upstream code and port settingFrom the above, it looks like when the trial is created, those Who is applying those argsLooking at the upstream code:
Now what does not make sense is that nowhere in upstream Katib manifests or in katib-operator repository is a |
Port configuration ENV variableAKSSSHing into the charms, it turns out that there are multiple katib-related ENV variables set ╰─$ juju ssh katib-db-manager/0
# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=65535
KATIB_DB_MANAGER_SERVICE_PORT_PLACEHOLDER=65535
KATIB_DB_MANAGER_SERVICE_HOST=10.0.214.142
# exit
╰─$ juju ssh katib-controller/0
# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=65535
KATIB_DB_MANAGER_SERVICE_PORT_PLACEHOLDER=65535
KATIB_DB_MANAGER_SERVICE_HOST=10.0.214.142 which explains why the trials' metric-collector container is misconfigured. Regarding the previous comment, I killed the # printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=6789
KATIB_DB_MANAGER_SERVICE_PORT_API=6789
KATIB_DB_MANAGER_SERVICE_HOST=10.152.183.120 |
Summary - recapProblemJuju and envvarsCopying from Kubernetes_service_patch (KSP) library:
This ties with the kubernetes native behavior of having container environment variables for every service present on the cluster:
Thus, when a charm with a pebble service called
Note that those ENV variables are not updated after the container has been created. Katib uses by default
|
Follow up PR to #185 Ref canonical/bundle-kubeflow#893
* fix: Explicitly set `KATIB_DB_MANAGER_SERVICE_PORT` in katib-controller * katib-db-manager: Remove port config option Closes canonical/bundle-kubeflow#893, #108 Addressed also part of #184
Is this still failing on 1.8 ? We recently pushed this PR to main branch of UATs canonical/charmed-kubeflow-uats@c894a90 which we successfully run on aks and eks for 1.8 couple of times (e.g. here). |
Bug Description
This UAT fails for bundle
latest/edge
in our CI withAssertionError: Katib Experiment was not successful.
. Note that this is successful for bundle1.8/stable
. Unfortunately, we cannot have detailed logs since that's a known limitation of how our UATs run.Example runs: first-k8s-1.28, second-k8s-1.28, k8s-1.26
To Reproduce
Rerun the CI from #892 or just from
main
for "latest/edge"Environment
AKS 1.26 and 1.28
Juju 3.1
CKF
latest/edge
Relevant Log Output
Additional Context
No response
The text was updated successfully, but these errors were encountered: