Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Python 3.8, 3.9 #61

Merged
merged 3 commits into from
May 16, 2023

Conversation

thtrunck
Copy link
Contributor

@thtrunck thtrunck commented Apr 28, 2023

[sc-132053]
Basic check on 2.7, 3.6, 3.7, 3.8, 3.9.

  • create nodepool preset
  • create cluster with simplest settings (I only added Service CIDR/DNS IP and Load balancer SKU)
  • check monitoring
  • run kubectl command (run nginx --image=nginx --restart=Never)
  • delete all pods
  • test network connectivity
  • inspect node pools
  • resize cluster
  • create containerized exec
  • push images
  • test containerized exec
  • attach cluster

On 3.9 I also tested

  • creating the cluster in a different resourcegroup/region
  • Using AKS managed identity with and without adding/removing permissions on Vnet/ACR
  • Using a different custom identity for the control plane/nodepool
  • Using service principal as the cluster identity type

I haven't tested legacy option (but I don't plan to do that).
It would be nice to test GPU but I did my setup in FranceCentral and I'm not able to provision GPU node here.

@thtrunck thtrunck self-assigned this Apr 28, 2023
@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #132053: [aks-clusters][Platform] necessary update.

@thtrunck thtrunck added this to the DSS 12.0.0 milestone Apr 28, 2023
@thtrunck thtrunck marked this pull request as draft April 28, 2023 10:59
ipaddress==1.0.23
msrest==0.6.21
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some info about this removal since it was explicitly added in sc-93743
Some exception where moved from msrest into azure-mgmt-core. And we run into that issue because we where pining azure-mgmt-core.

Azure/azure-sdk-for-python#24765 (comment)

@thtrunck thtrunck marked this pull request as ready for review May 9, 2023 09:41
@thtrunck thtrunck requested a review from amandineslx May 11, 2023 09:29
Copy link
Contributor

@amandineslx amandineslx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me globally
Tested on DSS 12.0 (kit from daily build , revision fd444867f6d5b88f698073369de5327457ed38ba) with plugin version packaged from this PR

TL;DR

  • globally OK
  • ⚠️ We should probably open cards for these two points about the Resize cluster macro:
    • default node pool name is not the correct one (None instead of nodepool0) ; probably not related to this PR
    • node pools can't be deleted with the macro even providing the node pool name ; probably not related to this PR

Tested

  • check where GPU machines are available (type-wise and quota-wise)
az vm list-skus --all --output table --size Standard_NC --location westus
  • compare with Dataiku quotas
  • create a new resource group in westus region
  • perform FM setup and deploy FM
  • create an elastic fleet and deploy the design node in DSS 11.4.2
  • create a container registry in the same region as DSS(asouilleuxregistry2.azurecr.io)
  • Administration > Settings > Containerized execution
    • create a new Containerized execution config
      • Image registry URL=asouilleuxregistry2.azurecr.io ; Image pre-push hook=Enable push to ACR ; Custom limits=nvidia.com/gpu=1
  • package plugin from PR and upgrade the AKS plugin with this new version
  • rebuild containerized execution base images
# download spark and hadoop zips from https://downloads.dataiku.com/preview/dss/12.0.0-dev21/
# upload them to the machine
rsync -Pav dataiku-dss-*.tar.gz $DSS_TARGET:~/
ssh -i $SSH_KEY $DSS_TARGET
sudo cp dataiku-dss-hadoop-standalone-libs-generic-hadoop3-12.0.0-dev21.tar.gz /opt/dataiku-dss-11.4.3/
sudo cp dataiku-dss-spark-standalone-12.0.0-dev21-3.3.1-generic-hadoop3.tar.gz /opt/dataiku-dss-11.4.3/
# install hadoop and spark support
sudo su dataiku
cd /data/dataiku/dss_data
./bin/dssadmin install-hadoop-integration -standalone generic-hadoop3 -standaloneArchive /opt/dataiku-dss-11.4.3/dataiku-dss-hadoop-standalone-libs-generic-hadoop3-12.0.0-dev21.tar.gz
./bin/dssadmin install-spark-integration -standaloneArchive /opt/dataiku-dss-11.4.3/dataiku-dss-spark-standalone-12.0.0-dev21-3.3.1-generic-hadoop3.tar.gz
# restart DSS instance
./bin/dss restart
# build container exec images
./bin/dssadmin build-base-image --type container-exec --without-r --with-py39 --with-cuda --cuda-version 11.2
# restart DSS instance
./bin/dss restart
  • Plugins > ADD PLUGIN
    • update the plugin with version built from the PR
  • Plugins > Installed > AKS clusters > Code environment > CHANGE
    • create a new code env py39 with Python 3.9
  • Administration > Clusters > CREATE AKS CLUSTER
    • create cluster asouilleux-cluster with GPUs ✅
      • Node pools
        • Machine type=Standard_NC6s_v3 ; disk size=0 ; Default number of nodes=1 ; Enable nodes autoscaling=ticked ; Min number of nodes=1 ; Max number of nodes=2 ; Availability zones=unticked ; GPU=ticked
      • Advanced options
        • Service CIDR=10.1.0.0/16 ; DNS IP=10.1.0.10 ; Load balancer SKU=Standard
  • Administration > Settings > Containerized execution
    • Default settings > Default cluster
      • change value to asouilleux-cluster
    • Resources for Kubernetes containers > Custom limits
      • add nvidia.com/gpu=1
    • PUSH BASE IMAGES
  • Administration > Code Envs > NEW PYTHON ENV
    • create a new code env with Python 3.9
    • Packages to install
      • add tensorflow==2.11.0
    • Containerized execution
      • Build for > Selected container configurations > azure
      • SAVE AND UPDATE
  • create project sc132053
    • create a Python Notebook
      • Code env=py39
      • Containerized exec=azure
    • open Python notebook and check the kernel starts ✅
    • execute the following ✅
import tensorflow as tf
tf.config.list_physical_devices('GPU')
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
  • stop the cluster asouilleux-cluster
  • start a cluster asouilleux-cluster2 with minimal options
    • Node pools
      • Machine type=Standard_B8ms ; disk size=0 ; Default number of nodes=1 ; Enable nodes autoscaling=ticked ; Min number of nodes=1 ; Max number of nodes=2 ; Availability zones=unticked
    • Advanced options
      • Service CIDR=10.1.0.0/16 ; DNS IP=10.1.0.10 ; Load balancer SKU=Standard
  • Administration > Settings > Containerized execution
    • Resources for Kubernetes containers > Custom limits
      • remove nvidia.com/gpu=1
    • Default settings > Default cluster
      • change value to asouilleux-cluster2
  • Administration > Clusters > asouilleux-cluster > Actions
    • Run kubectl command
    • Delete finished pods
    • Delete all pods
    • Delete finished jobs
    • Inspect node pools
    • Resize cluster
      • ⚠️ default node pool name is not the correct one (None instead of nodepool0) ; probably not related to this PR not reproduceable
      • ⚠️ node pools can't be deleted with the macro even providing the node pool name ; probably not related to this PR
      • resizing the node pool only works ✅
    • Test network connectivity
  • stop cluster ✅
  • change cluster settings
    • Identity assumed by cluster components
      • Identity type=Managed identities ; Control plane user identity=[DSS identity] ; Kubelet user identity=[DSS identity]
  • start cluster and test cluster connectivity with macro ✅
  • stop cluster ✅
  • change cluster settings
    • Identity assumed by cluster components
      • Identity type=Managed identities ; AKS managed identity=ticked ; Assign persmissions for Vnet=ticked ; AKS managed Kubelet identity=ticked ; Assign permissions for ACR=asouilleuxregistry2
  • in Azure portal
    • give ownership on asouilleuxregistry2 to asouilleux-dss-id
    • give ownership on asouilleux-fm-westus-vn to asouilleux-dss-id
  • start cluster and test cluster connectivity with macro ✅
  • stop cluster ✅
  • create a service principal in Azure
az ad sp create-for-rbac --name asouilleuxClusterServicePrincipal
  • change cluster settings
    • Identity assumed by cluster components
      • Identity type=Service principal ; Application (client) ID=[appId] ; Password=[password]
  • start cluster and test cluster connectivity with macro ✅
  • stop cluster ✅

@amandineslx amandineslx self-requested a review May 15, 2023 15:15
Copy link
Contributor

@amandineslx amandineslx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me, see comment above

@thtrunck thtrunck merged commit 4f2dd51 into master May 16, 2023
@thtrunck thtrunck deleted the perso/thtrunck/sc-132053-update-aks-python branch May 16, 2023 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants