cloudwise
is Surreal's cloud infrastructure provisioner based on Terraform. Surreal's website and github.
It prepares a kubernetes cluster using terraform. It generates .tf.json
files that are also recognized by
Symphony.
- Cloud wise runs in python 3
- Do
git clone [email protected]:SurrealAI/cloudwise.git && cd cloudwise
- Run
pip install -e .
in this directory. - Install
terraform
following instructions here - Install
kubectl
following instructions here
- (Optional, Recommended) Create and work in a clean directory as running terraform would generate relevant files.
> mkdir surreal
> cd surreal
- You first need to setup credentials for
terraform
to access google cloud. See guide here. Choose one of the two methods:- Run the following command
gcloud auth application-default login
or
- Go to the api key management page [https://console.cloud.google.com/apis/credentials/serviceaccountkey](https://console.cloud.google.com/apis/credentials/serviceaccountkey) and select **Create new service account**. You would need to give the service account sufficient permissions to do things properly. **Project editor** would suffice but is also more than enough. You can then generate and download the key, (*json* format is fine). Put the path to the `.json` file into the commandline argument when prompted.
- Follow the instructions in the commandline tool.
> cloudwise-gke
It will provide instructions and generate a <cluster_name>.tf.json
file which terraform recognizes. If you have generated a .json
credential file, you should provide it when prompted.
terraform init && terraform plan
describes changes to be made.terraform apply
makes the changes to your cloud project.- After cluster creation, obtain credentials for kubectl.
> gcloud container clusters get-credentials <cluster_name>
- If you have GPUs in your cluster, create the daemon set to install drivers, see documentation.
> kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml
- The generated
<cluster_name>.tf.json
is also recognized by Symphony's scheduling mechanism andSurreal
. So you may want to link to it - If you want to remove everything, run
terraform destroy
Stay tuned
Stay tuned
-
Terraform install fails.
- If you are seeing error:
... API has not been used in project...
: duringterraform apply
, go to the Kubernetes Engine tab and/or Compute Engine tab on your google cloud console to enable their APIs.
- If you are seeing error:
-
GPU nodes are not scaling up.
- Check if the driver installation daemon set is running (see documentation).