Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix autoprovisioning with spot nodes #187

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

avrittrohwer
Copy link
Collaborator

@avrittrohwer avrittrohwer commented Sep 18, 2024

Fixes / Features

  • Fixes workload rendering when using spot, without this change xpk workload create errors like:

    [XPK] Waiting for `Creating Workload`, for 0 seconds
    error: error parsing /tmp/tmp242uhnfs: error converting YAML to JSON: yaml: line 33: could not find expected ':'
    [XPK] Task: `Creating Workload` terminated with code `1`
    
  • Adds required pod tolerations when using node auto-provisioning with spot nodes. Without the tolerations cluster autoscaler will not create new spot node pools.

Testing / Documentation

  • [ y ] Tests pass
  • [ y, not needed ] Appropriate changes to documentation are included in the PR

Node auto-provisioning with spot

  1. Created a xpk cluster with --spot and autoprovisioning flags.
  2. Created a workload with a different topology than the cluster default.
  3. Observed a nodepool being created with the new workload topology using spot TPU nodes.

Node auto-provisioning without spot

  1. Created a xpk cluster with --spot and autoprovisioning flags.
  2. Created a workload with a different topology than the cluster default and --on-demand flag.
  3. Validated generated YAML does not specify spot node-selector and tolerations
  4. Observed a nodepool being created with the new workload topology using on-demand TPU nodes.

Not auto-provisioning with spot

  1. Created a xpk cluster with --spot flag.
  2. Validated nodepool was created with spot nodes
  3. Created a workload and validated it ran.

@avrittrohwer
Copy link
Collaborator Author

zone: 'us-central2'>] finished with error: Try a different location, or try again later: Google Compute Engine does not have enough resources available to fulfill request: us-central2-b

@avrittrohwer avrittrohwer marked this pull request as ready for review September 18, 2024 22:58
src/xpk/core/nap.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants