-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for AWS Launch Template Configuration #2668
Conversation
src/_nebari/stages/infrastructure/template/aws/modules/kubernetes/main.tf
Outdated
Show resolved
Hide resolved
@viniciusdc How can we help getting this PR prioritized for review & merge? We need this capability in order to make progress on our air-gapped deployment investigation. |
rm unecessary parameters & update template & set ami_type as private var
@joneszc can you try out these changes and check if the MIME user_data file generated workers as expected? I will do a deployment this evening as well. |
How to testThose are valid configs that should work: test:
instance: m5.xlarge
min_nodes: 0
max_nodes: 1
gpu: false
single_subnet: false
permissions_boundary:
launch_template:
node_prebootstrap_command: |
#!/bin/bash
echo "Hello, Nebari!"
test2:
instance: m5.xlarge
min_nodes: 0
max_nodes: 1
gpu: false
single_subnet: false
permissions_boundary:
ami_type: CUSTOM
launch_template:
ami_id: ami-0c3f3b5f2f3f3f3f3
node_prebootstrap_command: |
#!/bin/bash
echo "Hello, Nebari!" launch_template can also be defined at the provider level and will affect all node_groups, though for ami_id to be passed the ami_type needs to be set to |
|
Hello @viniciusdc Also, the options to set |
@joneszc @tylergraff I am testing this right now:
You are right; that's what I had in mind when first including that validator, mainly to avoid the case where the user misplaced something in yaml. But I agree with you that the current code does more harm to extensibility than anything. I have a fix for that already and will commit shortly.
I agree that was hindsight on my part, I will address it shortly. The only thing I am curious about is this last part here:
I am not sure that will be possible when launching an instance using a launch_template while the |
while testing found a minor bug with some validations of an internal │ /Nebari/nebari/src/_nebari/stages/terraform_state/__init__.py:2 │
│ 39 in deploy │
│ │
│ 236 │ def deploy( │
│ 237 │ │ self, stage_outputs: Dict[str, Dict[str, Any]], disable_prompt: bool = False │
│ 238 │ ): │
│ ❱ 239 │ │ self.check_immutable_fields() │
│ 240 │ │ │
│ 241 │ │ with super().deploy(stage_outputs, disable_prompt): │
│ 242 │ │ │ env_mapping = {} │
│ │
│ /Nebari/nebari/src/_nebari/stages/terraform_state/__init__.py:2 │
│ 75 in check_immutable_fields │
│ │
│ 272 │ │ │ bottom_level_schema = self.config │
│ 273 │ │ │ if len(keys) > 1: │
│ 274 │ │ │ │ print(keys) │
│ ❱ 275 │ │ │ │ bottom_level_schema = functools.reduce( │
│ 276 │ │ │ │ │ lambda m, k: getattr(m, k), keys[:-1], self.config │
│ 277 │ │ │ │ ) │
│ 278 │ │ │ extra_field_schema = schema.ExtraFieldSchema( │
│ │
│ /Nebari/nebari/src/_nebari/stages/terraform_state/__init__.py:2 │
│ 76 in <lambda> │
│ │
│ 273 │ │ │ if len(keys) > 1: │
│ 274 │ │ │ │ print(keys) │
│ 275 │ │ │ │ bottom_level_schema = functools.reduce( │
│ ❱ 276 │ │ │ │ │ lambda m, k: getattr(m, k), keys[:-1], self.config │
│ 277 │ │ │ │ ) │
│ 278 │ │ │ extra_field_schema = schema.ExtraFieldSchema( │
│ 279 │ │ │ │ **bottom_level_schema.model_fields[keys[-1]].json_schema_extra or {} │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'dict' object has no attribute 'test' |
Hi @tylergraff @joneszc while reviewing and testing this proposed changes (using
and also this one:
This is not a big problem as we now have ways to control this so that it never happens, but I just wanted to signal it as in our docs, we will be raising this Also, when using launch_templates, upgrades to kubernetes versions and AMIs are not performed automatically anymore on the UI, so usage of the nebari deployments to upgrade the available kuberntes versions will be required. |
Looking great! [terraform]: # module.kubernetes.aws_eks_node_group.main[3] must be replaced
[terraform]: -/+ resource "aws_eks_node_group" "main" {
[terraform]: ~ ami_type = "AL2_x86_64" -> (known after apply)
[terraform]: ~ arn = "arn:aws:eks:us-east-1:******:nodegroup/vini-template-dev/test/----" -> (known after apply)
[terraform]: ~ capacity_type = "ON_DEMAND" -> (known after apply)
[terraform]: ~ disk_size = 0 -> 50 # forces replacement
[terraform]: ~ id = "vini-template-dev:test" -> (known after apply)
[terraform]: ~ instance_types = [ # forces replacement
[terraform]: + "t2.micro",
[terraform]: ]
[terraform]: + node_group_name_prefix = (known after apply)
[terraform]: ~ release_version = "1.29.6-20240910" -> (known after apply)
[terraform]: ~ resources = [
[terraform]: - {
[terraform]: - autoscaling_groups = [
[terraform]: - {
[terraform]: - name = "eks-test-----"
[terraform]: },
[terraform]: ]
[terraform]: - remote_access_security_group_id = ""
[terraform]: },
[terraform]: ] -> (known after apply)
[terraform]: ~ status = "ACTIVE" -> (known after apply)
[terraform]: tags = {
[terraform]: "Environment" = "dev"
[terraform]: "Owner" = "terraform"
[terraform]: "Project" = "vini-template"
[terraform]: "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "test"
[terraform]: "propagate_at_launch" = "true"
[terraform]: }
[terraform]: ~ version = "1.29" -> (known after apply)
[terraform]: # (6 unchanged attributes hidden)
[terraform]:
[terraform]: - launch_template {
[terraform]: - id = "lt-02c4517df4f8e2bbc" -> null
[terraform]: - name = "test" -> null
[terraform]: - version = "1" -> null
[terraform]: }
[terraform]:
[terraform]: - update_config {
[terraform]: - max_unavailable = 1 -> null
[terraform]: - max_unavailable_percentage = 0 -> null
[terraform]: }
[terraform]:
[terraform]: # (1 unchanged block hidden)
[terraform]: }
[terraform]:
[terraform]: # module.kubernetes.aws_eks_node_group.main[4] will be created
[terraform]: + resource "aws_eks_node_group" "main" {
[terraform]: + ami_type = (known after apply)
[terraform]: + arn = (known after apply)
[terraform]: + capacity_type = (known after apply)
[terraform]: + cluster_name = "vini-template-dev"
[terraform]: + disk_size = 50
[terraform]: + id = (known after apply)
[terraform]: + instance_types = [
[terraform]: + "t2.micro",
[terraform]: ]
[terraform]: + labels = {
[terraform]: + "dedicated" = "test1"
[terraform]: }
[terraform]: + node_group_name = "test1"
[terraform]: + node_group_name_prefix = (known after apply)
[terraform]: + node_role_arn = "arn:aws:iam::******:role/vini-template-dev-eks-node-group-role"
[terraform]: + release_version = (known after apply)
[terraform]: + resources = (known after apply)
[terraform]: + status = (known after apply)
[terraform]: + subnet_ids = [
[terraform]: + "subnet-******",
[terraform]: + "subnet-******",
[terraform]: ]
[terraform]: + tags = {
[terraform]: + "Environment" = "dev"
[terraform]: + "Owner" = "terraform"
[terraform]: + "Project" = "vini-template"
[terraform]: + "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "test1"
[terraform]: + "propagate_at_launch" = "true"
[terraform]: }
[terraform]: + tags_all = {
[terraform]: + "Environment" = "dev"
[terraform]: + "Owner" = "terraform"
[terraform]: + "Project" = "vini-template"
[terraform]: + "k8s.io/cluster-autoscaler/node-template/label/dedicated" = "test1"
[terraform]: + "propagate_at_launch" = "true"
[terraform]: }
[terraform]: + version = (known after apply)
[terraform]:
[terraform]: + scaling_config {
[terraform]: + desired_size = 0
[terraform]: + max_size = 1
[terraform]: + min_size = 0
[terraform]: }
[terraform]: }
[terraform]:
[terraform]: # module.kubernetes.aws_launch_template.main["test"] will be destroyed
[terraform]: # (because key ["test"] is not in for_each map)
[terraform]: - resource "aws_launch_template" "main" {
[terraform]: - arn = "arn:aws:ec2:us-east-******:launch-template/lt-02c4517df4f8e2bbc" -> null
[terraform]: - default_version = 1 -> null
[terraform]: - disable_api_stop = false -> null
[terraform]: - disable_api_termination = false -> null
[terraform]: - id = "lt-02c4517df4f8e2bbc" -> null
[terraform]: - instance_type = "t2.micro" -> null
[terraform]: - latest_version = 1 -> null
[terraform]: - name = "test" -> null
[terraform]: - security_group_names = [] -> null
[terraform]: - tags = {} -> null
[terraform]: - tags_all = {} -> null
[terraform]: - user_data = "TUlNRS1W***FsbCAteSBodG9wCgoKCgotLT09TVlCT1VOREFSWT09LS0K" -> null
[terraform]: - vpc_security_group_ids = [
[terraform]: - "sg-******",
[terraform]: ] -> null
[terraform]:
[terraform]: - block_device_mappings {
[terraform]: - device_name = "/dev/xvda" -> null
[terraform]:
[terraform]: - ebs {
[terraform]: - iops = 0 -> null
[terraform]: - throughput = 0 -> null
[terraform]: - volume_size = 50 -> null
[terraform]: - volume_type = "gp2" -> null
[terraform]: }
[terraform]: }
[terraform]:
[terraform]: - metadata_options {
[terraform]: - http_endpoint = "enabled" -> null
[terraform]: - http_put_response_hop_limit = 0 -> null
[terraform]: - http_tokens = "required" -> null
[terraform]: - instance_metadata_tags = "enabled" -> null
[terraform]: }
[terraform]: }
[terraform]:
|
running with the example bellow: test:
instance: t2.micro
min_nodes: 1
max_nodes: 1
gpu: false
single_subnet: false
launch_template:
pre_bootstrap_command: "#!/bin/bash\n# This script is executed before the node is bootstrapped\n# You can use this script to install additional packages or configure the node\n# For example, to install the `htop` package, you can run:\n# sudo apt-get update\n# sudo apt-get install -y htop"
permissions_boundary: ami_id is not required; terraform resource handles that accordingly when applying the node_group. I haven't tested the custom ami_id yet (I could attest it switches to CUSTOM, though), as I don't have a custom image. |
Hello @viniciusdc It appears you changed the format of the env variable statements set in the bootstrap.sh command.
As you can see above, the env variables don't render on the node when enclosed in {{' '}} Additionally, when invoking a custom Finally, the following conditional, checking whether or not The bootstrap.sh command fails to render and the node fails to join the cluster. I was able to continue testing by, instead, using the following conditional with |
Hi @joneszc thank you so much for the testing, indeed while moving working around that template I ended up incorrectly assigning the values there, thanks for catching that. I fixed it this afternoon while testing the CUSTOM logic issue you related above. I had also noticed the blank user_data sections and fixed it. Will finish testing and commit the new changes |
I am addressing an issue with the ami_id and launch_template.ami_profile_name that might be related to the faulty logic you saw with the CUSTOM. After you continued testing with the no render parameter, were you able to get your instance running with the custom code? |
Hello @viniciusdc
|
For more details here's the doc's PR nebari-dev/nebari-docs#525 |
Its worth noticing that the AWS provider seems to have an issue with the update of I will address this inconsistency with a follow up PR at another time. |
Reference Issues or PRs
This is an extension version of the fantastic work suggested and developed by @joneszc on his original PR #2621
The significant difference from his PR is moving out the template and AMI-type constructs into its pydantic schema (also to allow users to customize the template further). This also reduces the logic conditionals with terraform HCL as well.
What does this implement/fix?
Adds new custom field
launch_template
, to aws provider config schema:Add new Input variable
ami_type
to handle the GPU x normal AMI selection. This previously happened in the terraform code only, but with the requirement for this type to be set to CUSTOM when using a user-made AMI, I moved that logic to python. (this is not exposed to user)Note: Both these configurations depend on a new way of spinning up the node_groups using launch_templates instead of the usual handling by the aws provider resources and scaling group. This only affects resources that have the launch_template file populated but does lead to the node_group being recreated and all associated instances within it as well.
tests/tests_unit/test_cli_validate.py
was also updated to better reflect the original exception in case of sachem errors during assertion.Put a
x
in the boxes that applyTesting
How to test this PR?
Any other comments?