Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip guest accelerators if count is 0. #866

Merged
merged 5 commits into from
Jan 23, 2018
Merged

Conversation

jacobstr
Copy link
Contributor

Instances in instance groups in google will fail to provision, despite
requesting 0 GPUs. This came up for me when trying to provision
a similar instance group in all available regions, but only asking for
GPU's in those that support them by parameterizing the count and
setting it to 0.

This might be a violation of some terraform principles. For example,
testing locally with this change terraform did not recognize that
indeed my infra needed to be re-deployed. Additionally, there may be
valid reasons for creating an instance template with 0 gpu's that can be
tuned upwards.

I'm putting this out there as an RFC to (hopefully) demonstrate what I
mean but I have not yet run the acceptance tests locally.

@jacobstr
Copy link
Contributor Author

jacobstr commented Dec 15, 2017

Some more flavor - we're deploying workers pools to join a kubernetes cluster. Each pool is deployed with a worker module that creates a MIG in each compute zone for a given region. The goal was to deploy a flavor of these workers pools that support GPUs e.g.

  • Regular worker pool: "us-east1-b", "us-east1-c", "us-east1-d".
  • GPU worker pool: "us-east1-b"

The regular worker pool and GPU worker pool are provisioned using the same module. The worker module intended to take a list of gcp_zones and gpu_count to toggle GPU support.

Reading various sources, in some cases the count attribute is exploited for this kind of conditional resource creation.

That being said, this is not a resource itself, but a block within a resource. I don't see count available for other configuration blocks that I could probably contrive a similar story for.

A sample error message from the cloud console when an instance in a zone with 0 GPUs attempts to spin up:

Instance 'koobz-wrk-xxx-1jkx' creation failed: The resource 'projects/derp/zones/us-east1-d/acceleratorTypes/nvidia-tesla-p100' was not found (when acting as '[email protected]')

@rosbo rosbo requested review from danawillow and removed request for danawillow December 20, 2017 19:33
@rosbo rosbo self-assigned this Dec 20, 2017
@rosbo
Copy link
Contributor

rosbo commented Dec 20, 2017

Hi Jacob,

Your use case is valid and your solution is sensible. Do you mind adding a test for the google_compute_instance too?

@jacobstr
Copy link
Contributor Author

jacobstr commented Jan 5, 2018

After updating the guestAccelerator test in a manner similar to the instance template test, I see the following errors, with redactions. The error is produced in this block of code.

This might be one of those terraform-isms I suspected I might be violating, where the continue hack isn't good enough. It looks like it sees that:

{guest_accelerators: []} != {guest_accelerators: [{count:0, type: "nvidia-tesla-p80"}]}

--- FAIL: TestAccComputeInstance_guestAcceleratorSkip (41.35s)
	testing.go:434: Step 0 error: After applying this step, the plan was not empty:

		DIFF:

		DESTROY/CREATE: google_compute_instance.foobar
		  boot_disk.#:                            "1" => "1"
		  boot_disk.0.auto_delete:                "true" => "true"
		  boot_disk.0.device_name:                "persistent-disk-0" => "<computed>"
		  boot_disk.0.disk_encryption_key_sha256: "" => "<computed>"
		  boot_disk.0.initialize_params.#:        "1" => "1"
		  boot_disk.0.initialize_params.0.image:  "debian-8-jessie-v20160803" => "debian-8-jessie-v20160803"
		  can_ip_forward:                         "false" => "false"
		  cpu_platform:                           "Intel Haswell" => "<computed>"
		  create_timeout:                         "4" => "4"
		  guest_accelerator.#:                    "0" => "1" (forces new resource)
		  guest_accelerator.0.count:              "" => "0" (forces new resource)
		  guest_accelerator.0.type:               "" => "nvidia-tesla-k80" (forces new resource)
		  instance_id:                            "xxx" => "<computed>"
		  label_fingerprint:                      "xxx" => "<computed>"
		  machine_type:                           "n1-standard-1" => "n1-standard-1"
		  metadata_fingerprint:                   "xxx" => "<computed>"
		  name:                                   "terraform-test-zihxsacz7q" => "terraform-test-zihxsacz7q"
		  network_interface.#:                    "1" => "1"
		  network_interface.0.address:            "10.142.0.3" => "<computed>"
		  network_interface.0.name:               "nic0" => "<computed>"
		  network_interface.0.network:            "xxx" => "default"
		  network_interface.0.network_ip:         "10.142.0.3" => "<computed>"
		  network_interface.0.subnetwork_project: "xxx" => "<computed>"
		  project:                                "xxx" => "<computed>"
		  scheduling.#:                           "1" => "1"
		  scheduling.0.automatic_restart:         "true" => "true"
		  scheduling.0.on_host_maintenance:       "TERMINATE" => "TERMINATE"
		  scheduling.0.preemptible:               "false" => "false"
		  self_link:                              "xxx" => "<computed>"
		  tags_fingerprint:                       "xxx=" => "<computed>"
		  zone:                                   "us-east1-d" => "us-east1-d"

		STATE:

		google_compute_instance.foobar:
		  ID = terraform-test-zihxsacz7q
		  attached_disk.# = 0
		  boot_disk.# = 1
		  boot_disk.0.auto_delete = true
		  boot_disk.0.device_name = persistent-disk-0
		  boot_disk.0.disk_encryption_key_raw =
		  boot_disk.0.disk_encryption_key_sha256 =
		  boot_disk.0.initialize_params.# = 1
		  boot_disk.0.initialize_params.0.image = debian-8-jessie-v20160803
		  boot_disk.0.initialize_params.0.size = 0
		  boot_disk.0.initialize_params.0.type =
		  boot_disk.0.source = xxx
		  can_ip_forward = false
		  cpu_platform = Intel Haswell
		  create_timeout = 4
		  guest_accelerator.# = 0
		  instance_id = xxx
		  label_fingerprint = xxx
		  machine_type = n1-standard-1
		  metadata.% = 0
		  metadata_fingerprint = xxx
		  min_cpu_platform =
		  name = terraform-test-zihxsacz7q
		  network_interface.# = 1
		  network_interface.0.access_config.# = 0
		  network_interface.0.address = 10.142.0.3
		  network_interface.0.alias_ip_range.# = 0
		  network_interface.0.name = nic0
		  network_interface.0.network = xxx
		  network_interface.0.network_ip = 10.142.0.3
		  network_interface.0.subnetwork = xxx
		  network_interface.0.subnetwork_project = xxx
		  project = xxx
		  scheduling.# = 1
		  scheduling.0.automatic_restart = true
		  scheduling.0.on_host_maintenance = TERMINATE
		  scheduling.0.preemptible = false
		  scratch_disk.# = 0
		  self_link = xxx
		  service_account.# = 0
		  tags_fingerprint = xxx
		  zone = us-east1-d

@@ -1198,6 +1198,9 @@ func expandInstanceGuestAccelerators(d TerraformResourceData, config *Config) ([
guestAccelerators := make([]*computeBeta.AcceleratorConfig, len(accels))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that you create empty entries here. Even if you use continue below, the empty entries are still added to the list.

Instead, change this line for: guestAccelerators := make([]*computeBeta.AcceleratorConfig, 0, len(accels))

And change the line below starting with guestAccelerators[i] = ... for guestAccelerators = append(guestAccelerators, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just attempted this and the test still fails. My theory now is that the resourceComputeInstanceRead at the end of resourceComputeInstanceCreate is what is persisted to terraform's state.

When the Plan is refreshed it sees {guest_accelerators: []} but the current context is requesting {guest_accelerators: [{count:0, type: "nvidia-tesla-p80"}]}.

The right way to do this might be to drop/modify the guest_accelerator from the schema.ResourceData instance as it's being read or immediately afterwards when the count is 0. I'll have to poke if there's an appropriate lifecycle hook (e.g. afterSchemaResourceDataRead) where this could be implemented.

I'm still puzzled why a similar behavior wasn't observed with the instance template.

Copy link
Contributor

@rosbo rosbo Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current state depends whether the -refresh flag is true or false.

When you see a diff like:

guest_accelerator.#:       "0" => "1" (forces new resource)
guest_accelerator.0.count: "" => "0" (forces new resource)
guest_accelerator.0.type:  "" => "nvidia-tesla-k80" (forces new resource)

The left hand side is the current state. By default, when you run terraform plan or terraform apply, the flag -refresh=true. This means it calls the Read function to refresh the current state. If you set -refresh=false, then, the current state will be equal to whatever is stored in your state file.

The right hand side (after =>) is always equal to what you have in your Terraform config file (.tf file).

In your case, the config has one guest_accelerator entry with count = 0 and type = nvidia-tesla-k80. However, the current state is empty causing a diff.

You can use the new customdiff feature to suppress the diff in that case. I added this new helper to our codebase yesterday and the PR hasn't been merged yet: #945.

Let me know if you need help with customdiff or if you want me to takeover from here.

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rosbo. Taking a stab with CustomizeDiffFunc.

@jacobstr jacobstr force-pushed the master branch 3 times, most recently from 264cfbd to 9080ac0 Compare January 12, 2018 00:11
@jacobstr
Copy link
Contributor Author

Took a stab at it with 9080ac0. There's an error I'm currently swallowing in that commit:

Clear only operates on computed keys - guest_accelerator is not one

Clear seemed like the obvious function to use to ignore a diff. But indeed, the docs state the limitation reported in the error message.

@jacobstr
Copy link
Contributor Author

@rosbo I amended the previous commit by adding Computed: True to the schema, which allowed the CustomizeDiff to do it's job. It's unclear to me what effect changing it to a computed field will have.

@@ -551,6 +553,9 @@ func resourceComputeInstance() *schema.Resource {
Deprecated: "Use timeouts block instead.",
},
},
CustomizeDiff: customdiff.All(
suppressEmptyGuestAcceleratorDiff,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use https://godoc.org/github.com/hashicorp/terraform/helper/customdiff#IfValueChange here so we can chain other customize diff in the future.

Copy link
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer to merging. One small suggestion and please rebase the branch and we should be good to go.

Thanks for your great work!

jacobstr and others added 5 commits January 22, 2018 13:29
Instances in instance groups in google will fail to provision, despite
requesting 0 GPUs. This came up for me when trying to provision
a similar instance group in all available regions, but only asking for
GPU's in those that support them by parameterizing the `count` and
setting it to 0.

This might be a violation of some terraform principles. For example,
testing locally with this change `terraform` did not recognize that
indeed my infra needed to be re-deployed (from it's pov, I assume it
believes this because inputs hadn't changed). Additionally, there may be
valid reasons for creating an instance template with 0 gpu's that can be
tuned upwards.
@jacobstr
Copy link
Contributor Author

jacobstr commented Jan 22, 2018

So I wrapped the suppressEmptyGuestAcceleratorDiff method in a customdiff.If and apply the custom diff if there's any change to guest_accelerators. It's quite non-specific but it felt repetitive to repeat the logic in suppressEmptyGuestAcceleratorDiff.

Wanted to point out that it's wrapped in customdiff.All, and the suppressEmptyGuestAccelerator diff method only affects the portion of the diff related to the guest_accelerator key. i.e. I believe it would still compose well without the conditional check. The exception might be if there's another diff customizer for the guest_accelerator key.

@rosbo
Copy link
Contributor

rosbo commented Jan 23, 2018

Alll tests are passing on the CI server. Merging this change. Thank you for your contribution @jacobstr

@rosbo rosbo merged commit 939ba6d into hashicorp:master Jan 23, 2018
modular-magician added a commit to modular-magician/terraform-provider-google that referenced this pull request Sep 27, 2019
@ghost
Copy link

ghost commented Mar 29, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Mar 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants