Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform fails to notice when a Nomad job has changed #1

Open
hashibot opened this issue Jun 13, 2017 · 40 comments
Open

Terraform fails to notice when a Nomad job has changed #1

hashibot opened this issue Jun 13, 2017 · 40 comments

Comments

@hashibot
Copy link

This issue was originally opened by @blalor as hashicorp/terraform#14038. It was migrated here as part of the provider split. The original body of the issue is below.


Terraform Version

Terraform v0.9.3

Affected Resource(s)

  • nomad_job

Terraform Configuration Files

provider "nomad" {
    address = "http://localhost:4646"
}

resource "nomad_job" "example" {
    jobspec = "${file("${path.module}/example.nomad")}"
}

Debug Output

Log for 2nd terraform apply: apply.log.txt

Console output:

$ TF_LOG=TRACE TF_LOG_PATH=apply.log.txt terraform apply
nomad_job.example: Refreshing state... (ID: example)

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Expected Behavior

Terraform should have noticed that the job had changed in the cluster and updated it.

Actual Behavior

Nuttin', honey.

Steps to Reproduce

  1. nomad agent -dev

  2. terraform apply

  3. nomad status example shows one allocation for task group grp

  4. change task group count:

     curl -sfS localhost:4646/v1/job/example | \
         jq '.TaskGroups[0].Count = 2 | {"Job": .}' | \
         curl -is -X PUT -d@- localhost:4646/v1/job/example
    
  5. nomad status example shows two allocations for example.grp

  6. terraform plan shows no change

  7. terraform apply makes no change

  8. nomad status example still shows two allocations

@hashibot hashibot added the bug label Jun 13, 2017
@ketzacoatl
Copy link

ketzacoatl commented Feb 20, 2018

@paddycarver / @grubernaut / @radeksimko, is this a verified issue/bug? I'm seeing a slightly different issue, but not sure if it's the same as this. If a job has been stopped, or is not running (but the spec has not changed), then the TF nomad provider should update nomad. It might be best to always have the job sent to nomad, and let nomad work out issues with differences in the spec? This issue makes the provider unusable for the most basic use cases.

@ketzacoatl
Copy link

The underlying issue here also produces the following behavior:

  • run a job with the terraform provider
  • nomad stop that job manually
  • re-apply the terraform plan, the job is still stopped

@paddycarver
Copy link
Contributor

Hi @ketzacoatl! I can't say for sure whether this is a verified bug, nor can I explain the behaviour. I'll try to look into this soon and come back with a bit more information, and a fix if necessary. Apologies for the delays on this.

@paddycarver paddycarver self-assigned this Feb 22, 2018
@ketzacoatl
Copy link

hi @paddycarver, thanks for taking a look! were you able to confirm the behavior we see in practice?

@shantanugadgil
Copy link
Contributor

My observation is the same as @ketzacoatl.

It would be just more operator friendly if users could interact via Terraform, rather than nomad status

My versions are
Terraform 0.11.5
Nomad (built from master) (0.8.0-dev)

@ketzacoatl
Copy link

ketzacoatl commented Apr 26, 2018

Any updates on this @paddycarver?

Also, a question: Does the Terraform provider always submit the job to nomad, or does it decide whether or not it should?

@kerscher
Copy link

kerscher commented May 3, 2018

@paddycarver I took a quick look inside the codebase, and it seems the culprit is:

  • resourceJobRead is not implemented
  • there is no metadata to discern changes other than job ID within the schema
  • de-then-registering only occurs when ID itself changes

#15 by @apparentlymart adds more metadata, but modify_index wouldn’t pick up changes from the provider. I think “Solution 1” below is a nice way of dealing with it.

Solution 1: add status computed metadata to the schema, plus #15 other items

  • Upon read, check with job.Stopped() whether it should be re-registered
  • Notify through metadata if stopped that recreation is necessary

Pros

  • Smarter information for plan
  • Will always restart a stopped job according to configuration

Cons

  • plan now has to read more information before, possibly slowing it down
  • More changes to the codebase

Solution 2: Always de-register and re-register

  • Refactor resourceJobRegister so that it always de-then-re-register, regardless of configuration change or not.
  • Always mark resource job specs as tainted

Pros

  • Easy to add
  • Never leads to stale jobs not being picked up
  • Retains insanely fast plan with Nomad jobs

Cons

  • Users probably do not expect that to happen if they did not change their configurations
  • Ugly plans with unnecessary information

@ketzacoatl
Copy link

ketzacoatl commented Jun 28, 2018

@paddycarver, I'd love to help resolve this problem, as IMO it prevents serious use of Terraform to manage nomad, do you have any guidance or recommendations on how you would like to address the problem?

cc @katbyte / @radeksimko

@ketzacoatl
Copy link

@mitchellh I'd be very grateful for your feedback/guidance here.

@ketzacoatl
Copy link

@cgbaker, With renewed development efforts here, what are your thoughts on this issue?

@cgbaker
Copy link
Contributor

cgbaker commented Dec 6, 2018

Hi @ketzacoatl, I wholeheartedly agree that this is something that must be addressed. This shortcoming makes the Nomad provider just about useless for Day 2 operations.

As noted in #15 , there are a few different options for addressing this. The solution partially hinges on whether we want to try to find a general solution that doesn't require modifying and re-releasing the provider every time Nomad releases a new version. On the other hand, with the Nomad product team taking ownership of this TF provider, we can potentially address such a workflow a little better than before. And it may be prudent to find a temporary solution to this issue while we work on a generic/unversioned provider.

The Nomad team is committed to addressing this; I will post an update here soon giving an idea as to the timetable, but my intention is to either resolve this issue as part of the upcoming 1.3.0 version of the provider (targeting Nomad 0.8.x) or the 1.4.0 version (targeting Nomad 0.9.x).

Thank you for your patience and persistence on this issue.

@ketzacoatl
Copy link

ketzacoatl commented Dec 11, 2018

And it may be prudent to find a temporary solution to this issue while we work on a generic/unversioned provider.

That would be great, WRT being able to continue using the provider while future improvements get worked out.

@cgbaker
Copy link
Contributor

cgbaker commented Dec 20, 2018

Nomad 0.8.x API support is available in the Nomad v1.3.0 that released today. I will look at finding a longer-term solution for this in upcoming versions; having said that, even if we continue version the Nomad providers, we pledge to be much more responsive in updating the Nomad provider going forward.

@ketzacoatl
Copy link

The Nomad team is committed to addressing this; I will post an update here soon giving an idea as to the timetable, but my intention is to either resolve this issue as part of the upcoming 1.3.0 version of the provider (targeting Nomad 0.8.x) or the 1.4.0 version (targeting Nomad 0.9.x).

Hello, just checking back up on this - have there been any improvements related to this?

@cgbaker
Copy link
Contributor

cgbaker commented Apr 4, 2019 via email

@ketzacoatl
Copy link

Fair enough, thanks for the update!

@cgbaker
Copy link
Contributor

cgbaker commented Jun 5, 2019

Update: this was not resolved in the 1.4.0 release of the provider, but we're still tracking this.

@ketzacoatl
Copy link

I am stoked to see this on the 1.5 milestone, rock on!

@paddycarver paddycarver removed their assignment Oct 7, 2019
@liemle3893
Copy link

It have been a year since the last comment. Any update on this?

@ketzacoatl
Copy link

@cgbaker WRT the roadmap, are there refactors or internal changes that block fixing this properly?

@cgbaker
Copy link
Contributor

cgbaker commented Jul 14, 2020

we were waiting for the update 2.0 plugin SDK to see if it helped deal with this. we're actively looking at it now; we want this issue dealt with in the next few months as part of the Nomad 1.0 milestone.

liemle3893 pushed a commit to liemle3893/terraform-provider-nomad that referenced this issue Jul 15, 2020
Resolved hashicorp#1. Terraform fails to notice when a Nomad job has changed
To have this, you need to maximize the use of Job Meta. (eg: Docker
Image)
This patch only notice when number of running instances of group in job
has changed.
Tested with Nomad 0.11.3
Tested with `Service` job.
Batch is assumed to be save to run multiple times so there would be no
problem.
@eliburke
Copy link

eliburke commented Jul 8, 2022

Is there any plan to release v1.5 soon? Is there more work to finish out this feature? #149 hasn't been touched in almost 2 years.

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 15, 2022

Hi @eliburke 👋

No plans to get this fixed unfortunately. The work on #149 was a brave attempt to try and map all possible jobspec fields into a Terraform resource schema, but that was not a sustainable approach. As Nomad evolves it becomes almost impossible to manually keep up.

We think that a better approach would be to leverage the Nomad OpenAPI project that is able to auto-generate a Nomad job spec, and the Terraform Provider Framework which is a new and more flexible way to create providers.

But this will take a significant effort that is not in our roadmap yet.

@shantanugadgil
Copy link
Contributor

suggestion: to have a quick workaround for this problem, what we have been doing for quite some time is the we render out the entire Nomad job file and use it via the terrafrm-nomad provider (in comple rendered form)

What this does is, when you suspect things, and want to check what is up with the job, you can try out nomad job plan myjob.nomad instead of the usual terraform plan.

In my opinion this provides an easy escape hatch mechanism.

@tristanmorgan
Copy link
Member

Can the Submission field in the API now be used?
enable support for setting original job source

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 31, 2023

Hi @tristanmorgan,

Unfortunately that's not enough. The key issue here is describing the Nomad job specification as a Terraform resource schema and detecting changes on individual fields. The resource already has the raw jobspec stored, which would be the same as the value returned by the new job submission endpoint.

@jorgemarey
Copy link
Contributor

Just made a comment on #238 (comment) about this.

lgfa29 added a commit that referenced this issue Dec 19, 2023
If available, use the job submission source to detect changes to
`jobspec`. This can mitigate drift detection problems such as #1.
@lgfa29
Copy link
Contributor

lgfa29 commented Dec 19, 2023

Reading @jorgemarey comment again, I noticed this part:

We detect changes by having a shasum of the rendered job and seeing if it changes with the value stored for the previous resource version.

While not quite related to this issue, it made me realize that we can use the job submission data (if available) to detect changes to jobspec, which should help mitigate drift problems like this when the submission data is available.

lgfa29 added a commit that referenced this issue Dec 19, 2023
The default behaviour of the Terraform SDK is to copy the plan result
into state which could result in partial state updates, where Terraform
state is updated, but the actual resource state is not, in case of an
error during the apply.

This is normally not an issue because resources are expected to undo
these changes on state refresh. Any partial update is reconciled with
the actual resource state.

But due to #1, the `nomad_job` resource is not able to properly
reconcile on refresh, causing the partial update to prevent further
applies unless the configuration is also changed.

This commit uses the `d.Partial()` method to signal to Terraform that
any state changes should be rolledback in case of an error.
lgfa29 added a commit that referenced this issue Dec 19, 2023
The default behaviour of the Terraform SDK is to copy the plan result
into state which could result in partial state updates, where Terraform
state is updated, but the actual resource state is not, in case of an
error during the apply.

This is normally not an issue because resources are expected to undo
these changes on state refresh. Any partial update is reconciled with
the actual resource state.

But due to #1, the `nomad_job` resource is not able to properly
reconcile on refresh, causing the partial update to prevent further
applies unless the configuration is also changed.

This commit uses the `d.Partial()` method to signal to Terraform that
any state changes should be rolledback in case of an error.
lgfa29 added a commit that referenced this issue Dec 19, 2023
The default behaviour of the Terraform SDK is to copy the plan result
into state which could result in partial state updates, where Terraform
state is updated, but the actual resource state is not, in case of an
error during the apply.

This is normally not an issue because resources are expected to undo
these changes on state refresh. Any partial update is reconciled with
the actual resource state.

But due to #1, the `nomad_job` resource is not able to properly
reconcile on refresh, causing the partial update to prevent further
applies unless the configuration is also changed.

This commit uses the `d.Partial()` method to signal to Terraform that
any state changes should be rolledback in case of an error.
lgfa29 added a commit that referenced this issue Dec 19, 2023
If available, use the job submission source to detect changes to
`jobspec`. This can mitigate drift detection problems such as #1.
lgfa29 added a commit that referenced this issue Dec 19, 2023
If available, use the job submission source to detect changes to
`jobspec`. This can mitigate drift detection problems such as #1.

Read HCL2 variables from job submission even if the `nomad_job` resource
does not speify an `hcl2` block.
@ag-TJNII
Copy link

Is this improved at all with v2 of the provider? I need to take over control of previously hand-generated Nomad job templates with Terraform. Usually when I do this workflow I create a resource in the Terraform code, import the existing resource, and reconcile the differences. However, the fact that this provider doesn't detect differences completely breaks that workflow.

I briefly tried upgrading to v2 provider but still got a unexpectedly large diff, how much of that was this bug and how much was v1 to v2 incompatibilities I didn't determine. I dropped back to v1 to not burn time upgrading to a new version that might have the same major bug.

@tgross
Copy link
Member

tgross commented Jul 12, 2024

Hey folks! @gulducat and I did a quick re-assessment of this bug and here's what the current situation is:

  • Because of the new job submission data, terraform plan will successfully detect changes that are made via nomad job run and the Nomad UI.
  • This will not detect any changes made via nomad job scale or via the Nomad API directly.
  • This will not detect that a job has been stopped via nomad job stop.
  • We think we can reduce the scope of this bug by adding state and count fields to the resources object, and that isn't a huge lift.
  • Resolving the problem completely will unfortunately still require having a complete model of the Job object in the Terraform state. There isn't really a good way around this yet and it's going to be a big lift to fix.

After some internal discussion I'm marking this for further roadmapping. We'll update again once we know more.

@tgross tgross added the hcc/jira Admin - internal label Jul 12, 2024
@tgross tgross removed this from the 1.5.0 milestone Jul 12, 2024
@ketzacoatl
Copy link

@tgross awesome update, thank you for all that info. Even with a big lift in the future, smaller/incremental improvements are very welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.