Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: rounding large int values from state #655

Closed
slevenick opened this issue Dec 8, 2020 · 21 comments · Fixed by hashicorp/terraform-provider-google#8088 or hashicorp/terraform-provider-google-beta#2817
Labels
bug Something isn't working

Comments

@slevenick
Copy link

Terraform Version

Terraform v0.14.0

Terraform Configuration Files

Nothing here is particularly important, only that google_compute_url_map stores a large value in a TypeInt. This value is Computed and is not changing between calls of terraform plan

resource "google_compute_url_map" "default" {
  name        = "url-map"
  description = "Description"

  default_service = google_compute_backend_bucket.static.id
  default_route_action {
    url_rewrite {
      host_rewrite = "REDACTED"
    }
  }
}

resource "google_compute_backend_bucket" "static" {
  name        = "static-asset-backend-bucket"
  bucket_name = google_storage_bucket.static.name
  enable_cdn  = true
}

resource "google_storage_bucket" "static" {
  name     = "static-bucket${random_integer.random.result}"
  location = "US"
}

resource "random_integer" "random" {
  min = 10000
  max = 99999
}

Debug Output

Key line:

2020/12/08 14:05:12 [WARN] Provider "registry.terraform.io/hashicorp/google" produced an unexpected new value for google_compute_url_map.default during refresh.
      - .map_id: was cty.NumberIntVal(7.227701560655104e+18), but now cty.NumberIntVal(7.227701560655103598e+18)

This turns into an error at apply-time

Crash Output

Expected Behavior

TypeInt value is read from state without rounding

Actual Behavior

TypeInt value is read from state with rounding

Value in state:
"map_id": 7227701560655103598,

Value returned from d.Get("map_id"):
7227701560655104000

Steps to Reproduce

  1. terraform apply
  2. terraform plan

Shows the error in debug logs. Changing the description of the URL map causing an update and then running terraform apply causes Terraform to error with:

Error: Provider produced inconsistent final plan

When expanding the plan for google_compute_url_map.default to include new
values learned so far during apply, provider
"registry.terraform.io/hashicorp/google" produced an invalid new value for
.map_id: was cty.NumberIntVal(7.227701560655103598e+18), but now
cty.NumberIntVal(7.227701560655104e+18).

Additional Context

I have tracked this down to a change in state between two runs of terraform apply. The value is set correctly in state and is written to the state file correctly, but seems to round when the value is read by the next invocation of terraform plan/apply

This behavior was introduced in 0.14.0, as these resources did not have a problem in the past.

References

@apparentlymart
Copy link
Contributor

Hi @slevenick! Thanks for reporting this.

Internally within Terraform itself, numbers are big.Float with a precision that would be more than enough for these values, but it does seem like these would be out of the range of a float64 because it requires a significand of more than 52 bits.

For that reason, my initial guess here would be that something in the Terraform SDK is round-tripping these values through a float64. Most of the SDK logic uses int64 for TypeInt, so this is still pretty surprising, but one place I know that the SDK does end up dealing with the raw JSON data from the state is in the schema upgrade codepath, because the provider itself is responsible for interpreting the stored values against the old schema -- Terraform Core only has access to the current schema.

A little digging led me to a direct json.Unmarshal to interface{} in the upgrade codepath, so this would be my initial suspect for the cause: without any special settings, Go's JSON decoder will decode all numbers into float64 values.

I've not proven that yet, but if that is the cause then this is something we'll need to tackle in the SDK repository rather than in this repository. I think to be robust there the SDK logic would need to decode into a json.Number instead -- which is really just a string containing the raw decimal representation from the input -- and then conditionally parse that either to an int64 or a float64 depending on whether the schema specifies schema.TypeInt or schema.TypeFloat, so that the supported ranges will match what the SDK normally supports for those schema types.

@apparentlymart
Copy link
Contributor

A further thing I intended to mention in the previous comment: if this is indeed an SDK bug then the Terraform Core version shouldn't make any difference here. Instead, upgrading the provider would be the most likely trigger, because the provider might've introduced a new schema version for google_compute_url_map and thus the UpgradeResourceState operation would end up visiting the codepath I indicated in the previous comment.

@apparentlymart
Copy link
Contributor

Looking in the implementation of google_compute_url_map I see that it doesn't seem to have any upgrade rules defined, so I guess a good question to answer when we have time to dig into this more is: does the codepath I indicated run even if there isn't an upgrade to do? If not, that would be a good way to eliminate this theory from consideration.

@paddycarver
Copy link
Contributor

Wild guess here, but is it possible that 0.14 is running UpgradeResourceState more frequently than 0.13 did? I've heard some reports that I've not been able to substantiate that providers that do not implement UpgradeResourceState and don't have any state upgrades in place run into problems with plugin-go, and I suppose one (again, reiterating that I'm guessing) plausible explanation is that 0.14 calls UpgradeResourceState even when the state does not need upgrading sometimes, which I think would explain how this showed up in 0.14 but not 0.13.

terraform-plugin-go uses json.Number and big.Float everywhere, I believe. terraform-plugin-sdk is a mess of different number types and that has bitten us before, so it would not surprise me that we're using a float64 where it's not optimal to be. The engineering lift to fix it wholesale is, unfortunately, pretty heavy because of compatibility concerns, but we may be able to get away with it.

We can transfer this issue to the SDK repo, we can leave it here until we ascertain why it suddenly showed up in 0.14, or we can transfer to the SDK and open an issue here to get to the bottom of what changed in 0.14 to prompt it, I'm fine with any of these approaches.

@jbardin
Copy link
Member

jbardin commented Dec 9, 2020

UpgradeResourceState is always called as part of preparing the state value for use. This must be done because the legacy providers still use the call to fixup state without bumping the schema version. It makes sense that the upgrade path using the json.Unmarshal defaults would evaluate the number as a double.

@paultyng
Copy link
Contributor

paultyng commented Dec 9, 2020

Going to move this to the SDK, I believe I've replicated this in terraform-provider-random on 0.12, 0.13, and 0.14. Still could be something in core but will run down at least all the SDK avenues here first.

@paultyng paultyng transferred this issue from hashicorp/terraform Dec 9, 2020
@paddycarver paddycarver added the bug Something isn't working label Dec 9, 2020
@paultyng
Copy link
Contributor

paultyng commented Dec 9, 2020

@slevenick as a workaround for now, if you have any numbers that go over 64 bits (or approach 64 bits) I'd recommend strings in the meantime.

@cagataygurturk
Copy link

I don't know if it is the same issue but with 0.14 I am getting these errors for a GKE cluster:

2020/12/11 11:28:32 [WARN] Provider "registry.terraform.io/hashicorp/google-beta" produced an unexpected new value for module.gke_cluster_europe_west3.google_container_cluster.primary during refresh.
      - .master_version: was cty.StringVal("1.17.13-gke.1401"), but now cty.StringVal("1.17.13-gke.2001")
      - .node_version: was cty.StringVal("1.17.13-gke.1400"), but now cty.StringVal("1.17.13-gke.2001")
      - .node_pool[0].node_count: was cty.NumberIntVal(0), but now cty.NumberIntVal(1)
      - .node_pool[0].version: was cty.StringVal("1.17.13-gke.1400"), but now cty.StringVal("1.17.13-gke.2001")
      - .node_pool[1].version: was cty.StringVal("1.17.13-gke.1400"), but now cty.StringVal("1.17.13-gke.2001")

and plan/apply fails. 13.5 works fine.

@slevenick
Copy link
Author

slevenick commented Dec 11, 2020

I don't know if it is the same issue but with 0.14 I am getting these errors for a GKE cluster:

2020/12/11 11:28:32 [WARN] Provider "registry.terraform.io/hashicorp/google-beta" produced an unexpected new value for module.gke_cluster_europe_west3.google_container_cluster.primary during refresh.
      - .master_version: was cty.StringVal("1.17.13-gke.1401"), but now cty.StringVal("1.17.13-gke.2001")
      - .node_version: was cty.StringVal("1.17.13-gke.1400"), but now cty.StringVal("1.17.13-gke.2001")
      - .node_pool[0].node_count: was cty.NumberIntVal(0), but now cty.NumberIntVal(1)
      - .node_pool[0].version: was cty.StringVal("1.17.13-gke.1400"), but now cty.StringVal("1.17.13-gke.2001")
      - .node_pool[1].version: was cty.StringVal("1.17.13-gke.1400"), but now cty.StringVal("1.17.13-gke.2001")

and plan/apply fails. 13.5 works fine.

That would be surprising, and a different but potentially related issue. Those values are strings and shouldn't have the rounding problem that we are seeing with large ints.

This looks like the GKE version has changed, but I'm not sure why that would be different between 0.14 and 0.13x. When you plan/apply on 0.13 does it show any changes?

Can you include debug logs for the failing plan? And potentially values in state/config for the master_version and other failing fields

@paultyng
Copy link
Contributor

paultyng commented Dec 11, 2020

@cagataygurturk if you are using ignore_changes, there was a regression in 0.14.0 that has been fixed in more recent 0.14 releases.

@Sytten
Copy link

Sytten commented Dec 14, 2020

@paultyng @slevenick What is the plan to fix this? Can this be reverted to 0.13 behaviour?

@paddycarver
Copy link
Contributor

I believe our current understanding is this is the 0.13 behavior, though I have no compelling answer as to why it suddenly became an issue.

I'm hoping to get a fix out in the next few days, if possible. Providers would then need to update their SDK versions and ship new releases.

@paultyng
Copy link
Contributor

I was able to verify this in 0.12.29 (the oldest I tested):

PS C:\Users\paul\Sandbox\bigints> terraform version
Terraform v0.12.29
+ provider.random v3.0.0

Your version of Terraform is out of date! The latest version
is 0.14.2. You can update by downloading from https://www.terraform.io/downloads.html
PS C:\Users\paul\Sandbox\bigints> cat main.tf
resource "random_integer" "bigint" {
    min = 7227701560655103597
    max = 7227701560655103598
}
PS C:\Users\paul\Sandbox\bigints> terraform apply

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # random_integer.bigint will be created
  + resource "random_integer" "bigint" {
      + id     = (known after apply)
      + max    = 7227701560655103598
      + min    = 7227701560655103597
      + result = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

random_integer.bigint: Creating...
random_integer.bigint: Creation complete after 0s [id=7227701560655103598]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
PS C:\Users\paul\Sandbox\bigints> terraform apply
random_integer.bigint: Refreshing state... [id=7227701560655103598]

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # random_integer.bigint must be replaced
-/+ resource "random_integer" "bigint" {
      ~ id     = "7227701560655103598" -> (known after apply)
      ~ max    = 7227701560655104000 -> 7227701560655103598 # forces replacement
      ~ min    = 7227701560655104000 -> 7227701560655103597 # forces replacement
      ~ result = 7227701560655104000 -> (known after apply)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

random_integer.bigint: Destroying... [id=7227701560655103598]
random_integer.bigint: Destruction complete after 0s
random_integer.bigint: Creating...
random_integer.bigint: Creation complete after 0s [id=7227701560655103598]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.
PS C:\Users\paul\Sandbox\bigints>

paddycarver added a commit to hashicorp/terraform-provider-corner that referenced this issue Dec 16, 2020
See hashicorp/terraform-plugin-sdk#655. Essentially, test that integers
that do not fit cleanly in float64s still retain their precision. These
tests currently fail.
@paddycarver
Copy link
Contributor

OK so I've done a lot of digging into this.

knowthingsnow

So fundamentally, the cause of the issue here was the liberal use of JSON encoding and decoding that the SDK does. Why we do this is a longer story and requires more context, but fundamentally JSON is used as sort of a babel-representation of different systems and so when converting between, e.g., 0.11 and 0.12's types, sometimes we just encode to JSON and decode back out.

Unfortunately, it is not always the case that our decoding uses the UseNumber() method, more frequently just reaching for the default json.Unmarshal... which decodes numbers into float64 when unmarshaling into an interface{}, which we do quite a bit of the time.

This is, usually, fine. However, for sufficiently large numbers (I believe the ceiling is 17 significant figures, but could be wrong about that) float64 can't actually represent them with precision, so it... silently rounds them. Whoops.

This is all basically what Martin said a week ago.

The fix here is straightforward technically, but messy when people get involved. We can just switch to using a JSON decoder with UseNumber() called and this specific problem goes away.

There are a couple of compatibility concerns, though. First, the use of ResourceData.Get and then casting the result to a specific type is an unavoidable pattern that every provider utilises. This means us changing the underlying type breaks compatibility, because any code using the old type will panic. Oops. Now, given that you can use ResourceData.Get and cast to an integer or a float depending on whether you're using schema.TypeInt or schema.TypeFloat gives me hope that we could potentially smooth this over a bit, again as Martin suggested a week ago (this comment is largely just "Martin was right a week ago, but now I have the code and tests to prove it"). Inordinately large TypeFloats would continue having a bad time, but large TypeInts would have less of a bad time, until they hit 64 bits and start having a real bad time again. But I still am researching and investigating the TypeFloat/TypeInt conversion and casting logic to see if we can even do this. If we can't, a boolean may need to be set on the schema.Resource, opting into the new behavior, which would be sad but would still offer a workaround.

Second, the helper/resource test harness also has a lot of exposure to JSON, and also doesn't always use UseNumber. It, similarly, needs to be upgraded, or things will work but the tests will insist there's a problem. (I lost several hours to investigating why my fix wasn't working, only to realise I hadn't fixed the test harness. Whoops.) This also needs to be done in such a way that users don't have to change their code, and I'm still investigating the impact of the needed changes (we will also need to change terraform-json and terraform-exec) to determine whether they're backwards compatible.

All of this to say: the problem is known (I still don't particularly understand why 0.14 exacerbated this; or if it didn't, as seems plausible given we have multiple reproductions of it in 0.12 and 0.13, why it hasn't been an issue before now) and we're making progress on solutions, fixing this just involves moving some pieces that our ecosystem relies directly on, and so just like replacing a load-bearing pillar, we're going to need to move carefully and that takes a little bit of time. But we are working on it.

@marekaf
Copy link

marekaf commented Dec 16, 2020

Hi,
is there any known workaround? We upgraded our state to terraform 0.14 and not we are stuck with this problem. Can't downgrade to 0.13 either.
Thanks!

paddycarver added a commit that referenced this issue Dec 16, 2020
Go's JSON decoder, by default, uses a lossy conversion of JSON integers
to float64s. For sufficiently large integers, this yields a loss of
precision, and that causes problems with diffs and plans not matching.
See #655 for more details.

This PR proposes a solution: keeping the lossless json.Number
representation of integers in state files. This will ensure that no
precision is lost, but it comes at a cost. json.Number is a string type,
not a float64 type. That means that existing state upgraders that
(correctly) cast a value to float64 will break, as the value will now be
surfaced to them as a json.Number, which cannot be cast to a float64.

To handle this, the schema.Resource type gains a `UseJSONNumber`
property. When set to new, users are opted into the new json.Number
values. When left false, the default, users get the existing float64
behavior.

If a resource with state upgraders wants to use the new json.Number
behavior, it must update all the state upgraders for that resource to
use json.Number instead of float64 or users with old state files will
panic during upgrades.

Note: the backwards compatibility properties of this commit are unknown
at this time, and there may be a way for us to avoid using
schema.Resource.UseJSONNumber, instead preserving the choice until we
get to the point where TypeInt casts the number to an int. Further
investigation needed.
@paultyng
Copy link
Contributor

paultyng commented Dec 17, 2020

@marekaf as mentioned above, this doesn't seem to be limited to 0.14 and seems to be affecting 0.13 and 0.12 since it is provider side. You could try to downgrade your provider (not the CLI), and see if older versions can handle this, they would need to be on the v1 SDK, as opposed to the v2 SDK I think, and I'm not sure where that switched in all providers.

15-17 integer digits is about the most you can safely carry in a float64 without any potential loss and this ID has crossed that threshold so it will occasionally have losses until the SDK fix is merged in to the provider or the typing is changed.

@marekaf
Copy link

marekaf commented Dec 17, 2020

@marekaf as mentioned above, this doesn't seem to be limited to 0.14 and seems to be affecting 0.13 and 0.12 since it is provider side. You could try to downgrade your provider (not the CLI), and see if older versions can handle this, they would need to be on the v1 SDK, as opposed to the v2 SDK I think, and I'm not sure where that switched in all providers.

15-17 integer digits is about the most you can safely carry in a float64 without any potential loss and this ID has crossed that threshold so it will occasionally have losses until the SDK fix is merged in to the provider or the typing is changed.

thanks!

I managed to work around it by just making sure the apply will not touch url_map (not possible via -targeting so I had to simply change manually in GCP web console everything to match the config in terraform hcl so that not drift would be detected). Everything else works for plan & apply.

paddycarver added a commit that referenced this issue Dec 18, 2020
Go's JSON decoder, by default, uses a lossy conversion of JSON integers
to float64s. For sufficiently large integers, this yields a loss of
precision, and that causes problems with diffs and plans not matching.
See #655 for more details.

This PR proposes a solution: keeping the lossless json.Number
representation of integers in state files. This will ensure that no
precision is lost, but it comes at a cost. json.Number is a string type,
not a float64 type. That means that existing state upgraders that
(correctly) cast a value to float64 will break, as the value will now be
surfaced to them as a json.Number, which cannot be cast to a float64.

To handle this, the schema.Resource type gains a `UseJSONNumber`
property. When set to new, users are opted into the new json.Number
values. When left false, the default, users get the existing float64
behavior.

If a resource with state upgraders wants to use the new json.Number
behavior, it must update all the state upgraders for that resource to
use json.Number instead of float64 or users with old state files will
panic during upgrades.

Note: the backwards compatibility properties of this commit are unknown
at this time, and there may be a way for us to avoid using
schema.Resource.UseJSONNumber, instead preserving the choice until we
get to the point where TypeInt casts the number to an int. Further
investigation needed.
@paddycarver
Copy link
Contributor

This should be resolved in 2.4.0 that is rolling out now. Resources impacted will need to set UseJSONNumber: true on the schema.Resource and any StateUpgraders on the resource will need to use json.Number instead of float64 for all numbers. Other than that, it should be a transparent upgrade process with no code changes needed.

@Abdelwaheb-Hnaien
Copy link

Abdelwaheb-Hnaien commented Dec 21, 2020

I'm having the same issue with google_compute_url_map resource since I upgraded to terraform v0.14.0
I created the url_map in my project using terraform v0.13.5
what I'm trying to do now is to make the url_map point to a new backend service (previously an instance group, now a bucket) and then I get this error:

Error: Provider produced inconsistent final plan


When expanding the plan for
module.google_compute_url_map_uat-mcm-url-map.google_compute_url_map.urlmap to
include new values learned so far during apply, provider
"registry.terraform.io/hashicorp/google" produced an invalid new value for
.map_id: was cty.NumberIntVal(6.438816093867301036e+18), but now
cty.NumberIntVal(6.438816093867301e+18).

This is a bug in the provider, which should be reported in the provider's own
issue tracker. 

I also tried with terraform v0.14.3, it's not working also.

this action (changing the backend) was working perfectly with terraform version 0.13.5

kmoe added a commit to hashicorp/terraform-provider-random that referenced this issue Jan 5, 2021
Prior to this change, integers in state that do not fit cleanly in float64s lose their precision, leading to permadiffs. As of SDK v2.4.0, setting UseJSONNumber on the resource schema changes number representation to json.Number throughout, preserving precision. Please see hashicorp/terraform-plugin-sdk#655 for more information.
@kmoe kmoe changed the title 0.14.0 rounding large int values from state bug: rounding large int values from state Jan 5, 2021
@kmoe
Copy link
Member

kmoe commented Jan 5, 2021

I've updated the title of this issue to prevent confusion when referring to it in the future, as this is not a 0.14 bug.

@ghost
Copy link

ghost commented Jan 18, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
10 participants