RawState distorted by schema-aware transformations #1667

t0yv0 · 2024-02-01T16:56:15Z

What happened?

There are incorrect migration or replace plans (example: pulumi/pulumi-gcp#1488) where Pulumi
bridged providers differ from canonical TF behavior in what gets passed as RawState to the
provider's PlanResourceChange method.

Specifically, bridged providers store a Pulumi-oriented representation in the statefile. When
recovering stored state to the TF representation bridged providers call the MakeTerraformState
method that is informed by the provider's schema to not lose information. Schema is a necessary part
of the process as it informs decisions such as MaxItems=1 flattening where Pulumi and TF
representations do not naturally agree. The problem however is that when the state was written with
V1 version of the provider and is being processed by the V2 version of the provider, it is the V2
version of the schema that is informing the process. When V2/V1 schemas do not agree, this may lead
to incorrect results.

Migrations

Normally upstream providers should version the should be considering state migration framework when
taking on changes in the schema:

https://github.com/hashicorp/terraform-plugin-sdk/blob/main/website/docs/plugin/sdkv2/resources/state-migration.mdx

This framework versions the schema by numbers 1, 2, 3, and runs custom code to migrate forward.
Pulumi supports this by storing the resource version under __meta. However, this area is not
sufficiently tested and Pulumi might be deserializing the state under the wrong schema, using the
current provider schema instead of the schema the resource was written with. This is worth
double-checking.

DiffCustomizers

Upstream providers do not always use migrations. In pulumi/pulumi-gcp#1488 upstream is not using the
state migration facility explicitly, but instead just uses a customize diff function that assumes
that previously-shaped data is passed in:

https://github.com/hashicorp/terraform-provider-google-beta/blob/main/google-beta/services/secretmanager/resource_secret_manager_secret.go#L36

Excerpt:

func secretManagerSecretAutoCustomizeDiff(_ context.Context, diff *schema.ResourceDiff, meta interface{}) error {
	oAutomatic, nAutomatic := diff.GetChange("replication.0.automatic")
	_, nAuto := diff.GetChange("replication.0.auto")
	autoLen := len(nAuto.([]interface{}))

	// Do not ForceNew if we are removing "automatic" while adding "auto"
	if oAutomatic == true && nAutomatic == false && autoLen > 0 {
		return nil
	}

	if diff.HasChange("replication.0.automatic") {
		if err := diff.ForceNew("replication.0.automatic"); err != nil {
			return err
		}
	}

	if diff.HasChange("replication.0.auto") {
		if err := diff.ForceNew("replication.0.auto"); err != nil {
			return err
		}
	}

	return nil
}

This presents a problem because the original schema is not available, even through the migration
framework. Current Pulumi behavior is to drop "automatic" data during translation since it is not
accounted for by the V2 schema, which prevents the above diff customize function from doing its job,
causing an unexpected replacement plans.

Solutions

Opaque Raw State

Pulumi TF providers could write TF state as-is in an opaque blob under __meta to using the TF
canonical JSON representation, deserialize it accordingly and populate the RawState that way. This
method fully removes the behavior discrepancy between Pulumi and TF.

It is easy to implement and roll out but unfortunately adding an opaque blob costs extra space in
the test file (2x?), and may need masking to account for secret material in the state.

New State Representation

Perhaps a new representation could be designed that replaces what is currently stored in the
statefiles so that space is not wasted, that could serve both the purpose of recovering raw TF state
and the purpose of tracking Pulumi metadata and secrets correctly. In this case the previous schema
of storing state can be retired.

When rolling this out care needs to be taken to around backwards compatibility.

Incremental Representation

Perhaps something can be designed that encodes just enough extra information in the __meta field so
that the bridged provider can recover the TF RawState from the Pulumi representation without
resorting to the schema. This optimizes for minimizing space and migration problems (purely
additive) at the cost of some code complexity that will need to be careful and co-evolve with any
more impedance-mismatch features in the bridge.

Deserialize using the right schema version

Pulumi could consult resources's schema version from the statefile, pull the right schema from the TF migration machinery and deserialize against that. This is an incomplete solution though because:

some resources do not use state migration machinery
the information on Pulumi flags such as MaxItems=1 info overrides from the provider version that wrote the resource is still not available

Example

See above.

Output of `pulumi about`

N/A

Additional context

N/A

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

VenelinMartinov · 2024-02-01T18:15:00Z

may need masking to account for secret material in the state.

Would the same state which is secret in pulumi also be marked secret on tf side? I seem to remember some discussion around TF having weaker support for secrets - is this correct? Would that all be resolvable?

costs extra space in the test file

Do you mean the state file? How much is storage space a concern here since this approach seems really simple to implement.

Perhaps something can be designed that encodes just enough extra information in the __meta field

I'm not sure I understand that - isn't the issue that the schema for some property changed, so we might no longer have access to the old schema - would we need to store this special information for all properties?

t0yv0 · 2024-02-01T19:00:15Z

TF representation does not have first class secrets, but Pulumi does. This is one of the impedance mismatch issues.

I'm not sure I understand that - isn't the issue that the schema for some property changed, so we might no longer have access to the old schema - would we need to store this special information for all properties?

This is another way of saying, we can develop a sophisticated algorithm that stores TF representation cty.Value in Pulumi statefile and recovers it as is, which satisfies the expectations of TF around GetRawState(), in an incremental way, this would be something like storing paths at which flattening occurred or some such delta information. To optimize for space. Serializing the schema could be one way but that's likely too verbose.

iwahbe · 2024-02-02T10:12:55Z

Thanks for digging into this! @t0yv0 Do you have any sense of how many bugs correspond to this problem?

Incremental Representation seems like the most viable option, simply by eliminating other options. I don't think we can 2x our state. A New State Representation would require deep collaboration between all of Pulumi's engineering teams, and I think we would need a more compelling argument then improving TF state recovery.

t0yv0 · 2024-02-02T16:20:59Z

We have no easy way of telling. We can start cross-correlating bugs here. I suspect this condition is rare and specific to changing upstream resources schemas and upgrades, however when it does hit it can be very impactful like the GCP P1 issue.

Yes, IR can be interesting but introduces complexity and some brittleness going forward that needs to be maintained. Perhaps we can at least explore it though to see exactly how this may look like.

iwahbe · 2024-04-25T22:48:06Z

A design document has been written, but we have decided to delay this item of work.

VenelinMartinov · 2024-08-29T13:24:50Z

Hit this again in pulumi/pulumi-aws#4410 (comment)

VenelinMartinov · 2024-08-29T13:26:01Z

One idea here for fixing this is potentially using the cty Type in the StateUpgrader for recovering the old version state in the cases where there is a migration.

iwahbe · 2024-09-05T09:00:18Z

I've explained the problem a couple of times recently, so I though I'd write a quick diagram illustrating the issue.

Write TF State to Pulumi State

stateDiagram
TF_Provider --> TFState
TFState --> MakePulumiState
Provider@v1 --> (TFSchema,ProviderInfo)@v1
(TFSchema,ProviderInfo)@v1 --> MakePulumiState
MakePulumiState --> PulumiState
PulumiState --> gRPC(state_file)

Read TF State from Pulumi State

stateDiagram
gRPC(state_file) --> OldPulumiState
OldPulumiState --> MakeTerraformState
Provider@v2 --> (TFSchema,ProviderInfo)@v2
(TFSchema,ProviderInfo)@v2 --> MakeTerraformState: Takes old state and new info
MakeTerraformState --> TF_Provider

The Problem

When we read & write TF state, we use the "current" version of the provider. If we write with v1 and read with v2, then any information we use from the v2 state may not be correct for the v1 state.

t0yv0 added needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec labels Feb 1, 2024

t0yv0 mentioned this issue Feb 1, 2024

secretmanager.Secret requires replace when updating from v6.62.0 to v7.4.0 pulumi/pulumi-gcp#1488

Closed

iwahbe removed the needs-triage Needs attention from the triage team label Feb 2, 2024

This was referenced Mar 6, 2024

Switch upgradeResourceState to use SDKv2's gRPC method #1735

Closed

[panic] State migration of BranchProtection from v5 -> v6 fails pulumi/pulumi-github#586

Closed

[PATCH]: BranchProtection State Migration pulumi/pulumi-github#595

Open

t0yv0 added this to the 0.102 milestone Mar 19, 2024

mjeffryes assigned iwahbe Apr 1, 2024

iwahbe mentioned this issue Apr 15, 2024

Support safely migrating from MaxItems=1 to no MaxItems #5

Open

VenelinMartinov mentioned this issue May 13, 2024

PlanResourceChange State Upgrade panics #1966

Closed

t0yv0 mentioned this issue May 20, 2024

Provider2 upgrade state rewrite #1998

Merged

iwahbe mentioned this issue May 21, 2024

Provider2 instance state fallback #2002

Merged

VenelinMartinov mentioned this issue Jun 4, 2024

Handle diags from gRPC state upgrader #2053

Merged

t0yv0 mentioned this issue Jun 4, 2024

TestJobQueueUpgrade panics under PlanResourceChange pulumi/pulumi-aws#4015

Closed

t0yv0 mentioned this issue Jun 12, 2024

Fix bridge not running state upgrades from 0 -> 1 #2081

Merged

iwahbe removed their assignment Jun 28, 2024

VenelinMartinov mentioned this issue Aug 29, 2024

PRC failures in AWS pulumi/pulumi-aws#4410

Closed

6 tasks

t0yv0 mentioned this issue Aug 29, 2024

Fix eks cluster PRC replace pulumi/pulumi-aws#4415

Merged

mjeffryes modified the milestones: 0.102, 0.110 Sep 12, 2024

mjeffryes removed this from the 0.110 milestone Oct 2, 2024

mjeffryes added this to the 0.111 milestone Oct 2, 2024

VenelinMartinov mentioned this issue Oct 3, 2024

Add state transform for compute.ForwardingRule pulumi/pulumi-gcp#2497

Merged

t0yv0 mentioned this issue Oct 16, 2024

Enable zero default schema version in AWS pulumi/pulumi-aws#4646

Closed

VenelinMartinov mentioned this issue Oct 28, 2024

Bridge State Upgrades Re-enablement Rollout #2133

Open

7 tasks

mjeffryes modified the milestones: 0.111, 0.112 Oct 30, 2024

mjeffryes removed this from the 0.112 milestone Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RawState distorted by schema-aware transformations #1667

RawState distorted by schema-aware transformations #1667

t0yv0 commented Feb 1, 2024

VenelinMartinov commented Feb 1, 2024 •

edited

Loading

t0yv0 commented Feb 1, 2024

iwahbe commented Feb 2, 2024

t0yv0 commented Feb 2, 2024

iwahbe commented Apr 25, 2024

VenelinMartinov commented Aug 29, 2024

VenelinMartinov commented Aug 29, 2024

iwahbe commented Sep 5, 2024

RawState distorted by schema-aware transformations #1667

RawState distorted by schema-aware transformations #1667

Comments

t0yv0 commented Feb 1, 2024

What happened?

Migrations

DiffCustomizers

Solutions

Opaque Raw State

New State Representation

Incremental Representation

Deserialize using the right schema version

Example

Output of pulumi about

Additional context

Contributing

VenelinMartinov commented Feb 1, 2024 • edited Loading

t0yv0 commented Feb 1, 2024

iwahbe commented Feb 2, 2024

t0yv0 commented Feb 2, 2024

iwahbe commented Apr 25, 2024

VenelinMartinov commented Aug 29, 2024

VenelinMartinov commented Aug 29, 2024

iwahbe commented Sep 5, 2024

Write TF State to Pulumi State

Read TF State from Pulumi State

The Problem

Output of `pulumi about`

VenelinMartinov commented Feb 1, 2024 •

edited

Loading