Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data sharing #17

Closed
wants to merge 5 commits into from
Closed

Conversation

sworisbreathing
Copy link
Contributor

Adds a redshit_datashare resource to manage data sharing between Redshift clusters. This should be defined on the producer cluster. For managing the schemas and objects, we use a nested attribute block.

resource "redshift_datashare" "share" {
  name = "my_datashare"
  owner = "fred"

  schema {
    name = redshift_schema.schema.name
    mode = "auto"
  }

  schema {
    name = redshift_schema.other_schema.name
    mode = "manual"
    tables = ["foo"]
    functions = ["bar"]
  }
}

Note that we're rolling up the management of schemas (and their objects) into two modes, auto and manual. This avoids a lot of weird edge cases that would otherwise come up if we'd tried to expose individual settings for ALL TABLES, ALL FUNCTIONS, and INCLUDENEW in the ALTER DATASHARE command:

  • in auto mode, we add ALL TABLES IN SCHEMA and ALL FUNCTIONS IN SCHEMA and SET INCLUDENEW=true FOR SCHEMA, so that newly-created tables/functions are automatically exposed to the datashare by the redshift cluster itself, without needing to re-run terraform.
  • in manual mode, we only expose the specific tables/functions that are configured in the schema block.

This PR turned out to be so big that I decided it best not to include the corresponding data source. For that one I intend for it to follow the same structure as what's in this PR, and allow it to be defined on either the producer or the consumer.

The update code turned out to be way more involved than I'd originally hoped, due to issues with terraform-plugin-sdk. What I really wanted for the nested schema blocks was to use blocks, but treat them in the backend as a map instead of a set/list (in other words, treat name as the unique identifier for the nested schema block). Unfortunately, schema.TypeMap can only store primitive types. Doing a simple hash function on the name meant terraform wasn't picking up changes to the tables/functions inside the schema block. I tried this in combination with CustomizeDiff but never could get it to properly detect the changes.

I ended up taking inspiration from the aws_security_group resource in terraform-provider-aws, which also has to work around this issue of doing incremental updates to nested attributes in a schema.TypeSet. The terraform plan output makes it appear that we're completely dropping the schema from the datashare and then re-adding it (I'd hoped that hashing on the name would solve that problem, but instead terraform simply didn't detect changes to the schema configuration), but during the update there's a bunch of extra logic to figure out what tables/functions actually need adding/removing. It's uglier than I'd like, but it works when none of the other approaches I've taken did.

@sworisbreathing
Copy link
Contributor Author

sworisbreathing commented Aug 12, 2021

I'm putting this in draft for now because there's still a bit of work to finish off the update logic, and probably some cleanup as well. But I wanted to go ahead and open the PR to show what it will look like.

Before putting this into ready for review I'll probably squash the commits.

…ove entire schema with all assets works. Modify manually-managed assets in datashare schema still not implemented.
@sworisbreathing
Copy link
Contributor Author

sworisbreathing commented Aug 13, 2021

One other thing to note... functions in manual mode doesn't work and instead fails with syntax error at end of input when trying to run ALTER DATASHARE... ADD FUNCTION.... This appears to either be a bug in Redshift or incorrect documentation. I've got a support case open with AWS and they've reproduced the issue and escalated to the engineering team for further investigation.

UPDATE: I've identified the issue, but it's kind of a deal-breaker for being able to manage functions in a data share. The issue is that when you manage UDFs in Redshift, you have to include the parameter list, like so:

ALTER DATASHARE ADD FUNCTION foo(varchar);

(NB: this is also the syntax you need in order to DROP FUNCTION...)

However you can't read back the parameter list - SVV_DATASHARE_OBJECTS only includes the function name but not the signature. This is problematic for 3 reasons:

  1. If we're updating a datashare schema from auto mode to manual mode, then the functions list is computed from SVV_DATASHARE_OBJECTS. If we need to drop some functions from the datashare, we don't have enough information to generate a workable DDL statement.
  2. Redshift allows function overloading - it's perfectly acceptable to have functions foo(varchar) and foo(varchar, varchar), but since SVV_DATASHARE_OBJECTS doesn't include the parameter list, but we can't know which of our foo functions is exposed to the datashare and which isn't. It also means that if we specify the parameter list as functions = ["foo(varchar)"], terraform will always show us as needing to make changes because we lose the full signature.
  3. INCLUDENEW applies to both functions and tables in the schema. This means that, if we removed the functions attribute, functions would silently slip through undetected in auto mode. You can't do ALTER DATASHARE ... SET INCLUDENEW=true FOR TABLES IN SCHEMA...

I can think of 3 ways of handling this:

  1. Rewrite the redshift_datashare resource so that you only pass in a list of schema names, and all schemas operate in auto mode. This is the simplest solution, but the least flexible.
  2. Drop support for the functions attribute. This still allows you to manage tables manually, but means that you can only share functions in auto mode, and terraform won't actually tell you what functions are exposed.
  3. Leave everything as-is for now, but add a bunch of warnings in the documentation recommending against using functions in manual mode, and advise that it'll be fixed up in a future release of the provider once AWS starts providing the missing signature information in SVV_DATASHARE_OBJECTS. This is probably what I'll do in the PR for now, but it's the one I'm least happy with.

@winglot what is your preferred approach? I think option 1 makes the most sense from a user standpoint, but I don't want to rip out all of the code I've written yet if you think there's value in being able to manage specific tables within a datashare/schema.

@sworisbreathing
Copy link
Contributor Author

further to my comment above, I think what I will probably do is create an alternative branch/PR for option 1 and close this one for the time being. That will make things work in a reliable and expected way, and leave the door open to more fine-grained controls if AWS ever roll out a fix for the above issue, without having to rewrite all the code from scratch.

StevenKGER pushed a commit to dbsystel/terraform-provider-redshift that referenced this pull request Oct 25, 2024
…t-digest

Update actions/checkout digest to 8ade135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant