Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Non Nullable DatasetConfig.ctl_dataset_id Field #2046

Merged

Conversation

pattisdr
Copy link
Contributor

@pattisdr pattisdr commented Dec 14, 2022

❗ Contains multiple migrations (schema and data)
πŸ‘‰ Note this PR is against a feature branch

Closes #1762
Closes #1764

Code Changes

  • Added a non-nullable datasetconfig.ctl_dataset_id column
  • Added a schema and data migration. This is an attempt to also resolve Add dataset migration CLI commandΒ #1764
    • First, create a datasetconfig.ctl_dataset_id column that is nullable
    • Then, attempt to copy the contents of DatasetConfig.dataset into new ctl_dataset records, and then link that ctl_dataset as the DatasetConfig.ctl_dataset_id. If there's a conflict with an existing ctl_datasets.fides_key, I error instead of attempting to upsert. The user should manually resolve.
    • Adds a follow-up schema migration to then make the datasetconfig.ctl_dataset_id field non-nullable
  • Added a new API endpoint PATCH {{host}}/connection/{{connection_key}}/datasetconfig that takes in a fides_key (for the DatasetConfig) and an existing ctl_dataset_fides_key. It upserts a DatasetConfig, links to the existing CTLDataset, then copies the CTLDataset contents back to the DatasetConfig.dataset. Soon, the DatasetConfig.dataset field is going away.
  • Updated existing PATCH dataset config JSON and YAML variants to temporarily still work. A raw dataset passed in will attempt to upsert both the DatasetConfig and CTLDataset object. I didn't want the UI to be broken on this feature branch.
    • Creating a saas connector from a template still works. The dataset in the template currently upserts the DatasetConfig and CTL Dataset record.
  • In the locations where we retrieve a DatasetConfig.dataset (such as building a graph), return the ctl_dataset record instead of DatasetConfig.dataset.
  • Lots of test fixtures needed to be changed to create a ctl_dataset before creating a datasetconfig and then linking the two.

Steps to Confirm

Migration verification

  • Checkout main. Create a datasetconfig in your application database and then switch to this branch without dropping that database. If you switch and bring up the shell, migrations should run. Verify that the existing datasetconfig table has a ctl_dataset_id row. Verify that this row is non-nullable and populated. Locate the corresponding ctl_datasets record. Compare each column and match to what was in datasetconfig.dataset. DatasetConfig.dataset itself should be untouched for now. Any fidesops_meta fields are likely converted to fides_meta fields.'

Run a privacy request

  • Run nox -s test_env
  • Go to the privacy center, run a privacy request
  • Return to admin UI, approve
  • Verify json file in fides_uploads folder and contents are roughly correct.

Existing endpoint parity (will soon be deprecated, but the UI still works here)

  • Run nox -s test_env
  • Go to http://localhost:3000, login, then http://localhost:3000/datastore-connection.
  • Click on the three horizontal dots of the Postgres connector > Configure. Jump to Dataset configuration tab. Edit the description of the dataset. Verify API request is successful (PATCH http://0.0.0.0:8080/api/v1/connection/postgres_connector/dataset)
  • Locate the updated datasetconfig.dataset field in the db. Verify the description has changed. Similarly note that the ctl_dataset.description column has changed (select description from ctl_datasets where id = 'xxxxxx';)

Test creating saas connectors from a template

  • - http://localhost:3000/datastore-connection > Create new connection
  • Select mailchimp. Enter in connector parameters - because we're not going to use this, your secrets can be fake
  • On Dataset configuration tab click > Save Yaml System.
  • Verify successful API requests
  • Verify that new datasetconfig and ctl_datasets records and the DatasetConfig has a ctl_dataset_id FK. The dataset should exist in both places.

New endpoint

  • In Postman or similar, create a new connection config resource PATCH {{host}}/connection/
  • In Postman or similar, add a dataset config with the new endpoint: PATCH {{host}}/connection/{{existing connection key}}/datasetconfig. Create a new fides_key (this will be the identifier for the DatasetConfig, but select an existing ctl_dataset fides_key.
[{
    "fides_key": "new_dataset_config",
    "ctl_dataset_fides_key": "postgres_example_test_dataset"
}]
  • Verify a new DatasetConfig was created and the contents of the existing ctl_dataset were ported back into the new DatasetConfig.dataset .
  • Visit this new connector and dataset in the UI

Existing CTL datasets tab

Pre-Merge Checklist

  • All CI Pipelines Succeeded
  • Documentation Updated:
    • documentation complete, or draft/outline provided (tag docs-team to complete/review on this branch)
    • documentation issue created (tag docs-team to complete issue separately)
  • Issue Requirements are Met
  • Relevant Follow-Up Issues Created
  • Update CHANGELOG.md

Description Of Changes

We are trying to move to storing the bulk of the contents of a Dataset solely on the ctl_datasets table. Right now, similar concepts exist in both the ctl_datasets table and the DatasetConfig.dataset column.

The idea with this increment is to add a non-nullable DatasetConfig.ctl_dataset_id field. DSR's can't run without an associated dataset so I think we should keep this a constraint from the beginning. I take the contents of existing DatasetConfig.datasets and attempt to create new ctl_dataset records and then link them to existing DatasetConfig.

The next step is to keep writing to both places - DatasetConfig.dataset AND making the same changes to the ctl_dataset record. This work also starts reading from the DatasetConfig.ctl_dataset record instead of DatasetConfig.dataset.

Follow-up work will deprecate some existing endpoints and stop writing to the DatasetConfig.dataset column.

- Add a data migration that takes existing datasetconfig.dataset and creates a new ctl_dataset record and links the new record back to the datasetconfig.
- Add a follow-up schema migration that makes the datasetconfig.ctl_dataset field not nullable.
…hat takes in a pair of a fides_key and ctl_dataset_fides_key. This request will create/update a DatasetConfig and link the ctl_dataset to it. As an incremental step, this endpoint copies the ctl_dataset and stores it on DatasetConfig.dataset.

- Update existing endpoint PATCH v1/connection/connection_key/dataset (that will be deprecated) to take the supplied dataset and upsert a ctl_dataset with it.  This still allows a raw dataset to be supplied through this endpoint for the moment to not break the UI.
- Both endpoints still try to update both DatasetConfig.dataset and the corresponding DatasetConfig.ctl_dataset resource. A followup will stop updating DatasetConfig.ctl_dataset
- When fetching the dataset, get the contents of the ctl dataset, not DatasetConfig.dataset, which is going away.
- Update the migration to validate the ctl_dataset created from a dataset before saving.
- Update a lot of DatasetConfig fixtures to have a ctl_dataset linked to it storing the actual dataset contents.
@pattisdr pattisdr added the run unsafe ci checks Runs fides-related CI checks that require sensitive credentials label Dec 14, 2022
@pattisdr pattisdr removed the run unsafe ci checks Runs fides-related CI checks that require sensitive credentials label Dec 15, 2022
@pattisdr pattisdr self-assigned this Dec 15, 2022
@pattisdr pattisdr changed the title [DRAFT] Add Non Nullable DatasetConfig.ctl_dataset_id Field Add Non Nullable DatasetConfig.ctl_dataset_id Field Dec 15, 2022
@@ -1025,6 +1025,9 @@ dataset:
- name: connection_config_id
data_categories: [system.operations]
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
- name: ctl_dataset_id
data_categories: [ system.operations ]
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This yaml file has been adjusted to reflect the new datasetconfig.ctl_dataset_id field

Comment on lines +90 to +94
except IntegrityError as exc:
raise Exception(
f"Fides attempted to copy datasetconfig.datasets into their own ctl_datasets rows but got error: {exc}. "
f"Adjust fides_keys in ctl_datasets table to not conflict."
)
Copy link
Contributor Author

@pattisdr pattisdr Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempt to create new ctl_dataset records as part of this data migration by default so we can 1) go ahead and make this field non-nullable while 2) not combining existing ctl_datasets and potentially doing it wrong.

I talked with Sean about this - he said the plan was to handle conflicts ad hoc with customer? So if there's a conflict, my current plan is that they resolve manually, which differs from more detailed plan spelled out here #1764

Comment on lines +205 to +213
@classmethod
def create_from_dataset_dict(cls, db: Session, dataset: dict) -> "Dataset":
"""Add a method to create directly using a synchronous session"""
validated_dataset: FideslangDataset = FideslangDataset(**dataset)
ctl_dataset = cls(**validated_dataset.dict())
db.add(ctl_dataset)
db.commit()
db.refresh(ctl_dataset)
return ctl_dataset
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we have endpoints/methods that already exist ctl-side for creating ctl_datasets but there's still a big division between ctl-code largely using asynchronous sessions and ops code largely using synchronous sessions. I don't want to take that on here, so I'm adding a small model method that uses a synchronous session that is used numerous times (largely in testing).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. IIRC the ctl endpoints are fairly generic and constructed differently

Comment on lines +191 to +194
ctl_dataset: CtlDataset = (
db.query(CtlDataset)
.filter_by(fides_key=dataset_pair.ctl_dataset_fides_key)
.first()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I was using a ctl-method to get this Dataset but it wasn't playing well with ops tests. It worked fine when I ran the test file by itself but broke down when I ran the whole test suite. with sqlalchemy.dialects.postgresql.asyncpg.InterfaceError - cannot perform operation: another operation is in progress.

Comment on lines +169 to +181
def patch_dataset_configs(
dataset_pairs: conlist(DatasetConfigCtlDataset, max_items=50), # type: ignore
db: Session = Depends(deps.get_db),
connection_config: ConnectionConfig = Depends(_get_connection_config),
) -> BulkPutDataset:
"""
Endpoint to create or update DatasetConfigs by passing in pairs of:
1) A DatasetConfig fides_key
2) The corresponding CtlDataset fides_key which stores the bulk of the actual dataset

Currently this endpoint looks up the ctl dataset and writes its contents back to the DatasetConfig.dataset
field for backwards compatibility but soon DatasetConfig.dataset will go away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New endpoint that the UI should switch to using in ops "create a connector" workflow.

Andrew described a flow where two endpoints will be hit, the ctl dataset endpoint to create/update that, and then the fides_key of that ctl_dataset will be passed to this endpoint. You could also select the ctl dataset from a dropdown.

@@ -75,7 +148,7 @@ def get_graph(self) -> GraphDataset:
the corresponding SaaS config is merged in as well
"""
dataset_graph = convert_dataset_to_graph(
Dataset(**self.dataset), self.connection_config.key # type: ignore
Dataset.from_orm(self.ctl_dataset), self.connection_config.key # type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we build the graph to run a DSR (this is the starting point for that), I'm pulling from the ctl_dataset instead of DatasetConfig.dataset.

Comment on lines +78 to +80

ctl_dataset = CtlDataset.create_from_dataset_dict(db, bigquery_dataset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is larger than it looks. 3/4's of the edits just make sure that test DatasetConfig fixtures have a CTL Dataset linked to it.

@@ -198,7 +259,13 @@ def patch_datasets(
"dataset": dataset.dict(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation on this endpoint makes sure data categories on the dataset exist in the database. Because we're accessing the database, it's done outside of a typical pydantic validator. If this endpoint goes away, we need a new place to stick this. Does the existing ctl_datasets endpoint have this validation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @ThomasLaPiana. You might be the right person to ask about the ctl side of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so it looks like the existing crud endpoints don't do this. The tricky bit is that the crud endpoints are very generic, it's blocks of code that applies to updating an entire set of resources. Added a note to look into the best place to put this in the next ticket #1763

@@ -172,10 +233,10 @@ def patch_datasets(
Given a list of dataset elements, create or update corresponding Dataset objects
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this endpoint {{host}}/connection/{{connection_key}}/dataset should be deprecated once the UI has been updated to use the new endpoint above.

Added some functionality here to make it still usable. If a raw dataset is passed in, I write it to both the DatasetConfig.dataset field and the ctl_dataset record.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a follow up ticket keeping track of all of the soon to be removed/deprecated routes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question Andrew, here's the follow-up ticket, we can wait to deprecate until the UI has been pointed to use the new endpoints #2092

Comment on lines +104 to +106
upsert_ctl_dataset(
dataset.ctl_dataset
) # Update existing ctl_dataset first.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I know the specific ctl_dataset_id because I got it off the existing DatasetConfig. However, if there is no dataset config, I'm looking up a ctl dataset by fides key.

So there's a little extra code here, sometimes I want to update the CTLDataset by id, others by fides_key.

…dataset. Add unit tests for temporary method.
@pattisdr
Copy link
Contributor Author

Test failures are just timescale related that we're seeing on other branches ^

Comment on lines +283 to +284
dataset_config.delete(db)
ctl_dataset.delete(db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these required? IIRC we have a fixture that runs after every test that clears out all of the tables.

fides/tests/ops/conftest.py

Lines 107 to 132 in e77d6dc

@pytest.fixture(autouse=True)
def clear_db_tables(db):
"""Clear data from tables between tests.
If relationships are not set to cascade on delete they will fail with an
IntegrityError if there are relationsips present. This function stores tables
that fail with this error then recursively deletes until no more IntegrityErrors
are present.
"""
yield
def delete_data(tables):
redo = []
for table in tables:
try:
db.execute(table.delete())
except IntegrityError:
redo.append(table)
finally:
db.commit()
if redo:
delete_data(redo)
db.commit() # make sure all transactions are closed before starting deletes
delete_data(Base.metadata.sorted_tables)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not technically, but I think this fixture that clears all the tables is too aggressive because there are resources that are expected to be in the database.

I filed an issue to investigate this further and in the meantime trying to make my new tests more self-sufficient generally #2016

@pattisdr pattisdr force-pushed the fides_1762_datasetconfig_ctl_dataset_id branch from 57aaae8 to 398d04e Compare December 20, 2022 15:49
@pattisdr
Copy link
Contributor Author

Failing tests are still timescale-related that have been fixed on main. This will be resolved after this is merged and I get unified-fides-resources up-to-date with main.

Copy link
Contributor

@TheAndrewJackson TheAndrewJackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@pattisdr pattisdr merged commit 7337c50 into unified-fides-resources Dec 20, 2022
@pattisdr pattisdr deleted the fides_1762_datasetconfig_ctl_dataset_id branch December 20, 2022 20:16
@ThomasLaPiana
Copy link
Contributor

ThomasLaPiana commented Dec 20, 2022

im taking a look through this now, will take me a bit

edit: nevermind lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants