Add Non Nullable DatasetConfig.ctl_dataset_id Field #2046

pattisdr · 2022-12-14T23:28:31Z

❗ Contains multiple migrations (schema and data)
👉 Note this PR is against a feature branch

Closes #1762
Closes #1764

Code Changes

Steps to Confirm

Migration verification

Checkout main. Create a datasetconfig in your application database and then switch to this branch without dropping that database. If you switch and bring up the shell, migrations should run. Verify that the existing datasetconfig table has a ctl_dataset_id row. Verify that this row is non-nullable and populated. Locate the corresponding ctl_datasets record. Compare each column and match to what was in datasetconfig.dataset. DatasetConfig.dataset itself should be untouched for now. Any fidesops_meta fields are likely converted to fides_meta fields.'

Run a privacy request

Run nox -s test_env
Go to the privacy center, run a privacy request
Return to admin UI, approve
Verify json file in fides_uploads folder and contents are roughly correct.

Existing endpoint parity (will soon be deprecated, but the UI still works here)

Run nox -s test_env
Go to http://localhost:3000, login, then http://localhost:3000/datastore-connection.
Click on the three horizontal dots of the Postgres connector > Configure. Jump to Dataset configuration tab. Edit the description of the dataset. Verify API request is successful (PATCH http://0.0.0.0:8080/api/v1/connection/postgres_connector/dataset)
Locate the updated datasetconfig.dataset field in the db. Verify the description has changed. Similarly note that the ctl_dataset.description column has changed (select description from ctl_datasets where id = 'xxxxxx';)

Test creating saas connectors from a template

- http://localhost:3000/datastore-connection > Create new connection
Select mailchimp. Enter in connector parameters - because we're not going to use this, your secrets can be fake
On Dataset configuration tab click > Save Yaml System.
Verify successful API requests
Verify that new datasetconfig and ctl_datasets records and the DatasetConfig has a ctl_dataset_id FK. The dataset should exist in both places.

New endpoint

In Postman or similar, create a new connection config resource PATCH {{host}}/connection/
In Postman or similar, add a dataset config with the new endpoint: PATCH {{host}}/connection/{{existing connection key}}/datasetconfig. Create a new fides_key (this will be the identifier for the DatasetConfig, but select an existing ctl_dataset fides_key.

[{
    "fides_key": "new_dataset_config",
    "ctl_dataset_fides_key": "postgres_example_test_dataset"
}]

Verify a new DatasetConfig was created and the contents of the existing ctl_dataset were ported back into the new DatasetConfig.dataset .
Visit this new connector and dataset in the UI

Existing CTL datasets tab

Go to http://localhost:3000/dataset. Our newly created datasets (mailchimp) and the postgres one created directly in the API should be there.

Pre-Merge Checklist

All CI Pipelines Succeeded
Documentation Updated:
- documentation complete, or draft/outline provided (tag docs-team to complete/review on this branch)
- documentation issue created (tag docs-team to complete issue separately)
Issue Requirements are Met
Relevant Follow-Up Issues Created
Update CHANGELOG.md

Description Of Changes

We are trying to move to storing the bulk of the contents of a Dataset solely on the ctl_datasets table. Right now, similar concepts exist in both the ctl_datasets table and the DatasetConfig.dataset column.

The idea with this increment is to add a non-nullable DatasetConfig.ctl_dataset_id field. DSR's can't run without an associated dataset so I think we should keep this a constraint from the beginning. I take the contents of existing DatasetConfig.datasets and attempt to create new ctl_dataset records and then link them to existing DatasetConfig.

The next step is to keep writing to both places - DatasetConfig.dataset AND making the same changes to the ctl_dataset record. This work also starts reading from the DatasetConfig.ctl_dataset record instead of DatasetConfig.dataset.

Follow-up work will deprecate some existing endpoints and stop writing to the DatasetConfig.dataset column.

- Add a data migration that takes existing datasetconfig.dataset and creates a new ctl_dataset record and links the new record back to the datasetconfig. - Add a follow-up schema migration that makes the datasetconfig.ctl_dataset field not nullable.

…hat takes in a pair of a fides_key and ctl_dataset_fides_key. This request will create/update a DatasetConfig and link the ctl_dataset to it. As an incremental step, this endpoint copies the ctl_dataset and stores it on DatasetConfig.dataset. - Update existing endpoint PATCH v1/connection/connection_key/dataset (that will be deprecated) to take the supplied dataset and upsert a ctl_dataset with it. This still allows a raw dataset to be supplied through this endpoint for the moment to not break the UI. - Both endpoints still try to update both DatasetConfig.dataset and the corresponding DatasetConfig.ctl_dataset resource. A followup will stop updating DatasetConfig.ctl_dataset - When fetching the dataset, get the contents of the ctl dataset, not DatasetConfig.dataset, which is going away. - Update the migration to validate the ctl_dataset created from a dataset before saving. - Update a lot of DatasetConfig fixtures to have a ctl_dataset linked to it storing the actual dataset contents.

…ence.

Update docstring and rename upsert_with_ctl_dataset. - Remove unneccesary CTLDataset in fixture file.

pattisdr · 2022-12-15T02:52:39Z

.fides/db_dataset.yml

@@ -1025,6 +1025,9 @@ dataset:
    - name: connection_config_id
      data_categories: [system.operations]
      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified
+    - name: ctl_dataset_id
+      data_categories: [ system.operations ]
+      data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified


This yaml file has been adjusted to reflect the new datasetconfig.ctl_dataset_id field

pattisdr · 2022-12-15T03:30:44Z

src/fides/api/ctl/migrations/versions/216cdc7944f1_add_datasetconfig_ctl_datasets_fk.py

+        except IntegrityError as exc:
+            raise Exception(
+                f"Fides attempted to copy datasetconfig.datasets into their own ctl_datasets rows but got error: {exc}. "
+                f"Adjust fides_keys in ctl_datasets table to not conflict."
+            )


I attempt to create new ctl_dataset records as part of this data migration by default so we can 1) go ahead and make this field non-nullable while 2) not combining existing ctl_datasets and potentially doing it wrong.

I talked with Sean about this - he said the plan was to handle conflicts ad hoc with customer? So if there's a conflict, my current plan is that they resolve manually, which differs from more detailed plan spelled out here #1764

pattisdr · 2022-12-15T03:32:54Z

src/fides/api/ctl/sql_models.py

+    @classmethod
+    def create_from_dataset_dict(cls, db: Session, dataset: dict) -> "Dataset":
+        """Add a method to create directly using a synchronous session"""
+        validated_dataset: FideslangDataset = FideslangDataset(**dataset)
+        ctl_dataset = cls(**validated_dataset.dict())
+        db.add(ctl_dataset)
+        db.commit()
+        db.refresh(ctl_dataset)
+        return ctl_dataset


I see we have endpoints/methods that already exist ctl-side for creating ctl_datasets but there's still a big division between ctl-code largely using asynchronous sessions and ops code largely using synchronous sessions. I don't want to take that on here, so I'm adding a small model method that uses a synchronous session that is used numerous times (largely in testing).

This makes sense. IIRC the ctl endpoints are fairly generic and constructed differently

pattisdr · 2022-12-15T03:36:11Z

src/fides/api/ops/api/v1/endpoints/dataset_endpoints.py

+        ctl_dataset: CtlDataset = (
+            db.query(CtlDataset)
+            .filter_by(fides_key=dataset_pair.ctl_dataset_fides_key)
+            .first()


Originally I was using a ctl-method to get this Dataset but it wasn't playing well with ops tests. It worked fine when I ran the test file by itself but broke down when I ran the whole test suite. with sqlalchemy.dialects.postgresql.asyncpg.InterfaceError - cannot perform operation: another operation is in progress.

pattisdr · 2022-12-15T03:38:09Z

src/fides/api/ops/api/v1/endpoints/dataset_endpoints.py

+def patch_dataset_configs(
+    dataset_pairs: conlist(DatasetConfigCtlDataset, max_items=50),  # type: ignore
+    db: Session = Depends(deps.get_db),
+    connection_config: ConnectionConfig = Depends(_get_connection_config),
+) -> BulkPutDataset:
+    """
+    Endpoint to create or update DatasetConfigs by passing in pairs of:
+    1) A DatasetConfig fides_key
+    2) The corresponding CtlDataset fides_key which stores the bulk of the actual dataset
+
+    Currently this endpoint looks up the ctl dataset and writes its contents back to the DatasetConfig.dataset
+    field for backwards compatibility but soon DatasetConfig.dataset will go away.
+


New endpoint that the UI should switch to using in ops "create a connector" workflow.

Andrew described a flow where two endpoints will be hit, the ctl dataset endpoint to create/update that, and then the fides_key of that ctl_dataset will be passed to this endpoint. You could also select the ctl dataset from a dropdown.

pattisdr · 2022-12-15T03:45:33Z

src/fides/api/ops/models/datasetconfig.py

@@ -75,7 +148,7 @@ def get_graph(self) -> GraphDataset:
        the corresponding SaaS config is merged in as well
        """
        dataset_graph = convert_dataset_to_graph(
-            Dataset(**self.dataset), self.connection_config.key  # type: ignore
+            Dataset.from_orm(self.ctl_dataset), self.connection_config.key  # type: ignore


When we build the graph to run a DSR (this is the starting point for that), I'm pulling from the ctl_dataset instead of DatasetConfig.dataset.

pattisdr · 2022-12-15T03:47:12Z

tests/ops/fixtures/bigquery_fixtures.py

+
+    ctl_dataset = CtlDataset.create_from_dataset_dict(db, bigquery_dataset)
+


This PR is larger than it looks. 3/4's of the edits just make sure that test DatasetConfig fixtures have a CTL Dataset linked to it.

pattisdr · 2022-12-15T04:02:57Z

src/fides/api/ops/api/v1/endpoints/dataset_endpoints.py

@@ -198,7 +259,13 @@ def patch_datasets(
            "dataset": dataset.dict(),


Validation on this endpoint makes sure data categories on the dataset exist in the database. Because we're accessing the database, it's done outside of a typical pydantic validator. If this endpoint goes away, we need a new place to stick this. Does the existing ctl_datasets endpoint have this validation?

CC: @ThomasLaPiana. You might be the right person to ask about the ctl side of this

OK so it looks like the existing crud endpoints don't do this. The tricky bit is that the crud endpoints are very generic, it's blocks of code that applies to updating an entire set of resources. Added a note to look into the best place to put this in the next ticket #1763

pattisdr · 2022-12-15T04:13:18Z

src/fides/api/ops/api/v1/endpoints/dataset_endpoints.py

@@ -172,10 +233,10 @@ def patch_datasets(
    Given a list of dataset elements, create or update corresponding Dataset objects


I believe this endpoint {{host}}/connection/{{connection_key}}/dataset should be deprecated once the UI has been updated to use the new endpoint above.

Added some functionality here to make it still usable. If a raw dataset is passed in, I write it to both the DatasetConfig.dataset field and the ctl_dataset record.

Should there be a follow up ticket keeping track of all of the soon to be removed/deprecated routes?

Good question Andrew, here's the follow-up ticket, we can wait to deprecate until the UI has been pointed to use the new endpoints #2092

pattisdr · 2022-12-15T04:14:52Z

src/fides/api/ops/models/datasetconfig.py

+            upsert_ctl_dataset(
+                dataset.ctl_dataset
+            )  # Update existing ctl_dataset first.


Here, I know the specific ctl_dataset_id because I got it off the existing DatasetConfig. However, if there is no dataset config, I'm looking up a ctl dataset by fides key.

So there's a little extra code here, sometimes I want to update the CTLDataset by id, others by fides_key.

…dataset. Add unit tests for temporary method.

pattisdr · 2022-12-15T17:00:26Z

Test failures are just timescale related that we're seeing on other branches ^

TheAndrewJackson · 2022-12-15T17:38:43Z

tests/ops/models/test_datasetconfig.py

+        dataset_config.delete(db)
+        ctl_dataset.delete(db)


Are these required? IIRC we have a fixture that runs after every test that clears out all of the tables.

fides/tests/ops/conftest.py

Lines 107 to 132 in e77d6dc

@pytest.fixture(autouse=True)

def clear_db_tables(db):

"""Clear data from tables between tests.

If relationships are not set to cascade on delete they will fail with an

IntegrityError if there are relationsips present. This function stores tables

that fail with this error then recursively deletes until no more IntegrityErrors

are present.

"""

yield

def delete_data(tables):

redo = []

for table in tables:

try:

db.execute(table.delete())

except IntegrityError:

redo.append(table)

finally:

db.commit()

if redo:

delete_data(redo)

db.commit() # make sure all transactions are closed before starting deletes

delete_data(Base.metadata.sorted_tables)

Not technically, but I think this fixture that clears all the tables is too aggressive because there are resources that are expected to be in the database.

I filed an issue to investigate this further and in the meantime trying to make my new tests more self-sufficient generally #2016

tests/ops/fixtures/postgres_fixtures.py

tests/ops/api/v1/endpoints/test_dataset_endpoints.py

pattisdr · 2022-12-20T19:51:31Z

Failing tests are still timescale-related that have been fixed on main. This will be resolved after this is merged and I get unified-fides-resources up-to-date with main.

TheAndrewJackson

ThomasLaPiana · 2022-12-20T20:16:58Z

im taking a look through this now, will take me a bit

edit: nevermind lol

pattisdr added 2 commits December 14, 2022 17:00

pattisdr added the run unsafe ci checks Runs fides-related CI checks that require sensitive credentials label Dec 14, 2022

pattisdr added 3 commits December 14, 2022 18:19

Fix some missed test fixtures that are missing a ctl_dataset_id refer…

fe2e04d

…ence.

Add new datasetconfig.ctl_dataset_id annotation.

99b5493

Change log formatting after loguru updates.

29aeda0

Update docstring and rename upsert_with_ctl_dataset. - Remove unneccesary CTLDataset in fixture file.

pattisdr removed the run unsafe ci checks Runs fides-related CI checks that require sensitive credentials label Dec 15, 2022

Update mock. Function name changed.

4596857

pattisdr self-assigned this Dec 15, 2022

pattisdr added the Unified Fides Resources label Dec 15, 2022

pattisdr changed the title ~~[DRAFT] Add Non Nullable DatasetConfig.ctl_dataset_id Field~~ Add Non Nullable DatasetConfig.ctl_dataset_id Field Dec 15, 2022

pattisdr commented Dec 15, 2022

View reviewed changes

pattisdr marked this pull request as ready for review December 15, 2022 04:28

pattisdr requested review from ThomasLaPiana, seanpreston, SteveDMurphy and TheAndrewJackson December 15, 2022 15:07

Update to pull the fides_key off of the dataset when upserting a ctl …

398d04e

…dataset. Add unit tests for temporary method.

TheAndrewJackson reviewed Dec 15, 2022

View reviewed changes

tests/ops/fixtures/postgres_fixtures.py Outdated Show resolved Hide resolved

TheAndrewJackson reviewed Dec 15, 2022

View reviewed changes

tests/ops/api/v1/endpoints/test_dataset_endpoints.py Outdated Show resolved Hide resolved

pattisdr force-pushed the fides_1762_datasetconfig_ctl_dataset_id branch from 57aaae8 to 398d04e Compare December 20, 2022 15:49

Respond to CR.

04864c9

pattisdr requested a review from TheAndrewJackson December 20, 2022 19:44

TheAndrewJackson approved these changes Dec 20, 2022

View reviewed changes

pattisdr merged commit 7337c50 into unified-fides-resources Dec 20, 2022

pattisdr deleted the fides_1762_datasetconfig_ctl_dataset_id branch December 20, 2022 20:16

pattisdr mentioned this pull request Dec 20, 2022

Add dataset migration CLI command #1764

Closed

pattisdr mentioned this pull request Jan 17, 2023

Unified Fides Resources Feature #2254

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Non Nullable DatasetConfig.ctl_dataset_id Field #2046

Add Non Nullable DatasetConfig.ctl_dataset_id Field #2046

pattisdr commented Dec 14, 2022 •

edited

Loading

pattisdr Dec 15, 2022

pattisdr Dec 15, 2022 •

edited

Loading

pattisdr Dec 15, 2022

TheAndrewJackson Dec 15, 2022

pattisdr Dec 15, 2022

pattisdr Dec 15, 2022

pattisdr Dec 15, 2022

pattisdr Dec 15, 2022

pattisdr Dec 15, 2022

TheAndrewJackson Dec 15, 2022

pattisdr Dec 20, 2022

pattisdr Dec 15, 2022

TheAndrewJackson Dec 15, 2022

pattisdr Dec 20, 2022

pattisdr Dec 15, 2022

pattisdr commented Dec 15, 2022

TheAndrewJackson Dec 15, 2022

pattisdr Dec 20, 2022

pattisdr commented Dec 20, 2022

TheAndrewJackson left a comment

ThomasLaPiana commented Dec 20, 2022 •

edited

Loading


		ctl_dataset = CtlDataset.create_from_dataset_dict(db, bigquery_dataset)

		@@ -198,7 +259,13 @@ def patch_datasets(
		"dataset": dataset.dict(),

		@@ -172,10 +233,10 @@ def patch_datasets(
		Given a list of dataset elements, create or update corresponding Dataset objects

	@pytest.fixture(autouse=True)
	def clear_db_tables(db):
	"""Clear data from tables between tests.

	If relationships are not set to cascade on delete they will fail with an
	IntegrityError if there are relationsips present. This function stores tables
	that fail with this error then recursively deletes until no more IntegrityErrors
	are present.
	"""
	yield

	def delete_data(tables):
	redo = []
	for table in tables:
	try:
	db.execute(table.delete())
	except IntegrityError:
	redo.append(table)
	finally:
	db.commit()

	if redo:
	delete_data(redo)

	db.commit() # make sure all transactions are closed before starting deletes
	delete_data(Base.metadata.sorted_tables)

Add Non Nullable DatasetConfig.ctl_dataset_id Field #2046

Add Non Nullable DatasetConfig.ctl_dataset_id Field #2046

Conversation

pattisdr commented Dec 14, 2022 • edited Loading

Code Changes

Steps to Confirm

Pre-Merge Checklist

Description Of Changes

Choose a reason for hiding this comment

pattisdr Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pattisdr commented Dec 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pattisdr commented Dec 20, 2022

TheAndrewJackson left a comment

Choose a reason for hiding this comment

ThomasLaPiana commented Dec 20, 2022 • edited Loading

pattisdr commented Dec 14, 2022 •

edited

Loading

pattisdr Dec 15, 2022 •

edited

Loading

ThomasLaPiana commented Dec 20, 2022 •

edited

Loading