-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Non Nullable DatasetConfig.ctl_dataset_id Field #2046
Add Non Nullable DatasetConfig.ctl_dataset_id Field #2046
Conversation
- Add a data migration that takes existing datasetconfig.dataset and creates a new ctl_dataset record and links the new record back to the datasetconfig. - Add a follow-up schema migration that makes the datasetconfig.ctl_dataset field not nullable.
β¦hat takes in a pair of a fides_key and ctl_dataset_fides_key. This request will create/update a DatasetConfig and link the ctl_dataset to it. As an incremental step, this endpoint copies the ctl_dataset and stores it on DatasetConfig.dataset. - Update existing endpoint PATCH v1/connection/connection_key/dataset (that will be deprecated) to take the supplied dataset and upsert a ctl_dataset with it. This still allows a raw dataset to be supplied through this endpoint for the moment to not break the UI. - Both endpoints still try to update both DatasetConfig.dataset and the corresponding DatasetConfig.ctl_dataset resource. A followup will stop updating DatasetConfig.ctl_dataset - When fetching the dataset, get the contents of the ctl dataset, not DatasetConfig.dataset, which is going away. - Update the migration to validate the ctl_dataset created from a dataset before saving. - Update a lot of DatasetConfig fixtures to have a ctl_dataset linked to it storing the actual dataset contents.
Update docstring and rename upsert_with_ctl_dataset. - Remove unneccesary CTLDataset in fixture file.
@@ -1025,6 +1025,9 @@ dataset: | |||
- name: connection_config_id | |||
data_categories: [system.operations] | |||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified | |||
- name: ctl_dataset_id | |||
data_categories: [ system.operations ] | |||
data_qualifier: aggregated.anonymized.unlinked_pseudonymized.pseudonymized.identified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This yaml file has been adjusted to reflect the new datasetconfig.ctl_dataset_id field
except IntegrityError as exc: | ||
raise Exception( | ||
f"Fides attempted to copy datasetconfig.datasets into their own ctl_datasets rows but got error: {exc}. " | ||
f"Adjust fides_keys in ctl_datasets table to not conflict." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempt to create new ctl_dataset records as part of this data migration by default so we can 1) go ahead and make this field non-nullable while 2) not combining existing ctl_datasets and potentially doing it wrong.
I talked with Sean about this - he said the plan was to handle conflicts ad hoc with customer? So if there's a conflict, my current plan is that they resolve manually, which differs from more detailed plan spelled out here #1764
@classmethod | ||
def create_from_dataset_dict(cls, db: Session, dataset: dict) -> "Dataset": | ||
"""Add a method to create directly using a synchronous session""" | ||
validated_dataset: FideslangDataset = FideslangDataset(**dataset) | ||
ctl_dataset = cls(**validated_dataset.dict()) | ||
db.add(ctl_dataset) | ||
db.commit() | ||
db.refresh(ctl_dataset) | ||
return ctl_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we have endpoints/methods that already exist ctl-side for creating ctl_datasets
but there's still a big division between ctl-code largely using asynchronous sessions and ops code largely using synchronous sessions. I don't want to take that on here, so I'm adding a small model method that uses a synchronous session that is used numerous times (largely in testing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. IIRC the ctl endpoints are fairly generic and constructed differently
ctl_dataset: CtlDataset = ( | ||
db.query(CtlDataset) | ||
.filter_by(fides_key=dataset_pair.ctl_dataset_fides_key) | ||
.first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I was using a ctl-method to get this Dataset but it wasn't playing well with ops tests. It worked fine when I ran the test file by itself but broke down when I ran the whole test suite. with sqlalchemy.dialects.postgresql.asyncpg.InterfaceError - cannot perform operation: another operation is in progress
.
def patch_dataset_configs( | ||
dataset_pairs: conlist(DatasetConfigCtlDataset, max_items=50), # type: ignore | ||
db: Session = Depends(deps.get_db), | ||
connection_config: ConnectionConfig = Depends(_get_connection_config), | ||
) -> BulkPutDataset: | ||
""" | ||
Endpoint to create or update DatasetConfigs by passing in pairs of: | ||
1) A DatasetConfig fides_key | ||
2) The corresponding CtlDataset fides_key which stores the bulk of the actual dataset | ||
|
||
Currently this endpoint looks up the ctl dataset and writes its contents back to the DatasetConfig.dataset | ||
field for backwards compatibility but soon DatasetConfig.dataset will go away. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New endpoint that the UI should switch to using in ops "create a connector" workflow.
Andrew described a flow where two endpoints will be hit, the ctl dataset endpoint to create/update that, and then the fides_key of that ctl_dataset will be passed to this endpoint. You could also select the ctl dataset from a dropdown.
@@ -75,7 +148,7 @@ def get_graph(self) -> GraphDataset: | |||
the corresponding SaaS config is merged in as well | |||
""" | |||
dataset_graph = convert_dataset_to_graph( | |||
Dataset(**self.dataset), self.connection_config.key # type: ignore | |||
Dataset.from_orm(self.ctl_dataset), self.connection_config.key # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we build the graph to run a DSR (this is the starting point for that), I'm pulling from the ctl_dataset instead of DatasetConfig.dataset.
|
||
ctl_dataset = CtlDataset.create_from_dataset_dict(db, bigquery_dataset) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is larger than it looks. 3/4's of the edits just make sure that test DatasetConfig fixtures have a CTL Dataset linked to it.
@@ -198,7 +259,13 @@ def patch_datasets( | |||
"dataset": dataset.dict(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation on this endpoint makes sure data categories on the dataset exist in the database. Because we're accessing the database, it's done outside of a typical pydantic validator. If this endpoint goes away, we need a new place to stick this. Does the existing ctl_datasets endpoint have this validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC: @ThomasLaPiana. You might be the right person to ask about the ctl side of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so it looks like the existing crud endpoints don't do this. The tricky bit is that the crud endpoints are very generic, it's blocks of code that applies to updating an entire set of resources. Added a note to look into the best place to put this in the next ticket #1763
@@ -172,10 +233,10 @@ def patch_datasets( | |||
Given a list of dataset elements, create or update corresponding Dataset objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this endpoint {{host}}/connection/{{connection_key}}/dataset
should be deprecated once the UI has been updated to use the new endpoint above.
Added some functionality here to make it still usable. If a raw dataset is passed in, I write it to both the DatasetConfig.dataset field and the ctl_dataset record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a follow up ticket keeping track of all of the soon to be removed/deprecated routes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question Andrew, here's the follow-up ticket, we can wait to deprecate until the UI has been pointed to use the new endpoints #2092
upsert_ctl_dataset( | ||
dataset.ctl_dataset | ||
) # Update existing ctl_dataset first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I know the specific ctl_dataset_id because I got it off the existing DatasetConfig. However, if there is no dataset config, I'm looking up a ctl dataset by fides key.
So there's a little extra code here, sometimes I want to update the CTLDataset by id, others by fides_key.
β¦dataset. Add unit tests for temporary method.
Test failures are just timescale related that we're seeing on other branches ^ |
dataset_config.delete(db) | ||
ctl_dataset.delete(db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these required? IIRC we have a fixture that runs after every test that clears out all of the tables.
Lines 107 to 132 in e77d6dc
@pytest.fixture(autouse=True) | |
def clear_db_tables(db): | |
"""Clear data from tables between tests. | |
If relationships are not set to cascade on delete they will fail with an | |
IntegrityError if there are relationsips present. This function stores tables | |
that fail with this error then recursively deletes until no more IntegrityErrors | |
are present. | |
""" | |
yield | |
def delete_data(tables): | |
redo = [] | |
for table in tables: | |
try: | |
db.execute(table.delete()) | |
except IntegrityError: | |
redo.append(table) | |
finally: | |
db.commit() | |
if redo: | |
delete_data(redo) | |
db.commit() # make sure all transactions are closed before starting deletes | |
delete_data(Base.metadata.sorted_tables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not technically, but I think this fixture that clears all the tables is too aggressive because there are resources that are expected to be in the database.
I filed an issue to investigate this further and in the meantime trying to make my new tests more self-sufficient generally #2016
57aaae8
to
398d04e
Compare
Failing tests are still timescale-related that have been fixed on main. This will be resolved after this is merged and I get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im taking a look through this now, will take me a bit edit: nevermind lol |
β Contains multiple migrations (schema and data)
π Note this PR is against a feature branch
Closes #1762
Closes #1764
Code Changes
datasetconfig.ctl_dataset_id
columndatasetconfig.ctl_dataset_id
column that is nullablectl_dataset
records, and then link thatctl_dataset
as the DatasetConfig.ctl_dataset_id. If there's a conflict with an existingctl_datasets.fides_key
, I error instead of attempting to upsert. The user should manually resolve.datasetconfig.ctl_dataset_id
field non-nullablePATCH {{host}}/connection/{{connection_key}}/datasetconfig
that takes in afides_key
(for the DatasetConfig) and an existingctl_dataset_fides_key
. It upserts a DatasetConfig, links to the existing CTLDataset, then copies the CTLDataset contents back to the DatasetConfig.dataset. Soon, the DatasetConfig.dataset field is going away.Steps to Confirm
Migration verification
datasetconfig
in your application database and then switch to this branch without dropping that database. If you switch and bring up the shell, migrations should run. Verify that the existingdatasetconfig
table has actl_dataset_id
row. Verify that this row is non-nullable and populated. Locate the correspondingctl_datasets
record. Compare each column and match to what was indatasetconfig.dataset
.DatasetConfig.dataset
itself should be untouched for now. Anyfidesops_meta
fields are likely converted tofides_meta
fields.'Run a privacy request
nox -s test_env
fides_uploads
folder and contents are roughly correct.Existing endpoint parity (will soon be deprecated, but the UI still works here)
nox -s test_env
PATCH http://0.0.0.0:8080/api/v1/connection/postgres_connector/dataset
)select description from ctl_datasets where id = 'xxxxxx';
)Test creating saas connectors from a template
datasetconfig
andctl_datasets
records and the DatasetConfig has a ctl_dataset_id FK. The dataset should exist in both places.New endpoint
PATCH {{host}}/connection/
PATCH {{host}}/connection/{{existing connection key}}/datasetconfig
. Create a new fides_key (this will be the identifier for the DatasetConfig, but select an existing ctl_dataset fides_key.Existing CTL datasets tab
Pre-Merge Checklist
CHANGELOG.md
Description Of Changes
We are trying to move to storing the bulk of the contents of a Dataset solely on the
ctl_datasets
table. Right now, similar concepts exist in both thectl_datasets
table and theDatasetConfig.dataset
column.The idea with this increment is to add a non-nullable
DatasetConfig.ctl_dataset_id
field. DSR's can't run without an associated dataset so I think we should keep this a constraint from the beginning. I take the contents of existingDatasetConfig.datasets
and attempt to create newctl_dataset
records and then link them to existing DatasetConfig.The next step is to keep writing to both places -
DatasetConfig.dataset
AND making the same changes to thectl_dataset
record. This work also starts reading from theDatasetConfig.ctl_dataset
record instead ofDatasetConfig.dataset
.Follow-up work will deprecate some existing endpoints and stop writing to the DatasetConfig.dataset column.