Typically expect catalog entries to have unique filepaths, protecting against overwrite #3993

david-stanley-94 · 2024-07-05T11:08:41Z

Description

Data has been accidentally overwritten in the past, after copy pasting a catalog entry for deriving a new one, and forgetting to change the filepath. I feel it would be useful to protect against this kind of situation by expecting catalog entries to have unique filepaths by default, and throwing an error when this is not the case, with certain sensible opt outs the user / developer can add.

Context

This would prevent some accidental overwriting of data by users, while still allowing unchanged functionality for when catalog entries are expected to share filepaths (e.g. SQLDatasets, transcoded entries).

Possible Implementation

By default check and throw an error for duplicate filepaths across the entire catalog, with the following exceptions

ignore transcoded entries (these are expected to share filepaths)
ignore overwrite: True flagged entries (or something like this)
- Might be this is a flag to add to datasets (e.g. SQLDataset) rather than catalog entries
- And then can overrule dataset setting with catalog entry flagging, so csv files say can be allowed to overwrite where desired, and sql tables can be prevented from overwrite where desired

So for a catalog.yml with:

my_first_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv

my_first_edited_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv

my_first_alt_edited_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv
  overwrite: True

my_second_csv_dataset@pandas:
  type: pandas.CSVDataset
  filepath: path/to/second/csv

my_second_csv_dataset@spark:
  type: spark.SparkDataset
  filepath: path/to/second/csv

my_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table

my_edited_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table

my_alt_edited_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table
  overwrite: False

There would be

Errors for my_first_csv_dataset and my_first_edited_csv_dataset sharing filepaths, but not for my_first_alt_edited_csv_dataset
NO errors for my_second_csv_dataset@pandas and my_second_csv_dataset@spark
An error for my_alt_edited_sql_dataset, but not for my_sql_dataset or my_alt_edited_sql_dataset

Possible Alternatives

Add a flag for running with no duplicate filepaths expected. Throw an error if they are detected, otherwise don't. Could make this default behaviour at a later date if sees popular use. However, this is not a versatile solution, as some pipelines may have a mixture of catalog entires they would and would not expect to be overwritten.

The text was updated successfully, but these errors were encountered:

datajoely · 2024-07-05T13:44:48Z

I'm trying to think about how this could work - as part of @ElenaKhaustova and @iamelijahko 's excellent DataCatalog research (#3934) there is now an initiative to make a consistent API for datasets to expose the file path as a public method: #3929

I think once the public API ticket is in, it would be really easy to write some sort of after_catalog_created validation hook to where you just collect all the filepath attributes and throw an error if you see more than one instance. The only complications I can maybe see in this pattern is ensuring we validate the rendered file path at runtime rather than any templated / factory file paths which are expressed differently at rest.

Davide-Ragazzon · 2024-07-05T13:50:21Z

Maybe these checks could be preformed by a separate optional function that does catalog validation.

This

Allows users to validate the catalog if needed
Does not restrict the cases where multiple datasets point to the same file to a specific subsets of allowed case.
This reduces the risk that users need something we are not thinking about and the whole catalog breaks by default
Does not force the users to add more flags like the overwrite flag suggested above, unless the user specifically decides to run the checks

E.g.: Sometimes it is useful to have e.g. a common way to update datasets in kedro is to define "input_dataset" and "updated_dataset" pointing to the same file, so you can have a function that take.

david-stanley-94 added the Issue: Feature Request New feature or improvement to existing feature label Jul 5, 2024

github-actions bot mentioned this issue Aug 1, 2024

Monthly issue metrics report #4049

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typically expect catalog entries to have unique filepaths, protecting against overwrite #3993

Typically expect catalog entries to have unique filepaths, protecting against overwrite #3993

david-stanley-94 commented Jul 5, 2024

datajoely commented Jul 5, 2024

Davide-Ragazzon commented Jul 5, 2024 •

edited

Loading

Typically expect catalog entries to have unique filepaths, protecting against overwrite #3993

Typically expect catalog entries to have unique filepaths, protecting against overwrite #3993

Comments

david-stanley-94 commented Jul 5, 2024

Description

Context

Possible Implementation

Possible Alternatives

datajoely commented Jul 5, 2024

Davide-Ragazzon commented Jul 5, 2024 • edited Loading

Davide-Ragazzon commented Jul 5, 2024 •

edited

Loading