Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typically expect catalog entries to have unique filepaths, protecting against overwrite #3993

Open
david-stanley-94 opened this issue Jul 5, 2024 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@david-stanley-94
Copy link

Description

Data has been accidentally overwritten in the past, after copy pasting a catalog entry for deriving a new one, and forgetting to change the filepath. I feel it would be useful to protect against this kind of situation by expecting catalog entries to have unique filepaths by default, and throwing an error when this is not the case, with certain sensible opt outs the user / developer can add.

Context

This would prevent some accidental overwriting of data by users, while still allowing unchanged functionality for when catalog entries are expected to share filepaths (e.g. SQLDatasets, transcoded entries).

Possible Implementation

By default check and throw an error for duplicate filepaths across the entire catalog, with the following exceptions

  • ignore transcoded entries (these are expected to share filepaths)
  • ignore overwrite: True flagged entries (or something like this)
    • Might be this is a flag to add to datasets (e.g. SQLDataset) rather than catalog entries
    • And then can overrule dataset setting with catalog entry flagging, so csv files say can be allowed to overwrite where desired, and sql tables can be prevented from overwrite where desired

So for a catalog.yml with:

my_first_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv

my_first_edited_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv

my_first_alt_edited_csv_dataset:
  type: pandas.CSVDataset
  filepath: path/to/csv
  overwrite: True

my_second_csv_dataset@pandas:
  type: pandas.CSVDataset
  filepath: path/to/second/csv

my_second_csv_dataset@spark:
  type: spark.SparkDataset
  filepath: path/to/second/csv

my_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table

my_edited_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table

my_alt_edited_sql_dataset:
  type: [SQLDataset]
  filpepath: path/to/table
  overwrite: False

There would be

  • Errors for my_first_csv_dataset and my_first_edited_csv_dataset sharing filepaths, but not for my_first_alt_edited_csv_dataset
  • NO errors for my_second_csv_dataset@pandas and my_second_csv_dataset@spark
  • An error for my_alt_edited_sql_dataset, but not for my_sql_dataset or my_alt_edited_sql_dataset

Possible Alternatives

Add a flag for running with no duplicate filepaths expected. Throw an error if they are detected, otherwise don't. Could make this default behaviour at a later date if sees popular use. However, this is not a versatile solution, as some pipelines may have a mixture of catalog entires they would and would not expect to be overwritten.

@david-stanley-94 david-stanley-94 added the Issue: Feature Request New feature or improvement to existing feature label Jul 5, 2024
@datajoely
Copy link
Contributor

I'm trying to think about how this could work - as part of @ElenaKhaustova and @iamelijahko 's excellent DataCatalog research (#3934) there is now an initiative to make a consistent API for datasets to expose the file path as a public method: #3929

I think once the public API ticket is in, it would be really easy to write some sort of after_catalog_created validation hook to where you just collect all the filepath attributes and throw an error if you see more than one instance. The only complications I can maybe see in this pattern is ensuring we validate the rendered file path at runtime rather than any templated / factory file paths which are expressed differently at rest.

@Davide-Ragazzon
Copy link

Davide-Ragazzon commented Jul 5, 2024

Maybe these checks could be preformed by a separate optional function that does catalog validation.

This

  • Allows users to validate the catalog if needed
  • Does not restrict the cases where multiple datasets point to the same file to a specific subsets of allowed case.
    This reduces the risk that users need something we are not thinking about and the whole catalog breaks by default
  • Does not force the users to add more flags like the overwrite flag suggested above, unless the user specifically decides to run the checks

E.g.: Sometimes it is useful to have e.g. a common way to update datasets in kedro is to define "input_dataset" and "updated_dataset" pointing to the same file, so you can have a function that take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants