Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dataset factories to register default datasets #2668

Closed
Tracked by #2635
ankatiyar opened this issue Jun 12, 2023 · 10 comments · Fixed by #3332
Closed
Tracked by #2635

Use dataset factories to register default datasets #2668

ankatiyar opened this issue Jun 12, 2023 · 10 comments · Fixed by #3332

Comments

@ankatiyar
Copy link
Contributor

ankatiyar commented Jun 12, 2023

Description

After the introduction of dataset factories in #2423, the creation of default datasets which currently is handled by runners should be handled in the DataCatalog.

Once the feature is already merged, users can overwrite default datasets using the following pattern -

"{default}":
  filepath: data/02_intermediate/{default}.csv
  type: pandas.CSVDataSet

This ticket is to change the behaviour of existing runners and use the factories way to register default datasets.
Instead of adding the catalog entry for default datasets in the catalog -

for ds_name in unregistered_ds:
catalog.add(ds_name, self.create_default_data_set(ds_name))

The runner should call a method from the catalog to register a default pattern

To dos -

  • runners register the default datasets as a catch-all pattern
@ankatiyar ankatiyar added the Issue: Feature Request New feature or improvement to existing feature label Jun 12, 2023
@merelcht merelcht added Component: Configuration and removed Issue: Feature Request New feature or improvement to existing feature labels Jun 12, 2023
@noklam
Copy link
Contributor

noklam commented Jun 12, 2023

Ccing @marrrcin for comments as I think overwriting default datasets is done across the collection of plugins you created.

@marrrcin
Copy link
Contributor

I've already had a call with @merelcht , maybe she can post some notes here.

@ankatiyar
Copy link
Contributor Author

Current behaviour

  • creation of default datasets is handled in the runner -
  • Each runner has a create_default_dataset() class which returns an MemoryDataset/SharedMemoryDataset(for parallel runner) object for each "unregistered dataset"(not mentioned in the catalog as an explicit entry but mentioned in the pipeline inputs or outputs) and adds it to the catalog.

Updating this behaviour to use a dataset factories instead for SequentialRunner and ThreadRunner is straight forward -

  • Just add a catch-all pattern to the catalog with MemoryDataset :
"{default}":
  type: MemoryDataset
  • Add a function to catalog catalog.add_pattern(name, config) which can be called either -
    • With the current behaviour, in runner.run() after the "free outputs" have been calculated as these datasets are returned as the output for the session
    • Before runner.run() is called in the session, but then some logic to determine which datasets are "free outputs" needs to be added.

Issues with ParallelRunner

  • The default dataset for ParallelRunner is SharedMemoryDataset which is a wrapper around MemoryDataset. It is not an AbstractDataset -> therefore, doesn't have a from_config() function that the catalog uses in catalog._get_dataset()/from_config() to initialise datasets on creation
  • ParallelRunner has a _manager attribute which is passed to the SharedMemoryDataset object on initialisation (can't be done in a declarative way by simply adding a catalog entry)

Solution

I've managed to hack together a solution in the draft PR #3269

Some of these things can be cherry-picked and implemented regardless of whether we decide to go for the whole thing -

  • Moved SharedMemoryDataset to kedro/io - it's currently defined in kedro/runner/parallel_runner.py
  • Make SharedMemoryDataset a subclass of AbstractDataset - this allows the catalog to register it also through a "catalog entry" (catch-all pattern) but without initialising the underlying shared_memory_dataset
  • In the code for ParallelRunner -> it initialises the shared_memory_dataset attribute of the SharedMemoryDataset with the (ParallelRunner's)manager.MemoryDataset() object if a dataset is a SharedMemoryDataset
  • The way i've implemented it right now is first the runner checks for the existence of all pipeline datasets with catalog.exists(dataset) which calls catalog._get_dataset(dataset) -> the dataset factory datasets are registered to the Catalog. And then for the SharedMemoryDataset, initialise the underlying dataset. (This could be improved in the final implementation)

@marrrcin
Copy link
Contributor

How does this affect the plugins? Our major plugins (AzureML, VertexAI and SageMaker) rely on the fact that we can transparently to the user (and the project) swap the default dataset implementation. This makes plugins' UX really good for the users, because they still don't have to think about datasets that are not explicitly mentioned in the catalog - they have the same experience as with MemoryDataSet.

@ankatiyar
Copy link
Contributor Author

ankatiyar commented Nov 15, 2023

@marrrcin This does not affect the user experience at all afaik. This is just internal implementation detail, where we currently set the default dataset in the runner but we want to leverage dataset factories to set it.
It just adds a default pattern to the catalog -

"{default}":
  type: MemoryDataset

This does not override the default one that the user has set.

How do the plugins set the default dataset at the moment? Do they use the runner.create_default_dataset() method?

@marrrcin
Copy link
Contributor

Yes, we rely on runner.create_default_dataset().

@ankatiyar
Copy link
Contributor Author

@marrrcin In that case, it would be replaced by something like catalog.add_default_pattern() mostly to move this responsibility away from the runners but that's a fair point about plugins needing to add default datasets. This was just a spike to see if this can work, we haven't made a decision either way yet.

@noklam
Copy link
Contributor

noklam commented Nov 16, 2023

@marrrcin This does not affect the user experience at all afaik. This is just internal implementation detail, where we currently set the default dataset in the runner but we want to leverage dataset factories to set it. It just adds a default pattern to the catalog -

"{default}":
  type: MemoryDataset

This does not override the default one that the user has set.

How do the plugins set the default dataset at the moment? Do they use the runner.create_default_dataset() method?

Moving it to catalog seems like a good choice, but I wonder what if user also create a catch-all pattern in catalog.yml? which one would take the preference? Should we ban the use of catch-all (if possible at all?) and ask the user to configure this in settings.py instead?

We need to make sure add_default_pattern will override the previous setting, I think set_default_pattern is a better name.

@marrrcin
Copy link
Contributor

@marrrcin In that case, it would be replaced by something like catalog.add_default_pattern() mostly to move this responsibility away from the runners but that's a fair point about plugins needing to add default datasets. This was just a spike to see if this can work, we haven't made a decision either way yet.

When / where would the catalog.add_default_pattern() happen during Kedro's lifecycle?

@merelcht merelcht self-assigned this Nov 24, 2023
@merelcht merelcht linked a pull request Nov 24, 2023 that will close this issue
7 tasks
@merelcht
Copy link
Member

merelcht commented Dec 7, 2023

Completed in #3332

@merelcht merelcht closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants