Use dataset factories to register default datasets #2668

ankatiyar · 2023-06-12T10:35:07Z

Description

After the introduction of dataset factories in #2423, the creation of default datasets which currently is handled by runners should be handled in the DataCatalog.

Once the feature is already merged, users can overwrite default datasets using the following pattern -

"{default}":
  filepath: data/02_intermediate/{default}.csv
  type: pandas.CSVDataSet

This ticket is to change the behaviour of existing runners and use the factories way to register default datasets.
Instead of adding the catalog entry for default datasets in the catalog -

kedro/kedro/runner/runner.py

Lines 85 to 86 in 64b7960

    
           for ds_name in unregistered_ds: 
        
               catalog.add(ds_name, self.create_default_data_set(ds_name))

The runner should call a method from the catalog to register a default pattern

To dos -

runners register the default datasets as a catch-all pattern

The text was updated successfully, but these errors were encountered:

noklam · 2023-06-12T10:42:15Z

Ccing @marrrcin for comments as I think overwriting default datasets is done across the collection of plugins you created.

marrrcin · 2023-06-12T13:36:21Z

I've already had a call with @merelcht , maybe she can post some notes here.

ankatiyar · 2023-11-09T11:47:05Z

Current behaviour

creation of default datasets is handled in the runner -
Each runner has a create_default_dataset() class which returns an MemoryDataset/SharedMemoryDataset(for parallel runner) object for each "unregistered dataset"(not mentioned in the catalog as an explicit entry but mentioned in the pipeline inputs or outputs) and adds it to the catalog.

Updating this behaviour to use a dataset factories instead for SequentialRunner and ThreadRunner is straight forward -

Just add a catch-all pattern to the catalog with MemoryDataset :

"{default}":
  type: MemoryDataset

Add a function to catalog catalog.add_pattern(name, config) which can be called either -
- With the current behaviour, in runner.run() after the "free outputs" have been calculated as these datasets are returned as the output for the session
- Before runner.run() is called in the session, but then some logic to determine which datasets are "free outputs" needs to be added.

Issues with `ParallelRunner`

The default dataset for ParallelRunner is SharedMemoryDataset which is a wrapper around MemoryDataset. It is not an AbstractDataset -> therefore, doesn't have a from_config() function that the catalog uses in catalog._get_dataset()/from_config() to initialise datasets on creation
ParallelRunner has a _manager attribute which is passed to the SharedMemoryDataset object on initialisation (can't be done in a declarative way by simply adding a catalog entry)

Solution

I've managed to hack together a solution in the draft PR #3269

Some of these things can be cherry-picked and implemented regardless of whether we decide to go for the whole thing -

Moved SharedMemoryDataset to kedro/io - it's currently defined in kedro/runner/parallel_runner.py
Make SharedMemoryDataset a subclass of AbstractDataset - this allows the catalog to register it also through a "catalog entry" (catch-all pattern) but without initialising the underlying shared_memory_dataset
In the code for ParallelRunner -> it initialises the shared_memory_dataset attribute of the SharedMemoryDataset with the (ParallelRunner's)manager.MemoryDataset() object if a dataset is a SharedMemoryDataset
The way i've implemented it right now is first the runner checks for the existence of all pipeline datasets with catalog.exists(dataset) which calls catalog._get_dataset(dataset) -> the dataset factory datasets are registered to the Catalog. And then for the SharedMemoryDataset, initialise the underlying dataset. (This could be improved in the final implementation)

marrrcin · 2023-11-14T19:11:03Z

How does this affect the plugins? Our major plugins (AzureML, VertexAI and SageMaker) rely on the fact that we can transparently to the user (and the project) swap the default dataset implementation. This makes plugins' UX really good for the users, because they still don't have to think about datasets that are not explicitly mentioned in the catalog - they have the same experience as with MemoryDataSet.

ankatiyar · 2023-11-15T11:51:51Z

@marrrcin This does not affect the user experience at all afaik. This is just internal implementation detail, where we currently set the default dataset in the runner but we want to leverage dataset factories to set it.
It just adds a default pattern to the catalog -

"{default}":
  type: MemoryDataset

This does not override the default one that the user has set.

How do the plugins set the default dataset at the moment? Do they use the runner.create_default_dataset() method?

marrrcin · 2023-11-15T14:17:22Z

Yes, we rely on runner.create_default_dataset().

ankatiyar · 2023-11-15T16:07:20Z

@marrrcin In that case, it would be replaced by something like catalog.add_default_pattern() mostly to move this responsibility away from the runners but that's a fair point about plugins needing to add default datasets. This was just a spike to see if this can work, we haven't made a decision either way yet.

noklam · 2023-11-16T07:15:49Z

@marrrcin This does not affect the user experience at all afaik. This is just internal implementation detail, where we currently set the default dataset in the runner but we want to leverage dataset factories to set it. It just adds a default pattern to the catalog -
"{default}":
  type: MemoryDataset
This does not override the default one that the user has set.

How do the plugins set the default dataset at the moment? Do they use the runner.create_default_dataset() method?

Moving it to catalog seems like a good choice, but I wonder what if user also create a catch-all pattern in catalog.yml? which one would take the preference? Should we ban the use of catch-all (if possible at all?) and ask the user to configure this in settings.py instead?

We need to make sure add_default_pattern will override the previous setting, I think set_default_pattern is a better name.

marrrcin · 2023-11-16T08:14:16Z

@marrrcin In that case, it would be replaced by something like catalog.add_default_pattern() mostly to move this responsibility away from the runners but that's a fair point about plugins needing to add default datasets. This was just a spike to see if this can work, we haven't made a decision either way yet.

When / where would the catalog.add_default_pattern() happen during Kedro's lifecycle?

merelcht · 2023-12-07T15:44:40Z

Completed in #3332

ankatiyar added the Issue: Feature Request New feature or improvement to existing feature label Jun 12, 2023

merelcht added this to the Make `OmegaConfigLoader` ready for 0.19.0 milestone Jun 12, 2023

merelcht added Component: Configuration and removed Issue: Feature Request New feature or improvement to existing feature labels Jun 12, 2023

ankatiyar mentioned this issue Jun 12, 2023

Dataset factories #2635

Merged

9 tasks

ankatiyar self-assigned this Jun 20, 2023

merelcht modified the milestones: Make `OmegaConfigLoader` ready for 0.19.0, Breaking config loader changes Aug 31, 2023

ankatiyar mentioned this issue Nov 9, 2023

[Draft] Trying to use dataset factories for default datasets #3269

Closed

merelcht mentioned this issue Nov 22, 2023

Remove create_default_dataset() #3332

Merged

7 tasks

merelcht self-assigned this Nov 24, 2023

merelcht linked a pull request Nov 24, 2023 that will close this issue

Remove create_default_dataset() #3332

Merged

7 tasks

merelcht closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dataset factories to register default datasets #2668

Use dataset factories to register default datasets #2668

ankatiyar commented Jun 12, 2023 •

edited

Loading

noklam commented Jun 12, 2023

marrrcin commented Jun 12, 2023

ankatiyar commented Nov 9, 2023

marrrcin commented Nov 14, 2023

ankatiyar commented Nov 15, 2023 •

edited

Loading

marrrcin commented Nov 15, 2023

ankatiyar commented Nov 15, 2023

noklam commented Nov 16, 2023 •

edited

Loading

marrrcin commented Nov 16, 2023

merelcht commented Dec 7, 2023

Use dataset factories to register default datasets #2668

Use dataset factories to register default datasets #2668

Comments

ankatiyar commented Jun 12, 2023 • edited Loading

Description

noklam commented Jun 12, 2023

marrrcin commented Jun 12, 2023

ankatiyar commented Nov 9, 2023

Current behaviour

Issues with ParallelRunner

Solution

marrrcin commented Nov 14, 2023

ankatiyar commented Nov 15, 2023 • edited Loading

marrrcin commented Nov 15, 2023

ankatiyar commented Nov 15, 2023

noklam commented Nov 16, 2023 • edited Loading

marrrcin commented Nov 16, 2023

merelcht commented Dec 7, 2023

ankatiyar commented Jun 12, 2023 •

edited

Loading

Issues with `ParallelRunner`

ankatiyar commented Nov 15, 2023 •

edited

Loading

noklam commented Nov 16, 2023 •

edited

Loading