Re-design io.core and io.data_catalog #1778

antonymilne · 2022-08-10T22:43:36Z

Spun out of #1691 (comment)... Let's collect ideas here on what current problems are with io. To me it feels like we've neglected it and it's ripe for a re-design.

Note. Like the configuration overhaul, some of this would be non-breaking (behind the scenes implementation) and some would be breaking (changes to public API). In theory we're free to do as we please with any function starting with _, i.e. we can remove, rename, changes arguments at will. In practice, however, some of these might be very commonly used (e.g. self._get_load_path() used in lots of datasets) so should not be regarded as non-breaking.

#1691 and #1580 are actually just symptoms of a more fundamental underlying issue: the API and underlying workings of io.core and io.data_catalog are very confusing and should be rethought in general. These are very old components in kedro and maybe some of the decisions that were originally made about their design should be revised. I think there's also very likely to be old bits of code there that could now be removed or renamed (e.g. who would guess that something named add_feed_dict is used to add parameters to the catalog?). It feels like tech debt rather than intentional design currently.

I don't think they're massively wrong as it stands, but I think it would be a good exercise to go through them and work out exactly what functionality we should expose in the API and how we might like to rework them. e.g. in the case raised here there is quite a bit of confusion about how to get the filepath:

catalog.datasets is presumably the "official" route to get a dataset rather than _get_dataset, but catalog.datasets wouldn't allow a namespaced datasets to be accessed without doing getattr. There's some very subtle and non-obvious differences between datasets and _get_dataset, and then there's also catalog._data_sets (which I think might just be a historical leftover... but not sure). In Improve resume pipeline suggestion for SequentialRunner #1795 @jmholzer used vars(catalog.datasets)[dataset_name].
it also seems at a glance that _filepath is only defined for versioned datasets (? seems weird)
to actually get the correct versioned filepath it's even harder - in our datasets we do get_filepath_str(self._get_load_path(), self._protocol) which is pretty obscure. Similar to Refactor load version logic #1654

So I think we should look holistically at the structures involved here and work out what the API should look like so there's one, clear way to access the things that people need to access. I actually don't think this is such a huge task. Then we can tell much more easily whether we need any new functionality in these structures (like a catalog.dumps) or whether it's just a case of making what we already have better organised, documented and clearer.

The text was updated successfully, but these errors were encountered:

antonymilne · 2022-08-10T23:00:51Z

@noklam also commented that we should consider what actually belongs to AbstractDataSet and what belongs to the implementations. Just to bring @noklam's comment to life a bit more, since it's something I've often thought about in the past too. We have the following bit of code repeated 20 times throughout our datasets:

    def _release(self) -> None:
        super()._release()
        self._invalidate_cache()

    def _invalidate_cache(self) -> None:
        """Invalidate underlying filesystem caches."""
        filepath = get_filepath_str(self._filepath, self._protocol)
        self._fs.invalidate_cache(filepath)

and the following is repeated 37 times:

load_path = get_filepath_str(self._get_load_path(), self._protocol)

_release is not an abc.abstractmethod so doesn't have to be supplied. Why does it exist separately in so many datasets? Why do we need to access so many protected members (e.g. self._fs, self._get_load_path(), etc.)?

Is there anything we can do to make it easier to define a custom dataset? e.g. why is _describe a required method. Overall it feels to me like we have more boilerplate in dataset implementations than we really need.

noklam · 2022-08-23T15:15:56Z

Adding this while I am looking at #1768 some object store related issues. Currently, we actually put this into CustomDataSet where it needs to access the self._fs, at least 99% of the case are copied & paste and identical to this.

exists_function=self._fs.exists,
glob_function=self._fs.glob,

A potential solution for #1768 may be passing some arguments into this line, but there is no easy way to pass in any arguments. I am also not sure how glob works across all different filesystems, it is actually surprising that it works currently, I guess glob is an implicit API across different filesystems?

kedro/kedro/io/core.py

Line 537 in f491420

version_paths = sorted(self._glob_function(pattern), reverse=True)

jmholzer · 2022-09-05T17:30:46Z

A really important issue IMO and a great write-up.

catalog.datasets is presumably the "official" route to get a dataset rather than _get_dataset, but catalog.datasets wouldn't allow a namespaced datasets to be accessed without doing getattr. There's some very subtle and non-obvious differences between datasets and _get_dataset, and then there's also catalog._data_sets (which I think might just be a historical leftover... but not sure).

I think the class (_FrozenDatasets) that catalog.datasets returns an object of is a good candidate for refactoring:

It has no simple interface for the datasets it contains. Currently, the only linter-friendly ways are to use vars(catalog.datasets)[dataset_name] or catalog.datasets.__dict__[dataset_name]. I don't feel this level of (read) encapsulation is merited for an object assigned to a public attribute.
The class is poorly documented; it would be good to have docstrings as the purpose of this class is not easy to grok.
There is too much is going on inside __init__, delegating most of this to a few new, well-documented methods would also make this class much easier to understand.

merelcht · 2022-10-17T14:02:16Z

Related: #1936

noklam · 2022-10-18T09:13:32Z

Added Expose version load version information when load_version=None

Another symptom here is:

self.resolve_load_version isn't doing anything, the real function that we want people to use is self._get_load_path, but it's not clearly defined in the API. We only tell people they need to override _load and _save. As a result, the redundant resolve_load_version call is there to signal that this needs to be used, but I am not convinced this is an ideal way to do so.

merelcht · 2022-10-26T13:36:14Z

Notes from Technical Design:

Catalog API

The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way. https://github.com/kedro-org/kedro/blob/main/kedro/io/data_catalog.py#L169 and the _FrozenDatasets class were added to make it possible to have tab completion for catalog datasets in ipython or jupyter sessions. The PR that added this functionality is on private-kedro: https://github.com/quantumblacklabs/private-kedro/pull/84/files
In order to improve the catalog API we need to gain understanding of users needs

What are the user needs for the catalog API? #1978

Datasets API

The invalidate_cache method relies on fsspec and that's why it's not part of the AbstractDataSet. It might be possible to add it to the AbstractVersionedDataSet
As a team we need to get a better understanding of how versioning works. It's one of the most complex aspects of datasets and the catalog and also where we expect to be able to improve a lot.

A related exercise is to completely re-design how the catalog and datasets work: #1981

astrojuanlu · 2023-09-20T16:32:16Z

One more thing (yes, I think about this issue and #1936 several times a day every single day):

The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way.

I really love this design, and I clearly see how DataCatalog.load and .save could be the only interfaces you need, hence hiding the datasets from users.

But it turns out users want the underlying datasets for all sorts of things. Just today I had two users ask me how they could access the underlying dataset object for various "wicked" uses of Kedro (one was related with dynamic pipelines and the other with kedro-mlflow). I seem to always forget about the .datasets property, so I always recommend the protected catalog._get_dataset.

But if you think only users that should know better are using this protected method, hold your horses, because our own kedro-viz does, too!

https://github.com/kedro-org/kedro-viz/blob/a258e1fe525cd0fed7cfbfb69131877d49b6bdcf/package/kedro_viz/data_access/repositories/catalog.py#L116-L126

So I think we should definitely explore the possibility of "opening" this abstraction.

antonymilne · 2023-10-21T08:23:41Z

I am glad this is still on the radar because it also still pops into my head on a daily basis 😀

Just to add some more context and another couple of data points... Hiding datasets does indeed make sense when using kedro as a framework, but it makes life extremely difficult when it comes to writing plugins/extensions/integrations with kedro. There are IMO many non-nefarious reasons to want to pull dataset information from the catalog (especially now there's the metadata attribute I would have thought) and there should really be a public method for this.

I always recommend catalog._get_dataset too because I believe it is the best method currently available, and as per @astrojuanlu, it is the one used in kedro-viz, which gives me some confidence that it's not going to disappear any time soon. I would even go so far as to say that I consider this method "almost public" already, and when writing kedro integration in vizro I opted to use it, just as I recommended its use within kedro core itself.

astrojuanlu · 2023-10-30T09:06:51Z

Another flaw of the current inheritance model is that if users want to configure a different versioning strategy for their datasets, they have to create a custom CustomVersionedDataset and hack or break the inheritance chain. xref #1979

antonymilne added Type: Parent Issue Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation labels Aug 10, 2022

antonymilne mentioned this issue Aug 10, 2022

Make kedro-datasets a dependency of kedro? #1776

Closed

antonymilne mentioned this issue Aug 23, 2022

Improve resume pipeline suggestion for SequentialRunner #1795

Merged

5 tasks

noklam mentioned this issue Oct 17, 2022

Remove redundant resolve_load_version call #1911

Merged

5 tasks

datajoely mentioned this issue Oct 17, 2022

Easier CustomDataset Creation #1936

Open

merelcht added this to the Redesign Catalog and Datasets milestone Oct 26, 2022

This was referenced Oct 26, 2022

What are the user needs for the catalog API? #1978

Closed

How can we improve dataset versioning? #1979

Open

antonymilne mentioned this issue Mar 30, 2023

Move AbstractDataSet to Kedro-Plugins #2409

Closed

astrojuanlu mentioned this issue Apr 25, 2023

Clarify documentation for matplotlib datasets #2536

Open

astrojuanlu mentioned this issue May 8, 2023

[kedro-datasets] fsspec mixin kedro-org/kedro-plugins#200

Open

antonymilne mentioned this issue May 15, 2023

feat: Add metadata attribute to datasets kedro-org/kedro-plugins#189

Merged

4 tasks

noklam mentioned this issue May 19, 2023

feat: Add ManagedTableDataset for managed Delta Lake tables in Databricks kedro-org/kedro-plugins#206

Merged

4 tasks

astrojuanlu mentioned this issue Jul 19, 2023

[DRAFT] Separate file format from processing engine in datasets kedro-org/kedro-plugins#273

Open

This was referenced Aug 22, 2023

Create re-design POC for catalog and datasets #1981

Closed

Render concrete parameter/return types for dataset save/load #2199

Closed

astrojuanlu mentioned this issue Nov 25, 2023

Update documentation around credentials management #2669

Open

astrojuanlu mentioned this issue Jan 30, 2024

[spike] Investigate suitability of Kedro for EL pipelines and incremental data loading #3578

Open

merelcht removed the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Feb 6, 2024

merelcht mentioned this issue Feb 9, 2024

Improve _FrozenDatasets class #3610

Closed

merelcht mentioned this issue Feb 9, 2024

Make it easy to get the correct file path of a dataset #3611

Closed

astrojuanlu mentioned this issue Mar 12, 2024

Decide on definitions of regular and experimental contributions kedro-org/kedro-plugins#583

Closed

ElenaKhaustova mentioned this issue Jun 4, 2024

[DataCatalog]: Refactor dataset factory resolution logic #3925

Closed

astrojuanlu mentioned this issue Jun 6, 2024

[DataCatalog]: Pretty printing #3913

Closed

ElenaKhaustova mentioned this issue Jun 6, 2024

View dataset public properties #3936

Open

astrojuanlu mentioned this issue Sep 3, 2024

Design DataCatalog2.0 #3995

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-design io.core and io.data_catalog #1778

Re-design io.core and io.data_catalog #1778

antonymilne commented Aug 10, 2022 •

edited

Loading

antonymilne commented Aug 10, 2022 •

edited

Loading

noklam commented Aug 23, 2022 •

edited

Loading

jmholzer commented Sep 5, 2022 •

edited

Loading

merelcht commented Oct 17, 2022

noklam commented Oct 18, 2022

merelcht commented Oct 26, 2022 •

edited

Loading

astrojuanlu commented Sep 20, 2023

antonymilne commented Oct 21, 2023 •

edited

Loading

astrojuanlu commented Oct 30, 2023

Re-design io.core and io.data_catalog #1778

Re-design io.core and io.data_catalog #1778

Comments

antonymilne commented Aug 10, 2022 • edited Loading

antonymilne commented Aug 10, 2022 • edited Loading

noklam commented Aug 23, 2022 • edited Loading

jmholzer commented Sep 5, 2022 • edited Loading

merelcht commented Oct 17, 2022

noklam commented Oct 18, 2022

merelcht commented Oct 26, 2022 • edited Loading

astrojuanlu commented Sep 20, 2023

antonymilne commented Oct 21, 2023 • edited Loading

astrojuanlu commented Oct 30, 2023

antonymilne commented Aug 10, 2022 •

edited

Loading

antonymilne commented Aug 10, 2022 •

edited

Loading

noklam commented Aug 23, 2022 •

edited

Loading

jmholzer commented Sep 5, 2022 •

edited

Loading

merelcht commented Oct 26, 2022 •

edited

Loading

antonymilne commented Oct 21, 2023 •

edited

Loading