Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve _FrozenDatasets class #3610

Closed
merelcht opened this issue Feb 9, 2024 · 1 comment
Closed

Improve _FrozenDatasets class #3610

merelcht opened this issue Feb 9, 2024 · 1 comment

Comments

@merelcht
Copy link
Member

merelcht commented Feb 9, 2024

Description

The _FrozenDatasets class is the class that returns catalog.datasets, which is supposed to be the official (and public) way for users to access datasets from the catalog through the API. However, it's been flagged by several team members that this class isn't very easy to work with:

  1. It has no simple interface for the datasets it contains. Currently, the only linter-friendly ways are to use vars(catalog.datasets)[dataset_name] or catalog.datasets.__dict__[dataset_name].
  2. The class is poorly documented; it would be good to have docstrings as the purpose of this class is not easy to grok.
  3. There is a lot going on inside __init__, delegating most of this to a few new, well-documented methods would also make this class much easier to understand.

On top of that we have evidence that users frequently resort to using private methods and attributes to access datasets e.g. catalog._datasets and trough _get_dataset(). So it also seems like the class isn't sufficiently allowing users to access datasets through the API. See e.g. https://github.com/Galileo-Galilei/kedro-mlflow/blob/e88679938b1d4c7633c3f631f6b402ff11ab61fe/kedro_mlflow/framework/hooks/mlflow_hook.py#L148

Observations about _FrozenDatasets

Context

The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way. The _FrozenDatasets class was added to make it possible to have tab completion for catalog datasets in ipython or jupyter sessions. The PR that added this functionality is on private-kedro: https://github.com/quantumblacklabs/private-kedro/pull/84/files. It's important to note that the _FrozenDatasets needs to be immutable, if users want to inject data they should use hooks.

Improvement suggestions

  1. Make _FrozenDatasets inherit from UserDict

Important

The above suggestions are based on ideas from several Kedro engineers see e.g. #1778. However, they are mostly solutions to improve developer experience, but we need a clear view on what user needs are as well. Any implementation should be preceded by user research: #1978

@noklam
Copy link
Contributor

noklam commented Feb 9, 2024

It's already mentioned that FrozenDataset is immutable because it is a public interface. I also want to mention that catalog.datasets.xxx is the easiest way to get auto-completion works on a IDE, which is a lot easy to type catalog.load("namespace.x.y.z.a.b.c") especially on larger pipeline. And because . can be used in catalog, and it conflicts with what catalog.datasets.namespace.dataset mean, so we replace . with __.

See previous PRs:

@kedro-org kedro-org locked and limited conversation to collaborators Mar 28, 2024
@merelcht merelcht converted this issue into discussion #3752 Mar 28, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants