-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research summary of insights for redesigning Kedro's data catalog API #3934
Comments
In the section 3 I have two ambitious wishes inspired by Dagster:
|
Questions/suggestions raised from tech design, for posterity:
|
Thank you for the questions and concerns raised, we will continue discussion in this ticket: #3995 to keep all the context at one place. |
Why are we doing this research?
Problem
The DataCatalog API, an older Kedro component, needs a refactor to better align with user needs. GitHub threads reveal confusion with the current API and its io.core and io.data_catalog workings, suggesting a general rethink is necessary.
This research aims to reimagine the DataCatalog as a flexible, intuitive tool for managing datasets, addressing the limitations of the current "frozen" datasets view, and reducing reliance on undocumented private APIs.
Hypothesis
We believe by enhancing the flexibility and visibility of the DataCatalog, including providing official support for dynamic interactions with datasets, for advanced Kedro users and plugin developers will achieve a more intuitive and efficient data management experience, reducing the reliance on undocumented private APIs and improving overall project outcomes.
What do we want to learn?
Objectives
Identify the specific limitations and challenges faced by advanced Kedro users and plugin developers when interacting with the current DataCatalog, particularly its "frozen" datasets view, and their reliance on undocumented private APIs.
Understand how these identified limitations impact the workflow, efficiency, and outcomes of projects using Kedro, particularly focusing on the user experience and the technical constraints they encounter.
Research Questions
Value Prop
Research Methodology
Who are our Advanced Users?
We define advanced users as those with experience in managing and accessing Kedro DataCatalog, using the "frozen" datasets view, and seeking undocumented private APIs for specific project needs.
These includes 6 internal users/groups and 4 external users/groups.
Persona Archetypes
• Workflows include configure the catalog, load data, perform data analysis, and save intermediate and final results.
• Workflows include configure the catalog, validate it, ensure it contains the correct data, access metadata, load datasets, perform analysis, run the pipeline, and obtain pipeline outputs.
• Workflows include access the catalog and datasets' metadata and modify the catalog on the fly (dataset attributes) after it is created.
• Struggle with understanding debug errors.
• Are required to install all dependencies even for unused datasets.
• Struggle to find datasets within the catalog, particularly when dealing with a large number of datasets
• Express the need for an improved visual representation of the catalog when printing.
• Express the need for autocomplete functionality.
• Challenges with accessing and managing dataset filepaths.
• Challenges with accessing and managing dataset filepaths.
• Confusion with FrozenDatasets public API.
• Complications in dataset pattern resolution.
• Complexity of accessing the catalog from Kedro session.
Overall Observations
1. Ease of Multi-Source Configuration
2. Frequent Use of ._get_dataset()
3. Limited Appeal of FrozenDataset
4. Catalog Modifications restrictions Do Not Work
5. Advanced Users Seek Complex Features
Synthesis
- Users express the need for an API to save and load catalogs after compilation or modification by converting catalogs to YAML format and back.
- Users encounter difficulties loading pickled DataCatalog objects when the Kedro version changes when loading, leading to compatibility issues. They require a solution to serialize and deserialize the DataCatalog object without dependency on Kedro versions.
- Users need to_yaml(), from_yaml() methods to avoid issues with pickling catalog objects.
- Function to compile catalog and showcase the result.
Exploration needed
New functionality
- Users address accessing the catalog from a Kedro session is complex and requires an understanding of framework details, such as project creation and environment setup;
- Users address acquiring the catalog involves writing a lot of code and navigating through parameters that are out of the context of their work;
- Users address creating a Kedro session too heavy for simple catalog reading tasks.
- When creating a session users have to care about the path to a kedro project, env and other parameters which might be irrelevant for user use-cases (Vizro).
- Creating kedro session seems too heavy for just reading the catalog.
- Way to create a catalog without instantiating kedro session for read only purposes.
- An easier method to access the catalog directly, without the need for a session or the complications of hooks, would significantly improve usability.
- Resolution logic residing in the private _get_dataset() method forces people to stick to private API since using the public exists() method instead is not straightforward.
- Developers often forget that dataset factory resolution requires _get_dataset(), leading to further bugs.
- Resolution logic duplicates between DataCatalog class and CLI, making it harder to maintain.
- Explore the feasibility of implementing simpler resolution logic for dataset factories to ensure that datasets are resolved when needed without iterating through all of them.
- Enhance documentation for advanced users to clearly explain the dataset resolution process and the usage of dataset factories.
- Developers often forget that datasets factory resolution needs _get_dataset() and it leads to bugs.
- Datasets factory resolution needs _get_dataset() method, that’s why they call exists() when it logically not required.
- Dataset factories are resolved lazily - design choice on Kedro side.
- Provide an opportunity to call datasets by their exact names - get dataset by name function.
- The use of double underscores instead of dots for namespaces in the catalog is unintuitive for users.
- Attribute Replacement: C1 finds the replacement of characters like “.” or “@” with “__” in dataset names to be unclean and prefers calling datasets by their exact names.
- Increase awareness of the FrozenDatasets API among users through tutorials, and documentation updates. Highlight the capabilities of the public API and provide guidance on how to use it effectively for dataset management and retrieval.
- Consider allowing
DataCatalog
modifications and getting rid of_FrozenDatasets
- this is a broader question related to another issue that will be linked later.FrozenDatasets
: the class itself starts with an underscore so this doesn't really feel safe to loop over a catalog.dataset.- Not easy to iterate all of the datasets: public API do not allow it, so you have to iterate via names and use private
_get_dataset()
method.- With
_FrozenDatasets
you can only access datasets as attributes but not usingget_by_name()
method.- Public API is limited with searching by name, save and load while access to more detailed metadata is not available.
- Implement "pretty printing" function specifically tailored to improve the visual representation of the catalog when printed or displayed.
- Need catalog pretty printing function.
- Function to compile catalog and showcase the result.
1. Catalog serialization and deserialization support
Insight:
Action:
Pain-point:
Feature request:
2. Simplify the way to access catalog
Insight:
Currently, there are two ways of accessing catalog: use DataCatalog.load_from_config() method or instantiate a KedroSession, load context and access catalog from there.
Action:
Pain-point:
Feature request:
3. Refactor dataset factory resolution logic
Insight:
Action:
Pain-point:
Feature request:
4. Improve the way to access namespaced datasets with _FrozenDataset API
Insight:
Action:
Pain-point:
5. Exploring DataCatalog as a standalone component for broader adoption and integration
Insight:
Action:
6. Enhance _FrozenDatasets public API
Insight:
Action:
DataCatalog
modifications and getting rid of_FrozenDatasets
- this is a broader question related to another issue that will be linked later.Pain-point:
FrozenDatasets
: the class itself starts with an underscore so this doesn't really feel safe to loop over a catalog.dataset._get_dataset()
method._FrozenDatasets
you can only access datasets as attributes but not usingget_by_name()
method.7. Pretty printing
Insight:
Action:
Pain-point:
Cannot compile catalog and showcase the result as compilation happens at runtime.
Feature request:
8. Autocompletion support for accessing datasets
Insight:
Feature request:
The text was updated successfully, but these errors were encountered: