Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access dataset filepath via public API for file-backed datasets  #3929

Open
ElenaKhaustova opened this issue Jun 5, 2024 · 0 comments
Open
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jun 5, 2024

Description

Users encounter challenges related to accessing and managing dataset filepaths. The absence of a mandatory filepath attribute in AbstractDataset and the lack of a standard API for accessing metadata hinder users' ability to reliably access dataset filepaths and understand which dataset version was loaded. Additionally, inconsistencies between APIs across different dataset types further complicate the process, requiring users to implement custom logic to handle dataset access and metadata retrieval.

We propose:

  1. Explore the feasibility of implementing file-backed AbstractDataset and making the filepath attribute mandatory to provide users with a consistent and reliable way to access dataset filepaths.
  2. Develop a standard API for accessing metadata across different dataset types, and decide what the standard metadata should include for each dataset type.

Relates to #1936

Context

  • Inconsistency of APIs between AbstractVersionedDataset and AbstractDataset - one has filepath attribute: "It's kind of weird that when I switch from AbstractDataset to the AbstractVersionedDataset, suddenly the file path appears at that point. Like that feels quite weird to me that doesn't feel right."

  • Users have to take into account the dataset type to be able to get the filepath:

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py#L48

Screenshot 2024-06-05 at 14 21 42
  • Hard to get filepath and understand which dataset version was loaded: "It's crazy confusing to actually get the the correct file path"

  • Users want some standardised API to access different types of datasets, for example file-backed. They want to rely on API when using DataCatalog / Datasets which is not mandatory to follow now: "We have MlflowArtifactDataset which is a wrapper for any AbstractDataset which logs the dataset automatically in mlflow as an artifact when its save method is called. The lack of a formal AbstractDataset API for file paths leads to inconsistencies, relying heavily on the convention that file paths are included as a hidden property _file_path in the dataset’s implementation. Formalizing this attribute as a public property would enhance reliability and convenience across Kedro’s framework. Otherwise, the potential difficulties in maintaining this might arise with community-maintained or experimental datasets, as it would be super hard to enforce that"

https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html

Screenshot 2024-06-05 at 15 08 10
  • Users find it challenging to access critical metadata such as file paths directly through the public API, which often requires delving into less transparent, potentially private API elements. This adds complexity to what could otherwise be straightforward data management tasks.

Screenshot 2024-06-05 at 16 33 40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

1 participant