Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Narrow down kedro-datasets public API #139

Open
astrojuanlu opened this issue Mar 22, 2023 · 4 comments
Open

Narrow down kedro-datasets public API #139

astrojuanlu opened this issue Mar 22, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@astrojuanlu
Copy link
Member

Description

Right now there are more or less two ways of importing any given dataset:

>>> from kedro_datasets.biosequence import BioSequenceDataSet
>>> from kedro_datasets.biosequence.biosequence_dataset import BioSequenceDataSet as BioSequenceDataSet2
>>> BioSequenceDataSet is BioSequenceDataSet2
True

I think we should communicate to users that only the former belongs to the public API.

Context

Reasons to do this:

Possible Implementation

Renaming the appropriate submodules to prepend an underscore. For example, kedro_datasets/biosequence/biosequence_dataset.py would become kedro_datasets/biosequence/_biosequence_dataset.py.

Possible Alternatives

Haven't considered any alternatives.

@astrojuanlu astrojuanlu added the enhancement New feature or request label Mar 22, 2023
@antonymilne
Copy link
Contributor

antonymilne commented Mar 24, 2023

Love this idea, and I've thought similar things in the past. Actually I think similar arguments could be made to parts of framework too (e.g. pipeline import springs to mind), where the import paths and public API are not great - see kedro-org/kedro#712. But kedro-datasets definitely seems like a good place to start. It's a breaking change but probably not that breaking in this case, because the vast majority of references to datasets will be through the catalog, in which case the canonical way to refer to them is through the higher-level type: biosequence.BioSequenceDataSet rather than the lower-level type: biosequence.biosequence_dataset.BioSequenceDataSet (which would still work at the moment I think, and I suspect some people do it that way unintentionally).

The one thing that's put me off this in the past is that the only way I could think of doing it is as you suggest, by prepending underscores, which I just find looks a bit weird even though. But it's Python convention I know, so that's not a strong argument against it... Also I actually don't know if there's a problem using autocomplete imports at the moment? In PyCharm if I try to automatically import BioSequenceDataSet it already does it as the higher-level from kedro-datasets.biosequence import BioSequenceDataSet and doesn't give the option of the lower-level import. I don't know what mechanism it uses to work that out or how it works in other IDEs though. e.g. are there IDEs which don't currently work that way but would change to working the way we want if we added the _?

@astrojuanlu
Copy link
Member Author

I think the IDE might work this out by seeing the __all__ attribute in __init__.py - which is excellent, but still potentially someone could consider importing from the submodule.

To me the underscore convention looks natural, but that's a matter of taste I guess 😄

@deepyaman
Copy link
Member

deepyaman commented Apr 24, 2023

Technically a breaking change, but no objections (so far), so need to migrate to the core Kedro repo and add to the 0.19 milestone.

@astrojuanlu
Copy link
Member Author

Does this have to be on 0.19 though? Since kedro-datasets is its own package, I was thinking that maybe this would belong to kedro-datasets 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

4 participants