Should `kedro-datasets` be a namespace package? #1652

noklam · 2022-06-28T16:12:46Z

@noklam @MerelTheisenQB Bit late to the party, but... did you all consider/discuss keeping the import path as is? kedro-datasets would just expose the kedro.extras.datasets namespace package, and kedro would ideally exclude that namespace (even though it will probably still work, which is pretty neat from a compatibility perspective). This would mean that the change would largely be transparent to the user.

Originally posted by @deepyaman in kedro-org/kedro-plugins#38 (comment)

The text was updated successfully, but these errors were encountered:

noklam · 2022-06-28T16:16:32Z

noklam Pending
@deepyaman I don't think we have discussed the namespace package approach. From my understanding, if we want to do this, we will have a non-namespace package(kedro) + namespace package(kedro-datasets)

I don't know if this works well in general, there are a few people against this approach from the links you posted.
I think the move to the plugin, with the namespace changes, also signifies that it's a less "core" kedro component, potentially more community-driven. Although technically kedro won't work without the dataset installed.
There are also notes in the docs

Use native namespace packages. This type of namespace package is defined in PEP 420 and is available in Python 3.3 and later. This is recommended if packages in your namespace only ever need to support Python 3 and installation via pip.

This also worries me a bit if pip isn't used, especially if we are doing something further deviated from the standard PEP 420. Personally, I will stand on a more conservative side.

@AntonyMilneQB AntonyMilneQB 23 hours ago •
@deepyaman We didn't think of this but it's a very interesting idea. Where are the places where that would be less breaking than what we're doing here? I can think of the case where a user defines a custom dataset that inherits from one of our built in ones (and hence does from kedro.extras.datasets import ...). Are there others?

What I think @noklam was asking about here though is a separate question about the documentation, i.e. how do we get these dataset docs in the right place. This is an excellent question and not something we'd thought of either. This actually sounds quite tricky and hard to do well 😬 @noklam please could you make a ticket for this if it's what you meant?

@deepyaman deepyaman 20 hours ago
@AntonyMilneQB It would also be less breaking in cases where people specify the full path to the dataset--which some users inexplicably do. Finally, for better or for worse, new datasets would potentially work out of the box, with partial paths, with older versions of Kedro (that don't have an updated default lookup path).

I think another part of it just comes down to whether you all think sharing the kedro namespace would be nice--the user feels like they're using one cohesive product--or confusing.

@AntonyMilneQB AntonyMilneQB 8 hours ago
@deepyaman All very good points I think. If we keep kedro.extras.datasets as a namespace then eventually I hope we could drop the superfluous extras part anyway rather than maintain it indefinitely for backwards compatibility.

@AntonyMilneQB AntonyMilneQB 8 hours ago
I think sharing the kedro namespace would be nice, although I do see why it would be potentially confusing also. The main motivation for splitting off kedro-datasets is for release/distribution purposes. It's still a part of kedro I think.

deepyaman 34 minutes ago
@AntonyMilneQB @noklam I was just thinking about this more, in the context of something I'm working on--the alternative is to try and disassociate it from kedro as much as possible. Datasets don't actually use anything on the Kedro side, and I'm actually just now thinking about a case where I may just want the generic load/save abstraction that datasets provide. If I'm then leveraging this in a separate library, that really has nothing to do with kedro, it would feel awkward to be referencing the kedro namespace--I'd rather the top-level namespace just be called datasets (not even kedro_datasets).

Happy to talk this through, if it could be helpful for all involved; I think the tradeoffs are worth discussing. :)

@noklam noklam Pending
@deepyaman

Do you know what happens if two packages have the same top-level namespace? Since datasets is one of the most popular libraries (HF) out there. I am not sure how it will be resolved by Python or does pip allow this at all.

import datasets # Which library does it actually import?
https://huggingface.co/docs/datasets/load_hub

noklam · 2022-06-28T16:17:01Z

Just copying the conversation here to document this more transparently.

antonymilne · 2022-06-29T09:24:13Z

@noklam as I understand it, the import path as it stands is kedro_datasets.datasets.pandas.CSVDataSet. Is there any reason for that rather than kedro_datasets.pandas.CSVDataSet?

noklam · 2022-06-29T11:44:43Z

That's a good point, maybe we just need the top level namespace

merelcht · 2022-07-06T15:06:17Z

Notes from Technical Design session:

Discussion

Pros:

Namespacing the package would make the experience of using kedro and datasets better for users, because they would only need to do import ... from kedro and not import .. from kedro and import ... from kedro_datasets
We can make the move to the new repo non-breaking this way by keeping the import to kedro.extras.datasets

Cons:

This could make our CI/CD setup more complicated
If it should be possible to use kedro-datasets separately from Kedro, it might be a bad idea to package it under the "kedro" namespace. However, it's only a super small use case and AbstractDataSet is and will remain a part of core Kedro so it's a very unlikely use case.

Conclusion

The team reached a consensus to namespace the kedro-datasets package under the kedro namespace.

Implementation

To enable namespacing edit setup.py: https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages
For now implement all changes in a non-breaking way
Renaming from kedro.extras.datasets to kedro.datasets and from DataSet to Dataset will be done when Kedro 0.19.0 is going to be released.

antonymilne · 2022-07-06T15:20:05Z

How do we map the kedro version to the kedro-datasets version, e.g. when a user does pip install kedro[someDataSet] which version do we install?
kedro-datasets should be a dependency inside kedro with a strict bound. So when a user is using kedro 0.18.x that will be mapped to kedro-datasets version x.x.x If a user then wants a newer dataset version they can install it manually (and accept the risks that this might break something that they need to fix themselves).

I don't think we want to map strictly to kedro-datasets==x.y.z, just kedro-datasets~=x.0 (assuming we release kedro-datasets 1.0; otherwise it would be kedro-datasets~=x.y.0). I think this was what Ivan meant by pinning the version.

noklam · 2022-07-07T09:39:43Z

To enable namespacing edit setup.py: https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages

When we do the namespace package for kedro-datasets, we also need to rename the current namespace from kedro_datasets.datasets -> kedro.extras.datasets.

These 2 links will be useful for this ticket

Mixing of namespace package and non-namespace package
- Mixing namespace/non-namespace packaging is not advised, but possible. Should we also make kedro a namespace package in this case?
- The key different is you omit the __init__.py and let setuptools to do the works to figure out where is the package
Python Standard Doc for namespace package

While doing this, we should make sure these still works

import kedro
from kedro.extras.datasets import *  # From kedro-dataset
from kedro.io import *  # From kedro

merelcht · 2022-07-08T10:39:33Z

Closing this issue in favour of the ticket for implementing the change.

noklam added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jun 28, 2022

merelcht added this to the Package kedro.extras.datasets into its own package milestone Jun 30, 2022

This was referenced Jul 7, 2022

Namespace kedro-datasets #1693

Closed

Make Kedro instantiate datasets from kedro-datasets with higher priority than kedro.extras.datasets #1494

Closed

merelcht closed this as completed Jul 8, 2022

noklam mentioned this issue Aug 4, 2022

kedro-datasets: dependencies and package structure. Are we doing the right thing? #1758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `kedro-datasets` be a namespace package? #1652

Should `kedro-datasets` be a namespace package? #1652

noklam commented Jun 28, 2022

noklam commented Jun 28, 2022

noklam commented Jun 28, 2022

antonymilne commented Jun 29, 2022

noklam commented Jun 29, 2022

merelcht commented Jul 6, 2022 •

edited

Loading

antonymilne commented Jul 6, 2022 •

edited

Loading

noklam commented Jul 7, 2022

merelcht commented Jul 8, 2022

Should kedro-datasets be a namespace package? #1652

Should kedro-datasets be a namespace package? #1652

Comments

noklam commented Jun 28, 2022

noklam commented Jun 28, 2022

noklam commented Jun 28, 2022

antonymilne commented Jun 29, 2022

noklam commented Jun 29, 2022

merelcht commented Jul 6, 2022 • edited Loading

Discussion

Conclusion

Implementation

antonymilne commented Jul 6, 2022 • edited Loading

noklam commented Jul 7, 2022

merelcht commented Jul 8, 2022

Should `kedro-datasets` be a namespace package? #1652

Should `kedro-datasets` be a namespace package? #1652

merelcht commented Jul 6, 2022 •

edited

Loading

antonymilne commented Jul 6, 2022 •

edited

Loading