Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should kedro-datasets be a namespace package? #1652

Closed
noklam opened this issue Jun 28, 2022 · 8 comments
Closed

Should kedro-datasets be a namespace package? #1652

noklam opened this issue Jun 28, 2022 · 8 comments
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@noklam
Copy link
Contributor

noklam commented Jun 28, 2022

@noklam @MerelTheisenQB Bit late to the party, but... did you all consider/discuss keeping the import path as is? kedro-datasets would just expose the kedro.extras.datasets namespace package, and kedro would ideally exclude that namespace (even though it will probably still work, which is pretty neat from a compatibility perspective). This would mean that the change would largely be transparent to the user.

Originally posted by @deepyaman in kedro-org/kedro-plugins#38 (comment)

@noklam noklam added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jun 28, 2022
@noklam
Copy link
Contributor Author

noklam commented Jun 28, 2022

noklam Pending
@deepyaman I don't think we have discussed the namespace package approach. From my understanding, if we want to do this, we will have a non-namespace package(kedro) + namespace package(kedro-datasets)

I don't know if this works well in general, there are a few people against this approach from the links you posted.
I think the move to the plugin, with the namespace changes, also signifies that it's a less "core" kedro component, potentially more community-driven. Although technically kedro won't work without the dataset installed.
There are also notes in the docs

Use native namespace packages. This type of namespace package is defined in PEP 420 and is available in Python 3.3 and later. This is recommended if packages in your namespace only ever need to support Python 3 and installation via pip.

This also worries me a bit if pip isn't used, especially if we are doing something further deviated from the standard PEP 420. Personally, I will stand on a more conservative side.

@AntonyMilneQB AntonyMilneQB 23 hours ago
@deepyaman We didn't think of this but it's a very interesting idea. Where are the places where that would be less breaking than what we're doing here? I can think of the case where a user defines a custom dataset that inherits from one of our built in ones (and hence does from kedro.extras.datasets import ...). Are there others?

What I think @noklam was asking about here though is a separate question about the documentation, i.e. how do we get these dataset docs in the right place. This is an excellent question and not something we'd thought of either. This actually sounds quite tricky and hard to do well 😬 @noklam please could you make a ticket for this if it's what you meant?

@deepyaman deepyaman 20 hours ago
@AntonyMilneQB It would also be less breaking in cases where people specify the full path to the dataset--which some users inexplicably do. Finally, for better or for worse, new datasets would potentially work out of the box, with partial paths, with older versions of Kedro (that don't have an updated default lookup path).

I think another part of it just comes down to whether you all think sharing the kedro namespace would be nice--the user feels like they're using one cohesive product--or confusing.

@AntonyMilneQB AntonyMilneQB 8 hours ago
@deepyaman All very good points I think. If we keep kedro.extras.datasets as a namespace then eventually I hope we could drop the superfluous extras part anyway rather than maintain it indefinitely for backwards compatibility.

@AntonyMilneQB AntonyMilneQB 8 hours ago
I think sharing the kedro namespace would be nice, although I do see why it would be potentially confusing also. The main motivation for splitting off kedro-datasets is for release/distribution purposes. It's still a part of kedro I think.

deepyaman 34 minutes ago
@AntonyMilneQB @noklam I was just thinking about this more, in the context of something I'm working on--the alternative is to try and disassociate it from kedro as much as possible. Datasets don't actually use anything on the Kedro side, and I'm actually just now thinking about a case where I may just want the generic load/save abstraction that datasets provide. If I'm then leveraging this in a separate library, that really has nothing to do with kedro, it would feel awkward to be referencing the kedro namespace--I'd rather the top-level namespace just be called datasets (not even kedro_datasets).

Happy to talk this through, if it could be helpful for all involved; I think the tradeoffs are worth discussing. :)

@noklam noklam Pending
@deepyaman

Do you know what happens if two packages have the same top-level namespace? Since datasets is one of the most popular libraries (HF) out there. I am not sure how it will be resolved by Python or does pip allow this at all.

import datasets # Which library does it actually import?
https://huggingface.co/docs/datasets/load_hub

@noklam
Copy link
Contributor Author

noklam commented Jun 28, 2022

Just copying the conversation here to document this more transparently.

@antonymilne
Copy link
Contributor

@noklam as I understand it, the import path as it stands is kedro_datasets.datasets.pandas.CSVDataSet. Is there any reason for that rather than kedro_datasets.pandas.CSVDataSet?

@noklam
Copy link
Contributor Author

noklam commented Jun 29, 2022

That's a good point, maybe we just need the top level namespace

@merelcht
Copy link
Member

merelcht commented Jul 6, 2022

Notes from Technical Design session:

Discussion

Pros:

  • Namespacing the package would make the experience of using kedro and datasets better for users, because they would only need to do import ... from kedro and not import .. from kedro and import ... from kedro_datasets
  • We can make the move to the new repo non-breaking this way by keeping the import to kedro.extras.datasets

Cons:

  • This could make our CI/CD setup more complicated
  • If it should be possible to use kedro-datasets separately from Kedro, it might be a bad idea to package it under the "kedro" namespace. However, it's only a super small use case and AbstractDataSet is and will remain a part of core Kedro so it's a very unlikely use case.

Other concerns and questions raised:

  • Can users still use the standalone datacatalog if we namespace kedro-datasets?
    Yes, because in order to use the datacatalog they need Kedro anyway, so this won't change anything to that workflow.
  • Should we rename kedro.extras.datasets to kedro.datasets and DataSet to Dataset?
    Yes, but not straight away. Both of these are breaking changes and should only be done when Kedro 0.19.0 will be released.
    Alternatively, we could alias the datasets which could be a fun exercise. It's still a question of whether it's worth the effort.
  • It might be confusing for users that they can import kedro-datasets through import ... from kedro, but need to install a dataset with pip install kedro-datasets[xxx] and not pip install kedro[xxx].
    We should add a redirect to make this possible
  • How do we map the kedro version to the kedro-datasets version, e.g. when a user does pip install kedro[someDataSet] which version do we install?
    kedro-datasets should be a dependency inside kedro with a strict bound. So when a user is using kedro 0.18.x that will be mapped to kedro-datasets version x.x.x If a user then wants a newer dataset version they can install it manually (and accept the risks that this might break something that they need to fix themselves).

Conclusion

The team reached a consensus to namespace the kedro-datasets package under the kedro namespace.

Implementation

@antonymilne
Copy link
Contributor

antonymilne commented Jul 6, 2022

  • How do we map the kedro version to the kedro-datasets version, e.g. when a user does pip install kedro[someDataSet] which version do we install?
    kedro-datasets should be a dependency inside kedro with a strict bound. So when a user is using kedro 0.18.x that will be mapped to kedro-datasets version x.x.x If a user then wants a newer dataset version they can install it manually (and accept the risks that this might break something that they need to fix themselves).

I don't think we want to map strictly to kedro-datasets==x.y.z, just kedro-datasets~=x.0 (assuming we release kedro-datasets 1.0; otherwise it would be kedro-datasets~=x.y.0). I think this was what Ivan meant by pinning the version.

@noklam
Copy link
Contributor Author

noklam commented Jul 7, 2022

To enable namespacing edit setup.py: https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages

  • When we do the namespace package for kedro-datasets, we also need to rename the current namespace from kedro_datasets.datasets -> kedro.extras.datasets.

These 2 links will be useful for this ticket

While doing this, we should make sure these still works

import kedro
from kedro.extras.datasets import *  # From kedro-dataset
from kedro.io import *  # From kedro

@merelcht
Copy link
Member

merelcht commented Jul 8, 2022

Closing this issue in favour of the ticket for implementing the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests

3 participants