Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make kedro-datasets a dependency of kedro? #1776

Closed
antonymilne opened this issue Aug 10, 2022 · 7 comments
Closed

Make kedro-datasets a dependency of kedro? #1776

antonymilne opened this issue Aug 10, 2022 · 7 comments
Assignees

Comments

@antonymilne
Copy link
Contributor

antonymilne commented Aug 10, 2022

My latest proposal in the ongoing discussion of kedro-datasets. See #1758 for context.

Proposal

  1. We continue with kedro-datasets becoming a separate namespace package as we are currently in [PAUSED] Namespace kedro-datasets as kedro.datasets kedro-plugins#49
  2. We move almost everything (see below for what) in kedro.io to kedro-datasets also. This would be another namespace package, so import paths would remain the same as kedro.io
  3. kedro-datasets becomes a core dependency of kedro. i.e. pip install kedro also does pip install kedro-datasets
  4. Extra dependencies like pandas.CSVDataSet are defined now. We implement extras_require in kedro so that pip install kedro[pandas.CSVDataSet] also does pip install kedro-datasets[pandas.CSVDataSet]

To be clear, the structure of kedro-datasets would be:

kedro
├── datasets
│   ├── __init__.py
│   └── all the stuff that's there now in https://github.com/kedro-org/kedro-plugins/pull/49
└── io
    ├── __init__.py
    ├── cached_dataset.py
    ├──  core.py
    └── ...
# no __init__.py as this level!

Pros

  • Users have the same flow as they currently have, which is nice and simple: pip install kedro[pandas.CSVDataSet]. They don't need to worry about the existence of kedro-datasets apart from if they want to manually update to a version beyond the "officially supported" one specified in the kedro requirements
  • In theory you could pip install kedro-datasets and use the datasets by themselves (probably). In practice this might not happen much, but having a self-contained fully-functioning package feels like a clean way to split things up
  • Definitions of AbstractDataSet etc. live in the same place as the implementations. e.g. if we change AbstractDataSet (which I think we should in the not too distant future: see Re-design io.core and io.data_catalog #1778) then this is much easier to handle
  • We break the circular dependencies problem of kedro-datasets: dependencies and package structure. Are we doing the right thing? #1758 because point (1) no longer holds: kedro-datasets does not depend on kedro
  • Documentation problem of How to make the documentations works for kedro-datasets? #1651 becomes simpler, since all we need to check is which version of kedro-datasets to fetch based on kedro's requirements.txt

Cons

  • Arguably things like AbstractDataSet are closer to core components (e.g. runner, pipeline) than they are to dataset implementations, which are really the optional extra thing that should be split out.
  • Probably some other things we haven't thought of yet

Key questions

  • Why was all the stuff in kedro.io supposed to remain in kedro rather than move to kedro-datasets? How important is that?
  • Do the files I suggest moving make sense? Did I miss any dependencies here? I only glanced through to get a rough idea; would be good to do it more carefully.
  • What did I miss here?
@antonymilne
Copy link
Contributor Author

antonymilne commented Aug 10, 2022

What stuff moves in kedro.io

Here's the files we have in kedro.io at the moment:

cached_dataset.py
core.py  # defines AbstractDataSet and AbstractVersionedDataSet
data_catalog.py  # defines DataCatalog
lambda_dataset.py
memory_dataset.py
partitioned_dataset.py

Out of these probably only data_catalog.py should remain part of kedro. Everything else would move to kedro-datasets.

  • looking through the imports in all these files, they are already a very well isolated chunk of code, so it should be possible to separate them from kedro without needing kedro-datasets to import anything from kedro. This is very important to ensure that kedro-datasets doesn't depend on kedro. I believe the only dependency that lives outside kedro.io is load_obj from kedro.utils, which could probably just be copied and pasted to kedro-datasets.
  • all the files ending _dataset.py are implementations of AbstractDataSet, just like all the things that live in kedro.datasets. They're more fundamental to kedro in a sense, which is why they are currently in io. Actually I think the only one that's really fundamental in the sense of being in practice non-optional would be MemoryDataSet.
  • data_catalog.py defines the DataCatalog, which is a core component of kedro that shouldn't be split out (probably?)
  • it would be nice to have core.py live in the same place as the implementations of its abstract classes

Note Moving io to a namespace package in kedro-datasets is completely non-breaking, since all imports remain the same, i.e. from kedro.io import ....

@antonymilne antonymilne changed the title Should kedro-datasets be a dependency of kedro? Make kedro-datasets a dependency of kedro? Aug 11, 2022
@noklam
Copy link
Contributor

noklam commented Aug 24, 2022

@AntonyMilneQB quick question, how will this work if kedro.io.data_catalog stays in kedro and kedro.io.xxxx stays in kedro-datasets?

@antonymilne
Copy link
Contributor Author

How will it work in terms of Python packaging? I'm not sure 😅 We should definitely check that!

@noklam
Copy link
Contributor

noklam commented Aug 24, 2022

Yes, it seems a weird thing to have a shared namespace with different distributions.

I suspect it may still work since pip probably just copies the files into the same folder, but we should check.

@antonymilne
Copy link
Contributor Author

antonymilne commented Aug 24, 2022

Yeah, definitely. Especially given this warning:

It is extremely important that every distribution that uses the namespace package omits the init.py or uses a pkgutil-style init.py. If any distribution does not, it will cause the namespace logic to fail and the other sub-packages will not be importable.

If we do have problems with this then I think there would probably be a way around it using some sort of aliasing so that importing kedro.io.data_catalog still works. Eventually, if that's the only file left in io I guess we might put it somewhere else, the import path would change, and then there wouldn't be any problem since all io stuff would live in kedro-datasets. But for the time being we definitely can't break that import so need to check this works somehow.

@merelcht
Copy link
Member

merelcht commented Aug 25, 2022

We discussed this issue in a Technical Design session:

  • The team is in agreement that kedro-datasets a dependency of kedro seems like a good idea.
  • Things that need to be clarified:
    • Why should we not move AbstractDataSet to kedro-datasets? What was @idanov 's reasoning to keep it in core kedro?
    • What parts of kedro.io should move to kedro-datasets. @AntonyMilneQB has already made a suggestion for this, but would like to get a second opinion. @noklam raised a good question above: "How will this work if kedro.io.data_catalog stays in kedro and kedro.io.xxxx stays in kedro-datasets"?

@AhdraMeraliQB AhdraMeraliQB self-assigned this Aug 30, 2022
@antonymilne
Copy link
Contributor Author

antonymilne commented Sep 1, 2022

Following a discussion with @idanov, we are not going with this proposal. All of io will remain in kedro and just the datasets will go to kedro-datasets. Ivan thinks all implementations of AbstractDataSet should be treated like they are written by a third party and split out. If we change AbstractDataSet then all third parties (us + anyone else implementing a dataset) will need to modify to match the new interface. MemoryDataSet and the like are "core" dataset implementations and so remain in kedro. @Galileo-Galilei also made a very good point with his Question 2, point 2 in #1758 that would make the proposal here unworkable.

The solution will instead be to make kedro a dependency of kedro-datasets and forget about the "nice to have" pip install kedro[...]. See #1758 for more.

@merelcht merelcht removed the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants