Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kedro-datasets: dependencies and package structure. Are we doing the right thing? #1758

Closed
antonymilne opened this issue Aug 4, 2022 · 11 comments
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@antonymilne
Copy link
Contributor

antonymilne commented Aug 4, 2022

Context

Let’s pause and take stock of where we are in #1457. This is where I think things stand:

  • @idanov planned for us to move kedro datasets into a new package kedro-datasets. This would mean users do pip install kedro-datasets[pandas.CSVDataSet] and imports become from kedro-datasets import ...
  • @deepyaman suggested using a namespaced package for kedro-datasets. In short, this would mean that it’s still a separate pip installable package but the import path would still come from the kedro namespace: from kedro.datasets import ...
  • this was generally agreed to be a good idea. The motivation for splitting out kedro-datasets is more for distribution purposes rather than us suggesting that datasets could be used independently of kedro
  • this would mean that instead of doing pip install kedro[pandas.CSVDataSet], a user would do pip install kedro kedro-datasets[pandas.CSVDataSet]. I argued that this doesn’t seem like such a smooth user journey and also it’s actually a bit confusing to pip install kedro-datasets but then import from kedro.datasets rather than from kedro-datsets
  • hence we decided we would maintain the “redirect” in which kedro’s extras_require would ensure that doing pip install kedro[pandas.CSVDataSet] would work as it does now. The intention with this is not purely for backwards compatibility but the recommended way to install kedro-datasets, so that e.g. even in requirements.txt files you would not specify kedro-datasets but instead kedro[...]. See Update all starters to install kedro-datasets[xxxxx] instead of kedro[xxxx] #1495 for more details
  • @noklam raised a very good question about how documentation would work for kedro-datasets: How to make the documentations works for kedro-datasets? #1651. We decided that it should remain part of the core kedro documentation (i.e. live in same place as API docs on RTD that it does now). I set out a plan for how we could achieve this, but it’s very complicated and not 100% satisfactory
  • while trying to come up with a solution for the documentation question, my spidey sense started tingling. Something didn’t feel quite right, and I thought that beyond the complexity of handling documentation, there may be some deeper issue here with how we’re handling kedro-datasets. I discussed with @deepyaman briefly, who had some interesting ideas

Note. Regardless of whether it's a namespace package or not, most times something from kedro-datasets is used you wouldn’t actually need to do this import explicitly, since in the data catalog you don’t need to specify the full import path to the dataset type but rather just pandas.CSVDataSet.

Concerns

The current concerns are (feel free to add if anyone has any others):

  • are we making bad circular dependencies?
  • is there just a whole better packaging model that we’ve not considered? e.g. metapackage, which doesn’t directly conflict with the current approach but would influence decisions we make now

Overall, the kedro-datsets work is quite complex. When it was first planned, we were not aware of the possibility of a namespace package, which changes the way we think about it quite a bit. I am concerned that we have not quite got the scheme right yet and might be missing something that would reduce overall complexity. My suggestion to resolve the situation:

  • let’s discuss the circular dependencies issue. Hopefully it’s not a problem at all, but I would like to feel more confident about this
  • let’s investigate how other libraries are handling similar situations. e.g. I believe the idea for kedro-datasets might have been inspired by how django packages different components (?). @deepyaman mentioned jupyter’s metapackage approach. Again, maybe what we are doing is the best approach, but I would like to feel more confident about this. Just as we missed the possibility of namespace packages in the first place, maybe we’re missing something big here

We don’t need to completely pause work on kedro-datasets while we resolve these questions, but I think the outcome does affect some of the tickets (e.g, #1651 #1495). I do think, however, that we shouldn’t release kedro-datasets before we’re really confident on these.

Circular dependencies

This is what first set my spidey sense tingling.

  1. kedro is a dependency of kedro-datsets.
  2. to enable pip install kedro[pandas.CSVDataSet], kedro-datasets becomes an optional dependency of kedro through extra_requires

(1) initially seemed to be non-negotiable to me but @deepyaman pointed out maybe that's not right (see below conversation). We don’t have to do (2) since we can just require people to pip install kedro-datasets, but it felt like at least a “nice to have” before.

Key question: is this form of circular dependency going to cause problems?

  • if yes, we need to change one of the above 2 points, i.e. either not specify kedro as a dependency of kedro-datasets or revert the decision to enable pip install kedro[pandas.CSVDataSet] and go back to pip install kedro-datasets. This would overall simplify things quite a bit but comes with some disadvantages (most important: not such a smooth user experience, less important: import paths don’t match package name)
  • if no, great. Let’s continue as we are. But we need to think carefully about exactly what kedro’s extra_requires points to (e.g. kedro-datasets~=1.0 is the current plan) and likewise what kedro-datasets specifies as its kedro version specifier

My discussion with @deepyaman:
F99F423A-DFF0-404A-BB03-CA6E080D75E1

(Note the last comment here is considering that we should not allow pip install kedro[...] and instead be explicit about pip install kedro-datasets.)

Are we missing something?

Maybe there is a whole different way of handling the kedro vs. kedro-datasets split which would resolve the question of dependencies, what a user should pip install, how to handle the namespace, etc. e.g. @deepyaman suggested a kedro metapackage in which kedro-framework and kedro-datasets are both namespaced packages underneath that.

We don’t need to commit to implementing the kedro-framework split now if we don’t want to, but I think it would be good to get a feeling for whether this a route we might want to go down in future because it influences our current decision on how to handle kedro-datasets. e.g. it might convince us that pip install kedro[pandas.CSVDataSet] is good or bad.
38DC2301-54DB-4A29-9261-698ECD6F82FC
6781F28A-4338-4C40-8221-643D158DA462

@merelcht
Copy link
Member

merelcht commented Aug 4, 2022

I think I like the sound of the metapackage. When I was reading through the python docs for Packaging namespace packages I found it describes a structure of kedro with kedro-framework and kedro-dataset inside rather than having a separate package altogether under the same namespace. This blog post kind of describes what we were trying to achieve but the conclusion is that namespacing doesn't offer a great solution.

We'll need to figure out if the metapackage solution will solve the problem of different release velocities, which I think it does, but would like to see how it works in practice.

@noklam
Copy link
Contributor

noklam commented Aug 4, 2022

Just to add one more thing🔥, the current way of kedro as a normal package and kedro-datasets as a namespace package seems to break dev install pip install -e ., there may be solution for it, but I don't know yet.

This is the exact issue that I was worried about in #1652, regardless the final solution we should definitely added this into the checklist.

  • Making sure development install will work properly.

CC @AhdraMeraliQB

@antonymilne
Copy link
Contributor Author

antonymilne commented Aug 4, 2022

One more thing to throw into the mix... While talking through kedro-org/kedro-plugins#49 with @AhdraMeraliQB I found that she was using distribution name kedro.datasets rather than kedro-datasets as I had been expecting, i.e. you would do pip install kedro.datasets

Actually it turns out that . and - and _ are all treated the same way by pip install. Note that surprisingly you can already do pip install kedro_viz or kedro-viz or kedro.viz. This actually answers the minor issue of pip install kedro-datasets not matching the import path from kedro.datasets because you can do pip install kedro.datasets. What it doesn't address is the more important issue of user experience since you would still need to pip install a separate package. I also don't know whether it's actually a good idea because it doesn't seem common to name packages with . this way even if it works.

If you're not sure about the difference between distribution name and package name I highly recommend these: https://stackoverflow.com/questions/53346450/is-it-acceptable-to-have-python-package-names-with-numbers-in-it https://stackoverflow.com/questions/62834928/how-to-find-the-package-name-for-a-specific-module. The classic example is that you do pip install scikit-learn but from sklearn import ....

Note. Just to put down my latest thoughts on this while I remember: I think kedro-datasets being a namespaced package that is dependent on kedro still feels like the right thing to do. What I'm not so sure about is whether we should allow pip install kedro[...] or just force an explicit pip install kedro-datasets.

@noklam doing pip install -e . on kedro-datasets works ok for me but it's possible I've missed something here or am using a contaminated environment...

@noklam
Copy link
Contributor

noklam commented Aug 4, 2022

@AntonyMilneQB Would be great if you can check the output of these

import kedro
import kedro.datasets
import sys
print(kedro), print(kedro.datasets),print(sys.path)

My hypothesis is that it doesn't work because of the Python's module Finder has the "first match wins" concept. As soon as it finds kedro in xxxx/site-pacakges/kedro/, it will think kedro.datasets is in xxxx/site-packages/kedro/datasets/. Therefore, pip install . would work fine because it install the namespaced package in site-packages/kedro/*, but development install didn't do that.

@antonymilne
Copy link
Contributor Author

@noklam ah yes, you are absolutely right - I must have been doing something wrong before. Your theory sounds like a good one, but judging by this it seems like this should work because Python looks through all the places in sys.path until it finds kedro.datasets. Maybe we've got something wrong here since the right path is definitely in my sys.path (cwd always is I think) but Python's import isn't picking up on it 🤔 Either way, it's irritating but not a show stopper since it only affects develop install.

I'm actually also getting some weird behaviour on kedro-org/kedro-plugins#49 where pip install ".[pandas.CSVDataSet]" seems to install all the requirements rather than just the relevant extra_requires (doesn't matter if it's develop install or not). Looking at setup.py I can see why this would be, but I was sure that was working differently this afternoon so not quite sure what's going on there 🤔 @AhdraMeraliQB

@deepyaman
Copy link
Member

  • let’s investigate how other libraries are handling similar situations. e.g. I believe the idea for kedro-datasets might have been inspired by how django packages different components (?). @deepyaman mentioned jupyter’s metapackage approach. Again, maybe what we are doing is the best approach, but I would like to feel more confident about this. Just as we missed the possibility of namespace packages in the first place, maybe we’re missing something big here

Agree that it would make sense to see/learn from the experience of more projects here. If somebody can find/share how Django does this, that would be great, because I haven't found it yet. :) While I think the metapackage approach sounds clean in theory, I wonder if it's overcomplicating things, if Kedro-Framework is essentially required, and Kedro-Datasets is the only additional package. Also, which (if any) of these approaches expect the underlying packages to be independent, and which support packages depending on each other (possibly again going back to the question of avoiding circular dependencies)?

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Aug 8, 2022
@antonymilne
Copy link
Contributor Author

antonymilne commented Aug 10, 2022

Notes from technical design discussion on 10 August

  • Cost-benefit of doing the split. In general several people did not feel convinced that the cost of this work (engineering time to setup, complexity of multiple release flows, maintenance of docs) justified the benefits. Possibly this is because we do not fully understand the benefits and need @idanov to explain them more. From what I can tell, the benefits would be:

    • original reasons outlined in Package kedro.extras.datasets into its own kedro-datasets package #1457, which are to do with different release cadence of datasets vs. core and breaking changes. However, this only seems to have been a problem once or twice in the past, and the release cadence of new Python versions seems to be once per year for recent 3.x, which does not feel too far out of sync with kedro breaking releases. @yetudada noted that people often pin their kedro version to not even allow for patch releases (I think our telemetry data demonstrates this when you look at the number of e.g. 0.17.4 downloads), which makes minor breaking changes in patches less of a problem. Overall, based on our current understanding and after quite a bit of discussion, people did not find the "breaking changes" argument to be compelling in practice
    • CI should be simpler and more stable in kedro (a reason I gave in Package kedro.extras.datasets into its own kedro-datasets package #1457). But we could achieve this without doing the kedro-datasets split; it would just be a nice extra we get out of doing it
    • philosophical reasons: we should have a learn core and then be pluggable. Not discussed anywhere, but personally I quite like this perspective on Kedro.
  • @yetudada confirmed what was stated above that point 2 of Circular dependencies is a "nice to have" and not essential. i.e. it's not ideal but is acceptable if users need to do pip install kedro kedro-datasets[pandas.CSVDataSet]. If we are happy to do this then it would immediately resolve the circular dependencies problem and Update all starters to install kedro-datasets[xxxxx] instead of kedro[xxxx] #1495 should be done after all

  • @yetudada confirmed that having the kedro-datasets documentation in the same place as it currently sits in the API docs of kedro is important. I'm not sure whether this is made easier or harder with namespace package - see How to make the documentations works for kedro-datasets? #1651 (comment)

  • We questioned whether point 1 of Circular dependencies (kedro is a dependency of kedro-datsets) was actually necessary. We think the reasoning for this originated from keeping AbstractDataSet etc. in kedro, but no one thought it was obvious why we should do that (see next steps)

  • It was generally agreed that we should look at how other packages handle this situation. No one had any particular knowledge on this already so it would need some further investigation

Next steps

@Galileo-Galilei
Copy link
Contributor

Galileo-Galilei commented Aug 30, 2022

Hi kedro team,

I've followed with great attention your journey on making kedro-datasets an independent package, and i'd like to share my thoughts on some of the questions which seem still open on this topic.

Question 1 : Should kedro really split kedro-datasets in a separate package?

In my opinion, this is a big yes because it will tremendously improve enterprise support, provided some specific implementation that I'll detail further.

The major benefit I expect from this split, apart from the ones summarised above by @AntonyMilneQB, is the ability to upgrade only partially between major versions of the framework (technically in terms of SemVer, i am taliking of minor version, but your understand what I mean: kedro-0.16, kedro-0.17, kedro 0.18).

Kedro is becoming more and more prevalent in the industry, but users can't pay migrations costs very often. My team moved this summer from 0.16.5 to 0.18.2, and reading the discord or the various github issues, it seems that many users are still stuck in 0.16 and 0.17 versions. The download statistics on pepy also indicate that 0.17.x is more used than 0.18.x series, and that 0.16.x, albeit less downloaded, is not completely abandoned by users.

I feel from personal experience (maybe it would need some users research to confirm / quantify it) that what scares users and prevents them from migrating are the template changes. This is a bit ironical since changing the template is often a matter of a couple of minutes, but there is a cost of understanding where objects goes in each new template. My intuition is that most of them would migrate much more often if they could just pip install kedro with the newest version.

Some good news though: the motivation for migration is very often (once again, based on personal experience) to get some improvements for datasets, for instance:

  • newer datasets that do not exist in old versions
  • annoying bugs in some datasets (e.g. old implementations of MatplotlibWriter)
  • incomplete features for some datasets (e.g. old implementations of ApiDataSet)
  • new fsspec protocol in more recent versions (smb, ftp, abfss...)
  • upgrade old dependencies in kedro requirements which creates conflicts with other librairies (e.g. fsspec<0.7 in kedro-0.16 is breaking many packages!)

It would feel much more modular and safer to be able to upgrade an application in production gradually by upgrading only the kedro-datasets version in its requirements rather than modifying the entire template, and it will enable to solve all above common feature requests.

Obviously, users will have to migrate entirely at some point, but being able to upgrade datasets much faster than we are able to do now would be a tremendous improvement for production maintenance (my team has maintained custom plugins for fsspec connections with unsupported protocol for two years because we were not able to migrate, while it would be awesome to just upgrade a version number with kedro-datasets!).

Question 2 : What should be the dependencies relationship between kedro and kedro-datasets?

Three scenarios are on the table at the moment. Based on q1, if we want to enable upgrading kedro-datasets with very old kedro-versions:

  1. kedro-datasets import kedro and reciprocally.
    This scenario is a no go for me. Apart from the circular dependency issue you are facing and discussing above, this makes the desired feature of upgrading easily only the kedro-datasets (cf. question 1) almost impossible to achieve. Indeed , kedro-datasets would reinstall a newer version of kedro incompatible with the template of your old project, except if requirement bounds are very extensive which is unlikely.

  2. kedro import kedro-datasets but not the opposite.

This feels quite natural, because it avoids asking users to both packages. However, I would find this very unpleasant if kedro-datasets upper bound was too tight and prevents me from upgrading easily. This is very likely if any upper bound is set, because many breaking changes in kedro-datasets would not be breaking from the kedro point of view (i.e. a breaking change will occur in one specific dataset implementation, but no breaking change in the "core" module, i.e. the AbstractDataset will still have load and save methods). It is very likely that users do want to benefit from breaking changes to specific dataset implementations and be able to upgrade the package which will raise pip VersionConflicts if the upper bound is set too tight.

  1. Ask users to install separately kedro and kedro-datasets separately

This is my preferred option, because this would make the updates over versions very easy, since the user would be responsible for managing the dependencies.

I understand that it is less users friendly and that you would likely get a lot of users claiming that they'd like to have both installed automatically, but if they get a very clear error message on their first kedro run, I guess it should be pretty ok. Another possibility is to make kedro-datasets a dependency of kedro with no upper bound, but I'm pretty sure you won't like this option :)

As a side note, I totally agree that documentaiton should still be hosted in the same place whatever is decided in the end for 2 reasons:

  • it would make clear users have to install kedro-datasets
  • everything will be searchable in the same place

Question 3 : what part of the kedro.io folder should move to kedro-datasets?

Basically it seems a consensus that all specific implementations + lambda / memory /partitioned /cached datasets (as well as load_obj utils) should be moved to kedro-datasets and it feels completly natural.

Regarding the AbstractDataSet and AbstractVersionDataSet, I am completly convinced they belong to kedro-datasets. The key arguments are:

  • not moving it would make kedro-datasets have kedro has a dependency, which is my worst scenario as described in question 2
  • this would enable to create custom datasets without importing kedro. From an upgrade perspective, it would be great to benefit from any improvement to this dataset for a custom implementation inside a project.
  • kedro should not know how AbstractDataSet work under the hood. The only "contract" between the two is that a dataset has a load and a save method. This is already done because you assume pickle libray have load and dumps methods.

Regarding the DataCatalog, I have less stronger feelings, but I feel that it should be part of kedro-datasets too. This is the native "container" for datasets, and I don't think people have ever customized it (but I may be wrong!), and if someone wants to use the package without the rest of kedro, this seems natural to have this utilities accessible directly.

Question 4: should kedro namespace kedro-datasets ?

From the first time this idea has been suggested, I feel there are much more drawbacks than advantages, but I understand the arguments at stake here.

Overall, I think that there are many cons to this :

  • the engineering setup cost seems higher thaty what you expected at first (but actually, this is not really a con, it is totally up to you to estimate if it is worth the cost)
  • it is very confusing for users to know what's going on internally. It seems quite easy to understand that kedro_datasets.pandas.CSVDataSet imports the module (and it is eventually easy to go check the code), while kedro.pandas.CSVDataSet obfuscates a lot the fact that the code lies in kedro-datasets package.
  • if I understand well, the main motivation of making this namespacing is to help people using absolute import in their catalog instead of usual relative import to upgrade transparently (=the ones who currently use kedro.extras.dataset.pandas.CSVDataSet instead of pandas.CSVDataSet). This does not seem a good motivation because:
    • these people are very likely a very small part of users
    • these people will have to suffer migration costs to 0.19.0 whatsoever, and I am deeply convinced that upgrading the catalog path will be extremely easy for them because they understand the underlying import mechanism.
    • even worse, this should be counterproductive because :
      • it may be counterintuitive for them (why should I still use kedro.extras when the code is in kedro-datasets?)
      • it likely stands against their initial motivation (I guess that using the absolute path is to make clear to readers where the code is, and if you read an import written as kedro.extras.datasets.pandas.CSVDataSet but there are no such folder in the kedro repo, this is very confusing)
  • There are very dangerous side effects :

Non answered questions :

  • what is the way to package and release a distribution of subpackages instead of a single package?
    I am no expert and don't know what are recommend best practices here, but tidyverse is a well known distribution in R which may be informative. The key idea is that you can install each package separately (e.g. kedro-datasets and kedro-framework) AND install the entire distribution (pip install kedro) so you can have both flexibility and ease of upgrade (if packages are installed separately) and ease of install (if the user install the entire distribution).

@antonymilne
Copy link
Contributor Author

antonymilne commented Sep 1, 2022

Following a discussion between @idanov, @AntonyMilneQB, @AhdraMeraliQB and @noklam this afternoon, here's where things stand:

  1. Circular dependencies was indeed deemed to be an issue with the current approach. The solution to this is to drop assumption 2 in the top post. i.e. we will no longer enable pip install kedro[pandas.CSVDataSet]. This was always a "nice to have" so fine to drop. kedro will be a dependency of kedro-datasets and not vice versa. We must think carefully before upper-bounding the kedro dependency so as to avoid conflicts as suggested by @Galileo-Galilei above.
  2. @idanov felt strongly that io should remain in kedro. Hence kedro-datasets will contain just the datasets. See Make kedro-datasets a dependency of kedro? #1776 (comment) for the reasoning.
  3. We would still like to have kedro-datasets a namespace package as in [PAUSED] Namespace kedro-datasets as kedro.datasets kedro-plugins#49, but this is just a "nice to have". There are issues currently around pytest (probably related to the develop install) which we are going to try to resolve. If it turns out that what we’re doing (mixing namespace kedro + non-namespace kedro package) is really not recommended and does not work well then we will make kedro-datasets a non-namespace package. This is an easy change to make.
  4. We should not release kedro-datasets until the namespace package question is fully resolved.
  5. A kedro metapackage is a nice idea that would resolve the circular dependency issue and still allow for pip install kedro. However, it feels like overkill for now, i.e. even more additional complexity with not enough obvious benefit.

Next steps


@Galileo-Galilei thank you very much, as ever, for your extremely carefully thought out and helpful response! We discussed this all again this afternoon - see above for the summarised outcomes. Let me respond to each of your points in turn here.

Question 1 : Should kedro really split kedro-datasets in a separate package?

Your answer to this really helps to motivate what we're doing and has given me a lot more confidence that it's a worthwhile change, thank you. It hugely helps to have your outside perspective of using kedro in the wild here 🙇

Question 2 : What should be the dependencies relationship between kedro and kedro-datasets?

This is very helpful, not least because it's identified a major weakness of the proposal in #1776, which is now off the table. Your point about pip dependency conflicts is very important. If our solution does not allow for users to easily do a breaking upgrade to kedro-datasets while leaving kedro version unchanged then we've failed to achieve the main incentive for this piece of work. Our solution here is your preferred solution 3. kedro-datasets will have a kedro dependency but with a suitable version specified so that pip dependency conflicts will not be an issue (so no upper bound I guess).

Question 3 : what part of the kedro.io folder should move to kedro-datasets?

Here we have gone in the opposite direction and decided that all of io should in fact remain in kedro. This is something of a change from where the consensus was heading, but @idanov made it clear that kedro dataset implementations are really what we're trying to split off here, and not the AbstractDataSet or even "core" dataset implementations like MemoryDataSet.

Question 4: should kedro namespace kedro-datasets

This is still something of an open question. In principle we are still in favour of using a namespace package but only if we can get round some of the technical difficulties that we're currently facing in kedro-org/kedro-plugins#49. The main argument in favour of a namespace package is not really to keep the imports the same - as you say, that is a small number of users since we are dropping extras from the path so those people will need to change the import paths anyway. Instead it's the "feel" you get that kedro-datasets is still part of the kedro package once it has been pip installed. We do understand that this use of namespace packages is not so well unknown and less obvious to users though. @deepyaman curious if you have any more thoughts here.

Non answered questions

Your tidyverse example sounds very similar to the metapackage idea that we were considering in #1777. We think this probably does have some advantages but it was not clear overall that the additional complexity would be worth it at the moment.

@noklam
Copy link
Contributor

noklam commented Sep 2, 2022

This is mostly just re-stating the same thing in different ways. Feel free to edit this. cc @AntonyMilneQB @AhdraMeraliQB

Requirements

  • Easy to upgrade kedro-dataset without upgrading kedro (Primary Goal)
  • keeping the kedro namespace (nice to have) - backward compatibility is not relevant here, since we are moving kedro.extras.datasets -> kedro.datasets anyway. i.e. (from kedro.extras.datasets import SomeDataSet
  • Possible to keep pip install kedro[xxx]
  • Editable install or pytest should be possible without installing the package first
  • Did I missed something more?
1. Original Proposal - no namespace, classic kedro_dataset 2. Namespace Package 3. Namespace-ish Package - dask, dask.distributed 4. Metapacakge kedro, kedro-framework, kedro-datasets
Easy to upgrade Depends on how we set the upper bound (if any)

assume `kedro-dataset` depends on `kedro`
Depends on how we set the upper bound (if any)

assume `kedro-dataset` depends on `kedro`
Depends on how we set the upper bound (if any)

assume `kedro-dataset` depends on `kedro`
Depends on how we set the upper bound (if any)
Keep kedro namespace No Yes Yes-ish Not sure
Circular Dependency - possible to do pip install kedro[xxx] Circular dependency exist Circular dependency exist Circular dependency exist No Circular dependency
Editable install or pytest should be possible without installing the package first Not a Problem Current approach is problematic - need more research on namespace package Not a problem Not a problem
placeholder

More explanation for approach 3 - the dask way

Dask has a package dask but also a dask.distributed namespace. It didn't use a real Python namespace package but use a trick instead.

In short, the real package here is dask and distributed, but dask keep a dask/distributed.py that just importing everything from distributed into the namespace of dask.distributed. As a result, pytest would just work but in a slightly weird way which is using import distributed instead of dask.distributed as evidence here

# dask/distributed.py
try:
    from distributed import *
except ImportError as e:
    if e.msg == "No module named 'distributed'":
        raise ImportError(_import_error_message) from e
    else:
        raise

@AhdraMeraliQB
Copy link
Contributor

AhdraMeraliQB commented Sep 15, 2022

Notes from technical design discussion on 14 September

After consideration of the 4 approaches outlined above, we agreed that the most correct way to proceed would be to Metapackage (option 4), but the engineering costs involved were not justified by the value addition of being able to import from kedro.datasets instead of kedro_datasets. Additionally, once implemented, it is very difficult to reverse metapackaging whilst minimising how the users are affected - it is currently just too high of a commitment. As such, we will be closing this issue and #1693.

Points to follow up on

@yetudada highlighted the addition in complexity for the users should we continue to separate out kedro-datasets without namespacing. We should conduct some user interviews to gauge how they feel about splitting out the datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests

6 participants