-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kedro-datasets: dependencies and package structure. Are we doing the right thing? #1758
Comments
I think I like the sound of the metapackage. When I was reading through the python docs for Packaging namespace packages I found it describes a structure of We'll need to figure out if the metapackage solution will solve the problem of different release velocities, which I think it does, but would like to see how it works in practice. |
Just to add one more thing🔥, the current way of This is the exact issue that I was worried about in #1652, regardless the final solution we should definitely added this into the checklist.
|
One more thing to throw into the mix... While talking through kedro-org/kedro-plugins#49 with @AhdraMeraliQB I found that she was using distribution name Actually it turns out that If you're not sure about the difference between distribution name and package name I highly recommend these: https://stackoverflow.com/questions/53346450/is-it-acceptable-to-have-python-package-names-with-numbers-in-it https://stackoverflow.com/questions/62834928/how-to-find-the-package-name-for-a-specific-module. The classic example is that you do
@noklam doing |
@AntonyMilneQB Would be great if you can check the output of these import kedro
import kedro.datasets
import sys
print(kedro), print(kedro.datasets),print(sys.path) My hypothesis is that it doesn't work because of the Python's module Finder has the "first match wins" concept. As soon as it finds |
@noklam ah yes, you are absolutely right - I must have been doing something wrong before. Your theory sounds like a good one, but judging by this it seems like this should work because Python looks through all the places in I'm actually also getting some weird behaviour on kedro-org/kedro-plugins#49 where |
Agree that it would make sense to see/learn from the experience of more projects here. If somebody can find/share how Django does this, that would be great, because I haven't found it yet. :) While I think the metapackage approach sounds clean in theory, I wonder if it's overcomplicating things, if Kedro-Framework is essentially required, and Kedro-Datasets is the only additional package. Also, which (if any) of these approaches expect the underlying packages to be independent, and which support packages depending on each other (possibly again going back to the question of avoiding circular dependencies)? |
Notes from technical design discussion on 10 August
Next steps
|
Hi kedro team, I've followed with great attention your journey on making Question 1 : Should kedro really split kedro-datasets in a separate package?In my opinion, this is a big yes because it will tremendously improve enterprise support, provided some specific implementation that I'll detail further. The major benefit I expect from this split, apart from the ones summarised above by @AntonyMilneQB, is the ability to upgrade only partially between major versions of the framework (technically in terms of SemVer, i am taliking of minor version, but your understand what I mean: kedro-0.16, kedro-0.17, kedro 0.18). Kedro is becoming more and more prevalent in the industry, but users can't pay migrations costs very often. My team moved this summer from 0.16.5 to 0.18.2, and reading the discord or the various github issues, it seems that many users are still stuck in 0.16 and 0.17 versions. The download statistics on pepy also indicate that 0.17.x is more used than 0.18.x series, and that 0.16.x, albeit less downloaded, is not completely abandoned by users. I feel from personal experience (maybe it would need some users research to confirm / quantify it) that what scares users and prevents them from migrating are the template changes. This is a bit ironical since changing the template is often a matter of a couple of minutes, but there is a cost of understanding where objects goes in each new template. My intuition is that most of them would migrate much more often if they could just Some good news though: the motivation for migration is very often (once again, based on personal experience) to get some improvements for datasets, for instance:
It would feel much more modular and safer to be able to upgrade an application in production gradually by upgrading only the Obviously, users will have to migrate entirely at some point, but being able to upgrade datasets much faster than we are able to do now would be a tremendous improvement for production maintenance (my team has maintained custom plugins for Question 2 : What should be the dependencies relationship between kedro and kedro-datasets?Three scenarios are on the table at the moment. Based on q1, if we want to enable upgrading kedro-datasets with very old kedro-versions:
This feels quite natural, because it avoids asking users to both packages. However, I would find this very unpleasant if
This is my preferred option, because this would make the updates over versions very easy, since the user would be responsible for managing the dependencies. I understand that it is less users friendly and that you would likely get a lot of users claiming that they'd like to have both installed automatically, but if they get a very clear error message on their first As a side note, I totally agree that documentaiton should still be hosted in the same place whatever is decided in the end for 2 reasons:
Question 3 : what part of the
|
Following a discussion between @idanov, @AntonyMilneQB, @AhdraMeraliQB and @noklam this afternoon, here's where things stand:
Next steps
@Galileo-Galilei thank you very much, as ever, for your extremely carefully thought out and helpful response! We discussed this all again this afternoon - see above for the summarised outcomes. Let me respond to each of your points in turn here.
Your answer to this really helps to motivate what we're doing and has given me a lot more confidence that it's a worthwhile change, thank you. It hugely helps to have your outside perspective of using kedro in the wild here 🙇
This is very helpful, not least because it's identified a major weakness of the proposal in #1776, which is now off the table. Your point about pip dependency conflicts is very important. If our solution does not allow for users to easily do a breaking upgrade to
Here we have gone in the opposite direction and decided that all of
This is still something of an open question. In principle we are still in favour of using a namespace package but only if we can get round some of the technical difficulties that we're currently facing in kedro-org/kedro-plugins#49. The main argument in favour of a namespace package is not really to keep the imports the same - as you say, that is a small number of users since we are dropping
Your |
This is mostly just re-stating the same thing in different ways. Feel free to edit this. cc @AntonyMilneQB @AhdraMeraliQB Requirements
More explanation for approach 3 - the dask wayDask has a package In short, the real package here is # dask/distributed.py
try:
from distributed import *
except ImportError as e:
if e.msg == "No module named 'distributed'":
raise ImportError(_import_error_message) from e
else:
raise |
Notes from technical design discussion on 14 SeptemberAfter consideration of the 4 approaches outlined above, we agreed that the most correct way to proceed would be to Metapackage (option 4), but the engineering costs involved were not justified by the value addition of being able to import from Points to follow up on@yetudada highlighted the addition in complexity for the users should we continue to separate out |
Context
Let’s pause and take stock of where we are in #1457. This is where I think things stand:
kedro-datasets
. This would mean users dopip install kedro-datasets[pandas.CSVDataSet]
and imports becomefrom kedro-datasets import ...
kedro-datasets
. In short, this would mean that it’s still a separatepip install
able package but the import path would still come from thekedro
namespace:from kedro.datasets import ...
kedro-datasets
is more for distribution purposes rather than us suggesting that datasets could be used independently of kedropip install kedro[pandas.CSVDataSet]
, a user would dopip install kedro kedro-datasets[pandas.CSVDataSet]
. I argued that this doesn’t seem like such a smooth user journey and also it’s actually a bit confusing topip install kedro-datasets
but then importfrom kedro.datasets
rather thanfrom kedro-datsets
kedro
’sextras_require
would ensure that doingpip install kedro[pandas.CSVDataSet]
would work as it does now. The intention with this is not purely for backwards compatibility but the recommended way to installkedro-datasets
, so that e.g. even in requirements.txt files you would not specifykedro-datasets
but insteadkedro[...]
. See Update all starters to installkedro-datasets[xxxxx]
instead ofkedro[xxxx]
#1495 for more detailskedro-datasets
: How to make the documentations works forkedro-datasets
? #1651. We decided that it should remain part of the core kedro documentation (i.e. live in same place as API docs on RTD that it does now). I set out a plan for how we could achieve this, but it’s very complicated and not 100% satisfactorykedro-datasets
. I discussed with @deepyaman briefly, who had some interesting ideasConcerns
The current concerns are (feel free to add if anyone has any others):
Overall, the
kedro-datsets
work is quite complex. When it was first planned, we were not aware of the possibility of a namespace package, which changes the way we think about it quite a bit. I am concerned that we have not quite got the scheme right yet and might be missing something that would reduce overall complexity. My suggestion to resolve the situation:kedro-datasets
might have been inspired by howdjango
packages different components (?). @deepyaman mentionedjupyter
’s metapackage approach. Again, maybe what we are doing is the best approach, but I would like to feel more confident about this. Just as we missed the possibility of namespace packages in the first place, maybe we’re missing something big hereWe don’t need to completely pause work on
kedro-datasets
while we resolve these questions, but I think the outcome does affect some of the tickets (e.g, #1651 #1495). I do think, however, that we shouldn’t releasekedro-datasets
before we’re really confident on these.Circular dependencies
This is what first set my spidey sense tingling.
kedro
is a dependency ofkedro-datsets
.pip install kedro[pandas.CSVDataSet]
,kedro-datasets
becomes an optional dependency ofkedro
throughextra_requires
(1) initially seemed to be non-negotiable to me but @deepyaman pointed out maybe that's not right (see below conversation). We don’t have to do (2) since we can just require people to
pip install kedro-datasets
, but it felt like at least a “nice to have” before.Key question: is this form of circular dependency going to cause problems?
kedro
as a dependency ofkedro-datasets
or revert the decision to enablepip install kedro[pandas.CSVDataSet]
and go back topip install kedro-datasets
. This would overall simplify things quite a bit but comes with some disadvantages (most important: not such a smooth user experience, less important: import paths don’t match package name)kedro
’sextra_requires
points to (e.g.kedro-datasets~=1.0
is the current plan) and likewise whatkedro-datasets
specifies as itskedro
version specifierMy discussion with @deepyaman:
(Note the last comment here is considering that we should not allow
pip install kedro[...]
and instead be explicit aboutpip install kedro-datasets
.)Are we missing something?
Maybe there is a whole different way of handling the
kedro
vs.kedro-datasets
split which would resolve the question of dependencies, what a user shouldpip install
, how to handle the namespace, etc. e.g. @deepyaman suggested akedro
metapackage in whichkedro-framework
andkedro-datasets
are both namespaced packages underneath that.We don’t need to commit to implementing the
kedro-framework
split now if we don’t want to, but I think it would be good to get a feeling for whether this a route we might want to go down in future because it influences our current decision on how to handlekedro-datasets
. e.g. it might convince us thatpip install kedro[pandas.CSVDataSet]
is good or bad.The text was updated successfully, but these errors were encountered: