Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify api hierarchy #712

Closed
WaylonWalker opened this issue Mar 5, 2021 · 19 comments
Closed

Simplify api hierarchy #712

WaylonWalker opened this issue Mar 5, 2021 · 19 comments
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@WaylonWalker
Copy link
Contributor

Description

When using kedro as a user it can sometimes be hard to remember where even the most common kedro features are located.

Context

This will lower the barrier to entry to using kedro, and make it easier to use without needing to use the docs for even the most simple task such as "How do I make a node".

Possible Implementation

To simplify the barrier to entry into kedro import the most commonly used features right into main.

WARNING - for the person browsing through, this is not the real api, this is simply a suggestion.

import kedro as kd

my_pipeline = kd.Pipeline([kd.node(create_cars, None, "cars")])

Alternatively users can also use the same api as so.

from kedro import Pipeline, node

my_pipeline = Pipeline([node(create_cars, None, "cars")])

At a minimum I feel that the very core features that every kedro user needs should be at the top of the api.

  • Pipeline
  • node

The next priority would be those used by hooks authors.

  • KedroSession
  • ConfigLoader

Finally the full scope of what should be considered to be made easy to reach should be every occurance of from kedro import x in the project template.

@WaylonWalker WaylonWalker added the Issue: Feature Request New feature or improvement to existing feature label Mar 5, 2021
@limdauto
Copy link
Contributor

limdauto commented Mar 6, 2021

I think this is a great idea.

Just thinking out loud here: one thing that will land in 0.17.2 is the ability to import user-defined pipelines as follows:

from kedro.framework.project import pipelines

This pipelines reference will be bounded with the pipelines value defined by users in their project at run time. Hopefully this will be the way forward to access other user-defined values such as catalog and config_loader as well (we are still looking for use cases to support making this move wholesale, so suggestion is very welcome). We do have a big use case that needs pipelines to be importable though, rather than accessing through the context as it is at the moment.

So essentially there will 2 things you can to import from kedro:

  • Things you define in your project, e.g. settings, pipelines
  • Library components define by kedro, e.g. Pipeline, node(), pipeline(), etc.

For the first one, importing from kedro.framework.project makes sense because it reads as "importing pipelines from my project". I wonder if we can come up with a similar API for the second one?

@WaylonWalker
Copy link
Contributor Author

Do other frameworks do this? I feel that its a bit confusing to get project defined objects from kedro rather than from my_project import settings, pipelines What is the benefit of importing from kedro rather than from my_project. I do not have any familiarity with a framwork that behaves this way, so it might be more common than I realize.

@mzjp2
Copy link
Contributor

mzjp2 commented Mar 7, 2021

Do other frameworks do this?

Django is a (pretty big) example that I've seen work really well. You define your settings in project/setting.py and then use them in the rest of your app by doing:

# project/settings.py
MY_SETTING = 1 if os.environ["env"] == "production" else 0

# project/any_file.py
from django.conf import settings

if settings.MY_SETTING == 1:
  blah

with this justification (that I've seen work well in practice):

Also note that your code should not import from either global_settings or your own settings file. django.conf.settings abstracts the concepts of default settings and site-specific settings; it presents a single interface. It also decouples the code that uses settings from the location of your settings

@limdauto
Copy link
Contributor

limdauto commented Mar 7, 2021

The main reason why we have to provide these imports on the kedro namespace and not the user-defined namespace is because we have to refer to the user-defined objects in the framework as well. Kedro doesn't know which value my_project will be ahead of time.

As @mzjp2 pointed out, this approach is very similar to Django's and dynaconf's.

pull bot pushed a commit to vishalbelsare/kedro that referenced this issue Apr 4, 2021
Pinning setuptools's upper bound will cause conflict with dependencies in user's projects that don't pin it. For example, consider a user's project which has the following in their requirements.txt, which is very common:

```
ipython
kedro
```

Ipython also specifies setuptools as a dependency without an upper bound and it is always listed before Kedro in alphabetical order. When pip tries to resolve dependency, at the moment, since it only performs conflicts resolution in a "first-encounter wins" manner, it will register latest setuptools from IPython as a dependency, which will conflict with Kedro's pinned version. If user moves IPython below Kedro, it should work correctly again.

So to be absolutely certain, this commit unpins setuptools.
@lorenabalan lorenabalan changed the title Simplify api heirarchy Simplify api hierarchy Apr 19, 2021
@lorenabalan lorenabalan added the pinned Issue shouldn't be closed by stale bot label Apr 21, 2021
@WaylonWalker
Copy link
Contributor Author

This is a very interesting approach that is a mix of mind-blowing and confusing to me, mostly because I have not used a framework that does this and not something I expect. It's definitely a big mental shift to import something that I created starting from from kedro..

@lorenabalan lorenabalan removed the pinned Issue shouldn't be closed by stale bot label Mar 2, 2022
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022
@merelcht merelcht removed the Community Issue/PR opened by the open-source community label May 16, 2022
@antonymilne
Copy link
Contributor

antonymilne commented Jul 1, 2022

I think we should make this happen. pipeline (note lowercase P) and node are now the recommended user-facing API, and we can make this recommendation clearer by exposing these (but not the underlying classes) as a higher level namespace.

As @WaylonWalker says, we don't have to work everything out now, but pipeline and node would be a great starting place as they seem very stable. There's also a suggestion to alias kedro.extras.extensions.ipython at a higher level, like kedro.ipython.

Overall the only question in my mind here is what exactly the structure should be:

  1. kedro.node, kedro.pipeline as per the initial suggestion
  2. kedro.something.node, kedro.something.pipeline as per @limdauto's idea that imports from the kedro library would contrast with imports from kedro.framework.project (maybe kedro.library.node, kedro.library.pipeline? I don't like that though)

1 seems easiest for users to remember, and I don't imagine us ending up with too many components at that top level, but maybe 2 is more correct somehow. I don't know - thoughts very welcome! Maybe @deepyaman @noklam?

Given this issue has an unprecedented number of positive reactions, it's clearly popular and so it would be nice if we could move forward with it.

@noklam
Copy link
Contributor

noklam commented Jul 1, 2022

@AntonyMilneQB I think there are definitely opportunities to simplify the kedro, I agree with kedro.extras.extensions.ipython together with the line magic need to be simplified.

I am not sure though if the original issue suggests changing our folder structure. I thought it can be done with a kedro/__init__.py for the most common kedro object. This comes with the drawback that importing any sub-module will also import whatever defined the kedro/__init__.py.

This is more inline with from kedro import node (What the issue suggests) since having a kedro.node module will still resulted in from kedro.node import node.

We should also do something with __all__, as we don't really want people to import Node but node only. So even when the user do from kedro.pipeline.node import *.

In reality, I don't know how often people actually need to import these stuff, but the most useful one for me is inside an interactive environment like Jupyter, so I can do kedro. + Tab to discover what I need to import. There are some comments above mention catalog too, I think we can also make auto-discovery happens, user should be able to do catalog. + Tab and see the datasets without checking catalog.yml, this will be valuable to interactive workflow.

@limdauto
Copy link
Contributor

limdauto commented Jul 1, 2022

I think the autosuggestion of dataset names in the catalog in an interactive workflow is already available. I think @datajoely added it iirc.

Re API: I have been doing a lot of Django lately and they expose some commonly used functions through a shortcuts namespace: https://docs.djangoproject.com/en/4.0/topics/http/shortcuts/ -- I personally don't like it too much because I hardly ever type in the import statement myself anymore (always the IDE doing it), but could be an option.

@noklam
Copy link
Contributor

noklam commented Jul 1, 2022

The last time I tried I don't remember seeing the datasets shortcuts, maybe I did something wrong there.

I think there are some gaps between using IDE and Jupyter/Databricks, it doesn't come with things like IntelliSense / Pylance that guess the import for you.

@limdauto
Copy link
Contributor

limdauto commented Jul 1, 2022

A bit off topic but @noklam if you are interested in IDE/Notebook tooling I had this project that I haven't found the time to develop further: https://github.com/Kedro-Zero-to-Hero/kedro-lsp -- if you are interested, pls let me know.

@datajoely
Copy link
Contributor

My view is that LSP is the right solution for all of this :)

@antonymilne
Copy link
Contributor

@noklam - sorry I wasn't clear. What I meant by kedro.node was exactly from kedro import node rather than a new kedro.node package. I don't think we should be changing any of our folder structure here (that would be a huge breaking change for a start), just aliasing things so you can import from the top level.

@limdauto that's interesting about shortcuts. I don't think I like the idea of kedro.shortcuts, but I also very rarely type in the import myself due to IDE autocompletion. Do you think it would be bad to make these things available at top level like from kedro import node? Presumably Django decided against it for some reason in favour of a dedicated shortcuts namespace, but they're a much bigger package.

@noklam
Copy link
Contributor

noklam commented Jul 1, 2022

@limdauto This is something that I have no experience but it looks very interesting! Any recommended resource for a complete beginner :) I'll try to look at it sometimes

@limdauto
Copy link
Contributor

limdauto commented Jul 2, 2022

@noklam I think kedro-lsp is a PoC so it's dead simple and therefore could be a good place to start. Apart from that, I think this is a good guide to language server protocol: https://microsoft.github.io/language-server-protocol/#:~:text=A%20Language%20Server%20is%20meant,servers%20and%20development%20tools%20communicate.

In the end of the day it's a JSON-RPC server that responds to requests from an IDE.

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 6, 2022
@merelcht
Copy link
Member

merelcht commented Jul 21, 2022

Notes from Technical Design session 20/7:

Discussion

This issue was discussed during a Technical Design session and the following points where brought up:

The idea of moving our imports to higher level or even top level for the core Kedro components, sounds like a good idea, but it depends on a couple of things:

  • If you do from kedro import ... , would it import everything under this path? This wouldn't be desirable so we need to investigate if this is the case.
  • Investigate which other packages do top-level imports and how they handle it.

Instead of moving imports all the way to top-level we could have something like a "shortcuts" or "library" section, which is what Django is doing.

  • Why is Django doing it this way?

  • How deep are the imports? Is it worth changing it? E.g. if an import currently is from kedro.pipelines import Pipeline making it from kedro.shortcuts import Pipeline wouldn't improve anything.

  • If we decide to move components to higher level imports which components would that be and which components from Kedro would we want to be importable from the top level? One good candidate for change is the line magic imports. These are very long and could become from kedro.ipython

  • Is this just for the interactive workflow, (IDE can autocomplete), if so can we put these in ipython extension?

@noklam
Copy link
Contributor

noklam commented Aug 26, 2022

If you do from kedro import ... , would it import everything under this path? This wouldn't be desirable so we need to investigate if this is the case.

I am pretty sure this is the case and it's just how Python works, and there isn't much we can do here.

For example if you have a __init__.py which just has 1 line.

import kedro.xxx.yyy.zzz

Everytime you do import kedro, it will also import kedro.xxx and kedro.xxx.yyy and kedro.xxx.yyy.zzz.
Equally when people do from xxx.yyy.zzz import something, it doesn't save any memory magically as the top-level library is imported, it just import it into a local namespace and optionally have an alias.

Direct Import

image

From import

image

@deepyaman
Copy link
Member

Vomiting some initial thoughts (based on what I remember):

  • A lot of libraries have a top-level api module in the package that defines the public API (e.g. from kedro.api import pipeline).
  • Some larger libraries take this further by having api modules for each submodule. Then, the higher-level APIs can import * from each of the submodules.
    • Personal preference: I like when the imports from each submodule are explicit, instead of using import *, because that makes it really difficult to track where some things are actually defined IMO.
  • There are also libraries like https://pypi.org/project/atpublic/ to keep __all__ more maintainable, though I haven't used it (and honestly only seen it used in one place).

@astrojuanlu
Copy link
Member

Importing everything in kedro/__init__.py has the potential of making initialization even slower #1476

And, as mentioned in #712 (comment), some imports are already only 1 level deep.

At the moment (2.5 years after this issue was opened), pipeline code does the following:

from kedro.pipeline import Pipeline, node, pipeline

from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles

I don't think that's excessively verbose, and changing it to

from kedro.api import pipeline, node

certainly does not seem like a huge life saver compared to the churn it would cause.

The issue of tab-completion confusing users is being tracked in #2805.

I am voting to close this one.

@astrojuanlu
Copy link
Member

We decided not to do this for now, we are open to reconsidering this in the future. If you'd like to see this happening, please drop a comment here.

@astrojuanlu astrojuanlu closed this as not planned Won't fix, can't repro, duplicate, stale Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests