[KED-2004] Manage `hook_manager` lifecycle in session #1153

merelcht · 2022-01-13T17:10:23Z

Description

Back in August 2020 it was discovered that the global hook_manager could have out of date hooks. To resolve this _clear_hook_manager was added to the tests and the ipython workflow. It was also suggested at the time that when the KedroSession was finished it might be a good idea to have a hook_manager per session.

Development notes

Changes I made:

Instead of registering hooks in the configure_project call (which was called from e.g. KedroSession.create(), bootstrap_project), this now happens when a new KedroSession is instantiated.
When the KedroSession is closed the hook manager gets cleared.
KedroSession passes self._hook_manager to KedroContext and Runner.run()

Problem with this implementation

This implementation works fine, apart from the case where you use the ParallelRunner and have a plugin installed with hook implementations. If you just create custom hooks inside your project it does work, but not when these hooks come from an installed plugin. And the plugin hooks work fine if you don't use the ParallelRunner. I've added a test that uncovers this issue.
The reason why this is happening is because the PluginManager isn't serialisable. It will be set to None when it gets pickled, which causes issues down the line.

👩🏼‍🔧. Fix: The solution to the above issue is to create a new hook_manager instance when using multiprocessing in the ParallelRunner.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

merelcht · 2022-01-13T17:12:27Z

kedro/framework/project/__init__.py

-    # set up all hooks so we can discover all pipelines
-    hook_manager = get_hook_manager()
-    _register_hooks(hook_manager, settings.HOOKS)
-    _register_hooks_setuptools(hook_manager, settings.DISABLE_HOOKS_FOR_PLUGINS)


Moved this logic to happen when a new session is instantiated instead.

My theory is that this is why the code fails when running with ParallelRunner. When we use ParallelRunner, we make sure each subprocess looks the same, even in spawn mode. That's why we call _bootstrap_subprocess, which configures a) the logging to be same and b) the project, via configure_project. configure_project also ensured that the hook manager has all the right hooks set up. Now that we don't do this here anymore, I believe the code should be replicated in _bootstrap_process.
Not entirely sure why the pluggy hook manager doesn't get pickled/unpickled to the same object.

Yes this is exactly what's happening! The reason why is that even though on first try it looked like the PluginManager could be pickled with the latest pluggy version, that wasn't actually the case.

Signed-off-by: Merel Theisen <[email protected]>

lorenabalan · 2022-01-17T15:24:47Z

For transparency I'll leave here the same comment from our conversation:

I think we might actually be fine with the hook manager. I was worried it would leak across projects or across projects with the same name, but I haven’t been able to prove that. The only thing was still with interactive workflow - if you change the hooks inside an ipython/jupyter session, and run using session.run(), it will still run with the old hooks. Doing %reload_kedro doesn’t help either, as it just reloads from the written code files, so still running with old hooks.

But I think that’s okay, updating settings.HOOKS from interactive session is a very niche thing to do, and arguably should be discouraged. They can just update settings.py and reload the ipython session everytime. I’m still thinking if there’s any case where having the same global hook manager over multiple different sessions is really bad but can’t think of any as of yet. I liked the idea of having one manager per session, intuitively it made sense to me, but given we’d have to pass it around, like you said, maybe it’s not as good as it could be.

TL;DR: I agree that maybe it's not the best time to make this change, and we can revisit at another time, but more for conceptual reasons rather than it actually being a sore problem.

antonymilne · 2022-01-17T15:25:37Z

I've spent a while thinking about this and I am also not sure about several things... So this isn't going to be a very useful review, but I'll just list my various questions and comments here.

I'm struggling to imagine a case where having out of date hooks would be a big issue. @lorenabalan did you have an example in mind here that would help illustrate? "For example if they have a code workflow and an interactive workflow (where they play around with different hooks), one may leak into the other." - you're thinking of someone in kedro jupyter doing hook_manager = get_hook_manager(), adding hooks to it and doing some trial kedro runs, then adding a new hook, doing another kedro run, etc.?
Unless I'm missing something these seems like a pretty edge case to me so I basically wouldn't worry about it too much either way. The current solution seems ok to me, but clearing hook managers when you close the session does seem tidier.
"I'm assuming we don't want to go and pass on/reference the hook manager instance from the session so that's why I didn't touch the calls." My initial instinct was that if the hook manager is something that started at the beginning of a session and cleared at the end then it would naturally be good as an attribute of the session. But given that the code uses hook manager (like the runner) doesn't have access to the session this wouldn't work I guess?

Edit: looks like Lorena commented at exactly the same time as me. Agreed with everything she says 👍 Having one manager per session makes sense to me also, but I don't immediately see a nice way of doing it.

kedro/extras/extensions/ipython.py

merelcht · 2022-01-17T16:06:59Z

Thanks both for your thoughts on this!

@AntonyMilneQB on this one:

"I'm assuming we don't want to go and pass on/reference the hook manager instance from the session so that's why I didn't touch the calls." My initial instinct was that if the hook manager is something that started at the beginning of a session and cleared at the end then it would naturally be good as an attribute of the session. But given that the code uses hook manager (like the runner) doesn't have access to the session this wouldn't work I guess?

the problem is that indeed the code that fetches the hook manager (e.g. Runner and KedroContext) doesn't have access to the session and it wouldn't be desirable for the session to be passed to these classes.

limdauto · 2022-01-17T17:06:32Z

My 2 cents is this is the conceptually correct thing to do. There used to be 2 kinds of hooks: registration & life-cycle. Managing them using the same hook managers was a mistake. Now that registration hooks are gone, scoping the life-cycle hook manager to a session makes sense because:

Life-cycle hooks are only ever triggered along the execution timeline, which is managed by a session. It doesn't exist without a session.
Each life-cycle hook should actually have access to the session (conceptually). That should replace all of the use cases for get_current_session(). For example, if I want to track run status of each node so I can visualise in viz, the natural place is writing run status to the session store after_node_run. But I don't have access to the session in after_node_run and can't write to the store.

antonymilne · 2022-01-18T09:13:30Z

@limdauto what you say definitely makes sense. Given that get_current_session was removed, what would be the right way to give hooks access to the session though? Would we need to pass session into the runner?

limdauto · 2022-01-18T14:58:30Z

@AntonyMilneQB:

Step 1: Add session to each hook spec.
Step 2: If hook_manager is managed by session, when we invoke the hook, we can easily do it with something like session.hook_manager.hook.before_node_run(..., session)

antonymilne · 2022-01-18T16:38:34Z

@limdauto makes sense, but that would mean passing session into the runner. Given that we currently do

with KedroSession.create(metadata.package_name) as session:
    session.run()

that sounds a bit strange to me, but I just might be misunderstanding what should have access to what. 🤔

idanov · 2022-01-18T17:38:28Z

I think the long-term solution is to create the hook_manager in the session and then pass it on to all other classes which require it upon their creation. The current state came from making it effectively a global variable and as all global variables, it introduced a lot of hidden interdependencies between different classes and unclear lifecycles. It doesn't help that the hooks themselves are in a way global (at least that's how pluggy shows how to use them in their docs).

Nevertheless, I did a quick search here in GitHub where get_hook_manager is being used and it seems that it's only in the Runner, the KedroContext and the KedroSession (I deliberetly omit the part where we register the hooks, i.e. in configure_project, and where we need to clean them, i.e. in the IPython extension).

So if we were to map out all the actors here, we have:

The initialiser of the hook manager (configure_project)
The clients of the hook manager (KedroSession, KedroContext and Runner)
The unexpected side-effect victim (IPython extension)

One thing we can notice here is that the lifecycles of both the KedroContext and the Runner are completely tied and in fact entirely determined by the KedroSession. So to me the most obvious solution is to give KedroSession the ownership of the hook manager (as @MerelTheisenQB has done in this draft PR) and then it will pass it on to the other clients of the object in their constructor methods or by some other more explicit means, thus avoiding a need for a global hook manager.

No other actors should be involved in this, unless they are also clients of the object. Eliminating the initialiser and tying the lifetime of the hook manager with the one of KedroSession will hopefully be sufficient to remove the need to clear the hooks on session closure or IPython reloading - who cares if those hooks are registered in an object which is no longer being used ever? I guess only the garbage collector will do 😃

All of this is subject to pluggy allowing multiple PluginManager instances with the same name without mixing it up itself (which I hope it does). If that's not the case, we can't escape from a forced global variable on us, but it means that we can always clear the PluginManager upon creation lest someone might've decided to register something before the currently instantiated KedroSession.

I think we should do these changes now in order not to kick the can down the road to 0.19, since that'd be breaking changes.

FOLLOW UP:
I did just check whether you can have two instances of PluginManager with ~~a different~~ the same name and different set of registered plugins, and you can have that indeed. So pluggy keeps no global state, i.e. we should not keep global instance of the PluginManager either and only keep it in the session.

limdauto · 2022-01-21T14:46:07Z

@idanov very well written. Just want to point out that we already have 2 pluggy manager instances: one for CLI hooks and one for lifecycle hooks, so all this is completely possible.

… + runner methods Signed-off-by: Merel Theisen <[email protected]>

Signed-off-by: Merel Theisen <[email protected]>

kedro/framework/hooks/manager.py

Signed-off-by: Merel Theisen <[email protected]>

merelcht · 2022-02-01T17:34:35Z

kedro/runner/parallel_runner.py

+    hook_manager = create_hook_manager()
+    _register_hooks(hook_manager, settings.HOOKS)
+    _register_hooks_setuptools(hook_manager, settings.DISABLE_HOOKS_FOR_PLUGINS)


Create a new hook_manager when doing multiprocessing, because the PluginManager can't be serialised.

PluginManager can't be serialised

Is that why we need it in fork mode as well? Or can we move it up under the if branch?

Also we should also update the docstrings, they still mention "activating the session" which we don't do anymore.

Yes, the problems from PluginManager not being serialisable happen in all modes, because we try to do the serialisation further up the stack. I tested this by creating a PluginManager and calling pickle.dump() on it and that fails.

And yes good point on the docstring!

merelcht · 2022-02-01T17:40:37Z

tests/framework/session/test_session_extension_hooks.py

-        "_get_pipelines_registry_callable",
-        return_value=mock_get_pipelines_registry_callable,
-    )
-    return mock_get_pipelines_registry_callable()


I did some cleanup here, because this fixture already exists in conftest.py

lorenabalan

LOVE this! ❤️ 😍 Fantastic job!! 🔥 👏 👏 👏
Don't forget to add a few lines in the release notes about the breaking changes, like different signatures, public API, and the fact that the hook manager is no longer global, but unique per session.

tests/framework/session/test_session_hook_manager.py

limdauto

Amazing work. Love the fact that we are now on pluggy 1 as well 🎉

kedro/framework/session/session.py

Signed-off-by: Merel Theisen <[email protected]>

antonymilne

Amazing work, especially getting it working with the parallel runner!!

Generally looks 🌟 but I just have a few questions which might be best answered by @idanov actually:

Why do we actually need _clear_hook_manager at all - can't we get rid of it entirely? Since the hook manager is now contained within a session then I don't see why we would need to clean anything up, so we could simplify this even more. If I understand correctly, this is what Ivan meant when he previously said "Eliminating the initialiser and tying the lifetime of the hook manager with the one of KedroSession will hopefully be sufficient to remove the need to clear the hooks on session closure or IPython reloading - who cares if those hooks are registered in an object which is no longer being used ever?"
The current clear up strategy seems inconsistent in ipython, since session.close() doesn't get called there. Either we care about clearing the hooks manager or we don't, but I think we want the same behaviour within a kedro run and kedro ipython? If, as I suspect, we don't in fact need _clear_hook_manager at all then it's fine that we don't do any hook manager clean up in ipython as how you currently have it, but I wonder if it's still worth putting in session.close() just for consistency (since this also calls _deactivate_session and potentially saves to the session store).
If in the future we do this to pass the session to hooks, does it mean passing session to Runner as well? Adding both a hook manager and session arguments to the runner somehow feels a bit bloated to me, especially given these arguments get cascaded down to run_node etc. I'm perfectly happy with the changes made here to and have no better method to propose, but just wondering where this might take us in the future.

kedro/framework/hooks/manager.py

kedro/framework/session/session.py

merelcht · 2022-02-03T10:23:04Z

Why do we actually need _clear_hook_manager at all - can't we get rid of it entirely?

Yes, I think you're right, I forgot about Ivan's comment. I will remove this.

The current clear up strategy seems inconsistent in ipython, since session.close() doesn't get called there. Either we care about clearing the hooks manager or we don't, but I think we want the same behaviour within a kedro run and kedro ipython? If, as I suspect, we don't in fact need _clear_hook_manager at all then it's fine that we don't do any hook manager clean up in ipython as how you currently have it, but I wonder if it's still worth putting in session.close() just for consistency (since this also calls _deactivate_session and potentially saves to the session store).

Where would you call session.close()? We create a session in reload_kedro, but we shouldn't close it there. Maybe the reason why we don't close the session is because there isn't really a good way to do it in the ipython flow?

If in the future we do this to pass the session to hooks, does it mean passing session to Runner as well? Adding both a hook manager and session arguments to the runner somehow feels a bit bloated to me, especially given these arguments get cascaded down to run_node etc. I'm perfectly happy with the changes made here to and have no better method to propose, but just wondering where this might take us in the future.

I think we need to discuss and design the proposal of passing the session to hooks in more detail. I don't really like the idea of passing the session to the Runner.

Signed-off-by: Merel Theisen <[email protected]>

…edro into KED-2004-hook-manager

antonymilne · 2022-02-03T10:51:09Z

Where would you call session.close()? We create a session in reload_kedro, but we shouldn't close it there. Maybe the reason why we don't close the session is because there isn't really a good way to do it in the ipython flow?

Oh yes, of course. I was thinking because _clear_hook_manager was being called there before we should be doing session.close, but looking at it again I see that _clear_hook_manager was called before session.create rather than at the end. I don't see any good way to put session.close into the flow, so all good how you have it 👍

antonymilne

⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐

Signed-off-by: Merel Theisen <[email protected]>

…edro into KED-2004-hook-manager

merelcht changed the title ~~Ked 2004 hook manager~~ [KED-2004] Manage hook_manager() lifecycle in session Jan 13, 2022

merelcht commented Jan 13, 2022

View reviewed changes

merelcht changed the title ~~[KED-2004] Manage hook_manager() lifecycle in session~~ [KED-2004] Manage hook_manager lifecycle in session Jan 13, 2022

merelcht requested review from limdauto, lorenabalan, antonymilne and idanov January 13, 2022 17:25

merelcht added 2 commits January 14, 2022 11:59

Register and clear up hook_manager within session lifecycle

a8d20c1

Signed-off-by: Merel Theisen <[email protected]>

Close session in fixture

ef49220

Signed-off-by: Merel Theisen <[email protected]>

merelcht force-pushed the KED-2004-hook-manager branch from cf8979a to ef49220 Compare January 14, 2022 11:59

merelcht pushed a commit that referenced this pull request Jan 17, 2022

[KED-2669] Drop support for python 3.6 (#1153)

f8055b5

antonymilne reviewed Jan 17, 2022

View reviewed changes

kedro/extras/extensions/ipython.py Show resolved Hide resolved

idanov assigned merelcht Jan 18, 2022

merelcht and others added 7 commits January 24, 2022 11:56

Remove global hook_manager, pass hook_manager from session to context…

374697b

… + runner methods Signed-off-by: Merel Theisen <[email protected]>

Merge branch 'develop' into KED-2004-hook-manager

3f31d5e

Update tests to contain hook_manager where needed

db5ad01

Signed-off-by: Merel Theisen <[email protected]>

Bump pluggy to 1.0.0 + linting

d6bf719

Signed-off-by: Merel Theisen <[email protected]>

Fix linting

53b8213

Signed-off-by: Merel Theisen <[email protected]>

Fix tests

975b0cf

Signed-off-by: Merel Theisen <[email protected]>

Add test for plugin with parallel runner to uncover issue

ece31b7

Signed-off-by: Merel Theisen <[email protected]>

lorenabalan reviewed Jan 31, 2022

View reviewed changes

kedro/framework/hooks/manager.py Outdated Show resolved Hide resolved

merelcht and others added 5 commits January 31, 2022 17:42

Create new hook_manager when doing multiprocessing

0908f5d

Signed-off-by: Merel Theisen <[email protected]>

Clean up + linting

81c0aac

Signed-off-by: Merel Theisen <[email protected]>

Merge branch 'develop' into KED-2004-hook-manager

2960932

Fix tests

83d7309

Signed-off-by: Merel Theisen <[email protected]>

Linting

f22c165

Signed-off-by: Merel Theisen <[email protected]>

merelcht marked this pull request as ready for review February 1, 2022 17:33

merelcht commented Feb 1, 2022

View reviewed changes

lorenabalan approved these changes Feb 2, 2022

View reviewed changes

tests/framework/session/test_session_hook_manager.py Show resolved Hide resolved

tests/framework/session/test_session_hook_manager.py Show resolved Hide resolved

limdauto approved these changes Feb 2, 2022

View reviewed changes

kedro/framework/session/session.py Outdated Show resolved Hide resolved

merelcht and others added 3 commits February 2, 2022 14:53

Address PR comments

45928e3

Signed-off-by: Merel Theisen <[email protected]>

Merge branch 'develop' into KED-2004-hook-manager

a820630

Merge branch 'develop' into KED-2004-hook-manager

ba730de

antonymilne reviewed Feb 2, 2022

View reviewed changes

kedro/framework/hooks/manager.py Outdated Show resolved Hide resolved

kedro/framework/session/session.py Outdated Show resolved Hide resolved

merelcht added 2 commits February 3, 2022 10:45

Fix linkcheck + remove clear_hook_manager

22cfe9d

Signed-off-by: Merel Theisen <[email protected]>

Merge branch 'KED-2004-hook-manager' of github.com:quantumblacklabs/k…

c225a21

…edro into KED-2004-hook-manager

merelcht requested a review from yetudada as a code owner February 3, 2022 10:46

Merge branch 'develop' into KED-2004-hook-manager

09b88f3

antonymilne approved these changes Feb 3, 2022

View reviewed changes

merelcht and others added 4 commits February 3, 2022 11:35

Make create hook manager private + add release notes

42bc3d0

Signed-off-by: Merel Theisen <[email protected]>

Merge branch 'develop' into KED-2004-hook-manager

bba04f7

Fix tests

47698ac

Signed-off-by: Merel Theisen <[email protected]>

Merge branch 'KED-2004-hook-manager' of github.com:quantumblacklabs/k…

a3e3ff8

…edro into KED-2004-hook-manager

merelcht merged commit 93d01c8 into develop Feb 4, 2022

merelcht deleted the KED-2004-hook-manager branch February 4, 2022 10:45

antonymilne mentioned this pull request Apr 19, 2022

[KED-2143] Adding a ConfigLoader instance into hook specs params #506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-2004] Manage `hook_manager` lifecycle in session #1153

[KED-2004] Manage `hook_manager` lifecycle in session #1153

merelcht commented Jan 13, 2022 •

edited

Loading

merelcht Jan 13, 2022

lorenabalan Jan 31, 2022

merelcht Feb 1, 2022

lorenabalan commented Jan 17, 2022

antonymilne commented Jan 17, 2022 •

edited

Loading

merelcht commented Jan 17, 2022

limdauto commented Jan 17, 2022

antonymilne commented Jan 18, 2022

limdauto commented Jan 18, 2022 •

edited

Loading

antonymilne commented Jan 18, 2022

idanov commented Jan 18, 2022 •

edited

Loading

limdauto commented Jan 21, 2022

merelcht Feb 1, 2022

lorenabalan Feb 2, 2022

merelcht Feb 2, 2022

merelcht Feb 1, 2022

lorenabalan left a comment •

edited

Loading

limdauto left a comment

antonymilne left a comment •

edited

Loading

merelcht commented Feb 3, 2022

antonymilne commented Feb 3, 2022 •

edited

Loading

antonymilne left a comment

[KED-2004] Manage hook_manager lifecycle in session #1153

[KED-2004] Manage hook_manager lifecycle in session #1153

Conversation

merelcht commented Jan 13, 2022 • edited Loading

Description

Development notes

Problem with this implementation

Checklist

merelcht Jan 13, 2022

Choose a reason for hiding this comment

lorenabalan Jan 31, 2022

Choose a reason for hiding this comment

merelcht Feb 1, 2022

Choose a reason for hiding this comment

lorenabalan commented Jan 17, 2022

antonymilne commented Jan 17, 2022 • edited Loading

merelcht commented Jan 17, 2022

limdauto commented Jan 17, 2022

antonymilne commented Jan 18, 2022

limdauto commented Jan 18, 2022 • edited Loading

antonymilne commented Jan 18, 2022

idanov commented Jan 18, 2022 • edited Loading

limdauto commented Jan 21, 2022

merelcht Feb 1, 2022

Choose a reason for hiding this comment

lorenabalan Feb 2, 2022

Choose a reason for hiding this comment

merelcht Feb 2, 2022

Choose a reason for hiding this comment

merelcht Feb 1, 2022

Choose a reason for hiding this comment

lorenabalan left a comment • edited Loading

Choose a reason for hiding this comment

limdauto left a comment

Choose a reason for hiding this comment

antonymilne left a comment • edited Loading

Choose a reason for hiding this comment

merelcht commented Feb 3, 2022

antonymilne commented Feb 3, 2022 • edited Loading

antonymilne left a comment

Choose a reason for hiding this comment

[KED-2004] Manage `hook_manager` lifecycle in session #1153

[KED-2004] Manage `hook_manager` lifecycle in session #1153

merelcht commented Jan 13, 2022 •

edited

Loading

antonymilne commented Jan 17, 2022 •

edited

Loading

limdauto commented Jan 18, 2022 •

edited

Loading

idanov commented Jan 18, 2022 •

edited

Loading

lorenabalan left a comment •

edited

Loading

antonymilne left a comment •

edited

Loading

antonymilne commented Feb 3, 2022 •

edited

Loading