Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2004] Manage hook_manager lifecycle in session #1153

Merged
merged 24 commits into from
Feb 4, 2022

Conversation

merelcht
Copy link
Member

@merelcht merelcht commented Jan 13, 2022

Description

Back in August 2020 it was discovered that the global hook_manager could have out of date hooks. To resolve this _clear_hook_manager was added to the tests and the ipython workflow. It was also suggested at the time that when the KedroSession was finished it might be a good idea to have a hook_manager per session.

Development notes

Changes I made:

  • Instead of registering hooks in the configure_project call (which was called from e.g. KedroSession.create(), bootstrap_project), this now happens when a new KedroSession is instantiated.
  • When the KedroSession is closed the hook manager gets cleared.
  • KedroSession passes self._hook_manager to KedroContext and Runner.run()

Problem with this implementation

This implementation works fine, apart from the case where you use the ParallelRunner and have a plugin installed with hook implementations. If you just create custom hooks inside your project it does work, but not when these hooks come from an installed plugin. And the plugin hooks work fine if you don't use the ParallelRunner. I've added a test that uncovers this issue.
The reason why this is happening is because the PluginManager isn't serialisable. It will be set to None when it gets pickled, which causes issues down the line.

👩🏼‍🔧. Fix: The solution to the above issue is to create a new hook_manager instance when using multiprocessing in the ParallelRunner.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@merelcht merelcht changed the title Ked 2004 hook manager [KED-2004] Manage hook_manager() lifecycle in session Jan 13, 2022
Comment on lines -184 to -187
# set up all hooks so we can discover all pipelines
hook_manager = get_hook_manager()
_register_hooks(hook_manager, settings.HOOKS)
_register_hooks_setuptools(hook_manager, settings.DISABLE_HOOKS_FOR_PLUGINS)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this logic to happen when a new session is instantiated instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My theory is that this is why the code fails when running with ParallelRunner. When we use ParallelRunner, we make sure each subprocess looks the same, even in spawn mode. That's why we call _bootstrap_subprocess, which configures a) the logging to be same and b) the project, via configure_project. configure_project also ensured that the hook manager has all the right hooks set up. Now that we don't do this here anymore, I believe the code should be replicated in _bootstrap_process.
Not entirely sure why the pluggy hook manager doesn't get pickled/unpickled to the same object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is exactly what's happening! The reason why is that even though on first try it looked like the PluginManager could be pickled with the latest pluggy version, that wasn't actually the case.

@merelcht merelcht changed the title [KED-2004] Manage hook_manager() lifecycle in session [KED-2004] Manage hook_manager lifecycle in session Jan 13, 2022
@lorenabalan
Copy link
Contributor

For transparency I'll leave here the same comment from our conversation:

I think we might actually be fine with the hook manager. I was worried it would leak across projects or across projects with the same name, but I haven’t been able to prove that. The only thing was still with interactive workflow - if you change the hooks inside an ipython/jupyter session, and run using session.run(), it will still run with the old hooks. Doing %reload_kedro doesn’t help either, as it just reloads from the written code files, so still running with old hooks.

But I think that’s okay, updating settings.HOOKS from interactive session is a very niche thing to do, and arguably should be discouraged. They can just update settings.py and reload the ipython session everytime. I’m still thinking if there’s any case where having the same global hook manager over multiple different sessions is really bad but can’t think of any as of yet. I liked the idea of having one manager per session, intuitively it made sense to me, but given we’d have to pass it around, like you said, maybe it’s not as good as it could be.

TL;DR: I agree that maybe it's not the best time to make this change, and we can revisit at another time, but more for conceptual reasons rather than it actually being a sore problem.

@antonymilne
Copy link
Contributor

antonymilne commented Jan 17, 2022

I've spent a while thinking about this and I am also not sure about several things... So this isn't going to be a very useful review, but I'll just list my various questions and comments here.

  1. I'm struggling to imagine a case where having out of date hooks would be a big issue. @lorenabalan did you have an example in mind here that would help illustrate? "For example if they have a code workflow and an interactive workflow (where they play around with different hooks), one may leak into the other." - you're thinking of someone in kedro jupyter doing hook_manager = get_hook_manager(), adding hooks to it and doing some trial kedro runs, then adding a new hook, doing another kedro run, etc.?

  2. Unless I'm missing something these seems like a pretty edge case to me so I basically wouldn't worry about it too much either way. The current solution seems ok to me, but clearing hook managers when you close the session does seem tidier.

  3. "I'm assuming we don't want to go and pass on/reference the hook manager instance from the session so that's why I didn't touch the calls." My initial instinct was that if the hook manager is something that started at the beginning of a session and cleared at the end then it would naturally be good as an attribute of the session. But given that the code uses hook manager (like the runner) doesn't have access to the session this wouldn't work I guess?

Edit: looks like Lorena commented at exactly the same time as me. Agreed with everything she says 👍 Having one manager per session makes sense to me also, but I don't immediately see a nice way of doing it.

@merelcht
Copy link
Member Author

Thanks both for your thoughts on this!

@AntonyMilneQB on this one:

  1. "I'm assuming we don't want to go and pass on/reference the hook manager instance from the session so that's why I didn't touch the calls." My initial instinct was that if the hook manager is something that started at the beginning of a session and cleared at the end then it would naturally be good as an attribute of the session. But given that the code uses hook manager (like the runner) doesn't have access to the session this wouldn't work I guess?

the problem is that indeed the code that fetches the hook manager (e.g. Runner and KedroContext) doesn't have access to the session and it wouldn't be desirable for the session to be passed to these classes.

@limdauto
Copy link
Contributor

My 2 cents is this is the conceptually correct thing to do. There used to be 2 kinds of hooks: registration & life-cycle. Managing them using the same hook managers was a mistake. Now that registration hooks are gone, scoping the life-cycle hook manager to a session makes sense because:

  • Life-cycle hooks are only ever triggered along the execution timeline, which is managed by a session. It doesn't exist without a session.
  • Each life-cycle hook should actually have access to the session (conceptually). That should replace all of the use cases for get_current_session(). For example, if I want to track run status of each node so I can visualise in viz, the natural place is writing run status to the session store after_node_run. But I don't have access to the session in after_node_run and can't write to the store.

@antonymilne
Copy link
Contributor

@limdauto what you say definitely makes sense. Given that get_current_session was removed, what would be the right way to give hooks access to the session though? Would we need to pass session into the runner?

@limdauto
Copy link
Contributor

limdauto commented Jan 18, 2022

@AntonyMilneQB:

Step 1: Add session to each hook spec.
Step 2: If hook_manager is managed by session, when we invoke the hook, we can easily do it with something like session.hook_manager.hook.before_node_run(..., session)

@antonymilne
Copy link
Contributor

@limdauto makes sense, but that would mean passing session into the runner. Given that we currently do

with KedroSession.create(metadata.package_name) as session:
    session.run()

that sounds a bit strange to me, but I just might be misunderstanding what should have access to what. 🤔

@idanov
Copy link
Member

idanov commented Jan 18, 2022

I think the long-term solution is to create the hook_manager in the session and then pass it on to all other classes which require it upon their creation. The current state came from making it effectively a global variable and as all global variables, it introduced a lot of hidden interdependencies between different classes and unclear lifecycles. It doesn't help that the hooks themselves are in a way global (at least that's how pluggy shows how to use them in their docs).

Nevertheless, I did a quick search here in GitHub where get_hook_manager is being used and it seems that it's only in the Runner, the KedroContext and the KedroSession (I deliberetly omit the part where we register the hooks, i.e. in configure_project, and where we need to clean them, i.e. in the IPython extension).

So if we were to map out all the actors here, we have:

  • The initialiser of the hook manager (configure_project)
  • The clients of the hook manager (KedroSession, KedroContext and Runner)
  • The unexpected side-effect victim (IPython extension)

One thing we can notice here is that the lifecycles of both the KedroContext and the Runner are completely tied and in fact entirely determined by the KedroSession. So to me the most obvious solution is to give KedroSession the ownership of the hook manager (as @MerelTheisenQB has done in this draft PR) and then it will pass it on to the other clients of the object in their constructor methods or by some other more explicit means, thus avoiding a need for a global hook manager.

No other actors should be involved in this, unless they are also clients of the object. Eliminating the initialiser and tying the lifetime of the hook manager with the one of KedroSession will hopefully be sufficient to remove the need to clear the hooks on session closure or IPython reloading - who cares if those hooks are registered in an object which is no longer being used ever? I guess only the garbage collector will do 😃

All of this is subject to pluggy allowing multiple PluginManager instances with the same name without mixing it up itself (which I hope it does). If that's not the case, we can't escape from a forced global variable on us, but it means that we can always clear the PluginManager upon creation lest someone might've decided to register something before the currently instantiated KedroSession.

I think we should do these changes now in order not to kick the can down the road to 0.19, since that'd be breaking changes.

FOLLOW UP:
I did just check whether you can have two instances of PluginManager with a different the same name and different set of registered plugins, and you can have that indeed. So pluggy keeps no global state, i.e. we should not keep global instance of the PluginManager either and only keep it in the session.

@limdauto
Copy link
Contributor

@idanov very well written. Just want to point out that we already have 2 pluggy manager instances: one for CLI hooks and one for lifecycle hooks, so all this is completely possible.

@merelcht merelcht marked this pull request as ready for review February 1, 2022 17:33
Comment on lines 123 to 125
hook_manager = create_hook_manager()
_register_hooks(hook_manager, settings.HOOKS)
_register_hooks_setuptools(hook_manager, settings.DISABLE_HOOKS_FOR_PLUGINS)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a new hook_manager when doing multiprocessing, because the PluginManager can't be serialised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PluginManager can't be serialised

Is that why we need it in fork mode as well? Or can we move it up under the if branch?

Also we should also update the docstrings, they still mention "activating the session" which we don't do anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the problems from PluginManager not being serialisable happen in all modes, because we try to do the serialisation further up the stack. I tested this by creating a PluginManager and calling pickle.dump() on it and that fails.

And yes good point on the docstring!

"_get_pipelines_registry_callable",
return_value=mock_get_pipelines_registry_callable,
)
return mock_get_pipelines_registry_callable()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some cleanup here, because this fixture already exists in conftest.py

Copy link
Contributor

@lorenabalan lorenabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOVE this! ❤️ 😍 Fantastic job!! 🔥 👏 👏 👏
Don't forget to add a few lines in the release notes about the breaking changes, like different signatures, public API, and the fact that the hook manager is no longer global, but unique per session.

Copy link
Contributor

@limdauto limdauto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work. Love the fact that we are now on pluggy 1 as well 🎉

kedro/framework/session/session.py Outdated Show resolved Hide resolved
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, especially getting it working with the parallel runner!!

Generally looks 🌟 but I just have a few questions which might be best answered by @idanov actually:

  1. Why do we actually need _clear_hook_manager at all - can't we get rid of it entirely? Since the hook manager is now contained within a session then I don't see why we would need to clean anything up, so we could simplify this even more. If I understand correctly, this is what Ivan meant when he previously said "Eliminating the initialiser and tying the lifetime of the hook manager with the one of KedroSession will hopefully be sufficient to remove the need to clear the hooks on session closure or IPython reloading - who cares if those hooks are registered in an object which is no longer being used ever?"

  2. The current clear up strategy seems inconsistent in ipython, since session.close() doesn't get called there. Either we care about clearing the hooks manager or we don't, but I think we want the same behaviour within a kedro run and kedro ipython? If, as I suspect, we don't in fact need _clear_hook_manager at all then it's fine that we don't do any hook manager clean up in ipython as how you currently have it, but I wonder if it's still worth putting in session.close() just for consistency (since this also calls _deactivate_session and potentially saves to the session store).

  3. If in the future we do this to pass the session to hooks, does it mean passing session to Runner as well? Adding both a hook manager and session arguments to the runner somehow feels a bit bloated to me, especially given these arguments get cascaded down to run_node etc. I'm perfectly happy with the changes made here to and have no better method to propose, but just wondering where this might take us in the future.

kedro/framework/hooks/manager.py Outdated Show resolved Hide resolved
kedro/framework/session/session.py Outdated Show resolved Hide resolved
@merelcht
Copy link
Member Author

merelcht commented Feb 3, 2022

  1. Why do we actually need _clear_hook_manager at all - can't we get rid of it entirely?

Yes, I think you're right, I forgot about Ivan's comment. I will remove this.

  1. The current clear up strategy seems inconsistent in ipython, since session.close() doesn't get called there. Either we care about clearing the hooks manager or we don't, but I think we want the same behaviour within a kedro run and kedro ipython? If, as I suspect, we don't in fact need _clear_hook_manager at all then it's fine that we don't do any hook manager clean up in ipython as how you currently have it, but I wonder if it's still worth putting in session.close() just for consistency (since this also calls _deactivate_session and potentially saves to the session store).

Where would you call session.close()? We create a session in reload_kedro, but we shouldn't close it there. Maybe the reason why we don't close the session is because there isn't really a good way to do it in the ipython flow?

  1. If in the future we do this to pass the session to hooks, does it mean passing session to Runner as well? Adding both a hook manager and session arguments to the runner somehow feels a bit bloated to me, especially given these arguments get cascaded down to run_node etc. I'm perfectly happy with the changes made here to and have no better method to propose, but just wondering where this might take us in the future.

I think we need to discuss and design the proposal of passing the session to hooks in more detail. I don't really like the idea of passing the session to the Runner.

@antonymilne
Copy link
Contributor

antonymilne commented Feb 3, 2022

Where would you call session.close()? We create a session in reload_kedro, but we shouldn't close it there. Maybe the reason why we don't close the session is because there isn't really a good way to do it in the ipython flow?

Oh yes, of course. I was thinking because _clear_hook_manager was being called there before we should be doing session.close, but looking at it again I see that _clear_hook_manager was called before session.create rather than at the end. I don't see any good way to put session.close into the flow, so all good how you have it 👍

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐

@merelcht merelcht merged commit 93d01c8 into develop Feb 4, 2022
@merelcht merelcht deleted the KED-2004-hook-manager branch February 4, 2022 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants