Reconsider how editing environments works #886

krassowski · 2024-09-19T17:27:57Z

Context

Currently editing environment:

creates a new build directory
swaps the symlink

This means that autoreloading does not work. For example, when using with Jupyter/IPython:

kernelspec needs to be reloaded hence need for back-and-forth switching between environments to see changes applied; in my experience this wastes about 3 minutes per each change in an environment (compared to workflow without conda-store)
newly installed packages do not become available in a running kernel which means that possibly hours of computation may be lost as the kernel needs to be restarted to pickup smallest change in the env
changed packages cannot be picked up by the IPython autoreloader either

If instead it worked like:

copy existing build to a new folder
the copied folder becomes the archivial build version
update the environment:
- ideally by updating in place rather than rebuilding from scratch
- if needs be by rebuilding (which will still work but likely change some versions inadvertently if new versions were released, which is another issue)
there is no need to swap the symlink so autoreloading and everything else works!

Value and/or benefit

Many minutes to hours in productivity gained (or rather not lost) for the use case of interactive environment creation by a senior data analyst.

Anything else?

No response

krassowski · 2024-09-19T17:33:35Z

@kcpevey mentioned to me that this may be a foot gun for shared environments:

The problem with autoreloading the environment is that the environment can change underneath you - other people could have updated the environment without your knowledge.

I somewhat agree, but ultimately if shared env is changed by someone else, activating it after the change will cause the same issue.

And questions the UX for user awareness:

What if you are running a notebook, stop to kick off a rebuild the env which takes 20 minutes, while that's going you keep working in the notebook. At some point, the env build is complete - What happens to your running notebook? The kernel remains as the old env until you restart the kernel? The user gets a warning that the kernel has been replaced?

Here I would mention that auto-reloading is not enabled by default, and users who enable it know what they are doing. Also, rebuilding and env should not take 20 minutes (but it does). I do however, agree that a notification that an environment building has completed should be shown when conda-store is used with JupyterLab, which is tracked in:

Send notification when build completes jupyterlab-conda-store#49

dharhas · 2024-09-19T18:22:46Z

As an fyi, historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

dharhas · 2024-09-19T18:24:52Z

But this does go with another discussion I had had about the packaging at pycon, we actually have multiple target audiences (devs, end users etc) for environment management and we are using the same tools for all of them.

dharhas · 2024-09-19T18:32:13Z

newly installed packages do not become available in a running kernel which means that possibly hours of computation may be lost as the kernel needs to be restarted to pickup smallest change in the env

Is this actually a valid use case? How reliably does it work? For pure python packages maybe. To me it seems if you change the underlying environment all bets are off on whether your python objects are even valid if an install changed something under the hood. Seems like a better option would be to make sure you serialized your results.

krassowski · 2024-09-20T07:37:33Z

Is this actually a valid use case?

Yes, in IPython installation/updating of packages via %pip and %conda magics is supported and valid use case. These magics warn that for some packages restarting kernel may be required, but when autoreload is on it is rarely the case (it is the case for updating non-pure Python).

How reliably does it work? For pure python packages maybe.

Very well in my experience. And Databricks considers it a valid use case too, they are contributing enhancements in ipython/ipython#14500.

To me it seems if you change the underlying environment all bets are off on whether your python objects are even valid if an install changed something under the hood.

This is my call as an experienced user to make. I can tell if I will need to restart the kernel or not, and I often know what specific changes will be made. It is not for updating numpy from 1.x to 2.x, it is for grabbing patches for very specific bugfixes.

Seems like a better option would be to make sure you serialized your results.

No, I disagree here. It sounds like blaming user here but in fact even with best serialization and caching, there are operations that always take time like loading up large files or training small/medium models. I don't think that using conda-store should be incompatible with data scientists analysing big data or training baseline/statistical models in notebooks. But maybe I misunderstand the target audience of conda-store.

krassowski · 2024-09-20T07:47:18Z

historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

On the other hand, the current conda-store approach leads to broken notebooks for data scientists who are not used to working with conda-store:

one time my Python version was updated when I added a small unrelated package because new Python version has been released; it broke half of the packages I had installed
my numpy was upgraded to 2.0 when I was doing an unrelated change in env as it was just released and broke my code
another time when I upgraded one dependency it downgraded another dependency without telling me

Why not give advanced users a choice on whether to update in place or not? If the old environment is copied as an archival build that has 0 risk, right?

Thinking about it, what I really miss is:

a) hot-reloading support (as discussed in this issue)
b) a confirmation screen saying what changes will be made so that I can adjust pins
c) a way to pin all packages to current versions (like if I have auto-installed numpy 1.2 I would want to apply numpy>=1.2.0,<2.0" pin when modifying the environment easily
- right now the version information is actually a bit hidden

krassowski · 2024-09-20T13:09:54Z

Part of the delay is that even after environment is built we need to wait for up to a minute for it to be refreshed on the nb_conda_kernels side: https://github.com/anaconda/nb_conda_kernels/blob/04c5fc605c08a4ced0cc45d2a6507dea40897600/nb_conda_kernels/manager.py#L18

So I as a user keep restarting the environment until it clicks. If my new/edited dependency is used lower in the notebook I can waste many minutes there.

For the interactive use case we need to somehow rewrite nb_conda_kernels to watch kernelspecs changes on disk, or emit an event from conda-store and make nb_conda_kernels refresh. Interaction with nb_conda_kernels appears in scope as this is included in the Dockerfile:

conda-store/conda-store/Dockerfile

Line 34 in bde7fe4

python=${python_version} nb_conda_kernels nodejs=18 yarn constructor \

peytondmurray · 2024-10-03T22:44:33Z

Part of the goal of the 2024 conda-store roadmap is to eliminate some of the painfully slow tasks that users regularly encounter, and from the description this seems like a major UX annoyance. From what I can tell this is particularly irritating because

There's no user feedback to indicate when an environment is updated
Once it is updated, the new environment may have versions of dependencies that are different than what the user might expect (as in the case where numpy 2.0 released and updated) with no feedback to the user
Swapping environments back and forth takes a long time because the kernelspec needs to be reloaded to see the applied changes
Running kernels can't take advantage of new packages that are installed, so if you're partway through executing a notebook and you realize you need another package, you have to reexecute everything
The IPython autoreloader doesn't work with the current symlink swapping scheme

So it's a combination of slow iteration and not enough feedback to the user. It sounds like there are downsides to both symlinking and update-in-place. However I feel like I don't have the full context, especially with regards to:

historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

In conda-store, each build is a separate specification with a corresponding lockfile, though. The idea of adding a new package to an existing environment with a conda install <package> doesn't apply. Or am I missing something?

@kcpevey mentioned to me that this may be a foot gun for shared environments:

The problem with autoreloading the environment is that the environment can change underneath you - other people could have updated the environment without your knowledge.

I somewhat agree, but ultimately if shared env is changed by someone else, activating it after the change will cause the same issue.

This really sounds like a problem that could be fixed by notifying the user. We could also possibly give them the option to stay on the old build or bump their own build to the latest version as well, although if we do opt for hot-reloading by eliminating the symlinking mechanism, users who stick with the old build would need to be the ones who reload (to target the old build)?

krassowski · 2024-10-24T13:28:20Z

From what I can tell this is particularly irritating because

I agree with your summary. Just one more thing:

In conda-store, each build is a separate specification with a corresponding lockfile, though. The idea of adding a new package to an existing environment with a conda install <package> doesn't apply. Or am I missing something?

How can a user achieve the closest possible thing to "add a new package to environment without updating changing dependencies of anything else unless necessary", like in conda install <package>. Is my only choice to manually add pins for every single of ~30 packages that I have in "Requested packages" section?

As I user I now have a fear of adding anything to an environment (but I have to!). What is the safe path?

peytondmurray · 2024-10-25T17:53:42Z

How can a user achieve the closest possible thing to "add a new package to environment without updating changing dependencies of anything else unless necessary", like in conda install . Is my only choice to manually add pins for every single of ~30 packages that I have in "Requested packages" section?

With conda-store you can't do this at the moment because the environment gets re-solved when a new specification is submitted. This comes from previous experiences with incremental updates:

As an fyi, historically, updating in place rather than rebuilding from scratch has been a really bad idea and has ended up with folks having non-reproducible bespoke / broken environments because to recreate the environment you have to recreate every update step and that is not tracked anywhere.

I have trouble understanding how we'd be able to reproduce the environment you'd find yourself with if there were an option to add a new package without changing dependencies of anything else unless necessary; isn't this what happens when you pip install <package>? pip downloads the requested version of your package, and if the currently installed dependencies meet the requirements of the package you're trying to install, those dependencies don't change. The only way you'd be able to get back to your particular environment would be either by

Pinning every dependency, or
Recreating the incremental build steps that you followed from the initial state of your environment

But maybe I'm missing something or there's another way to do this?

About user-facing messaging: are we currently passing messages to the jupyterlab-conda-store extension? If not, it sounds like we'll need to do so in order to fix the messaging/notification part of this issue.

krassowski · 2024-10-25T18:11:31Z

I have trouble understanding how we'd be able to reproduce the environment you'd find yourself with if there were an option to add a new package without changing dependencies of anything else unless necessary

Well, by having a lock file/pip freeze output committed. Right now conda-store does not solve the problem, just shifts it. Instead of pinning every dependency in a freeze/lock file I have to pin it in the specification file.

The conda-store approach might be fine, but if the only sane way to add a new package is to have everything pinned, then I think it should have a button to populate pins for all packages in spec from the currently installed versions.

krassowski added the needs: triaging 🚦 Someone needs to have a look at this issue and triage label Sep 19, 2024

krassowski mentioned this issue Sep 25, 2024

[BUG] - Newly built environments don't immediately show up as available to JupyterLab conda-incubator/conda-store-ui#428

Open

peytondmurray added needs: discussion 💬 needs: investigation 🔎 labels Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider how editing environments works #886

Reconsider how editing environments works #886

krassowski commented Sep 19, 2024

krassowski commented Sep 19, 2024

dharhas commented Sep 19, 2024 •

edited

Loading

dharhas commented Sep 19, 2024 •

edited

Loading

dharhas commented Sep 19, 2024

krassowski commented Sep 20, 2024

krassowski commented Sep 20, 2024

krassowski commented Sep 20, 2024

peytondmurray commented Oct 3, 2024 •

edited

Loading

krassowski commented Oct 24, 2024

peytondmurray commented Oct 25, 2024

krassowski commented Oct 25, 2024

Reconsider how editing environments works #886

Reconsider how editing environments works #886

Comments

krassowski commented Sep 19, 2024

Context

Value and/or benefit

Anything else?

krassowski commented Sep 19, 2024

dharhas commented Sep 19, 2024 • edited Loading

dharhas commented Sep 19, 2024 • edited Loading

dharhas commented Sep 19, 2024

krassowski commented Sep 20, 2024

krassowski commented Sep 20, 2024

krassowski commented Sep 20, 2024

peytondmurray commented Oct 3, 2024 • edited Loading

krassowski commented Oct 24, 2024

peytondmurray commented Oct 25, 2024

krassowski commented Oct 25, 2024

dharhas commented Sep 19, 2024 •

edited

Loading

dharhas commented Sep 19, 2024 •

edited

Loading

peytondmurray commented Oct 3, 2024 •

edited

Loading