Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added non-git source puller functionality #194

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

sean-morris
Copy link
Contributor

@sean-morris sean-morris commented Jul 21, 2021

This PR adds support for downloading non-git-based compressed archives(zip and tar-compatible) of jupyter notebooks and files to nbgitpuller. The documentation of plugin development, as well as examples of plugins that download from Google Drive, Dropbox, and any publicly available web address, are found in this issue.

The optional third command-line argument, repo_dir, is now a keyword
argument and the name is changed to target-dir; this allows us to avoid
having to pass an empty branch_name as the second argument if you want
to pass in the repo_dir(now target-dir) as the third argument.

In the previous configuration, we used positional arguments:
gitpuller git_url branch_name repo_dir

Now, we use a keyword argument for target-dir:
gitpuller git_url [branch_name] --target-dir [TARGET_DIR]
@sean-morris sean-morris force-pushed the non-git branch 6 times, most recently from d02f6cd to 656420e Compare July 22, 2021 22:53
Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very quick review, particularly around plugin discovery with pluggy. I'll provide another pass early next week.

Excited to get this landed!

nbgitpuller/handlers.py Outdated Show resolved Hide resolved
nbgitpuller/handlers.py Outdated Show resolved Hide resolved
nbgitpuller/hookspecs.py Outdated Show resolved Hide resolved
nbgitpuller/plugins/plugin_helper.py Outdated Show resolved Hide resolved
nbgitpuller/plugins/plugin_helper.py Outdated Show resolved Hide resolved
nbgitpuller/plugins/plugin_helper.py Outdated Show resolved Hide resolved
nbgitpuller/plugins/plugin_helper.py Outdated Show resolved Hide resolved
Handles non-git source compressed
archives from google drive, dropbox, and any publicly
available web address.
nbgitpuller/handlers.py Outdated Show resolved Hide resolved
Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more thorough review! I wrote this up yesterday but GitHub ate it :(

nbgitpuller/plugins/zip_puller.py Outdated Show resolved Hide resolved
MANIFEST.in Outdated Show resolved Hide resolved
nbgitpuller/hookspecs.py Outdated Show resolved Hide resolved
nbgitpuller/plugins/plugin_helper.py Outdated Show resolved Hide resolved
nbgitpuller/plugins/plugin_helper.py Outdated Show resolved Hide resolved
nbgitpuller/plugins/plugin_helper.py Outdated Show resolved Hide resolved
@choldgraf
Copy link
Member

hey all - what's the status on this one? @sean-morris do you have time to finish this up?

@sean-morris
Copy link
Contributor Author

hey all - what's the status on this one? @sean-morris do you have time to finish this up?

@choldgraf somehow this message slipped by me .... yes I have been talking to @yuvipanda. It is embarrassing how long this took. The move to async was rough ... !

@sean-morris
Copy link
Contributor Author

sean-morris commented Sep 13, 2021

Ok, I have a few pieces:

  1. I needed two packages for testing -- aioresponses and pytest-asyncio. I put them in setup.py but that does not seem correct because they are not needed in production. What is the best practice here?
  2. I have a series of URLs for whoever is reviewing this PR to test with. Each link is either a zip or tar archive to google drive, dropbox, or Github(which acts like a publicly accessible URL). I could copy them here?
  3. I committed a change to tox.ini setting the max line length to 150 -- not sure we want that?
  4. I also set up the pullers for google drive, dropbox, and a standard publicly available URL as separate plugins. This can be changed, I am not sure how we want to do this.

@yuvipanda
Copy link
Contributor

@sean-morris great! I'll try to do a review this week. From now on, can we just add additional commits and not force push to make reviewing changes easier?

You also need to rebase against master locally at some point to resolve the conflicts

@sean-morris
Copy link
Contributor Author

sean-morris commented Sep 14, 2021

@sean-morris great! I'll try to do a review this week. From now on, can we just add additional commits and not force push to make reviewing changes easier?

You also need to rebase against master locally at some point to resolve the conflicts

@yuvipanda Maybe I blew this but I think all the changes since the last review are in the second commit on September 12th? Yes on the rebase -- of course. I read the error wrong. good deal.

nbgitpuller/handlers.py Outdated Show resolved Hide resolved
This includes moving most of the functions in plugin_helper.py
to async generators. The GitSyncView also opens the console
view whenever anything is written to it -- error or just progress
output.
@consideRatio
Copy link
Member

@sean-morris wow this looks like a very thorough PR with lots of functionality added! Could you write an overview of this PR to help me review it?

To help you, help me, help you, help everyone with this contribution, I considered what I would like to learn from such overview, and here are some guiding questions that would be helpful to me and perhaps other reviewers.

  • About the goal outcome as understandable from an end user of nbgitpuller
  • About what changes you've opted to make towards the goal
  • Your expectations about it being a breaking change or not, and in what regard you could suspect it could be breaking if it would be breaking in any way.
  • About how you test the functionality
  • About anything not included in PR you consider relevant to strengthen in the code to make it more robust and maintainable

@sean-morris
Copy link
Contributor Author

sean-morris commented Oct 26, 2021 via email

@sean-morris
Copy link
Contributor Author

About the goal outcome as understandable from an end-user of nbgitpuller:

  • We are allowing nbgitpuller to accept compressed archives(zip or tar-compatible) of
    notebooks(or I guess any type of file) instead of only handling git repositories.
  • An user provides, as part of a nbgitpuller link a publicly available URL: to a compressed archive,
    a compressed archive in Google Drive, or a folder in Dropbox.

About what changes you've opted to make towards the goal

  • There are no changes that affect the current functionality minus the UI choices in
    the link generator application(a separate PR). The SyncHandler's get method in
    handlers.py checks to see if the URL query parameters contain the "provider" key.
    If so, a series of steps are taken to download the compressed archive, mimic remote git repo in the hidden
    folders of a user's hub account, and then pass off the handling(pull, merge, etc) to the
    GitPuller itself. This approach allowed me to leverage the current puller functionality.

Your expectations about it being a breaking change or not, and in what regard you could suspect it could be breaking if it would be breaking in any way.

  • This causes no breaking changes. I did have a problem in the link generator PR in regards to not including
    the branch in the URL query parameters. I have updated that PR so that branch is required again.

About how you test the functionality

  • I have a series of links to Google Drive, Dropbox, and Github(this operates as a standard web server at a publicly available URL); I use these links to create nbgitpuller links
  • I ran a local instance of jupyterhub and tested the links to be sure the archives
    are downloaded, decompressed, set up to mimic a remote git repo, and then merged into any current files in the user's drive on the hub as the GitPuller would normally do.

About anything not included in PR you consider relevant to strengthen in the code to make it more robust and maintainable
It is unclear where to put the plugins (Google, Dropbox, standard) themselves. I have them in the nbgitpuller repo under the nbgitpuller/plugins directory. Should they be put in their own git projects, versioned, etc?
I install them separately into the virtual environment from the plugin's directory(pip install nbgitpuller/plugins/x)

A summary of what is involved:
SyncHandler in handlers.py:
The provider key on the URL query parameter is checked for existence.

If it does exist:

  • a download queue is created to help facilitate the monitoring of the download process and
    reporting this progress to the UI
  • the plugin manager calls the handle_files hook function implemented by the plugin itself
  • the handle_files function determines the extension used on the archived file
  • the asyncio event loop is retrieved and the handle_files helper, as well as the progress loop function
    executed.
  • the progress loop function emits the progress from the async yields to the UI
  • the handle_files_helper uses an async generator to:
    • initialize a local repo(that will mimic a remote repo) - called origin_repo
    • clone the origin_repo(which is empty!) to a temporary folder at the path (temp_download_repo)
    • download the compressed archive
    • decompress the archive
    • commit and push the downloaded files to the local origin_repo.(remember this mimic's a remote origin)
    • returns the path to the local origin_repo to the SyncHandler

From here, the original Gitpuller takes over.
The repo is either:

  • If we had a provider, the local origin_repo path created when a provider is present
  • if we did not have a provider, the git URL passed with the nbgitpuller link

I could also get on zoom and move through what I did to pull all this off. Please tell me if I can add more detail.

Thanks!

Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing! I felt I left very few comments on changes to consider given how large this PR is.

Other suggestions

  • Add a README.md file to the root of the plugin folder, describing the plugin systems use a bit.
    • Clarify if these plugins are installed by default, or if they are something you need to enable or similar.
  • Add a README.md file within each plugin folder where the plugin is very briefly described. I'm thinking for example one paragraph with 3-4 sentences and perhaps an example of what kind of URL you may specify to use that specific plugin.

As you may note, I put quite a bit of focus here on documentation for comprehension.

tests/test_files/hw/hw01/hw01.ipynb Outdated Show resolved Hide resolved
nbgitpuller/plugin_helper.py Outdated Show resolved Hide resolved
nbgitpuller/hookspecs.py Outdated Show resolved Hide resolved
nbgitpuller/handlers.py Outdated Show resolved Hide resolved
nbgitpuller/handlers.py Outdated Show resolved Hide resolved
nbgitpuller/handlers.py Outdated Show resolved Hide resolved
nbgitpuller/handlers.py Outdated Show resolved Hide resolved
@manics
Copy link
Member

manics commented Nov 3, 2021

What are your thoughts on how this is tested and maintained long term- do you think the tests you've added are sufficient for us to release future versions, or merge PRs not directly related to this functionality, without worrying about things accidentally breaking? Or would you recommend some manual testing?

- removed the global declaration of TEMP_DOWNLOAD_REPO_DIR
- the directory is now created by tempfile.TemporaryDirectory
@sean-morris
Copy link
Contributor Author

sean-morris commented Nov 30, 2021

Almost there!

  • I'm not that experienced with Python and I've never used the global keyword, but I recall it is a practice to avoid. I think that is plausible to avoid, but let's not bother unless someone with more experience can point us towards a concrete alternative solution.

@manics @consideRatio
I went all over the place on this and ended with the global variable solution from this reference. I would not call it definitive but got me through!

  • I'm a bit concerned about seeing that we seem to rely on a single temp downloader directory. Can we instead rely on something like the following ideas to avoid conflicts?

    1. A root location for temp downloads, but we have separate folders under it to avoid conflicts
    2. Use of a Python tempfile.TemporaryDirecory

Btw, if we use a TemporaryDirectory, I think you can pass dir=... and make it create the temporary directory within some other directory.

I got this in -- good deal

@manics
Copy link
Member

manics commented Nov 30, 2021

@sean-morris I think global is fine for single-threaded scripts as in the StackOverflow post, but since this is a web application you could have multiple async requests, either intentional or not. Debugging multithreaded/async apps can be very difficult so I think we need to handle that case to avoid problems in future.

- can now pass in a custom download function and/or custom
download function parameters.
- the temp_download_file is added to custom download_func_params -- I
did this so that a plugin does not need to know about the
temp_download_file nor try to handle it.
- the download_archive function now uses keywords: repo and
temp_download_file
- moved the tempfile cleanup to a finally
- removed the dir specification for tempfile
@sean-morris
Copy link
Contributor Author

Removing the TEMP_DOWNLOAD_DIR from the plugin_helper.py meant that the plugins did not have access to the temporary directory created by plugin_helper.handle_files_helper. It got a little confusing as to how to handle this.

I ended up having the plugin_helper handle adding the temp_download_dir to the download_func_params. I am not sure this is the best method. I was trying to give plugins' the flexibility to customize their download functionality but leverage the other pieces of the download process against what I had written in plugin_helper. It ends up getting a bit complicated on the plugin developer side. Open to ideas here.

We needed this to deal with the dir_names being set in
the async generator but needing to be returned by
handle_files_helper
@sean-morris
Copy link
Contributor Author

@sean-morris I think global is fine for single-threaded scripts as in the StackOverflow post, but since this is a web application you could have multiple async requests, either intentional or not. Debugging multithreaded/async apps can be very difficult so I think we need to handle that case to avoid problems in future.

@manics @consideRatio
I moved the async generator and handle_files_helper into a class as @manics suggested. It needs a bit more testing but I do think this is much better. Do you see any problems here?

@consideRatio
Copy link
Member

I moved the async generator and handle_files_helper into a class as @manics suggested. It needs a bit more testing but I do think this is much better. Do you see any problems here?

Seems ok i think! Seems like a more robust and easy to understand logic.

/ Brief review check fron mobile

@consideRatio
Copy link
Member

@sean-morris wrote:

I am actually hating this whole download function abstraction mostly because it creates a very confusing set of moments about the temporary download directory. I am wondering if a better solution would be to mandate a directory be created by the plugin? This would get the creation of the directory out of plugin_helper and may limit the confusion for someone trying to create a custom downloader plugin.

I'm thinking that a lot of the complexity stem from not having a clear abstraction layer. Ideally in my mind, the default git content provider should be able to act as a plugin - the default one. It could be a feature creep, it could also be a helpful strategy on how to arrive at a clear logical separation of concern for the various parts of code we work with.

I don't yet understand the technical details or hurdles well enough so think clearly about this though. Perhaps we can have a video meet at the end of the next week @sean-morris to sync a bit? I'd like to make sure we get this to land without you overworking yourself. I'm projecting, but if I'd worked so thoroughly and for so long on a feature, I'd be quite tired on feeling the goal post moving around. I'd like to ensure we get into the goal while also ensuring you are able to not feel alone with the responsibility of making it happen.

@sean-morris
Copy link
Contributor Author

@sean-morris wrote:

I am actually hating this whole download function abstraction mostly because it creates a very confusing set of moments about the temporary download directory. I am wondering if a better solution would be to mandate a directory be created by the plugin? This would get the creation of the directory out of plugin_helper and may limit the confusion for someone trying to create a custom downloader plugin.

I'm thinking that a lot of the complexity stem from not having a clear abstraction layer. Ideally in my mind, the default git content provider should be able to act as a plugin - the default one. It could be a feature creep, it could also be a helpful strategy on how to arrive at a clear logical separation of concern for the various parts of code we work with.

I don't yet understand the technical details or hurdles well enough so think clearly about this though. Perhaps we can have a video meet at the end of the next week @sean-morris to sync a bit? I'd like to make sure we get this to land without you overworking yourself. I'm projecting, but if I'd worked so thoroughly and for so long on a feature, I'd be quite tired on feeling the goal post moving around. I'd like to ensure we get into the goal while also ensuring you are able to not feel alone with the responsibility of making it happen.

This makes sense to me to have all content providers including git sources be a plug-in although it doesn't solve some of these temporary download directory problems; the git sources don't need it. Side note, I actually think I have a solution to my problem with the temporary download directory that will be clear. But generally, it makes sense to me to bring git sources into a plugin as well.

Thanks for your concern. It is definitely a bit of a push. I felt pressure to keep it moving in the face of bunches of other work piling up! Yes, I can definitely talk over video. I had wanted to suggest this. Some of the details are hard to get clear in writing. In the meantime, I am going to move git sources to a plugin and implement the temp download directory solution. We can see what it looks like.

So you are nine hours ahead of me. I could do a 6:00 in the morning meeting which puts you around 3:00 and gives me enough flexibility to not screw up the rest of my life! name the day.

@consideRatio
Copy link
Member

@sean-morris ❤️ I'm contacting you on slack!

The downloader-plugin utilities and tests are now in their
own repo with the other downloader plugins.
The downloader-plugin utilities and tests are now in their
own repo with the other downloader plugins.
# this allows us to nest usage of the event_loop from asyncio
# being used by tornado in jupyter distro
# Ref: https://medium.com/@vyshali.enukonda/how-to-get-around-runtimeerror-this-event-loop-is-already-running-3f26f67e762e
nest_asyncio.apply()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This monkeypatches the asyncio event loop, is this event loop shared across the whole of jupyter-server? If so are there any potential issues that might be caused in jupyter-server, extensions, or kernels that we should be aware of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@consideRatio @manics Erik do you have any sense of this? I completely based this solution on the reference I linked to in the comments. He seemed to be trying to solve a similar nested event loop issue in Jupyter.

I read a bunch more pieces.

  • Here is a discussion about where the error occurs(Runtime error related to trying to start an already running loop); the ipykernel seems to be the issue. The nested_asyncio solution becomes a part of the thread in August 2020.
  • Here is another from the jupyter blog in 2018; at this point, they don't want to integrate the nested_asyncio package until there is more testing.
  • I also tried upgrading the tornado and ipykernel packages; a random comment seemed to indicate it would solve the problem; after upgrading tornado to version 6.1(no change in my environment) and ipykernel to 6.6.1(up from 6.6.0) -- it didn't work. I still need the nested_asyncio package.

pip3 install --upgrade tornado
pip3 install --upgrade ipykernel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the nest_asyncio calls to the nbgitpuller-downloader-plugins themselves as well as added notes to the documentation that the import and call to nest_asyncio need to be included in any nbgitpuller-downloader-plugin.

The nested_asyncio import and call are moved to the nbgitpuller
downloader plugins
The logic related to downloading compressed archives now is initiated
by the the same Thread GitPuller uses.
We moved all the logic related to pulling compressed archives to the
GitPuller class;
These changes reflect changes to the downloader-plugins that now
encapsulate classes and store the handle_files outputs in an instance
variable that we access from nbgitpuller when the downloader plugin is
complete. This required the addition of a function to automatically
register nbgitpuller-downloader-plugin classes with pluggy.
@ericvd-ucb
Copy link

Where is this project? Users want the cool new functionality!

@sean-morris
Copy link
Contributor Author

sean-morris commented Oct 11, 2022 via email

@choldgraf
Copy link
Member

Just another note that I think this would be useful as well. Jupyter book creates ipynb files that are bundled with the websites that are generated, and I think it'd be useful if we could generate interact links that pointed to those ipynb files rather than the ones that are on GitHub.

@sean-morris
Copy link
Contributor Author

sean-morris commented Jan 1, 2023 via email

@ericvd-ucb
Copy link

I thought I stopped by this old issue once a year to say hi, but, it looks like its been two whole years!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants