Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch implementation of ParametricUMAP #1103

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jh83775
Copy link
Contributor

@jh83775 jh83775 commented Mar 23, 2024

Hi All,
I've popped in a pull request with a version of ParametricUMAP written in PyTorch. It mostly follows the discussions in the threads below:

I'm not quite sure where it best fits in the codebase - currently I've just put it in umap/torch.py so you can import with from umap.torch import ParametricUMAP, which feels intuitive. It should be fairly easy to edit and include a nonparametric UMAP as well, via the torch.nn.Embedding class, but I can put in a fresh pull request for that when I find the time. There could also be a discussion around what parameters are accessible to the user while not bloating the number of optional args. For example, in my implementation the user can manually change the batch size and learning rate, but not the optimizer or negative sample rate. What is included in the optional args is relatively arbitrary based on what I tended to be changing the most for my use case. It will be relatively easy to expose any currently hardcoded parameter choices to the user though.

I've also added a few examples in a Jupyter notebook (see notebooks/Parametric_UMAP/08.0-torch-parametric-umap.ipynb) to give an indication of how to get started.

If any PyTorch experts want to take a look, I'll gladly add in any optimizations (I'm sure there will be a few things here and there that can be changed and improve the runtime).

I've tested it with pytorch==1.12.1 & CUDA==11.6. I think it should work with other versions too, but haven't tested more widely.

@pep8speaks
Copy link

pep8speaks commented Mar 23, 2024

Hello @jh83775! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 297:1: W293 blank line contains whitespace

Comment last updated at 2024-03-26 11:59:02 UTC

@lmcinnes
Copy link
Owner

Thanks for this. There is currently work underway to support Keras 3 (the initial support is done) such that multiple backends, including pytorch, could be used. You may want to check in on the refactoring discussion on PR #1101

@jh83775
Copy link
Contributor Author

jh83775 commented Mar 23, 2024

Thanks for this. There is currently work underway to support Keras 3 (the initial support is done) such that multiple backends, including pytorch, could be used. You may want to check in on the refactoring discussion on PR #1101

That looks helpful, thanks. I'll update to reflect the discussions on that thread. Looks like the keras-pytorch version might supersede this work, although hopefully this can still be useful for those more familiar with torch than keras

@lmcinnes
Copy link
Owner

lmcinnes commented Mar 23, 2024 via email

@lmcinnes
Copy link
Owner

lmcinnes commented Apr 4, 2024

So as far as I can tell this will just sit alongside the existing Keras3 implementation (which should now support a pytorch backed) just fine as an extra pytorch specific ParametricUMAP. I would be happy to just merge this if you think it has sufficient value as-is for a standalone pytorch speciifc implementation.

@timsainb
Copy link
Collaborator

timsainb commented Apr 4, 2024

So as far as I can tell this will just sit alongside the existing Keras3 implementation (which should now support a pytorch backed) just fine as an extra pytorch specific ParametricUMAP. I would be happy to just merge this if you think it has sufficient value as-is for a standalone pytorch speciifc implementation.

I'm also glad to see this implementation. Thanks!

I do think the overlap in functionality between this and the keras implementation is pretty high.

This code is missing some key functionality (it's been a few days since I read it). One of the benefits of parametric UMAP is it allows you to balance the global MDS loss and auto encoder loss, which iirc are not in this implementation.

One thing that the keras code is missing that this code has is the data iterator. Currently to run the keras code, even if you use torch as a backend you still rely on on the tensorflow dataset, meaning you need to have keras, tensorflow, and torch installed. If we pull the iterator from this code into the keras implementation we will overcome that issue (this is something I've been meaning to do but haven't had time yet)

@turian
Copy link

turian commented May 9, 2024

@timsainb as per discussion on HN by Max Woolf, one big limitation of the current parametric UMAP implementation is OOM. Caused by tensorflow dataset pulling the entire graph into memory.

So a parametric UMAP implementation that works iterating over the dataset would be great.

@timsainb
Copy link
Collaborator

@turian I think this is out of the scope of the umap python library, but I give an example of how to build your own graph and iterator here: https://colab.research.google.com/drive/1WkXVZ5pnMrm17m0YgmtoNjM_XHdnE5Vp?usp=sharing

You could switch out the graph and the interator for one that supports a graph larger than memory and an iterator that grabs from that graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants