-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the best way to regularize supervised UMAP? #1116
Comments
I don't have a good answer. Since nobody else has answered & I like the question, I thought I'd follow up with a bad answer and an offer to try some things out. Clarifying the question: It isn't 100% clear to me what terms like "bias," "variance," and "overfitting" should mean in the context of supervised embedding without any underlying data model. Based on your description of what you're looking for in your visualization, I'm going to guess that you basically want the combined workflow of (UMAP)+(KNN regression) to not overfit - that is, if KNN regression fails on your test set, the labels of the training data shouldn't vary smoothly with the embedding! I don't think there is anything in UMAP that explicitly tries to solve this regularization problem. Here are some elements of UMAP that do some regularization of similar-looking problems:
If you wanted to dive into the code a bit more, there are a few places where people have worked on regularization strategies that would be easy to add:
None of these approaches directly regularize the thing you're looking for. Some other approaches could easily be tweaked to do this (e.g. the fully model-based approach https://arxiv.org/abs/2304.07658), but the approaches I'm aware of in this direction are very slow. I'm interested in this question but don't check github very often. If you feel like following up (especially if there is a version of the problem that you can share), my very-public-email-address is username asmi28, domain uottawa CA. I'm happy to write up some notebooks and such and post them if it feels interesting. |
I am working on a regression problem where I am attempting to use UMAP for supervised feature embedding, and use the resulting low-dimensional embeddings as input variables of a subsequent regression model.
Using the L2 target metric, UMAP successfully transforms my very high dimensional sparse feature matrix into a low-dimensional one for the training set. Visualizations of the resulting embeddings show a nice and smooth variation of the target variable with the UMAP features. However, when I embed the data of my validation set using the trained UMAP model, I see that the model generalizes poorly to unseen data, which appears to be a typical overfitting problem.
I tried to perform hyper-parameter optimization of the composite model comprising UMAP + the regression model on top of it by maximizing the R2 metric of the regression on top of UMAP. By tweaking several hyper-parameters of the UMAP model, so far I have been unable to effectively achieve a good bias-variance tradeoff of the embedding. The manual also does not seem to address the question of how to regularize supervised UMAP.
I would be extremely grateful for any advice as to which hyper-parameters to focus on in order to achieve better generalization properties of the supervised UMAP algorithm.
The text was updated successfully, but these errors were encountered: