Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unexpected outliers in TSNE results #3057

Closed
cjnolet opened this issue Oct 23, 2020 · 4 comments · Fixed by #3084
Closed

[BUG] Unexpected outliers in TSNE results #3057

cjnolet opened this issue Oct 23, 2020 · 4 comments · Fixed by #3084
Labels
bug Something isn't working CUDA / C++ CUDA issue

Comments

@cjnolet
Copy link
Member

cjnolet commented Oct 23, 2020

With cuML version 0.16, I'm noticing some strange outliers suddenly in the rapids-single-cell example notebooks. Please refer to the notebook in the repository for the expected output. Below is the output I am getting when running the notebook with 0.16:

image

The same issue is happening on the 1M cells notebook, though the outliers look much more extreme:

image

I installed the 0.16 environment using the following yaml file (CUDA toolkit version is 10.2 and driver version is 11.0):

channels:
  - rapidsai
  - nvidia
  - conda-forge
  - bioconda
dependencies:
  - rapids=0.16*
  - python=3.8
  - cudatoolkit=10.2
  - cudf
  - dask-cuda
  - dask-cudf
  - cuml
  - cugraph
  - scipy
  - ucx-py
  - ucx-proc=*=gpu
  - scikit-learn=0.23.1
  - louvain
  - cupy=8*
  - scanpy
  - umap-learn
  - ipykernel
  - jupyterlab
  - pip
  - pip:
      - jupyter-server-proxy
      - git+https://github.com/dask/dask.git
      - git+https://github.com/dask/distributed.git

Since there have been recent changes to TSNE, it would probably be best to bisect through the commit history in 0.16 to find where this started.

@cjnolet cjnolet added bug Something isn't working CUDA / C++ CUDA issue labels Oct 23, 2020
@cjnolet cjnolet changed the title [BUG] Outliers in TSNE [BUG] Unexpected outliers in TSNE results Oct 23, 2020
@cjnolet
Copy link
Member Author

cjnolet commented Oct 24, 2020

Ok addition to isolating the cause of these outliers, I think this exposes a larger problem, which is that we need a better way to test for potential issues like this.

Something to keep on mind: I wonder if we would create a test harness on some real-world datasets and find a good density-based or graph clustering to use for validation, in addition to trustworthiness.

@zbjornson
Copy link
Contributor

zbjornson commented Oct 29, 2020

Bisected to 6a93762 with ~95% confidence. This is difficult to assess because that commit aimed to fix another bug that caused funky plots (see next comment), i.e. there were artifacts both before and after, but I think I can tell the difference visually and the artifacts seem to happen more frequently after that commit. The key visual telltale is that the "new bad" sometimes makes a cluster disappear or pinpoint, whereas the "old bad" just made them have unnatural shapes.

Old bad New bad
3316718e4-1 (k60) 6a93762b1-3 (k60)
Green is oddly shaped and pink is spread, but all 8 clusters present Yellow is tiny, lots of spreading

Note that even the best plots before that commit are still not quite as "pretty" as I'd hope (compared to exact, CannyLab's FFT or #3058).

@zbjornson
Copy link
Contributor

zbjornson commented Oct 29, 2020

that commit aimed to fix another bug that caused funky plots

Scratch that: that commit aimed to fix a deadlock, I was thinking of a different commit.

Now I'm even more curious how changing the cache pref for the summarization kernel between Shared and L1 can cause a deadlock or change the output. Seems like it has to be a timing/synchronization bug, right?

zbjornson added a commit to zbjornson/cuml that referenced this issue Oct 29, 2020
@cjnolet
Copy link
Member Author

cjnolet commented Oct 30, 2020

@zbjornson, Here's the TSNE projection from our single-cell examples with your PR:
image

And here's the TSNE projection pre-0.16:
image

At first glance, it appears your PR fixes the problem.

cjnolet added a commit that referenced this issue Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CUDA / C++ CUDA issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants