Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: module 'cuml.cluster.hdbscan' has no attribute 'all_points_membership_vectors' #912

Closed
DominikMann opened this issue Jan 6, 2023 · 8 comments

Comments

@DominikMann
Copy link

Hi Maarten,

after updating to the new version 0.13.0 a new error occured in my code:
hdbscan_issue

I have read in the changelog that you have made changes to support cuML' hdbscan, which I am using. When using the "normal" hdbscan package, I get the following error:
hdbscan_issue

I instantiated the hdbscan model like this: hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=20, gen_min_span_tree=True)

When I switch back to version 0.12.0 I get no errors and everything runs as it should. Is there a problem on my end or is this behaviour not intended?

All the best,
Dominik

@lorenzobalzani
Copy link

Just to confirm, same here with bertopic==0.13.0.

@MaartenGr
Copy link
Owner

Which version of cuML are you using? Also, could you share your entire code for training the model? That makes it a bit easier to see what exactly is going on.

@MaartenGr
Copy link
Owner

Also, I believe when using the original HDBSCAN model, you will need to set prediction_data=True to generate the probabilities.

@DominikMann
Copy link
Author

DominikMann commented Jan 6, 2023

When using the original HDBSCAN with prediction_data=True it actually works, thank you.

For the cuML part:
I am using version 21.10.2 which is the default for kaggle i suppose.

The code for training:

from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic

stopwords = list(stopwords.words('english'))
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=stopwords)
embedding_model = SentenceTransformer('all-mpnet-base-v2')

umap_model = UMAP(n_neighbors=9, n_components=4, min_dist=0.05, random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=20, gen_min_span_tree=True)

embeddings = embedding_model.encode(docs, show_progress_bar=True)

model_bert = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
top_n_words=15,
language='english',
calculate_probabilities=True,
verbose=True,
diversity=0.3,
nr_topics=50
)

topics, probs = model_bert.fit_transform(docs, embeddings)

And just a side question:
I noticed that with the original HDBSCAN package I get more outliers than with the cuML package. Is that normal?

@MaartenGr
Copy link
Owner

I am using version 21.10.2 which is the default for kaggle i suppose.

Ah, you will need to have 22.10 at the very least in order to use those probabilities. I definitely should have made that clear in the documentation. Having said that, you can also use Google Colab using the instructions here.

I noticed that with the original HDBSCAN package I get more outliers than with the cuML package. Is that normal?

That depends on the parameter space that you are using compared with the original. I believe they are not exactly one on one comparable so making sure all parameters are equal should help a bit.

@DominikMann
Copy link
Author

DominikMann commented Jan 6, 2023

Thank you very much for your fast help! I will try it on Google Colabs in the next days.

Edit: It worked with Google Colabs! Thank you!

@p-dre
Copy link

p-dre commented Jan 19, 2023

I get a similar error message with bertopic==0.13 connected with all_points_membership_vectors
I have tried both cuml==22.12 and cuml 22.10

  File "test_rapids.py", line 37, in <module>
    topics, probs = topic_model.fit_transform(docs)
  File "/home/p/p_drec01/miniconda3/envs/bertopic_0_13/lib/python3.8/site-packages/bertopic/_bertopic.py", line 354, in fit_transform
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
  File "/home/p/p_drec01/miniconda3/envs/bertopic_0_13/lib/python3.8/site-packages/bertopic/_bertopic.py", line 2888, in _cluster_embeddings
    probabilities = hdbscan_delegator(self.hdbscan_model, "all_points_membership_vectors")
  File "/home/p/p_drec01/miniconda3/envs/bertopic_0_13/lib/python3.8/site-packages/bertopic/cluster/_utils.py", line 36, in hdbscan_delegator
    return cuml_hdbscan.all_points_membership_vectors(model)
  File "prediction.pyx", line 137, in cuml.cluster.hdbscan.prediction.all_points_membership_vectors
ValueError: PredictionData not generated. Please call clusterer.fit again with prediction_data=True
import pandas as pd
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

umap_model = UMAP()
hdbscan_model = HDBSCAN()


print('start model')
# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, language = "multilingual", calculate_probabilities=True)
print('fit model')
topics, probs = topic_model.fit_transform(docs)
conda list
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
aiohttp                   3.8.3            py38h0a891b7_1    conda-forge
aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
arrow-cpp                 9.0.0           py38he270906_2_cpu    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
attrs                     22.2.0             pyh71513ae_0    conda-forge
aws-c-cal                 0.5.11               h95a6274_0    conda-forge
aws-c-common              0.6.2                h7f98852_0    conda-forge
aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
aws-c-io                  0.10.5               hfb6a706_0    conda-forge
aws-checksums             0.1.11               ha31a3da_7    conda-forge
aws-sdk-cpp               1.8.186              hecaee15_4    conda-forge
bertopic                  0.13.0             pyhd8ed1ab_0    conda-forge
bokeh                     3.0.3              pyhd8ed1ab_0    conda-forge
brotlipy                  0.7.0           py38h0a891b7_1005    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
cachetools                5.2.1              pyhd8ed1ab_0    conda-forge
certifi                   2022.12.7          pyhd8ed1ab_0    conda-forge
cffi                      1.15.1           py38h4a40e3a_3    conda-forge
charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
click                     8.1.3           unix_pyhd8ed1ab_2    conda-forge
cloudpickle               2.2.0              pyhd8ed1ab_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
contourpy                 1.0.7            py38hfbd4bf9_0    conda-forge
cryptography              39.0.0           py38h1724139_0    conda-forge
cubinlinker               0.2.2            py38h7144610_0    rapidsai
cuda-python               11.8.1           py38h241159d_2    conda-forge
cudatoolkit               11.5.1              h59c8dcf_11    conda-forge
cudf                      22.10.01        cuda_11_py38_gca9a422da9_2    rapidsai
cuml                      22.10.01        cuda11_py38_ge3f4f57d1_0    rapidsai
cupy                      11.4.0           py38h405e1b6_0    conda-forge
cython                    0.29.33          py38h8dc9893_0    conda-forge
cytoolz                   0.12.0           py38h0a891b7_1    conda-forge
dask                      2022.9.2           pyhd8ed1ab_0    conda-forge
dask-core                 2022.9.2           pyhd8ed1ab_0    conda-forge
dask-cuda                 22.10.00        py38_g382e519_0    rapidsai
dask-cudf                 22.10.01        cuda_11_py38_gca9a422da9_2    rapidsai
dataclasses               0.8                pyhc8e2a94_3    conda-forge
datasets                  2.7.1              pyhd8ed1ab_0    conda-forge
dill                      0.3.6              pyhd8ed1ab_1    conda-forge
distributed               2022.9.2           pyhd8ed1ab_0    conda-forge
dlpack                    0.5                  h9c3ff4c_0    conda-forge
faiss-proc                1.0.0                      cuda    rapidsai
fastavro                  1.7.0            py38h0a891b7_0    conda-forge
fastrlock                 0.8              py38hfa26641_3    conda-forge
filelock                  3.9.0              pyhd8ed1ab_0    conda-forge
freetype                  2.12.1               hca18f0e_1    conda-forge
frozenlist                1.3.3            py38h0a891b7_0    conda-forge
fsspec                    2022.11.0          pyhd8ed1ab_0    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
glog                      0.6.0                h6f12383_0    conda-forge
grpc-cpp                  1.47.1               hbad87ad_6    conda-forge
hdbscan                   0.8.29           py38h26c90d9_1    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
huggingface_hub           0.11.1             pyhd8ed1ab_0    conda-forge
icu                       70.1                 h27087fc_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
importlib-metadata        6.0.0              pyha770c72_0    conda-forge
importlib_metadata        6.0.0                hd8ed1ab_0    conda-forge
jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
jpeg                      9e                   h166bdaf_2    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.20.1               hf9c8cef_0    conda-forge
lcms2                     2.14                 hfd0df8a_1    conda-forge
ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libabseil                 20220623.0      cxx17_h05df665_6    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
libbrotlidec              1.0.9                h166bdaf_8    conda-forge
libbrotlienc              1.0.9                h166bdaf_8    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
libcudf                   22.10.01        cuda11_gca9a422da9_2    rapidsai
libcuml                   22.10.01        cuda11_ge3f4f57d1_0    rapidsai
libcumlprims              22.10.00        cuda11_gfdb85e0_0    nvidia
libcurl                   7.87.0               h6312ad2_0    conda-forge
libcusolver               11.4.2.57                     0    nvidia
libcusparse               12.0.0.76                     0    nvidia
libdeflate                1.17                 h0b41bf4_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               h9b69904_4    conda-forge
libfaiss                  1.7.0           cuda112h5bea7ad_8_cuda    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgoogle-cloud           2.1.0                h9ebe8e8_2    conda-forge
libhwloc                  2.8.0                h32351e8_1    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
libjpeg-turbo             2.1.4                h166bdaf_0    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
libllvm11                 11.1.0               he0ac6c6_5    conda-forge
libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libpng                    1.6.39               h753d276_0    conda-forge
libprotobuf               3.20.2               h6239696_0    conda-forge
libraft-distance          22.10.01        cuda11_gf7d2335_0    rapidsai
libraft-headers           22.10.01        cuda11_gf7d2335_0    rapidsai
libraft-nn                22.10.01        cuda11_gf7d2335_0    rapidsai
librmm                    22.10.01        cuda11_gd98b8719_0    rapidsai
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libthrift                 0.16.0               h491838f_2    conda-forge
libtiff                   4.5.0                h6adf6a1_2    conda-forge
libutf8proc               2.8.0                h166bdaf_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libwebp-base              1.2.4                h166bdaf_0    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libxml2                   2.10.3               h7463322_0    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
llvm-openmp               15.0.7               h0cdce71_0    conda-forge
llvmlite                  0.39.1           py38h38d86a4_1    conda-forge
locket                    1.0.0              pyhd8ed1ab_0    conda-forge
lz4                       4.2.0            py38hd012fdc_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
markupsafe                2.1.2            py38h1de0b5d_0    conda-forge
mkl                       2022.2.1         h84fe81f_16997    conda-forge
msgpack-python            1.0.4            py38h43d8883_1    conda-forge
multidict                 6.0.4            py38h1de0b5d_0    conda-forge
multiprocess              0.70.14          py38h0a891b7_3    conda-forge
nccl                      2.14.3.1             h0800d71_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
ninja                     1.11.0               h924138e_0    conda-forge
nltk                      3.8.1              pyhd8ed1ab_0    conda-forge
numba                     0.56.4           py38h9a4aae9_0    conda-forge
numpy                     1.23.5           py38h7042d01_0    conda-forge
nvtx                      0.2.3            py38h0a891b7_2    conda-forge
openjpeg                  2.5.0                hfec8fc6_2    conda-forge
openssl                   1.1.1s               h0b41bf4_1    conda-forge
orc                       1.7.6                h6c59b99_0    conda-forge
packaging                 23.0               pyhd8ed1ab_0    conda-forge
pandas                    1.5.2            py38hdc8b05c_2    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
partd                     1.3.0              pyhd8ed1ab_0    conda-forge
pillow                    9.4.0            py38hb32c036_0    conda-forge
pip                       22.3.1             pyhd8ed1ab_0    conda-forge
plotly                    5.12.0             pyhd8ed1ab_1    conda-forge
pooch                     1.6.0              pyhd8ed1ab_0    conda-forge
protobuf                  3.20.2           py38hfa26641_1    conda-forge
psutil                    5.9.4            py38h0a891b7_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptxcompiler               0.7.0            py38h241159d_3    conda-forge
pyarrow                   9.0.0           py38h097c49a_2_cpu    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pylibraft                 22.10.01        cuda11_py38_gf7d2335_0    rapidsai
pynndescent               0.5.8              pyh1a96a4e_0    conda-forge
pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
pyopenssl                 23.0.0             pyhd8ed1ab_0    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.8.15          h257c98d_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-xxhash             3.2.0            py38h1de0b5d_0    conda-forge
python_abi                3.8                      3_cp38    conda-forge
pytorch                   1.12.1          cpu_py38h39c826d_0    conda-forge
pytz                      2022.7.1           pyhd8ed1ab_0    conda-forge
pyyaml                    5.4.1            py38h0a891b7_4    conda-forge
raft-dask                 22.10.01        cuda11_py38_gf7d2335_0    rapidsai
re2                       2022.06.01           h27087fc_1    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
regex                     2022.10.31       py38h0a891b7_0    conda-forge
requests                  2.28.2             pyhd8ed1ab_0    conda-forge
responses                 0.18.0             pyhd8ed1ab_0    conda-forge
rmm                       22.10.01        cuda11_py38_gd98b8719_0    rapidsai
s2n                       1.0.10               h9b69904_0    conda-forge
sacremoses                0.0.53             pyhd8ed1ab_0    conda-forge
scikit-learn              1.2.0            py38h1e1a916_0    conda-forge
scipy                     1.10.0           py38h10c12cc_0    conda-forge
sentence-transformers     2.2.2              pyhd8ed1ab_0    conda-forge
sentencepiece             0.1.96           py38h43d8883_1    conda-forge
setuptools                66.0.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sleef                     3.5.1                h9b69904_2    conda-forge
snappy                    1.1.9                hbd366e4_2    conda-forge
sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
spdlog                    1.8.5                h4bd325d_1    conda-forge
tbb                       2021.7.0             h924138e_1    conda-forge
tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
tenacity                  8.1.0              pyhd8ed1ab_0    conda-forge
threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tokenizers                0.13.1           py38hb35c9e2_2    conda-forge
toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
torchvision               0.14.0          cpu_py38hb98b4bf_0    conda-forge
tornado                   6.1              py38h0a891b7_3    conda-forge
tqdm                      4.64.1             pyhd8ed1ab_0    conda-forge
transformers              4.24.0             pyhd8ed1ab_0    conda-forge
treelite                  3.0.0            py38h8e2129e_1    conda-forge
treelite-runtime          3.0.0                    pypi_0    pypi
typing-extensions         4.4.0                hd8ed1ab_0    conda-forge
typing_extensions         4.4.0              pyha770c72_0    conda-forge
ucx                       1.13.1               h538f049_1    conda-forge
ucx-proc                  1.0.0                       gpu    rapidsai
ucx-py                    0.28.00         py38_g8292636_0    rapidsai
umap-learn                0.5.3            py38h578d9bd_0    conda-forge
urllib3                   1.26.14            pyhd8ed1ab_0    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xxhash                    0.8.1                h0b41bf4_0    conda-forge
xyzservices               2022.9.0           pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.8.2            py38h0a891b7_0    conda-forge
zict                      2.2.0              pyhd8ed1ab_0    conda-forge
zipp                      3.11.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge
zstd                      1.5.2                h3eb15da_5    conda-forge

@MaartenGr
Copy link
Owner

@p-dre I should update the documentation but the error message already gives you a hint as to what should be changed. In order to generate those probabilities, you should set prediction_data=True when instantiating the HDBSCAN model:

hdbscan_model = HDBSCAN(min_samples=10, min_cluster_size=10, gen_min_span_tree=True, prediction_data=True)

MaartenGr added a commit that referenced this issue Feb 4, 2023
@MaartenGr MaartenGr mentioned this issue Feb 8, 2023
MaartenGr added a commit that referenced this issue Feb 14, 2023
* Add representation models
  * bertopic.representation.KeyBERTInspired
  * bertopic.representation.PartOfSpeech
  * bertopic.representation.MaximalMarginalRelevance
  * bertopic.representation.Cohere
  * bertopic.representation.OpenAI
  * bertopic.representation.TextGeneration
  * bertopic.representation.LangChain
  * bertopic.representation.ZeroShotClassification
* Fix topic selection when extracting repr docs
* Improve documentation, #769, #954, #912
* Add wordcloud example to documentation
* Add title param for each graph, #800
* Improved nr_topics procedure
* Fix #952, #903, #911, #965. Add #976
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants