large number of representative docs for topic 0 #965

jchen04 · 2023-01-26T22:49:59Z

jchen04
Jan 26, 2023

I'm trying to build BERTopic models on a set of ~130k tweets, preprocessed in the same way as the DTM example in the docs. For some reason, topic 0 has over 200 representative docs. Topic 1 has 6 docs, and the rest of the topics have just 3 docs (which seems to be the expected/correct number). This makes me also wonder if the selected keywords for topics 0/1, as well as the docs assigned to those topics, can be trusted.

My code is below. I'm running BERTopic v0.13.0, in a jupyter notebook. Any ideas would be appreciated, thanks!

umap_model = UMAP(random_state=42)

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model6.encode(df['text'].values, show_progress_bar=True)

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
np.random.seed(100)
topic_model = BERTopic(calculate_probabilities=False,
                        nr_topics='auto',
                        ctfidf_model=ctfidf_model,
                        seed_topic_list=seed_topic_list,
                        umap_model=umap_model,
                        min_topic_size=100,
                       )
t1 = time.time()
topics, probs = topic_model6.fit_transform(df['text'].values,
                                           embeddings
                                           )

MaartenGr · 2023-01-27T06:39:53Z

MaartenGr
Jan 27, 2023
Maintainer

When you use "auto" to reduce the topics, the representative documents for topics that are merged are also merged. When BERTopic was first developed, the way to extract representative documents was through some of the internal functions of HDBSCAN which prevented re-calculating the representative documents. Currently, that process is much easier by using c-TF-IDF with randomly sampled documents in each topic. In other words, the more representative documents a topic contains the more topics were merged in that resulting topic. I might chance this in an upcoming release but will require some testing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large number of representative docs for topic 0 #965

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

large number of representative docs for topic 0 #965

jchen04 Jan 26, 2023

Replies: 1 comment

MaartenGr Jan 27, 2023 Maintainer

jchen04
Jan 26, 2023

MaartenGr
Jan 27, 2023
Maintainer