Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I get kmeans clustered features? #22

Open
Remaxic opened this issue Feb 27, 2024 · 6 comments
Open

How can I get kmeans clustered features? #22

Remaxic opened this issue Feb 27, 2024 · 6 comments

Comments

@Remaxic
Copy link

Remaxic commented Feb 27, 2024

Hi,
I called the checkpoint_best_legacy_100.pt model using the inference code under the fairseq framework, and I found that the features generated were unclustered. I read in your paper that it is optional whether the output is clustered or not, so I would like to know how can I choose to output the clustered features?

Meanwhile, I have clustered the output using learn_kmeans.py and dump_km_label.py in fairseq framework. I chose n=50 and then decoded it using a trained decoder. I found the results to be very poor. I'm wondering if this is because your model was trained for n=100, so even though the output is continuous features, it only presents the best performance at n=100?

@Remaxic Remaxic changed the title How can I get the results after kmeans clustering? How can I get kmeans clustered features? Feb 27, 2024
@auspicious3000
Copy link
Owner

Clustering is a separate step. You need to use the code in the fairseq framework to do that. Just like you did above.

"I'm wondering if this is because your model was trained for n=100, so even though the output is continuous features, it only presents the best performance at n=100?"
Not necessarily. You can cluster the features into any clusters you want. The key here is to retrain the decoder because even if you cluster into 100 classes, the class ids are going to be different every time you do it.

@Remaxic
Copy link
Author

Remaxic commented Feb 28, 2024

I see! Thank you very much!
"Clustering is a separate step“,so what's the difference between the model with classes=100 and classes=500?

@auspicious3000
Copy link
Owner

That's the teacher label's number of clusters.

@Remaxic
Copy link
Author

Remaxic commented Feb 28, 2024

Thank you!

@huangf79
Copy link

@Remaxic Hi. Have you obtained good clustered results? Could you share your script?

@Remaxic
Copy link
Author

Remaxic commented Mar 26, 2024

@huangf79 Hi, I just extracted the features of my dataset using contentvec model and generated k-means clustering model with k=50 and k=100 by calling learn_kmeans.py and dump_km_label.py files under fairseq framework. I found that the former performs nowhere near as well as the latter, and does not even meet the basic needs of my downstream task.

I read the papers of the HuBERT model proposers, hoping to find their particular method of training a perfect clustering model. But there doesn't seem to be one, and they didn't perform dimensionality reduction or other special operations, except that the dataset (100h) is much larger than mine (about 44h). Considering that the model performs well with k=100, I'm guessing it has something to do with contentvec's feature extraction capabilities. Perhaps it is not suitable for small codebook tasks, or perhaps a better discretisation idea is needed.

If you have a better clustering idea and would like to let me know, I would be very grateful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants