-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have an question about the codes for "Index IVFFlat-Clustering". #150
Comments
Hi @yyy007 Thanks for the observation! This is indeed a bug, it should read:
|
Thank you! @mdouze |
Absolutely. The purpose of this function is to get out of bad initializations for realistic datasets without crashes and infinite loops. It could be improved to handle corner cases like the one you mention. Feel free to suggest another approach. Maybe making the perturbation random and asymmetric should do the trick. @jegou, any thoughts on this? |
@yyy007: The case that you mention is a corner case that has infinitesimal probability to happen for vectors long enough. Even if it happens, it is not particularly an issue: you can also have some times ties in k-means, whose resolution by any untying rule will change the situation at the next iteration because the vector will contribute to the update of one of the centroid. Compared to the alternative of splitting the cluster in two by keeping the original cluster |
In faiss/utils.cpp (from line 1431), for "ci" and "cj", why the code for "j%2==0" and "j%2==1" are the same?
For a extreme case (two dimension), if a cluster for split contains two vector: (2, 0) and (0, 2), the centroid is (1, 1), and after "small symmetric pertubation", the two centroids become (1+ESP, 1+ESP) and (1-ESP, 1-ESP).
The two vectors has the same distances with the two new centroids, would the two vectors be assigned to the same centroid, and the other new centroid is still void after split?
The text was updated successfully, but these errors were encountered: