Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have an question about the codes for "Index IVFFlat-Clustering". #150

Closed
jayoungo opened this issue Jul 6, 2017 · 4 comments
Closed
Labels

Comments

@jayoungo
Copy link

jayoungo commented Jul 6, 2017

In faiss/utils.cpp (from line 1431), for "ci" and "cj", why the code for "j%2==0" and "j%2==1" are the same?

if (j % 2 == 0) {
    centroids[ci * d + j] *= 1 + EPS;
    centroids[cj * d + j] *= 1 - EPS;
}else {
    centroids[ci * d + j] *= 1 + EPS;
    centroids[cj * d + j] *= 1 - EPS;
}

For a extreme case (two dimension), if a cluster for split contains two vector: (2, 0) and (0, 2), the centroid is (1, 1), and after "small symmetric pertubation", the two centroids become (1+ESP, 1+ESP) and (1-ESP, 1-ESP).
The two vectors has the same distances with the two new centroids, would the two vectors be assigned to the same centroid, and the other new centroid is still void after split?

@mdouze mdouze added the bug label Jul 6, 2017
@mdouze
Copy link
Contributor

mdouze commented Jul 6, 2017

Hi @yyy007

Thanks for the observation! This is indeed a bug, it should read:

if (j % 2 == 0) {
    centroids[ci * d + j] *= 1 + EPS;
    centroids[cj * d + j] *= 1 - EPS;
}else {
    centroids[ci * d + j] *= 1 - EPS;
    centroids[cj * d + j] *= 1 + EPS;
}

@jayoungo
Copy link
Author

jayoungo commented Jul 6, 2017

Thank you! @mdouze
But for another assumption, for vectors (0, 0) and (2, 2), the new centroids will be (1+ESP, 1-ESP) and (1-ESP, 1+ESP). It seems that the two vectors still have the same distances with the two new centroids.

@mdouze
Copy link
Contributor

mdouze commented Jul 6, 2017

Absolutely.

The purpose of this function is to get out of bad initializations for realistic datasets without crashes and infinite loops.

It could be improved to handle corner cases like the one you mention. Feel free to suggest another approach. Maybe making the perturbation random and asymmetric should do the trick. @jegou, any thoughts on this?

@jegou
Copy link
Contributor

jegou commented Jul 6, 2017

@yyy007: The case that you mention is a corner case that has infinitesimal probability to happen for vectors long enough. Even if it happens, it is not particularly an issue: you can also have some times ties in k-means, whose resolution by any untying rule will change the situation at the next iteration because the vector will contribute to the update of one of the centroid.

Compared to the alternative of splitting the cluster in two by keeping the original cluster c and adding a new cluster c+epsilon, our proposal is much better to fix the following issue:
If you are in high-dimensional space, but that locally around the centroids your vectors belong to a subspace, then at the next iteration all the vectors will be assigned with high probability to the original centroid, and the new centroid c+epsilon will have no vector assigned to it. This is because this new centroid has some energy in the local null subspace. By making the perturbation symmetrical, we ensure that the two new centroids (c-epsilon and c+epsilon) have exactly the same amount of energy in this null space, and none is favored for the assignment. As a result (and we empirically observe it in the corner cases that this choice is intended to solve), you reduce the probability of generating void clusters along the iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants