I have an question about the codes for "Index IVFFlat-Clustering". #150

jayoungo · 2017-07-06T03:40:55Z

In faiss/utils.cpp (from line 1431), for "ci" and "cj", why the code for "j%2==0" and "j%2==1" are the same?

if (j % 2 == 0) {
    centroids[ci * d + j] *= 1 + EPS;
    centroids[cj * d + j] *= 1 - EPS;
}else {
    centroids[ci * d + j] *= 1 + EPS;
    centroids[cj * d + j] *= 1 - EPS;
}

For a extreme case (two dimension), if a cluster for split contains two vector: (2, 0) and (0, 2), the centroid is (1, 1), and after "small symmetric pertubation", the two centroids become (1+ESP, 1+ESP) and (1-ESP, 1-ESP).
The two vectors has the same distances with the two new centroids, would the two vectors be assigned to the same centroid, and the other new centroid is still void after split?

The text was updated successfully, but these errors were encountered:

mdouze · 2017-07-06T05:20:52Z

Hi @yyy007

Thanks for the observation! This is indeed a bug, it should read:

if (j % 2 == 0) {
    centroids[ci * d + j] *= 1 + EPS;
    centroids[cj * d + j] *= 1 - EPS;
}else {
    centroids[ci * d + j] *= 1 - EPS;
    centroids[cj * d + j] *= 1 + EPS;
}

jayoungo · 2017-07-06T05:53:17Z

Thank you! @mdouze
But for another assumption, for vectors (0, 0) and (2, 2), the new centroids will be (1+ESP, 1-ESP) and (1-ESP, 1+ESP). It seems that the two vectors still have the same distances with the two new centroids.

mdouze · 2017-07-06T10:03:29Z

Absolutely.

The purpose of this function is to get out of bad initializations for realistic datasets without crashes and infinite loops.

It could be improved to handle corner cases like the one you mention. Feel free to suggest another approach. Maybe making the perturbation random and asymmetric should do the trick. @jegou, any thoughts on this?

jegou · 2017-07-06T11:09:34Z

@yyy007: The case that you mention is a corner case that has infinitesimal probability to happen for vectors long enough. Even if it happens, it is not particularly an issue: you can also have some times ties in k-means, whose resolution by any untying rule will change the situation at the next iteration because the vector will contribute to the update of one of the centroid.

Compared to the alternative of splitting the cluster in two by keeping the original cluster c and adding a new cluster c+epsilon, our proposal is much better to fix the following issue:
If you are in high-dimensional space, but that locally around the centroids your vectors belong to a subspace, then at the next iteration all the vectors will be assigned with high probability to the original centroid, and the new centroid c+epsilon will have no vector assigned to it. This is because this new centroid has some energy in the local null subspace. By making the perturbation symmetrical, we ensure that the two new centroids (c-epsilon and c+epsilon) have exactly the same amount of energy in this null space, and none is favored for the assignment. As a result (and we empirically observe it in the corner cases that this choice is intended to solve), you reduce the probability of generating void clusters along the iterations.

mdouze added the bug label Jul 6, 2017

jegou closed this as completed Jul 6, 2017

q423462798 mentioned this issue May 30, 2018

infinite loop when clustering #463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I have an question about the codes for "Index IVFFlat-Clustering". #150

I have an question about the codes for "Index IVFFlat-Clustering". #150

jayoungo commented Jul 6, 2017

mdouze commented Jul 6, 2017

jayoungo commented Jul 6, 2017

mdouze commented Jul 6, 2017

jegou commented Jul 6, 2017 •

edited

Loading

I have an question about the codes for "Index IVFFlat-Clustering". #150

I have an question about the codes for "Index IVFFlat-Clustering". #150

Comments

jayoungo commented Jul 6, 2017

mdouze commented Jul 6, 2017

jayoungo commented Jul 6, 2017

mdouze commented Jul 6, 2017

jegou commented Jul 6, 2017 • edited Loading

jegou commented Jul 6, 2017 •

edited

Loading