Improve top-level item selection with k-means++ initialisation #24

lsorber · 2019-03-10T06:52:55Z

Currently, the top-level items are selected randomly [1], and are then pruned by removing items that are too similar [2]. This procedure might result in suboptimal top-level items being selected.

One trick for selecting the cluster centroids that has worked exceptionally well for k-means is the k-means++ initialisation [3]. The idea is to start with a randomly chosen first centroid, and then to select following centroids with a probability proportional to the distance to the already selected centroids.

An implementation of this algorithm should be relatively straightforward, with potentially large benefits. Would there be any interesting in adding such an initialisation to PySparNN?

[1] https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py#L127
[2] https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py#L135
[3] https://en.wikipedia.org/wiki/K-means%2B%2B#Improved_initialization_algorithm

kchaliki · 2020-02-05T12:42:18Z

@lsorber does this help your usecase?

#28

I realize it's not the same approach but hopefully achieves similar results in practice.

spencebeecher · 2020-02-05T16:45:09Z

@lsorber - apologies for missing this. if you still have interest in submitting a PR I would happily integrate it. I think @kchaliki has a very good alternative which I plan on getting integrated over the next few days.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve top-level item selection with k-means++ initialisation #24

Improve top-level item selection with k-means++ initialisation #24

lsorber commented Mar 10, 2019

kchaliki commented Feb 5, 2020

spencebeecher commented Feb 5, 2020

Improve top-level item selection with k-means++ initialisation #24

Improve top-level item selection with k-means++ initialisation #24

Comments

lsorber commented Mar 10, 2019

kchaliki commented Feb 5, 2020

spencebeecher commented Feb 5, 2020