Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

Improve top-level item selection with k-means++ initialisation #24

Open
lsorber opened this issue Mar 10, 2019 · 2 comments
Open

Improve top-level item selection with k-means++ initialisation #24

lsorber opened this issue Mar 10, 2019 · 2 comments

Comments

@lsorber
Copy link

lsorber commented Mar 10, 2019

Currently, the top-level items are selected randomly [1], and are then pruned by removing items that are too similar [2]. This procedure might result in suboptimal top-level items being selected.

One trick for selecting the cluster centroids that has worked exceptionally well for k-means is the k-means++ initialisation [3]. The idea is to start with a randomly chosen first centroid, and then to select following centroids with a probability proportional to the distance to the already selected centroids.

An implementation of this algorithm should be relatively straightforward, with potentially large benefits. Would there be any interesting in adding such an initialisation to PySparNN?

[1] https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py#L127
[2] https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py#L135
[3] https://en.wikipedia.org/wiki/K-means%2B%2B#Improved_initialization_algorithm

@kchaliki
Copy link

kchaliki commented Feb 5, 2020

@lsorber does this help your usecase?

#28

I realize it's not the same approach but hopefully achieves similar results in practice.

@spencebeecher
Copy link
Contributor

@lsorber - apologies for missing this. if you still have interest in submitting a PR I would happily integrate it. I think @kchaliki has a very good alternative which I plan on getting integrated over the next few days.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants