This project implements three clustering algorithms - KMeans, KMode, KPrototype, and DBSCAN - using both local and global parallelized approach with PySpark. The algorithms are implemented separately and organized into different subdirectories within the repository.
-
KMeans: KMeans is a popular clustering algorithm that partitions data into 'k' clusters based on similarity of features.
-
KMode: KMode is a clustering algorithm specifically designed for categorical data, where clusters are formed based on the mode of categories. Both local and global implementations are provided in this directory, along with an experimental notebook demonstrating how the size-up and num partition affect runtime. There are also notebooks that includes a step-by-step explanation of implementation.
-
KPrototype: KPrototype extends KMeans to handle both numerical and categorical data, allowing for more versatile clustering in mixed data types.
-
DBSCAN: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together data points that are closely packed, while marking outliers as noise.
During the mapping phase, centroids
After computing merged count hashed data for each cluster, representing (cluster, merged count-hash) pairs, they are transmitted to the driver code. Subsequently, new centroids
The iterative process continues until convergence, indicated by the Hamming distance between new and old centroids reaching the stopping criterion of 0, signifying identical centroids.
In this approach, we adopt a parallelized strategy where the sequential model learns from each data partition concurrently. Thus, dividing the full dataset into N partitions and selecting K as the number of clusters results in the mapping phase generating NxK centroids as output, transmitted to the driver. These centroids, each representing the mode of clusters within its partition, collectively approximate the behavior of the sequential algorithm. Considering these centroids as a dataset, KMode clustering is performed in the driver code to derive the final centroids for K clusters.
The notebooks in this folder contain both the Kmode global and local approaches. We also keep versions showing failures and mistakes to demonstrate what we've learned from this project.
Please refer to kmode/experiments_notebook/K-Mode-global3.ipynb
and kmode/experiments_notebook/K-Mode-local2.ipynb
if you want to try out our implementation of the global and local approaches, respectively. The required packages to run our code are provided in requirements.txt.
-
kmode/experiments_notebook/K-Mode-local.ipynb
: The initial version of the local implementation where there was an issue withnp.apply_along_axis
, resulting in strings having only the first character, leading to incorrect centroids. This has been fixed in./experiments_notebook/K-Mode-local2.ipynb
. -
kmode/experiments_notebook/K-Mode-global.ipynb
: The first version of the global implementation, where we simply mapped centroids in the mapping phase and grouped all the data in the reducing phase, then sent them back to the driver to calculate the mode. This was inefficient as it involved a lot of data movement and didn't utilize the reducing process well. -
kmode/experiments_notebook/K-Mode-global2.ipynb
: The second version of the global implementation, where we counted the number of occurrences of elements within the reducer phase to reduce data movement from the reducer phase to the driver. However, the data movement from the mapping phase to the reducing phase remained the same. We then implemented mini-reducers in the final version./experiments_notebook/K-Mode-global3.ipynb
.
The final implementation has been rewritten in a class object style, but we suggest testing using notebooks for better visibility.
-
kmode/kmode_global.py
: This file contains the global implementation of KMode as a Python class object. -
kmode/kmode_local.py
: This file contains the local implementation of KMode as a Python class object. In each partition, it uses a sequential KMode object fromkmode_sequential.py
to learn on data in their partition.
The results from the global and sequential approaches are an exact match; however, the local approach provides only an approximation.
Feature | Global & Sequential | Local | ||||
---|---|---|---|---|---|---|
Cluster | 1 | 2 | 3 | 1 | 2 | 3 |
cap-shape | f | x | f | x | x | x |
cap-surface | nan | y | x | nan | nan | h |
cap-color | n | n | x | n | n | n |
does-bruise-or-bleed | f | f | f | f | f | f |
gill-attachment | d | a | d | d | a | d |
gill-spacing | c | nan | c | c | nan | c |
gill-color | w | w | w | w | w | w |
stem-root | nan | nan | nan | nan | nan | nan |
stem-surface | nan | nan | nan | nan | nan | nan |
stem-color | w | w | w | w | w | w |
veil-type | nan | nan | nan | nan | nan | nan |
veil-color | nan | nan | nan | nan | nan | nan |
has-ring | f | t | f | f | f | t |
ring-type | f | f | f | f | f | e |
spore-print-color | nan | nan | nan | nan | nan | nan |
habitat | d | d | d | d | d | d |
season | u | a | u | a | a | u |
Table: Comparison of the final centroids obtained from clustering secondary mushroom dataset with KMode algorithm with sequential, global and local approach
Feature | Global & Sequential Cluster 1 | Global & Sequential Cluster 2 | Global & Sequential Cluster 3 | Local Cluster 1 | Local Cluster 2 | Local Cluster 3 |
---|---|---|---|---|---|---|
bin_0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
bin_1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
bin_2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
bin_3 | F | T | F | F | F | T |
bin_4 | N | N | N | N | Y | N |
nom_0 | Red | Red | Blue | Red | Red | Blue |
nom_1 | Polygon | Trapezoid | Triangle | Triangle | Triangle | Triangle |
nom_2 | Hamster | Hamster | Dog | Axolotl | Hamster | Hamster |
nom_3 | India | Finland | Costa Rica | India | India | India |
nom_4 | Bassoon | Theremin | Theremin | Theremin | Theremin | Theremin |
nom_5 | nan | nan | nan | nan | nan | nan |
nom_6 | nan | nan | nan | nan | nan | nan |
nom_7 | nan | nan | nan | nan | nan | nan |
nom_8 | nan | nan | nan | nan | nan | nan |
nom_9 | nan | nan | nan | nan | nan | nan |
ord_0 | 2.0 | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 |
ord_1 | Expert | Novice | Contributor | Novice | Novice | Novice |
ord_2 | Freezing | Freezing | Freezing | Freezing | Freezing | Freezing |
ord_3 | n | n | a | n | n | a |
ord_4 | Y | N | R | N | P | N |
ord_5 | nan | nan | nan | nan | nan | nan |
day | 1.0 | 6.0 | 7.0 | 3.0 | 5.0 | 3.0 |
month | 6.0 | 7.0 | 5.0 | 8.0 | 8.0 | 5.0 |
Table: Comparison of the final centroids obtained from clustering Kaggle categorical encoding competition II dataset using the KMode algorithm with sequential, global, and local approaches.
To get started with using these clustering algorithms:
- Navigate to the respective subdirectory for the algorithm you're interested in.
- Follow the instructions provided in the README file within each directory for installation, usage, and examples.
- Ensure you have PySpark installed and configured to run the algorithms in a distributed environment.