Customer Segmentation

Data exploration and preprocessing
K-Means clustering
Hierarchical clustering
Potential targeted marketing based on clusters

1. Data exploration and preprocessing

Data on 3,999 customers obtained from the loyalty program of a former airline.

Six numerical values describing customers:

Balance: Number of miles eligible for award travel
BonusTrans: Number of non-flight bonus transactions in the past 12 months
BonusMiles: Number of miles earned from those transactions
FlightTrans: Number of flight transactions
FlightMiles: Number of miles earned from those transactions
DaysSinceEnroll: Tenure in the program (days)

Data preprocessing (scaling)

First, 'center' the data by substracting the mean to each column (mean becomes 0 for each column)
Then, 'scale' the data by dividing by the standard deviation (standard deviation becomes 1 for each column)

#step 1: create the pre-processor using preProcess
# normalization for each col: (X_i-mean)/std
pp <- preProcess(airline, method=c("center", "scale"))
class(pp)
pp
pp$mean

#step 2: apply it to the dataset
airline.scaled <- predict(pp, airline)

# Sanity check
# mean is (approximately) 0 for all columns.
colMeans(airline)
colMeans(airline.scaled)
apply(airline.scaled,2,sd)

Original

Scaled

2. K-Means clustering

K-means has a random start where the centroids are initially randomly located. Then iterate the following two steps, until convergence:

Assign each observation to the nearest centroid
Recalculate centroids as average of assigned observations

# The kmeans function creates the clusters
# set the number of k=8
km <- kmeans(airline.scaled, centers = 8, iter.max=100) 
# centers randomly selected from rows of airline.scaled

class(km) # class: kmeans
names(km)

# cluster centroids. Store this result
km.centroids <- km$centers
km.centroids
# cluster for each point. Store this result.
km.clusters <- km$cluster
km.clusters

# the sum of the squared distances of each observation from its cluster centroid => cluster dissimilarity
km$tot.withinss  # cluster dissimilarity

# the number of observations in each cluster
km.size <- km$size
km.size

Scree plot

For k-means, try many value of k and look at their dissimilarity; here, let's test all k from 1 to 100.

k.data <- data.frame(k = 1:100)
k.data$SS <- sapply(k.data$k, function(k) {
  kmeans(airline.scaled, iter.max=100, k)$tot.withinss
})

# Plot the scree plot.
plot(k.data$k, k.data$SS, type="l")
plot(k.data$k, k.data$SS, type="l", xlim=c(0,40))
axis(side = 1, at = 1:10)

Let's zoom on the smallest k values (1-40) to take a closer look.

To select a "good" k value, pick something that defines the corner / pivot in the L, where there is a change of slope from steep to shallow (an elbow). Here, k=8 seems to be a good pick.

3. Hierarchical Clustering

Hierarchical clustering doesn't require to pre-specify the number of clusters as required by K-means. It also has an advantage over k-means that it visualizes the clustering process into a tree-based representation called a dendrogram.

Hierarchical clustering computes all-pair euclidian distances between the observations. It initially starts with as many clusters as data points (here, 3,999), then iteratively combine the pair of clusters with the smallest dissimilarity ("close" to each other) until the number of clusters goes down to 1.

d <- dist(airline.scaled) # method = "euclidean"
class(d)

# Creates the Hierarchical clustering
hclust.mod <- hclust(d, method="ward.D2")

# The "method=ward.D2" indicates the criterion to select the pair of clusters to be merged at each iteration

# Now, plot the hierarchy structure (dendrogram)
# labels=F (false) to not print text for each of the 3999 observations
plot(hclust.mod, labels=F, ylab="Dissimilarity", xlab = "", sub = "")

Scree Plot

Create the scree plot: dissimilarity for each k.

hc.dissim <- data.frame(k = seq_along(hclust.mod$height),   # index: 1,2,...,length(hclust.mod$height)
                        dissimilarity = rev(hclust.mod$height)) # reverse elements
head(hc.dissim)

# Scree plot
plot(hc.dissim$k, hc.dissim$dissimilarity, type="l")

# Let's zoom on the smallest k values:
plot(hc.dissim$k, hc.dissim$dissimilarity, type="l", xlim=c(0,40))
axis(side = 1, at = 1:10)

Based on the discussion above, k=7 seems to be a good pick.

# Improvement in dissimilarity for increasing number of clusters
hc.dissim.dif = head(hc.dissim,-1)-tail(hc.dissim,-1)
head(hc.dissim.dif,10)

# construct the clusters with k=7
h.clusters <- cutree(hclust.mod, 7)
h.clusters

# The *centroid* for a cluster is the mean value of all points in the cluster: 
aggregate(airline.scaled, by=list(h.clusters), mean) # Compute centroids

# *size* of each cluster
table(h.clusters)

# many zeros mean clusters from kmeans and hierarchical "match up"
table(h.clusters, km.clusters)

Visualization

We can visualize the clusters using fviz_cluster .

# install.packages("factoextra"), if not installed
library(factoextra)
# k-mean
fviz_cluster(km, data=airline.scaled, geom = "point", alpha=0.4)
# hclust
fviz_cluster(list(data = airline.scaled, cluster = h.clusters), geom="point", alpha=0.4)

K-means (k=8)	Hierarchical clustering (k=7)

Potential target marketing options

K-Means Clusters

Original variables	Clusters
Original variables	1	2	3	4	5	6	7	8
Balance	61,201	57,207	127,761	91,719	31,165	168,964	566,040	168,897
BonusMiles	19,073	7,565	58,156	16,360	2,308	44,062	52,696	46,301
BonusTrans	17	8	21	17	3	33	19	43
FlightMiles	118	147	333	2,763	114	5,851	1133	14,244
FlightTrans	0	0	1	8	0	18	4	33
DaysSinceEnroll	2,923	6,074	5,484	3,964	2,300	5,157	6,312	3,446
Cluster Size	893	1,124	504	212	1,107	69	76	14

Option 1: Dormant customer

Cluster 2 and 5 are low-acitivity customers.
→ Provide promotional one-time events to incentivize new purchases.

Option 2: Point seeker

Customers in Cluster 1 and 3 focus on bonus transactions.
→ Provide target bonuses for flying; special offers for bonus transactions

Option 3: Old guard

Cluster 6 and 7 are long-lasting customers with moderate spending.
→ Provide thank-you gift or speical offers for loyalty.

Option 4: New oil

Cluster 4 and 8 are recent customers with very high spending. Should retain these customers.
→ Provide bonus miles, perks, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
img		img
.RData		.RData
.Rhistory		.Rhistory
README.md		README.md
airline_cluster.R		airline_cluster.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Segmentation

1. Data exploration and preprocessing

Data preprocessing (scaling)

2. K-Means clustering

Scree plot

3. Hierarchical Clustering

Scree Plot

Visualization

Potential target marketing options

K-Means Clusters

Option 1: Dormant customer

Option 2: Point seeker

Option 3: Old guard

Option 4: New oil

About

Releases

Packages

Languages

imbottlebird/customer_segmentation

Folders and files

Latest commit

History

Repository files navigation

Customer Segmentation

1. Data exploration and preprocessing

Data preprocessing (scaling)

2. K-Means clustering

Scree plot

3. Hierarchical Clustering

Scree Plot

Visualization

Potential target marketing options

K-Means Clusters

Option 1: Dormant customer

Option 2: Point seeker

Option 3: Old guard

Option 4: New oil

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages