Identifying Customer Segments for Mail-Order Sales Company with PCA and KMeans Clustering

In this project, I explore Data from a Mail-Order Sales Company and use unsupervised machine learning techniques to help them identify segments of the population for direct marketing campaigns that would have the highest rate of return. You can find the full code here and below.

Here’s a quick overview of the data and analysis that took place.

Data

The project included various important components:

Data on the general population.
Data on the company’s customers.
Features Dictionary which included the various codes for missing or unknown values per column.

Analysis

This project primarily applied unsupervised machine learning algorithms to solve the problem. I used a very methodical process to go about the analysis:

Clean and Analyze the General Population data:
- Clean and pre-process general population data.
  - Using The Pandas Library.
- Create a cleaning function to later clean customers’ data.
  - Note: This approach works here because the customers’ data set has a similar architecture to the general population data set.
- Reduce dimentionality with PCA on the general population data.
  - Using the Scikit-Learn Library.
- Apply KMeans clustering to general population data.
  - Using the Scikit-Learn Library.
Clean and Analyze the Customers’ data:
- Clean Customers’ data set.
- Reduce dimentionality with PCA on the customers’ data.
- Apply KMeans clustering to customers’ data.
Compare results and derive insights.

First, Dimentionality Reduction with PCA on the General Population Data.

“PCA finds a new set of dimensions (or a set of basis of views) such that all the dimensions are orthogonal (and hence linearly independent) and ranked according to the variance of data along them. It means more important principle axis occurs first. (more important = more variance/more spread out data)
Read more on this excellent article: Understanding Principal Component Analysis

After normalizing and scaling the data properly, I applied Principal Component Analysis to understand the ratio of variance explained by each principal component as well as the cumulative variance explained. Based on the results, seen here in the graph below, I chose 28 components because I felt that 85% seemed like a sufficient percentage to capture the majority of the variability in our data without any redundant features.

Then, KMeans Clustering on the General Population Data.

After refitting the PCA to the new number of components (28) and transforming the data accordingly, I set about to fit the data using the KMeans clustering algorithm and graph the average within-cluster distances from each point to their assigned cluster’s centroid to decide on a number of clusters to keep.

Line Chart explaining the relationship between K (number of centroids) and SSE (Sum of Squared Errors)

If the line chart looks like an arm, then the “elbow” on the arm is the value of k that is the best. The idea is that we want a small SSE, but that the SSE tends to decrease toward 0 as we increase k (the SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).
Read more on how to select the best K: Introduction to K-means Clustering

Another visual way to see the clustering on the data set.

I chose 15 clusters as I thought this was a representative number and in the graph above, we can see that 15 clusters has a very small SSE related to it. However, in these situations, it is always helpful to ask a professional with domain expertise for their opinion, 15 clusters might be just right, their sage advice can help us adjust the model for the best results.

Once these parameters have been set, we apply them to our customers’ data set and lastly, compare the results.

In our analysis above, we can clearly see that some clusters are represented a lot more in the customers’ data set than in general population data. For example clusters 1, 2, and 8 are significantly more represented than the others. We can also clearly see that the company gravitates significantly towards people who are identified as rational, traditional, combative (aggressive) and who are investors. Those categories are over represented in the customers data compared to the general population. The analysis also shows that there is little representation of young and dreamful people who are financial minimalists with little investments in the customers data.

Overall, this analysis was an exciting and challenging one at the same time. The challenging part was not having direct contact with industry experts to determine the best number of clusters. Yet, this was an incredibly rewarding project because this analysis produced very actionable results that will hopefully increase the ROI of the mail-order sales company.

Here is a link to the code in full page mode:

Identifying Customer Segments

The Information above is from the Udacity Data Scientist Nanodegree.