In this post, I examine credit risk data and create a classification model using the K-Nearest Neighbor Algorithm (K-NN) to predict credit risk rating.
In this exercise, the client had several new loan applicants and needed to asses the credit risk of the applicants. They provided historical data that included customers’ purchases, financial and demographic information as well as their stated loan purpose on the application and the assigned credit risk. This data is not necessarily big, it only had 425 points, however, it was certainly useful in assessing the credit risk of our new applicants. Although more data would obviously yield better results, an analysis was still invoked to answer the main question:
Question: What level of credit risk does each new applicant present? Can you classify them as High or Low?
Answer: This question was answered in the visualization below!
Please note that there are further explanations about each finding below. I also included my methodology and some explanations about cluster analysis towards the end.
Methodology
- Convert categorical string values into numerical values.
- Take a sample of 200 data point.
- Partition the sample into training data (60%) and Validation data (40%).
- The split ratio is relatively subjective here.
- The test data is the new data.
- Here is a really cool article About Train, Validatation and Test Sets in Machine Learning.
- Apply the K means clustering algorithm to the Data.
- Normalize the data (my predictors had different scales).
- Choose the best K parameter (the most accurate/one with the smallest error in both the training and the validation data set).
- Here is a great post on Optimal Tuning Parameters, it even includes very helpful Python code.
- Run the code.
- Examine your newly classified data and visualize it.
- All the resulting lift charts are included at the end of this post.
- ” A lift chart graphically represents the improvement that a mining model provides when compared against a random guess, and measures the change in terms of a lift score. ” For more on Lift Charts, read this.
Findings
Before I conducted my analysis on the new data, I wanted to understand the historical data and see if there are any relationships. In the visualization below, we can see the credit risk rating across many categories.
- Overall credit risk distribution was relatively even, although slightly more people had a low risk rating.
- Age: it wasn’t a particularly strong indicator of credit risk, even when segmented across various loan purposes.
- Gender: Only 13.41% of women had a low risk compared with 36.94% of men. However, women also were less represented in the high risk category at 18.25%, compared to men at 31.29%. This shows that if anything, women apply for loans less on average than men, not that women were automatically afforded a high risk rating.
- Martial Status: The Martial Status also, on it’s own, did not have an effect on the credit risk rating.
- Job Type and Income: The last part examines the amount of money in the checking and savings accounts of applicants in concurrence with their job type. even then, no clear distinction is made.
The only way to truly be able to find similarities is to apply a classification method and train a machine learning model to study the previous data, validate it on a subset of the original data and then apply it to the new test data.
New Data Classification
The new data was classified according to the same metrics from the historical data that the model picked up. We can clearly see that a great number of the new data was classified with low credit risk. Our new applicants are most likely to receive the loans they wanted.
The K- Nearest Neighbors Algorithm is an incredibly powerful technique to classify data, identify patterns and facilitate decision making. I would highly recommend learning this techniques because, when it comes to data, regression is not always the right tool. Learning how to classify data and understanding the underlying mechanisms behind it, will make it much easier to understand how to apply it, regardless whether it is by clicking a program or writing classification code.
Software and Lift Charts
The Software used here is the Analytic Solver Platform for Education (XLMiner), a comprehensive data mining Add-in for Excel. (Here is the online guide for how to use it)
The information above is from the Graduate Certificate in Business Analytics: Descriptive Analytics course at Penn State University.