Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, I've built an automatic credit card approval predictor using machine learning techniques, just like the real banks do.
I've used the Credit Card Approval dataset from the UCI Machine Learning Repository.
The structure of this notebook is as follows:
- First, I started off by loading and viewing the dataset.
- The dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.
- Then, I preprocessed the dataset to ensure the machine learning model we choose can make good predictions.
- After the data gets into good shape, I did some exploratory data analysis to build intuitions.
- Finally, I built a machine learning model that can predict if an individual's application for a credit card will be accepted.
Jump Straight to The Video Tutorial
-
First, loading and viewing the dataset. I found that since this data is confidential, the contributor of the dataset has anonymized the feature names.
-
Then I tried to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but this blog gives us a pretty good overview of the probable features. The probable features in a typical credit card application are
Gender
,Age
,Debt
,Married
,BankCustomer
,EducationLevel
,Ethnicity
,YearsEmployed
,PriorDefault
,Employed
,CreditScore
,DriversLicense
,Citizen
,ZipCode
,Income
and finally theApprovalStatus
. This gives a pretty good starting point, and we can map these features with respect to the columns in the output.
As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing.
- Then, I fixed some missing values for both numerical as well as non-numerical columns.
*There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. I divided these remaining preprocessing steps into three main tasks:
- Convert the non-numeric data into numeric.
- Split the data into train and test sets.
- Scale the feature values to a uniform range.
First, I converted all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called label encoding.
- After successfully converting all the non-numeric values to numeric ones, now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then apply the scaling.
Also, features like DriversLicense
and ZipCode
are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as feature selection.
-
The data is now split into two separate sets - train and test sets respectively.
-
Essentially, predicting if a credit card application will be approved or not is a classification task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.
This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.
Which model should we pick? A question to ask is: are the features that affect the credit card approval decision process correlated with each other? Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).
-
We will now evaluate our model on the test set with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.
Accuracy of logistic regression classifier: 0.8377192982456141 Confusion matrix: [[93 10] [27 98]]