2019独角兽企业重金招聘Python工程师标准>>>
1: Introduction
In this course, we will walk through the full data science life cycle, from data cleaning and feature selection to machine learning. We will focus on credit modelling, a well known data science problem that focuses on modeling a borrower's credit risk. Credit has played a key role in the economy for centuries and some form of credit has existed since the beginning of commerce. We'll be working with financial lending data from Lending Club. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace here.
Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data (and their own data science process!) and assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. You can read more about the interest rate that Lending Club assigns here. Lending Club also tries to verify each piece of information the borrower provides but it can't always verify all of the information (usually for regulation reasons).
A higher interest rate means that the borrower is riskier and more unlikely to pay back the loan while a lower interest rate means that the borrower has a good credit history is more likely to pay back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given a grade according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.
Investors are primarily interested in receiveing a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the origination fee that Lending Club charges.
The borrower then makes monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition the requested amount. Many loans aren't completely paid off on time, however, and some borrowers default on the loan.
Here's a diagram from Bible Money Matters that sums up the process:
While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. While at first, you may wonder why investors would put money into anything but low interest loans. The incentive investors have to back higher interest loans is, well, the higher interest! If investors believe the borrower can pay back the loan, even if he or she has a weak financial history, then investors can make more money through the larger additional amount the borrower has to pay.
Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this course, we'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.
2: Introduction To The Data
Lending Club releases data for all of the approved and declined loan applications periodically on their website. You can select a few different year ranges to download the datasets (in CSV format) for both approved and declined loans.
You'll also find a data dictionary (in XLS format) which contains information on the different column names towards the bottom of the page. We recommend downloading the data dictionary to so you can refer to it whenever you want to learn more about what a column represents in the datasets. Here's a link to the data dictionary file hosted on Google Drive.
Before diving into the datasets themselves, let's get familiar with the data dictionary. The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on data on approved loans only.
The approved loans datasets contain information on current loans, completed loans, and defaulted loans. Let's now define the problem statement for this machine learning project:
- Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?
Before we can start doing machine learning, we need to define what features we want to use and which column repesents the target column we want to predict. Let's start by reading in the dataset and exploring it.
3: Reading In To Pandas
In this mission, we'll focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.
To ensure that code runs fast on our platform, we reduced the size of LoanStats3a.csv
by:
- removing the first line:
- because it contains the extraneous text
Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
instead of the column titles, which prevents the dataset from being parsed by the pandas library properly
- because it contains the extraneous text
- removing the
desc
column:- which contains a long text explanation for each loan
- removing the
url
column:- which contains a link to each loan on Lending Club which can only be accessed with an investor account
- removing all columns containing more than 50% missing values:
- which allows us to move faster since we can spend less time trying to fill these values
The following code replicates this process, if you want to replicate the dataset to work with it on your own:
import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False)
We named the filtered dataset loans_2007.csv
instead in case we want to explore the raw dataset (LoanStats3a.csv
) without mixing up the two. First things first, let's read in the dataset into a Dataframe so we can start to explore the data and explore the remaining features.
Instructions
-
Read
loans_2007.csv
into a DataFrame namedloans_2007
and use theprint
function to display the first row of the Dataframe. -
Use the
print
function to:- display the first row of
loans_2007
and - the number of columns in
loans_2007
.
- display the first row of
import pandas as pd
loans_2007 = pd.read_csv("loans_2007.csv")
loans_2007.drop_duplicates()
print(loans_2007.iloc[0])
print(loans_2007.shape[1])
4: First Group Of Columns
The Dataframe contains many columns and can be cumbersome to try to explore all at once. Let's break up the columns into 3 groups of 18 columns and use the data dictionary to become familiar with what each column represents. As you understand each feature, you want to pay attention to any features that:
- leak information from the future (after the loan has already been funded)
- don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
- formatted poorly and need to be cleaned up
- require more data or a lot of processing to turn into a useful feature
- contain redundant information
We need to especially pay attention to data leakage, since it can cause our model to overfit. This is because the model would be using data about the target column that wouldn't be available when we're using the model on future loans. We encourage you to spend as much time as you need to understand each column, because a poor understanding could cause you to make mistakes in the data analysis and modeling process. As you go through the dictionary, keep in mind that we need to select one of the columns as the target column we want to use when we move on to the machine learning phase.
In this screen and the next few screens, let's focus on just columns that we need to remove from consideration. Then, we can circle back and further dissect the columns we decided to keep.
To make this process easier, we created a table that contains the name, data type, first row's value, and description from the data dictionary for the first 18 rows.
name | dtype | first value | description |
---|---|---|---|
id | object | 1077501 | A unique LC assigned ID for the loan listing. |
member_id | float64 | 1.2966e+06 | A unique LC assigned Id for the borrower member. |
loan_amnt | float64 | 5000 | The listed amount of the loan applied for by the borrower. |
funded_amnt | float64 | 5000 | The total amount committed to that loan at that point in time. |
funded_amnt_inv | float64 | 49750 | The total amount committed by investors for that loan at that point in time. |
term | object | 36 months | The number of payments on the loan. Values are in months and can be either 36 or 60. |
int_rate | object | 10.65% | Interest Rate on the loan |
installment | float64 | 162.87 | The monthly payment owed by the borrower if the loan originates. |
grade | object | B | LC assigned loan grade |
sub_grade | object | B2 | LC assigned loan subgrade |
emp_title | object | NaN | The job title supplied by the Borrower when applying for the loan. |
emp_length | object | 10+ years | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. |
home_ownership | object | RENT | The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER. |
annual_inc | float64 | 24000 | The self-reported annual income provided by the borrower during registration. |
verification_status | object | Verified | Indicates if income was verified by LC, not verified, or if the income source was verified |
issue_d | object | Dec-2011 | The month which the loan was funded |
loan_status | object | Charged Off | Current status of the loan |
pymnt_plan | object | n | Indicates if a payment plan has been put in place for the loan |
purpose | object | car | A category provided by the borrower for the loan request. |
After analyzing each column, we can conclude that the following features need to be removed:
id
: randomly generated field by Lending Club for unique identification purposes onlymember_id
: also a randomly generated field by Lending Club for unique identification purposes onlyfunded_amnt
: leaks data from the future (after the loan is already started to be funded)funded_amnt_inv
: also leaks data from the future (after the loan is already started to be funded)grade
: contains redundant information as the interest rate column (int_rate
)sub_grade
: also contains redundant information as the interest rate column (int_rate
)emp_title
: requires other data and a lot of processing to potentially be usefulissue_d
: leaks data from the future (after the loan is already completed funded)
Recall that Lending Club assigns a grade and a sub-grade based on the borrower's interest rate. While the grade
and sub_grade
values are categorical, the int_rate
column contains continuous values, which are better suited for machine learning.
Let's now drop these columns from the Dataframe before moving onto the next group of columns.
Instructions
Use the Dataframe method drop to remove the following columns from theloans_2007
Dataframe:
id
member_id
funded_amnt
funded_amnt_inv
grade
sub_grade
emp_title
issue_d
loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
5: Second Group Of Features
Let's now look at the next 18 columns:
name dtype first value description
title object Computer The loan title provided by the borrower
zip_code object 860xx The first 3 numbers of the zip code provided by the borrower in the loan application.
addr_state object AZ The state provided by the borrower in the loan application
dti float64 27.65 A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
delinq_2yrs float64 0 The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
earliest_cr_line object Jan-1985 The month the borrower's earliest reported credit line was opened
inq_last_6mths float64 1 The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
open_acc float64 3 The number of open credit lines in the borrower's credit file.
pub_rec float64 0 Number of derogatory public records
revol_bal float64 13648 Total credit revolving balance
revol_util object 83.7% Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
total_acc float64 9 The total number of credit lines currently in the borrower's credit file
initial_list_status object f The initial listing status of the loan. Possible values are – W, F
out_prncp float64 0 Remaining outstanding principal for total amount funded
out_prncp_inv float64 0 Remaining outstanding principal for portion of total amount funded by investors
total_pymnt float64 5863.16 Payments received to date for total amount funded
total_pymnt_inv float64 5833.84 Payments received to date for portion of total amount funded by investors
total_rec_prncp float64 5000 Principal received to date
Within this group of columns, we need to drop the following columns:
zip_code: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
out_prncp: leaks data from the future, (after the loan already started to be paid off)
out_prncp_inv: also leaks data from the future, (after the loan already started to be paid off)
total_pymnt: also leaks data from the future, (after the loan already started to be paid off)
total_pymnt_inv: also leaks data from the future, (after the loan already started to be paid off)
total_rec_prncp: also leaks data from the future, (after the loan already started to be paid off)
The out_prncp and out_prncp_inv both describe the outstanding principal amount for a loan, which is the remaining amount the borrower still owes. These 2 columns as well as the total_pymnt column describe properties of the loan after it's fully funded and started to be paid off. This information isn't available to an investor before the loan is fully funded and we don't want to include it in our model.
Let's go ahead and remove these columns from the Dataframe.
Instructions
Use the Dataframe method drop to remove the following columns from the loans_2007 Dataframe:
zip_code
out_prncp
out_prncp_inv
total_pymnt
total_pymnt_inv
total_rec_prncp
loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
6: Third Group Of Features
Let's now move on to the last group of features:
name | dtype | first value | description |
---|---|---|---|
total_rec_int | float64 | 863.16 | Interest received to date |
total_rec_late_fee | float64 | 0 | Late fees received to date |
recoveries | float64 | 0 | post charge off gross recovery |
collection_recovery_fee | float64 | 0 | post charge off collection fee |
last_pymnt_d | object | Jan-2015 | Last month payment was received |
last_pymnt_amnt | float64 | 171.62 | Last total payment amount received |
last_credit_pull_d | object | Jun-2016 | The most recent month LC pulled credit for this loan |
collections_12_mths_ex_med | float64 | 0 | Number of collections in 12 months excluding medical collections |
policy_code | float64 | 1 | publicly available policy_code=1 new products not publicly available policy_code=2 |
application_type | object | INDIVIDUAL | Indicates whether the loan is an individual application or a joint application with two co-borrowers |
acc_now_delinq | float64 | 0 | The number of accounts on which the borrower is now delinquent. |
chargeoff_within_12_mths | float64 | 0 | Number of charge-offs within 12 months |
delinq_amnt | float64 | 0 | The past-due amount owed for the accounts on which the borrower is now delinquent. |
pub_rec_bankruptcies | float64 | 0 | Number of public record bankruptcies |
tax_liens | float64 | 0 | Number of tax liens |
In the last group of columns, we need to drop the following columns:
total_rec_int
: leaks data from the future, (after the loan already started to be paid off),total_rec_late_fee
: also leaks data from the future, (after the loan already started to be paid off),recoveries
: also leaks data from the future, (after the loan already started to be paid off),collection_recovery_fee
: also leaks data from the future, (after the loan already started to be paid off),last_pymnt_d
: also leaks data from the future, (after the loan already started to be paid off),last_pymnt_amnt
: also leaks data from the future, (after the loan already started to be paid off).
All of these columns leak data from the future, meaning that they're describing aspects of the loan after it's already been fully funded and started to be paid off by the borrower.
Instructions
Use the Dataframe method drop to remove the following columns from theloans_2007
Dataframe:
total_rec_int
total_rec_late_fee
recoveries
collection_recovery_fee
last_pymnt_d
last_pymnt_amnt
Use the print
function to:
- display the first row of
loans_2007
and - the number of columns in
loans_2007
.
loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_2007.iloc[0])
print(loans_2007.shape[1])
7: Target Column
Just by becoming familiar with the columns in the dataset, we were able to reduce the number of columns from 52 to 34 columns. We now need to decide on a target column that we want to use for modeling.
We should use the loan_status
column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. Currently, this column contains text values and we need to convert it to a numerical one for training a model. Let's explore the different values in this column and come up with a strategy for converting the values in this column.
Instructions
- Use the Dataframe methodvalue_counts to return the frequency of the unique values in the
loan_status
column. - Display the frequency of each unique value using the
print
function.
print(loans_2007['loan_status'].value_counts())
8: Binary Classification
There are 8 different possible values for the loan_status
column. You can read about most of the different loan statuses on the Lending Clube webste. The 2 values that start with "Does not meet the credit policy" aren't explained unfortunately. A quick Google search takes us to explanations from the lending comunity here and here.
We've compiled the explanation for each column as well as the counts in the Dataframe in the following table:
Loan Status | Count | Meaning |
---|---|---|
Fully Paid | 33136 | Loan has been fully paid off. |
Charged Off | 5634 | Loan for which there is no longer a reasonable expectation of further payments. |
Does not meet the credit policy. Status:Fully Paid | 1988 | While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace. |
Does not meet the credit policy. Status:Charged Off | 761 | While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace. |
In Grace Period | 20 | The loan is past due but still in the grace period of 15 days. |
Late (16-30 days) | 8 | Loan hasn't been paid in 16 to 30 days (late on the current payment). |
Late (31-120 days) | 24 | Loan hasn't been paid in 31 to 120 days (late on the current payment). |
Current | 961 | Loan is up to date on current payments. |
Default | 3 | Loan is defaulted on and no payment has been made for more than 121 days. |
From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. Only theFully Paid
and Charged Off
values describe the final outcome of the loan. The other values describe loans that are still on going and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default
status resembles the Charged Off
status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance. You can read about the difference here.
Since we're interesting in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a binary classification one. Let's remove all the loans that don't contain either Fully Paid
and Charged Off
as the loan's status and then transform the Fully Paid
values to 1
for the positive case and the Charged Off
values to 0
for the negative case. While there are a few different ways to transform all of the values in a column, we'll use the Dataframe method replace. According to the documentation, we can pass the replace
method a nested mapping dictionary in the following format:
mapping_dict = {
"date": {
"january": 1,
"february": 2,
"march": 3
}
}
df = df.replace(mapping_dict)
Lastly, one thing we need to keep in mind is the class imbalance between the positive and negative cases. While there are 33,136 loans that have been fully paid off, there are only 5,634 that were charged off. This class imbalance is a common problem in binary classification and during training, the model ends up having a strong bias towards predicting the class with more observations in the training set and will rarely predict the class with less observations. The stronger the imbalance, the more biased the model becomes. There are a few different ways to tackle this class imbalance, which we'll explore later.
Instructions
- Remove all rows from
loans_2007
that contain values other thanFully Paid
orCharged Off
for theloan_status
column. - Use the Dataframe methodreplace to replace:
Fully Paid
with1
Charged Off
with0
loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]
status_replace = {
"loan_status" : {
"Fully Paid": 1,
"Charged Off": 0,
}
}
loans_2007 = loans_2007.replace(status_replace)
9: Removing Single Value Columns
To wrap up this mission, let's look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application. In addition, removing these columns will reduce the number of columns we'll need to explore further in the next mission.
We'll need to compute the number of unique values in each column and drop the columns that contain only one unique value. While the Series method unique
returns the unique values in a column, it also counts the Pandas missing value object nan
as a value:
# Returns 0 and nan.
unique_values = loans['tax_liens'].unique()
Since we're trying to find columns that contain one true unique value, we should first drop the null values then compute the number of unique values:
non_null = loans_2007['tax_liens'].dropna()
unique_non_null = non_null.unique()
num_true_unique = len(len_unique_non_null)
Instructions
- Remove any columns from
loans_2007
that contain only one unique value:- Create an empty list,
drop_columns
to keep track of which columns you want to drop - For each column:
- Use the Series method dropna to remove any null values and then use the Series methodunique to return the set of non-null unique values
- Use the
len()
function to return the number of values in that set - Append the column to
drop_columns
if it contains only 1 unique value
- Use the Dataframe methoddrop to remove the columns in
drop_columns
fromloans_2007
- Create an empty list,
- Use the
print
function to displaydrop_columns
so we know which ones were removed
orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
col_series = loans_2007[col].dropna().unique()
if len(col_series) == 1:
drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print(drop_columns)
10: Next Steps
It looks we we were able to remove 9 more columns since they only contained 1 unique value.
In this mission, we started to become familiar with the columns in the dataset and removed many columns that aren't useful for modeling. We also selected our target column and decided to focus our modeling efforts on binary classification. In the next mission, we'll explore the individual features in greater depth and work towards training our first machine learning model.