如何准备机器学习数据集
Cleaning and preparing data is a critical first step in any machine learning project. In this blog post, Dataquest student Daniel Osei’s takes us through examining a dataset, selecting columns for features, exploring the data visually and then encoding the features for machine learning.
清理和准备数据是任何机器学习项目中至关重要的第一步。 在此博客文章中,Dataquest的学生Daniel Osei带领我们通过检查数据集,选择要素列,以可视方式探索数据,然后对要素进行编码以进行机器学习。
This post is based on a Dataquest ‘Monthly Challenge’, where our students are given a free-form task to complete.
这篇文章基于Dataquest的 “每月挑战”,在这里我们的学生将获得一份自由形式的任务来完成。
After first reading about Machine Learning on Quora in 2015, Daniel became excited at the prospect of an area that could combine his love of Mathematics and Programming. After reading this article on how to learn data science, Daniel started following the steps, eventually joining Dataquest to learn Data Science with us in in April 2016.
在2015年首次阅读有关Quora上的机器学习的文章后,Daniel对这个可以将他对数学和编程的热爱相结合的领域的前景感到兴奋。 在阅读了有关如何学习数据科学的文章之后,Daniel开始遵循这些步骤,最终于2016年4月加入Dataquest与我们一起学习数据科学。
We’d like to thank Daniel for his hard work, and generously letting us publish this post. This walkthrough uses Python 3.5 and Jupyter notebook.
我们要感谢Daniel的辛勤工作,并慷慨地允许我们发表这篇文章。 本演练使用Python 3.5和Jupyter笔记本 。
Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower’s credit score using past historical data (and their own data science process!) and assigns an interest rate to the borrower.
Lending Club是个人贷款的市场,它与寻求贷款的借款人与希望借钱并获得回报的投资者相匹配。 每个借款人都填写一份全面的申请表,提供他们过去的财务记录,贷款原因等。 Lending Club使用过去的历史数据(以及他们自己的数据科学过程!)评估每个借款人的信用评分,并为借款人分配一个利率。
The loan is then listed on the Lending Club marketplace. You can read more about their marketplace here.
贷款然后在Lending Club市场上列出。 您可以在此处阅读有关其市场的更多信息。
Investors are primarily interested in receiving a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower’s credit score, the purpose for the loan, and other information from the application.
投资者主要有兴趣获得投资回报。 批准的贷款在Lending Club网站上列出,合格的投资者可以在其中浏览最近批准的贷款,借款人的信用评分,贷款目的以及应用程序中的其他信息。
Once an investor decides to fund a loan, the borrower then makes monthly payments back to Lending Club. Lending Club redistributes these payments to the investors. This means that investors don’t have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition to the requested amount. Many loans aren’t completely paid off on time, however, and some borrowers default on the loan.
一旦投资者决定为贷款提供资金,借款人便每月向Lending Club偿还款项。 Lending Club将这些付款重新分配给投资者。 这意味着投资者不必等到还清全部款项就可以开始收回资金。 如果按时还清了贷款,则投资者可获得与借款人除要求的金额外还需支付的利率相对应的回报。 但是,许多贷款还没有按时还清,有些借款人拖欠贷款。
Suppose an investor has approached us and has asked us to build a machine learning model that can reliably predict if a loan will be paid off or not. This investor described himself/herself as a conservative investor who only wants to invest in loans that have a good chance of being paid off on time. Thus, this client is more interested in a machine learning model which does a good job of filtering out high percentage of loan defaulters.
假设有一个投资者联系我们并要求我们建立一个机器学习模型,该模型可以可靠地预测贷款是否还清。 该投资者将自己描述为一个保守的投资者,他只想投资那些很有可能按时还清的贷款。 因此,该客户对机器学习模型更感兴趣,该模型可以很好地过滤掉高比例的贷款违约者。
Our task is to construct a machine learning model that achieves a True Positive Rate of greater than 70% while maintaining a False Positive Rate of less than 7%.
我们的任务是构建一种机器学习模型,该模型可以实现大于70%的真实肯定率 ,同时保持小于7%的错误肯定率 。
Lending Club periodically releases data for all the approved and declined loan applications on their website. You can select different year ranges to download the dataset (in CSV format) for both approved and declined loans.
Lending Club会在其网站上定期发布所有已批准和已拒绝贷款申请的数据。 您可以选择不同的年份范围来下载批准和拒绝贷款的数据集(CSV格式)。
You’ll also find a data dictionary (in XLS format), towards the bottom of the page, which contains information on the different column names. The data dictionary is useful to help understand what a column represents in the dataset.
您还会在页面底部找到一个数据字典 (XLS格式),其中包含有关不同列名的信息。 数据字典对于帮助理解列在数据集中表示什么很有用。
The data dictionary contains two sheets:
数据字典包含两页:
We’ll be using the LoanStats sheet since we’re interested in the approved loans dataset.
因为我们对批准的贷款数据集感兴趣,所以我们将使用LoanStats工作表 。
The approved loans dataset contains information on current loans, completed loans, and defaulted loans. For this challenge, we’ll be working with approved loans data for the years 2007 to 2011.
批准的贷款数据集包含有关当前贷款,已完成贷款和拖欠贷款的信息。 为应对这一挑战,我们将处理2007年至2011年的批准贷款数据。
First, lets import some of the libraries that we’ll be using, and set some parameters to make the output easier to read.
首先,让我们导入我们将要使用的一些库,并设置一些参数以使输出更易于阅读。
import import pandas pandas as as pd
pd
import import numpy numpy as as np
np
pdpd .. set_optionset_option (( 'max_columns''max_columns' , , 120120 )
)
pdpd .. set_optionset_option (( 'max_colwidth''max_colwidth' , , 50005000 )
)
import import matplotlib.pyplot matplotlib.pyplot as as plt
plt
import import seaborn seaborn as as sns
sns
%% matplotlib inline
matplotlib inline
pltplt .. rcParamsrcParams [[ 'figure.figsize''figure.figsize' ] ] = = (( 1212 ,, 88 )
)
We’ve downloaded our dataset and named it lending_club_loans.csv
, but now we need to load it into a pandas DataFrame to explore it.
我们已经下载了数据集并将其命名为lending_club_loans.csv
,但是现在我们需要将其加载到pandas DataFrame中以对其进行探索。
To ensure that code run fast for us, we need to reduce the size of lending_club_loans.csv
by doing the following:
为了确保代码对我们而言快速运行,我们需要通过执行以下操作来减小lending_club_loans.csv
的大小:
We’ll also name the filtered dataset loans_2007
and later at the end of this section save it as loans_2007.csv
to keep it separate from the raw data. This is good practice and makes sure we have our original data in case we need to go back and retrieve any of the original data we’re removing.
我们还将命名过滤后的数据集loans_2007
并在本节末尾将其另存为loans_2007.csv
以使其与原始数据分开。 这是一种很好的做法,可以确保我们拥有原始数据,以防万一需要返回并检索要删除的任何原始数据。
Now, let’s go ahead and perform these steps:
现在,让我们继续执行以下步骤:
Let’s use the pandas head()
method to display first three rows of the loans_2007 DataFrame, just to make sure we were able to load the dataset properly:
让我们使用pandas head()
方法显示loan_2007 DataFrame的前三行,只是为了确保我们能够正确加载数据集:
loans_2007loans_2007 .. headhead (( 33 )
)
id | ID | member_id | 会员ID | loan_amnt | loan_amnt | funded_amnt | funded_amnt | funded_amnt_inv | funded_amnt_inv | term | 术语 | int_rate | int_rate | installment | 分期付款 | grade | 年级 | sub_grade | 次等级 | emp_title | emp_title | emp_length | emp_length | home_ownership | 房产权 | annual_inc | Annual_inc | verification_status | 验证状态 | issue_d | 发行 | loan_status | 贷款状态 | pymnt_plan | pymnt_plan | purpose | 目的 | title | 标题 | zip_code | 邮政编码 | addr_state | addr_state | dti | dti | delinq_2yrs | delinq_2yrs | earliest_cr_line | earlyest_cr_line | fico_range_low | fico_range_low | fico_range_high | fico_range_high | inq_last_6mths | inq_last_6mths | open_acc | open_acc | pub_rec | pub_rec | revol_bal | revol_bal | revol_util | revol_util | total_acc | total_acc | initial_list_status | initial_list_status | out_prncp | out_prncp | out_prncp_inv | out_prncp_inv | total_pymnt | total_pymnt | total_pymnt_inv | total_pymnt_inv | total_rec_prncp | total_rec_prncp | total_rec_int | total_rec_int | total_rec_late_fee | total_rec_late_fee | recoveries | 回收率 | collection_recovery_fee | collection_recovery_fee | last_pymnt_d | last_pymnt_d | last_pymnt_amnt | last_pymnt_amnt | last_credit_pull_d | last_credit_pull_d | last_fico_range_high | last_fico_range_high | last_fico_range_low | last_fico_range_low | collections_12_mths_ex_med | collections_12_mths_ex_med | policy_code | policy_code | application_type | 应用类型 | acc_now_delinq | acc_now_delinq | chargeoff_within_12_mths | chargeoff_within_12_mths | delinq_amnt | delinq_amnt | pub_rec_bankruptcies | pub_rec_bankruptcies | tax_liens | tax_liens | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1077501 | 1077501 | 1296599.0 | 1296599.0 | 5000.0 | 5000.0 | 5000.0 | 5000.0 | 4975.0 | 4975.0 | 36 months | 36个月 | 10.65% | 10.65% | 162.87 | 162.87 | B | 乙 | B2 | B2 | NaN | N | 10+ years | 10年以上 | RENT | 出租 | 24000.0 | 24000.0 | Verified | 已验证 | Dec-2011 | 2011年12月 | Fully Paid | 全额付款 | n | ñ | credit_card | 信用卡 | Computer | 电脑 | 860xx | 860xx | AZ | AZ | 27.65 | 27.65 | 0.0 | 0.0 | Jan-1985 | 1985年1月 | 735.0 | 735.0 | 739.0 | 739.0 | 1.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 13648.0 | 13648.0 | 83.7% | 83.7% | 9.0 | 9.0 | f | F | 0.0 | 0.0 | 0.0 | 0.0 | 5863.155187 | 5863.155187 | 5833.84 | 5833.84 | 5000.00 | 5000.00 | 863.16 | 863.16 | 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | Jan-2015 | 2015年1月 | 171.62 | 171.62 | Sep-2016 | 2016年9月 | 744.0 | 744.0 | 740.0 | 740.0 | 0.0 | 0.0 | 1.0 | 1.0 | INDIVIDUAL | 个人 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1个 | 1077430 | 1077430 | 1314167.0 | 1314167.0 | 2500.0 | 2500.0 | 2500.0 | 2500.0 | 2500.0 | 2500.0 | 60 months | 60个月 | 15.27% | 15.27% | 59.83 | 59.83 | C | C | C4 | C4 | Ryder | 莱德 | < 1 year | <1年 | RENT | 出租 | 30000.0 | 30000.0 | Source Verified | 来源已验证 | Dec-2011 | 2011年12月 | Charged Off | 充电完毕 | n | ñ | car | 汽车 | bike | 自行车 | 309xx | 309xx | GA | GA | 1.00 | 1.00 | 0.0 | 0.0 | Apr-1999 | 1999年4月 | 740.0 | 740.0 | 744.0 | 744.0 | 5.0 | 5.0 | 3.0 | 3.0 | 0.0 | 0.0 | 1687.0 | 1687.0 | 9.4% | 9.4% | 4.0 | 4.0 | f | F | 0.0 | 0.0 | 0.0 | 0.0 | 1008.710000 | 1008.710000 | 1008.71 | 1008.71 | 456.46 | 456.46 | 435.17 | 435.17 | 0.0 | 0.0 | 117.08 | 117.08 | 1.11 | 1.11 | Apr-2013 | 2013年4月 | 119.66 | 119.66 | Sep-2016 | 2016年9月 | 499.0 | 499.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | INDIVIDUAL | 个人 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 2 | 1077175 | 1077175 | 1313524.0 | 1313524.0 | 2400.0 | 2400.0 | 2400.0 | 2400.0 | 2400.0 | 2400.0 | 36 months | 36个月 | 15.96% | 15.96% | 84.33 | 84.33 | C | C | C5 | C5 | NaN | N | 10+ years | 10年以上 | RENT | 出租 | 12252.0 | 12252.0 | Not Verified | 未经审核的 | Dec-2011 | 2011年12月 | Fully Paid | 全额付款 | n | ñ | small_business | 小本生意 | real estate business | 房地产业务 | 606xx | 606xx | IL | 白介素 | 8.72 | 8.72 | 0.0 | 0.0 | Nov-2001 | 2001年11月 | 735.0 | 735.0 | 739.0 | 739.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.0 | 0.0 | 2956.0 | 2956.0 | 98.5% | 98.5% | 10.0 | 10.0 | f | F | 0.0 | 0.0 | 0.0 | 0.0 | 3005.666844 | 3005.666844 | 3005.67 | 3005.67 | 2400.00 | 2400.00 | 605.67 | 605.67 | 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | Jun-2014 | 2014年6月 | 649.91 | 649.91 | Sep-2016 | 2016年9月 | 719.0 | 719.0 | 715.0 | 715.0 | 0.0 | 0.0 | 1.0 | 1.0 | INDIVIDUAL | 个人 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Let’s also use pandas .shape
attribute to view the number of samples and features we’re dealing with at this stage:
让我们还使用pandas .shape
属性来查看我们现阶段要处理的样本和特征的数量:
(42538, 56)
It’s a great idea to spend some time to familiarize ourselves with the columns in the dataset, to understand what each feature represents. This is important, because a poor understanding of the features could cause us to make mistakes in the data analysis and the modeling process.
花一些时间来熟悉数据集中的列,以了解每个要素代表什么是一个好主意。 这很重要,因为对功能的了解不足可能会导致我们在数据分析和建模过程中出错。
We’ll be using the data dictionary Lending Club provided to help us become familiar with the columns and what each represents in the dataset. To make the process easier, we’ll create a DataFrame to contain the names of the columns, data type, first row’s values, and description from the data dictionary.
我们将使用Lending Club提供的数据字典来帮助我们熟悉列以及数据集中的各个列。 为了简化该过程,我们将创建一个DataFrame来包含列的名称,数据类型,第一行的值以及数据字典中的描述。
To make this easier, we’ve pre-converted the data dictionary from Excel format to a CSV.
为了简化操作,我们已经将数据字典从Excel格式预先转换为CSV。
data_dictionary data_dictionary = = pdpd .. read_csvread_csv (( 'LCDataDictionary.csv''LCDataDictionary.csv' ) ) # Loading in the data dictionary
# Loading in the data dictionary
printprint (( data_dictionarydata_dictionary .. shapeshape [[ 00 ])
])
printprint (( data_dictionarydata_dictionary .. columnscolumns .. tolisttolist ())
())
117
['LoanStatNew', 'Description']
LoanStatNew | 贷款统计新 | Description | 描述 | ||
---|---|---|---|---|---|
0 | 0 | acc_now_delinq | acc_now_delinq | The number of accounts on which the borrower is now delinquent. | 借款人现在拖欠的帐户数。 |
1 | 1个 | acc_open_past_24mths | acc_open_past_24mths | Number of trades opened in past 24 months. | 过去24个月内开设的交易数量。 |
2 | 2 | addr_state | addr_state | The state provided by the borrower in the loan application | 借款人在贷款申请中提供的状态 |
3 | 3 | all_util | all_util | Balance to credit limit on all trades | 所有交易的余额到信用额度 |
4 | 4 | annual_inc | Annual_inc | The self-reported annual income provided by the borrower during registration. | 借款人在注册期间提供的自我报告的年收入。 |
Now that we’ve got the data dictionary loaded, let’s join the first row of loans_2007
to the data_dictionary
DataFrame to give us a preview DataFrame with the following columns:
现在,我们已经加载了数据字典,让我们将loans_2007
的第一行与loans_2007
data_dictionary
起来,为我们提供一个预览数据帧,其中包含以下列:
name
– contains the column names of loans_2007
.dtypes
– contains the data types of the loans_2007
columns.first value
– contains the values of loans_2007
first row.description
– explains what each column in loans_2007
represents.name
-包含的列名loans_2007
。 dtypes
–包含loans_2007
列的数据类型。 first value
–包含loans_2007
第一行的值。 description
-解释loans_2007
每一列的loans_2007
。 loans_2007_dtypes loans_2007_dtypes = = pdpd .. DataFrameDataFrame (( loans_2007loans_2007 .. dtypesdtypes ,, columnscolumns == [[ 'dtypes''dtypes' ])
])
loans_2007_dtypes loans_2007_dtypes = = loans_2007_dtypesloans_2007_dtypes .. reset_indexreset_index ()
()
loans_2007_dtypesloans_2007_dtypes [[ 'name''name' ] ] = = loans_2007_dtypesloans_2007_dtypes [[ 'index''index' ]
]
loans_2007_dtypes loans_2007_dtypes = = loans_2007_dtypesloans_2007_dtypes [[[[ 'name''name' ,, 'dtypes''dtypes' ]]
]]
loans_2007_dtypesloans_2007_dtypes [[ 'first value''first value' ] ] = = loans_2007loans_2007 .. locloc [[ 00 ]] .. values
values
preview preview = = loans_2007_dtypesloans_2007_dtypes .. mergemerge (( data_dictionarydata_dictionary , , onon == 'name''name' ,, howhow == 'left''left' )
)
name | 名称 | dtypes | dtypes | first value | 第一价值 | description | 描述 | ||
---|---|---|---|---|---|---|---|---|---|
0 | 0 | id | ID | object | 目的 | 1077501 | 1077501 | A unique LC assigned ID for the loan listing. | 贷款清单的唯一LC分配ID。 |
1 | 1个 | member_id | 会员ID | float64 | float64 | 1.2966e+06 | 1.2966e + 06 | A unique LC assigned Id for the borrower member. | 借款人成员的唯一LC分配ID。 |
2 | 2 | loan_amnt | loan_amnt | float64 | float64 | 5000 | 5000 | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. | 借款人申请的贷款清单。 如果信贷部门在某个时间点减少了贷款额,那么它将反映在该值中。 |
3 | 3 | funded_amnt | funded_amnt | float64 | float64 | 5000 | 5000 | The total amount committed to that loan at that point in time. | 在该时间点对该贷款的总承诺额。 |
4 | 4 | funded_amnt_inv | funded_amnt_inv | float64 | float64 | 4975 | 4975 | The total amount committed by investors for that loan at that point in time. | 投资者在该时间点对该贷款的承诺总额。 |
When we printed the shape of loans_2007
earlier, we noticed that it had 56 columns which also means this preview DataFrame has 56 rows. It can be cumbersome to try to explore all the rows of preview at once, so instead we’ll break it up into three parts and look at smaller selection of features each time.
当我们早些时候打印loans_2007
的形状时,我们注意到它有56列,这也意味着此预览DataFrame有56行。 尝试一次浏览所有预览行可能很麻烦,因此,我们将其分为三个部分,每次查看较小的功能选择。
As you explore the features to better understand each of them, you’ll want to pay attention to any column that:
在探索功能以更好地理解每个功能时,您将需要注意以下各列:
I’ll say it again to emphasize it because it’s important: We need to especially pay close attention to data leakage, which can cause the model to overfit. This is because the model would be also learning from features that wouldn’t be available when we’re using it make predictions on future loans.
我将再次强调它,因为它很重要: 我们需要特别注意数据泄漏 ,这可能会导致模型过拟合。 这是因为该模型还将从我们使用时无法使用的功能中进行学习,从而对未来的贷款进行预测。
Let’s display the first 19 rows of preview
and analyze them:
让我们显示preview
的前19行并进行分析:
previewpreview [:[: 1919 ]
]
name | 名称 | dtypes | dtypes | first value | 第一价值 | description | 描述 | ||
---|---|---|---|---|---|---|---|---|---|
0 | 0 | id | ID | object | 目的 | 1077501 | 1077501 | A unique LC assigned ID for the loan listing. | 贷款清单的唯一LC分配ID。 |
1 | 1个 | member_id | 会员ID | float64 | float64 | 1.2966e+06 | 1.2966e + 06 | A unique LC assigned Id for the borrower member. | 借款人成员的唯一LC分配ID。 |
2 | 2 | loan_amnt | loan_amnt | float64 | float64 | 5000 | 5000 | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. | 借款人申请的贷款清单。 如果信贷部门在某个时间点减少了贷款额,那么它将反映在该值中。 |
3 | 3 | funded_amnt | funded_amnt | float64 | float64 | 5000 | 5000 | The total amount committed to that loan at that point in time. | 在该时间点对该贷款的总承诺额。 |
4 | 4 | funded_amnt_inv | funded_amnt_inv | float64 | float64 | 4975 | 4975 | The total amount committed by investors for that loan at that point in time. | 投资者在该时间点对该贷款的承诺总额。 |
5 | 5 | term | 术语 | object | 目的 | 36 months | 36个月 | The number of payments on the loan. Values are in months and can be either 36 or 60. | 贷款的还款次数。 值以月为单位,可以是36或60。 |
6 | 6 | int_rate | int_rate | object | 目的 | 10.65% | 10.65% | Interest Rate on the loan | 贷款利率 |
7 | 7 | installment | 分期付款 | float64 | float64 | 162.87 | 162.87 | The monthly payment owed by the borrower if the loan originates. | 如果贷款产生,则借款人每月欠的款项。 |
8 | 8 | grade | 年级 | object | 目的 | B | 乙 | LC assigned loan grade | 信用证指定的贷款等级 |
9 | 9 | sub_grade | 次等级 | object | 目的 | B2 | B2 | LC assigned loan subgrade | 立法会指定的贷款路基 |
10 | 10 | emp_title | emp_title | object | 目的 | NaN | N | The job title supplied by the Borrower when applying for the loan.* | 借款人在申请贷款时提供的职位。* |
11 | 11 | emp_length | emp_length | object | 目的 | 10+ years | 10年以上 | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. | 就业年限。 可能的值在0到10之间,其中0表示少于一年,而10表示十年或更长时间。 |
12 | 12 | home_ownership | 房产权 | object | 目的 | RENT | 出租 | The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER. | 借款人在注册过程中提供的房屋所有权状态。 我们的值是:租金,自己拥有,抵押,其他。 |
13 | 13 | annual_inc | Annual_inc | float64 | float64 | 24000 | 24000 | The self-reported annual income provided by the borrower during registration. | 借款人在注册期间提供的自我报告的年收入。 |
14 | 14 | verification_status | 验证状态 | object | 目的 | Verified | 已验证 | Indicates if income was verified by LC, not verified, or if the income source was verified | 指明收入是否已由LC验证,未验证或收入来源是否已验证 |
15 | 15 | issue_d | 发行 | object | 目的 | Dec-2011 | 2011年12月 | The month which the loan was funded | 贷款融资的月份 |
16 | 16 | loan_status | 贷款状态 | object | 目的 | Fully Paid | 全额付款 | Current status of the loan | 贷款的现状 |
17 | 17 | pymnt_plan | pymnt_plan | object | 目的 | n | ñ | Indicates if a payment plan has been put in place for the loan | 指示是否已为贷款制定了付款计划 |
18 | 18 | purpose | 目的 | object | 目的 | credit_card | 信用卡 | A category provided by the borrower for the loan request. | 借款人为贷款请求提供的类别。 |
After analyzing the columns, we can conclude that the following features can be removed:
在分析了列之后,我们可以得出结论,可以删除以下功能:
id
– randomly generated field by Lending Club for unique identification purposes only.member_id
– also randomly generated field by Lending Club for identification purposes only.funded_amnt
– leaks information from the future(after the loan is already started to be funded).funded_amnt_inv
– also leaks data from the future.sub_grade
– contains redundant information that is already in the grade
column (more below).int_rate
– also included within the grade
column.emp_title
– requires other data and a lot of processing to become potentially usefulissued_d
– leaks data from the future.id
– Lending Club随机生成的字段,仅供唯一标识。 member_id
–也是Lending Club随机生成的字段,仅供识别。 funded_amnt
–泄露未来的信息(在贷款已经开始筹集资金之后)。 funded_amnt_inv
–还会泄漏将来的数据。 sub_grade
–包含“ grade
列中已经存在的冗余信息(更多信息请参见下文)。 int_rate
–也包含在“ grade
列中。 emp_title
–需要其他数据和大量处理才能变得很有用 issued_d
–泄漏将来的数据。 Lending Club uses a borrower’s grade and payment term (30 or months) to assign an interest rate (you can read more about Rates & Fees). This causes variations in interest rate within a given grade. But, what may be useful for our model is to focus on clusters of borrowers instead of individuals. And, that’s exactly what grading does – it segments borrowers based on their credit score and other behaviors, which is we should keep the grade
column and drop interest int_rate
and sub_grade
.
Lending Club使用借款人的等级和还款期限(30个月或几个月)来分配利率(您可以阅读更多有关Rates&Fees的信息 )。 这会导致给定等级内利率的变化。 但是,对于我们的模型可能有用的是将重点放在借款人群体而不是个人身上。 而且,这正是分级的作用-它根据借款人的信用评分和其他行为对借款人进行细分,这就是我们应该保留grade
列,并降低利息int_rate
和sub_grade
。
Let’s drop these columns from the DataFrame before moving onto to the next group of columns.
在移至下一组列之前,让我们从DataFrame中删除这些列。
Let’s move on to the next 19 columns:
让我们继续进行下19列:
previewpreview [[ 1919 :: 3838 ]
]
name | 名称 | dtypes | dtypes | first value | 第一价值 | description | 描述 | ||
---|---|---|---|---|---|---|---|---|---|
19 | 19 | title | 标题 | object | 目的 | Computer | 电脑 | The loan title provided by the borrower | 借款人提供的贷款名称 |
20 | 20 | zip_code | 邮政编码 | object | 目的 | 860xx | 860xx | The first 3 numbers of the zip code provided by the borrower in the loan application. | 借款人在贷款申请中提供的邮政编码的前3个数字。 |
21 | 21 | addr_state | addr_state | object | 目的 | AZ | AZ | The state provided by the borrower in the loan application | 借款人在贷款申请中提供的状态 |
22 | 22 | dti | dti | float64 | float64 | 27.65 | 27.65 | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. | 用借款人的每月总债务付款额(不包括抵押贷款和所要求的信用证贷款)除以借款人的自我报告的每月收入,得出的比率。 |
23 | 23 | delinq_2yrs | delinq_2yrs | float64 | float64 | 0 | 0 | The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years | 过去2年中借款人的信用档案中逾期30天以上的逾期欠款的次数 |
24 | 24 | earliest_cr_line | earlyest_cr_line | object | 目的 | Jan-1985 | 1985年1月 | The month the borrower’s earliest reported credit line was opened | 借款人最早报告的信贷额度开放的月份 |
25 | 25 | fico_range_low | fico_range_low | float64 | float64 | 735 | 735 | The lower boundary range the borrower’s FICO at loan origination belongs to. | 借款人原始贷款时的FICO的下限范围。 |
26 | 26 | fico_range_high | fico_range_high | float64 | float64 | 739 | 739 | The upper boundary range the borrower’s FICO at loan origination belongs to. | 借款人在贷款发起时的FICO的上限范围。 |
27 | 27 | inq_last_6mths | inq_last_6mths | float64 | float64 | 1 | 1个 | The number of inquiries in past 6 months (excluding auto and mortgage inquiries) | 最近6个月的查询数量(不包括汽车和抵押贷款查询) |
28 | 28 | open_acc | open_acc | float64 | float64 | 3 | 3 | The number of open credit lines in the borrower’s credit file. | 借款人的信用档案中未清信用额度的数量。 |
29 | 29 | pub_rec | pub_rec | float64 | float64 | 0 | 0 | Number of derogatory public records | 贬损的公共记录数 |
30 | 30 | revol_bal | revol_bal | float64 | float64 | 13648 | 13648 | Total credit revolving balance | 信贷周转总额 |
31 | 31 | revol_util | revol_util | object | 目的 | 83.7% | 83.7% | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. | 循环线利用率,或借款人相对于所有可用循环信贷所使用的信贷量。 |
32 | 32 | total_acc | total_acc | float64 | float64 | 9 | 9 | The total number of credit lines currently in the borrower’s credit file | 借款人信用档案中当前的信用额度总数 |
33 | 33 | initial_list_status | initial_list_status | object | 目的 | f | F | The initial listing status of the loan. Possible values are – W, F | 贷款的初始列表状态。 可能的值为– W,F |
34 | 34 | out_prncp | out_prncp | float64 | float64 | 0 | 0 | Remaining outstanding principal for total amount funded | 剩余未偿还本金总额 |
35 | 35 | out_prncp_inv | out_prncp_inv | float64 | float64 | 0 | 0 | Remaining outstanding principal for portion of total amount funded by investors | 投资者出资总额中剩余的未偿还本金 |
36 | 36 | total_pymnt | total_pymnt | float64 | float64 | 5863.16 | 5863.16 | Payments received to date for total amount funded | 迄今已收到的已付款总额 |
37 | 37 | total_pymnt_inv | total_pymnt_inv | float64 | float64 | 5833.84 | 5833.84 | Payments received to date for portion of total amount funded by investors | 迄今收到的付款,占投资者资助总额的一部分 |
In this group,take note of the fico_range_low
and fico_range_high
columns. Both are in this second group of columns but because they related to some other columns, we’ll talk more about them after looking at the last group of columns.
在该组中,记下fico_range_low
和fico_range_high
列。 两者都在第二组列中,但是由于它们与其他一些列相关,因此在查看最后一组列之后,我们将进一步讨论它们。
We can drop the following columns:
我们可以删除以下列:
zip_code
– mostly redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible.out_prncp
– leaks data from the future.out_prncp_inv
– also leaks data from the future.total_pymnt
– also leaks data from the future.total_pymnt_inv
– also leaks data from the future.zip_code
–由于5位数邮政编码中只有前3位数是可见的,因此与addr_state列大部分相同。 out_prncp
–泄漏将来的数据。 out_prncp_inv
–还会泄漏将来的数据。 total_pymnt
–还会泄漏将来的数据。 total_pymnt_inv
–还会泄漏将来的数据。 Let’s go ahead and remove these 5 columns from the DataFrame:
让我们继续,从DataFrame中删除以下5列:
Let’s analyze the last group of features:
让我们分析最后一组功能:
previewpreview [[ 3838 :]
:]
name | 名称 | dtypes | dtypes | first value | 第一价值 | description | 描述 | ||
---|---|---|---|---|---|---|---|---|---|
38 | 38 | total_rec_prncp | total_rec_prncp | float64 | float64 | 5000 | 5000 | Principal received to date | 校长至今 |
39 | 39 | total_rec_int | total_rec_int | float64 | float64 | 863.16 | 863.16 | Interest received to date | 迄今为止收到的利息 |
40 | 40 | total_rec_late_fee | total_rec_late_fee | float64 | float64 | 0 | 0 | Late fees received to date | 迄今为止收取的滞纳金 |
41 | 41 | recoveries | 回收率 | float64 | float64 | 0 | 0 | post charge off gross recovery | 过帐总回收费用 |
42 | 42 | collection_recovery_fee | collection_recovery_fee | float64 | float64 | 0 | 0 | post charge off collection fee | 免收邮寄费 |
43 | 43 | last_pymnt_d | last_pymnt_d | object | 目的 | Jan-2015 | 2015年1月 | Last month payment was received | 上个月已收到付款 |
44 | 44 | last_pymnt_amnt | last_pymnt_amnt | float64 | float64 | 171.62 | 171.62 | Last total payment amount received | 最近收到的总付款金额 |
45 | 45 | last_credit_pull_d | last_credit_pull_d | object | 目的 | Sep-2016 | 2016年9月 | The most recent month LC pulled credit for this loan | 最近一个月LC取消了这笔贷款的信贷 |
46 | 46 | last_fico_range_high | last_fico_range_high | float64 | float64 | 744 | 744 | The upper boundary range the borrower’s last FICO pulled belongs to. | 借款人最近一次FICO所属于的上限范围。 |
47 | 47 | last_fico_range_low | last_fico_range_low | float64 | float64 | 740 | 740 | The lower boundary range the borrower’s last FICO pulled belongs to. | 借款人上次FICO所属的下限范围。 |
48 | 48 | collections_12_mths_ex_med | collections_12_mths_ex_med | float64 | float64 | 0 | 0 | Number of collections in 12 months excluding medical collections | 12个月内的馆藏数量(医疗馆藏除外) |
49 | 49 | policy_code | policy_code | float64 | float64 | 1 | 1个 | publicly available policy_code=1nnew products not publicly available policy_code=2 | 公开提供的policy_code = 1新产品未公开提供policy_code = 2 |
50 | 50 | application_type | 应用类型 | object | 目的 | INDIVIDUAL | 个人 | Indicates whether the loan is an individual application or a joint application with two co-borrowers | 指示贷款是个人申请还是与两个共同借款人共同申请 |
51 | 51 | acc_now_delinq | acc_now_delinq | float64 | float64 | 0 | 0 | The number of accounts on which the borrower is now delinquent. | 借款人现在拖欠的帐户数。 |
52 | 52 | chargeoff_within_12_mths | chargeoff_within_12_mths | float64 | float64 | 0 | 0 | Number of charge-offs within 12 months | 12个月内的注销数量 |
53 | 53 | delinq_amnt | delinq_amnt | float64 | float64 | 0 | 0 | The past-due amount owed for the accounts on which the borrower is now delinquent. | 借款人现在拖欠其帐户的逾期款项。 |
54 | 54 | pub_rec_bankruptcies | pub_rec_bankruptcies | float64 | float64 | 0 | 0 | Number of public record bankruptcies | 公共记录破产数量 |
55 | 55 | tax_liens | tax_liens | float64 | float64 | 0 | 0 | Number of tax liens | 税收留置权数量 |
In this last group of columns, we need to drop the following, all of which leak data from the future:
在最后一组列中,我们需要删除以下所有列,所有这些将来都会泄漏数据:
total_rec_prncp
total_rec_int
total_rec_late_fee
recoveries
collection_recovery_fee
last_pymnt_d
last_pymnt_amnt
total_rec_prncp
total_rec_int
total_rec_late_fee
recoveries
collection_recovery_fee
last_pymnt_d
last_pymnt_amnt
Let’s drop our last group of columns:
让我们删除最后一组列:
Now, besides the explanations provided here in the Description column,let’s learn more about fico_range_low
, fico_range_high
, last_fico_range_low
, and last_fico_range_high
.
现在,除了“描述”列中此处提供的解释之外,让我们进一步了解fico_range_low
, fico_range_high
, last_fico_range_low
和last_fico_range_high
。
FICO scores are a credit score, or a number used by banks and credit cards to represent how credit-worthy a person is. While there are a few types of credit scores used in the United States, the FICO score is the best known and most widely used.
FICO分数是一个信用分数,或者是银行和信用卡用来表示一个人的信用程度的数字。 虽然在美国使用的信用评分类型有几种,但FICO评分是最著名和使用最广泛的。
When a borrower applies for a loan, Lending Club gets the borrowers credit score from FICO – they are given a lower and upper limit of the range that the borrowers score belongs to, and they store those values as fico_range_low
, fico_range_high
. After that, any updates to the borrowers score are recorded as last_fico_range_low
, and last_fico_range_high
.
当借款人申请贷款时,Lending Club会从FICO获得借款人的信用评分-给出借款人评分所属范围的上下限,并将这些值存储为fico_range_low
, fico_range_high
。 之后,对借方分数的任何更新都记录为last_fico_range_low
和last_fico_range_high
。
A key part of any data science project is to do everything you can to understand the data. While researching this data set, I found a project done in 2014 by a group of students from Stanford University on this same dataset.
任何数据科学项目的关键部分是尽一切可能理解数据。 在研究此数据集时,我发现了由斯坦福大学的一群学生在2014年完成的一个项目,使用了相同的数据集。
In the report for the project, the group listed the current credit score (last_fico_range
) among late fees and recovery fees as fields they mistakenly added to the features but state that they later learned these columns all leak information into the future.
在该项目的报告中 ,该小组在滞纳金和回收金中列出了当前的信用评分( last_fico_range
),因为他们错误地将这些字段添加到功能中,但指出后来他们知道这些列将所有信息泄漏到将来。
However, following this group’s project, another group from Stanford worked on this same Lending Club dataset. They used the FICO score columns, dropping only last_fico_range_low
, in their modeling. This second group’s report described last_fico_range_high
as the one of the more important features in predicting accurate results.
但是,按照该小组的项目,斯坦福大学的另一个小组研究了相同的Lending Club数据集。 他们使用FICO分数列,在建模中仅删除last_fico_range_low
。 第二组的报告将last_fico_range_high
描述为预测准确结果的更重要特征之一。
The question we must answer is, do the FICO credit scores information into the future? Recall a column is considered leaking information when especially it won’t be available at the time we use our model – in this case when we use our model on future loans.
我们必须回答的问题是,FICO信用评分信息是否会面向未来? 回想一下,当我们在使用模型时,特别是在某列中将无法获得信息时,会认为该列是泄漏信息–在这种情况下,当我们在未来的贷款中使用我们的模型时。
This blog examines in-depth the FICO scores for lending club loans, and notes that while looking at the trend of the FICO scores is a great predictor of whether a loan will default, that because FICO scores continue to be updated by the Lending Club after a loan is funded, a defaulting loan can lower the borrowers score, or in other words, will leak data.
该博客深入研究了贷款俱乐部贷款的FICO分数,并指出,在查看FICO分数趋势的同时,可以很好地预测贷款是否会违约,因为FICO分数会在贷款俱乐部继续更新之后贷款是有资金的,拖欠贷款会降低借款人的分数,换句话说,会泄漏数据。
Therefore we can safely use fico_range_low
and fico_range_high
, but not last_fico_range_low
, and last_fico_range_high
. Lets take a look at the values in these columns:
因此,我们可以安全地使用fico_range_low
和fico_range_high
,但不能使用last_fico_range_low
和last_fico_range_high
。 让我们看一下这些列中的值:
printprint (( loans_2007loans_2007 [[ 'fico_range_low''fico_range_low' ]] .. uniqueunique ())
())
printprint (( loans_2007loans_2007 [[ 'fico_range_high''fico_range_high' ]] .. uniqueunique ())
())
[ 735. 740. 690. 695. 730. 660. 675. 725. 710. 705. 720. 665.
670. 760. 685. 755. 680. 700. 790. 750. 715. 765. 745. 770.
780. 775. 795. 810. 800. 815. 785. 805. 825. 820. 630. 625.
nan 650. 655. 645. 640. 635. 610. 620. 615.]
[ 739. 744. 694. 699. 734. 664. 679. 729. 714. 709. 724. 669.
674. 764. 689. 759. 684. 704. 794. 754. 719. 769. 749. 774.
784. 779. 799. 814. 804. 819. 789. 809. 829. 824. 634. 629.
nan 654. 659. 649. 644. 639. 614. 624. 619.]
Let’s get rid of the missing values, then plot histograms to look at the ranges of the two columns:
让我们摆脱缺失的值,然后绘制直方图以查看两列的范围:
Let’s now go ahead and create a column for the average of fico_range_low
and fico_range_high
columns and name it fico_average
. Note that this is not the average FICO score for each borrower, but rather an average of the high and low range that we know the borrower is in.
现在,让我们为fico_range_low
和fico_range_high
列的平均值创建一列,并将其命名为fico_average
。 请注意,这不是每个借款人的平均FICO得分,而是我们知道借款人所处的最高和最低范围的平均值。
loans_2007loans_2007 [[ 'fico_average''fico_average' ] ] = = (( loans_2007loans_2007 [[ 'fico_range_high''fico_range_high' ] ] + + loans_2007loans_2007 [[ 'fico_range_low''fico_range_low' ]) ]) / / 2
2
Let’s check what we just did.
让我们检查一下我们刚才做了什么。
fico_range_low | fico_range_low | fico_range_high | fico_range_high | fico_average | fico_average | ||
---|---|---|---|---|---|---|---|
0 | 0 | 735.0 | 735.0 | 739.0 | 739.0 | 737.0 | 737.0 |
1 | 1个 | 740.0 | 740.0 | 744.0 | 744.0 | 742.0 | 742.0 |
2 | 2 | 735.0 | 735.0 | 739.0 | 739.0 | 737.0 | 737.0 |
3 | 3 | 690.0 | 690.0 | 694.0 | 694.0 | 692.0 | 692.0 |
4 | 4 | 695.0 | 695.0 | 699.0 | 699.0 | 697.0 | 697.0 |
Good! We got the mean calculations and everything right. Now, we can go ahead and drop fico_range_low
, fico_range_high
, last_fico_range_low
, and last_fico_range_high
columns.
好! 我们得到了均值计算,一切都正确。 现在,我们可以继续删除fico_range_low
, fico_range_high
, last_fico_range_low
和last_fico_range_high
列。
drop_cols drop_cols = = [[ 'fico_range_low''fico_range_low' ,, 'fico_range_high''fico_range_high' ,, 'last_fico_range_low''last_fico_range_low' ,
,
'last_fico_range_high''last_fico_range_high' ]
]
loans_2007 loans_2007 = = loans_2007loans_2007 .. dropdrop (( drop_colsdrop_cols , , axisaxis == 11 )
)
loans_2007loans_2007 .. shape
shape
(42535, 33)
Notice just by becoming familiar with the columns in the dataset, we’re able to reduce the number of columns from 56 to 33.
注意,只要熟悉数据集中的列,我们就能将列数从56减少到33。
Now, let’s decide on the appropriate column to use as a target column for modeling – keep in mind the main goal is predict who will pay off a loan and who will default.
现在,让我们确定合适的列以用作建模的目标列–请记住,主要目标是预测谁将还清贷款以及谁会违约。
We learned from the description of columns in the preview DataFrame that loan_status
is the only field in the main dataset that describe a loan status, so let’s use this column as the target column.
我们从预览DataFrame中的列描述中得知, loan_status
是主数据集中描述贷款状态的唯一字段,因此让我们将此列用作目标列。
name | 名称 | dtypes | dtypes | first value | 第一价值 | description | 描述 | ||
---|---|---|---|---|---|---|---|---|---|
16 | 16 | loan_status | 贷款状态 | object | 目的 | Fully Paid | 全额付款 | Current status of the loan | 贷款的现状 |
Currently, this column contains text values that need to be converted to numerical values to be able use for training a model.
当前,此列包含需要转换为数值才能用于训练模型的文本值。
Let’s explore the different values in this column and come up with a strategy for converting the values in this column. We’ll use the DataFrame method value_counts()
to return the frequency of the unique values in the loan_status
column.
让我们探索此列中的不同值,并提出一种转换此列中的值的策略。 我们将使用DataFrame方法value_counts()
来返回loan_status
列中唯一值的频率。
loans_2007loans_2007 [[ "loan_status""loan_status" ]] .. value_countsvalue_counts ()
()
Fully Paid 33586
Charged Off 5653
Does not meet the credit policy. Status:Fully Paid 1988
Does not meet the credit policy. Status:Charged Off 761
Current 513
In Grace Period 16
Late (31-120 days) 12
Late (16-30 days) 5
Default 1
Name: loan_status, dtype: int64
The loan status has nine different possible values!
贷款状态有九种可能的值!
Let’s learn about these unique values to determine the ones that best describe the final outcome of a loan, and also the kind of classification problem we’ll be dealing with.
让我们了解这些独特的值,以确定最能描述贷款最终结果的值,以及我们将要处理的分类问题。
You can read about most of the different loan statuses on the Lending Club website as well as these posts on the Lend Academy and Orchard forums. I have pulled that data together in a table below so we can see the unique values, their frequency in the dataset and what each means:
您可以在Lending Club网站上以及Lend Academy和Orchard论坛上了解有关大多数不同贷款状态的信息 。 我将这些数据汇总到下表中,以便我们可以看到唯一值,它们在数据集中的出现频率以及各自的含义:
Loan Status | 贷款状况 | Count | 计数 | Meaning | 含义 | ||
---|---|---|---|---|---|---|---|
0 | 0 | Fully Paid | 全额付款 | 33586 | 33586 | Loan has been fully paid off. | 贷款已全额还清。 |
1 | 1个 | Charged Off | 充电完毕 | 5653 | 5653 | Loan for which there is no longer a reasonable expectation of further payments. | 不再有合理预期进一步付款的贷款。 |
2 | 2 | Does not meet the credit policy. Status:Fully Paid | 不符合信用政策。 状态:全额付款 | 1988 | 1988年 | While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn’t be approved on to the marketplace. | 虽然还清了贷款,但今天的贷款申请将不再符合信贷政策,也不会被批准进入市场。 |
3 | 3 | Does not meet the credit policy. Status:Charged Off | 不符合信用政策。 状态:已充电 | 761 | 761 | While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn’t be approved on to the marketplace. | 在清算贷款后,今天的贷款申请将不再符合信贷政策,也不会被批准进入市场。 |
4 | 4 | Current | 当前 | 513 | 513 | Loan is up to date on current payments. | 贷款是当前付款的最新信息。 |
5 | 5 | In Grace Period | 宽限期 | 16 | 16 | The loan is past due but still in the grace period of 15 days. | 贷款已过期,但仍处于15天的宽限期内。 |
6 | 6 | Late (31-120 days) | 晚(31-120天) | 12 | 12 | Loan hasn’t been paid in 31 to 120 days (late on the current payment). | 在31到120天内没有还清贷款(当前付款的时间很晚)。 |
7 | 7 | Late (16-30 days) | 晚(16-30天) | 5 | 5 | Loan hasn’t been paid in 16 to 30 days (late on the current payment). | 在16到30天内没有还清贷款(当前付款已晚)。 |
8 | 8 | Default | 默认 | 1 | 1个 | Loan is defaulted on and no payment has been made for more than 121 days. | 拖欠贷款是默认的,并且超过121天未付款。 |
Remember, our goal is to build a machine learning model that can learn from past loans in trying to predict which loans will be paid off and which won’t. From the above table, only the Fully Paid and Charged Off values describe the final outcome of a loan. The other values describe loans that are still on going, and even though some loans are late on payments, we can’t jump the gun and classify them as Charged Off.
请记住,我们的目标是建立一个机器学习模型,该模型可以从过去的贷款中学习,以试图预测哪些贷款将得到还清,而哪些则不会。 在上表中,仅“已付清”和“已清还”值描述了贷款的最终结果。 其他值描述的是仍在继续的贷款,即使有些贷款延迟付款,我们也无法将其归类为“冲销”。
Also, while the Default status resembles the Charged Off status, in Lending Club’s eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance. Therefore, we should use only samples where the loan_status
column is 'Fully Paid'
or 'Charged Off'
.
同样,虽然“默认”状态类似于“已注销”状态,但在Lending Club看来,已注销的贷款基本上没有机会偿还,而“默认”的机会很小。 因此,我们应该仅使用loan_status
'Fully Paid'
或'Charged Off'
示例。
We’re not interested in any statuses that indicate that the loan is ongoing or in progress, because predicting that something is in progress doesn’t tell us anything.
我们对表示贷款正在进行或进行中的任何状态都不感兴趣,因为预测正在发生的事情不会告诉我们任何事情。
Since we’re interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as binary classification.
由于我们有兴趣预测贷款将属于这两个值中的哪个值,因此可以将问题视为二进制分类 。
Let’s remove all the loans that don’t contain either 'Fully Paid'
or 'Charged Off'
as the loan’s status and then transform the 'Fully Paid'
values to 1
for the positive case and the 'Charged Off'
values to 0
for the negative case.
让我们删除所有不包含'Fully Paid'
或'Charged Off'
作为贷款状态的贷款,然后将正数情况下的'Fully Paid'
值转换为1
,将正情况下的'Charged Off'
值转换为0
。否定情况。
This will mean that out of the ~42,000 rows we have, we’ll be removing just over 3,000.
这意味着在我们拥有的约42,000行中,我们将删除3,000多行。
There are few different ways to transform all of the values in a column, we’ll use the DataFrame method replace()
.
转换列中所有值的方法很少,我们将使用DataFrame方法replace()
。
loans_2007 loans_2007 = = loans_2007loans_2007 [([( loans_2007loans_2007 [[ "loan_status""loan_status" ] ] == == "Fully Paid""Fully Paid" ) ) |
|
(( loans_2007loans_2007 [[ "loan_status""loan_status" ] ] == == "Charged Off""Charged Off" )]
)]
mapping_dictionary mapping_dictionary = = {{ "loan_status""loan_status" :{ :{ "Fully Paid""Fully Paid" : : 11 , , "Charged Off""Charged Off" : : 00 }}
}}
loans_2007 loans_2007 = = loans_2007loans_2007 .. replacereplace (( mapping_dictionarymapping_dictionary )
)
These plots indicate that a significant number of borrowers in our dataset paid off their loan – 85.62% of loan borrowers paid off amount borrowed, while 14.38% unfortunately defaulted. From our loan data it is these ‘defaulters’ that we’re more interested in filtering out as much as possible to reduce loses on investment returns.
这些图表明,在我们的数据集中,有大量借款人还清了他们的贷款,其中85.62%的借款人还清了借入的金额,而不幸的是有14.38%的违约。 从我们的贷款数据来看,正是这些“违约者”使我们对尽可能多地过滤掉以减少投资回报损失更感兴趣。
In part two of our walkthrough, we’ll learn that the significant percentage difference, or class imbalance, in target variable needs to be considered when we build our model.
在本演练的第二部分中,我们将学习在构建模型时需要考虑目标变量中的显着百分比差异或类不平衡 。
To wrap up this section, let’s look for any columns that contain only one unique value and remove them. These columns won’t be useful for the model since they don’t add any information to each loan application. In addition, removing these columns will reduce the number of columns we’ll need to explore further in the next stage.
为了结束本节,让我们查找仅包含一个唯一值的所有列并将其删除。 这些列不会对模型有用,因为它们不会向每个贷款申请添加任何信息。 此外,删除这些列将减少我们在下一阶段需要进一步探索的列数。
The pandas Series method nunique()
returns the number of unique values, excluding any null values. We can use apply this method across the dataset to remove these columns in one easy step.
熊猫Series方法nunique()
返回唯一值的数量,不包括任何空值。 我们可以在整个数据集中使用此方法,只需一个简单的步骤即可删除这些列。
loans_2007 loans_2007 = = loans_2007loans_2007 .. locloc [:,[:, loans_2007loans_2007 .. applyapply (( pdpd .. SeriesSeries .. nuniquenunique ) ) != != 11 ]
]
Again, there may be some columns with more than one unique values but one of the values has insignificant frequency in the dataset. Let’s find out and drop such column(s):
同样,可能有一些列具有不止一个唯一值,但其中一个值在数据集中的频率不重要。 让我们找出并删除这样的列:
36 months 29096
60 months 10143
Name: term, dtype: int64
Not Verified 16845
Verified 12526
Source Verified 9868
Name: verification_status, dtype: int64
1 33586
0 5653
Name: loan_status, dtype: int64
n 39238
y 1
Name: pymnt_plan, dtype: int64
The payment plan column (pymnt_plan
) has two unique values, 'y'
and 'n'
, with 'y'
occurring only once. Let’s drop this column:
付款计划列( pymnt_plan
)具有两个唯一值'y'
和'n'
,其中'y'
仅发生一次。 让我们删除此列:
printprint (( loans_2007loans_2007 .. shapeshape [[ 11 ])
])
loans_2007 loans_2007 = = loans_2007loans_2007 .. dropdrop (( 'pymnt_plan''pymnt_plan' , , axisaxis == 11 )
)
printprint (( "We've been able to reduced the features to => "We've been able to reduced the features to => {}{} "" .. formatformat (( loans_2007loans_2007 .. shapeshape [[ 11 ]))
]))
25
We've been able to reduced the features to => 24
Lastly, lets save our work in this section to a CSV file.
最后,让我们将本节中的工作保存到CSV文件中。
Start for Free
免费开始
In this section, we’ll prepare the filtered_loans_2007.csv
data for machine learning. We’ll focus on handling missing values, converting categorical columns to numeric columns and removing any other extraneous columns.
在本节中,我们将准备filtered_loans_2007.csv
数据用于机器学习。 我们将专注于处理缺失值,将分类列转换为数字列并删除任何其他无关的列。
We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression.
在将数据输入机器学习算法之前,我们需要处理缺失值和分类特征 ,因为大多数机器学习模型所基于的数学假定数据是数值的并且不包含缺失值。 为了加强此要求,如果在使用线性回归和逻辑回归等模型时尝试使用包含缺失值或非数值的数据训练模型,则scikit-learn将返回错误。
Here’s an outline of what we’ll be doing in this stage:
这是我们在此阶段将要做的事情的概要:
First though, let’s load in the data from last section’s final output:
不过首先,让我们从上一节的最终输出中加载数据:
filtered_loans filtered_loans = = pdpd .. read_csvread_csv (( 'processed_data/filtered_loans_2007.csv''processed_data/filtered_loans_2007.csv' )
)
printprint (( filtered_loansfiltered_loans .. shapeshape )
)
filtered_loansfiltered_loans .. headhead ()
()
(39239, 24)
loan_amnt | loan_amnt | term | 术语 | installment | 分期付款 | grade | 年级 | emp_length | emp_length | home_ownership | 房产权 | annual_inc | Annual_inc | verification_status | 验证状态 | loan_status | 贷款状态 | purpose | 目的 | title | 标题 | addr_state | addr_state | dti | dti | delinq_2yrs | delinq_2yrs | earliest_cr_line | earlyest_cr_line | inq_last_6mths | inq_last_6mths | open_acc | open_acc | pub_rec | pub_rec | revol_bal | revol_bal | revol_util | revol_util | total_acc | total_acc | last_credit_pull_d | last_credit_pull_d | pub_rec_bankruptcies | pub_rec_bankruptcies | fico_average | fico_average | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 5000.0 | 5000.0 | 36 months | 36个月 | 162.87 | 162.87 | B | 乙 | 10+ years | 10年以上 | RENT | 出租 | 24000.0 | 24000.0 | Verified | 已验证 | 1 | 1个 | credit_card | 信用卡 | Computer | 电脑 | AZ | AZ | 27.65 | 27.65 | 0.0 | 0.0 | Jan-1985 | 1985年1月 | 1.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 13648.0 | 13648.0 | 83.7% | 83.7% | 9.0 | 9.0 | Sep-2016 | 2016年9月 | 0.0 | 0.0 | 737.0 | 737.0 |
1 | 1个 | 2500.0 | 2500.0 | 60 months | 60个月 | 59.83 | 59.83 | C | C | < 1 year | <1年 | RENT | 出租 | 30000.0 | 30000.0 | Source Verified | 来源已验证 | 0 | 0 | car | 汽车 | bike | 自行车 | GA | GA | 1.00 | 1.00 | 0.0 | 0.0 | Apr-1999 | 1999年4月 | 5.0 | 5.0 | 3.0 | 3.0 | 0.0 | 0.0 | 1687.0 | 1687.0 | 9.4% | 9.4% | 4.0 | 4.0 | Sep-2016 | 2016年9月 | 0.0 | 0.0 | 742.0 | 742.0 |
2 | 2 | 2400.0 | 2400.0 | 36 months | 36个月 | 84.33 | 84.33 | C | C | 10+ years | 10年以上 | RENT | 出租 | 12252.0 | 12252.0 | Not Verified | 未经审核的 | 1 | 1个 | small_business | 小本生意 | real estate business | 房地产业务 | IL | 白介素 | 8.72 | 8.72 | 0.0 | 0.0 | Nov-2001 | 2001年11月 | 2.0 | 2.0 | 2.0 | 2.0 | 0.0 | 0.0 | 2956.0 | 2956.0 | 98.5% | 98.5% | 10.0 | 10.0 | Sep-2016 | 2016年9月 | 0.0 | 0.0 | 737.0 | 737.0 |
3 | 3 | 10000.0 | 10000.0 | 36 months | 36个月 | 339.31 | 339.31 | C | C | 10+ years | 10年以上 | RENT | 出租 | 49200.0 | 49200.0 | Source Verified | 来源已验证 | 1 | 1个 | other | 其他 | personel | 人事 | CA | 认证机构 | 20.00 | 20.00 | 0.0 | 0.0 | Feb-1996 | 1996年2月 | 1.0 | 1.0 | 10.0 | 10.0 | 0.0 | 0.0 | 5598.0 | 5598.0 | 21% | 21% | 37.0 | 37.0 | Apr-2016 | 2016年4月 | 0.0 | 0.0 | 692.0 | 692.0 |
4 | 4 | 5000.0 | 5000.0 | 36 months | 36个月 | 156.46 | 156.46 | A | 一个 | 3 years | 3年 | RENT | 出租 | 36000.0 | 36000.0 | Source Verified | 来源已验证 | 1 | 1个 | wedding | 婚礼 | My wedding loan I promise to pay back | 我答应偿还我的婚礼贷款 | AZ | AZ | 11.20 | 11.20 | 0.0 | 0.0 | Nov-2004 | 2004年11月 | 3.0 | 3.0 | 9.0 | 9.0 | 0.0 | 0.0 | 7963.0 | 7963.0 | 28.3% | 28.3% | 12.0 | 12.0 | Jan-2016 | 2016年1月 | 0.0 | 0.0 | 732.0 | 732.0 |
Let’s compute the number of missing values and determine how to handle them. We can return the number of missing values across the DataFrame by:
让我们计算缺失值的数量并确定如何处理它们。 我们可以通过以下方式返回整个DataFrame中缺失值的数量:
isnull()
to return a DataFrame containing Boolean values:
True
if the original value is nullFalse
if the original value isn’t nullsum()
to calculate the number of null values in each column.isnull()
返回包含布尔值的DataFrame:
True
False
sum()
计算每列中空值的数量。
Number of null values in each column:
loan_amnt 0
term 0
installment 0
grade 0
emp_length 0
home_ownership 0
annual_inc 0
verification_status 0
loan_status 0
purpose 0
title 10
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
last_credit_pull_d 2
pub_rec_bankruptcies 697
fico_average 0
dtype: int64
Notice while most of the columns have 0 missing values, title
has 9 missing values, revol_util
has 48, and pub_rec_bankruptcies
contains 675 rows with missing values. Let’s remove columns entirely where more than 1% (392) of the rows for that column contain a null value. In addition, we’ll remove the remaining rows containing null values, which means we’ll lose a bit of data, but in return keep some extra features to use for prediction.
请注意,虽然大多数列都有0个缺失值, title
有9个缺失值, revol_util
有48个,而pub_rec_bankruptcies
包含675个缺失值的行。 让我们完全删除那些该列中超过1%(392)的行包含空值的列。 此外,我们将删除其余包含空值的行,这意味着我们将丢失一些数据,但作为回报,保留一些额外的功能以用于预测。
This means that we’ll keep the title
and revol_util
columns, just removing rows containing missing values, but drop the pub_rec_bankruptcies
column entirely since more than 1% of the rows have a missing value for this column.
这意味着我们将保留title
和revol_util
列,只删除包含缺失值的行,但由于有1%以上的行具有该列的缺失值,因此将pub_rec_bankruptcies
列完全删除。
Here’s a list of steps we can use to achieve that:
这是我们可以用来实现这一目标的步骤列表:
filtered_loans filtered_loans = = filtered_loansfiltered_loans .. dropdrop (( "pub_rec_bankruptcies""pub_rec_bankruptcies" ,, axisaxis == 11 )
)
filtered_loans filtered_loans = = filtered_loansfiltered_loans .. dropnadropna ()
()
Next, we’ll focus on the categorical columns.
接下来,我们将集中讨论类别列。
Keep in mind, the goal in this section is to have all the columns as numeric columns (int or float data type), and containing no missing values. We just dealt with the missing values, so let’s now find out the number of columns that are of the object data type and then move on to process them into numeric form.
请记住,本节的目标是使所有列都为数字列(int或float数据类型),并且不包含任何缺失值。 我们只是处理缺少的值,所以现在让我们找出对象数据类型的列数,然后继续将其处理为数字形式。
Data types and their frequency
float64 11
object 11
int64 1
dtype: int64
We have 11 object columns that contain text which need to be converted into numeric features. Let’s select just the object columns using the DataFrame method select_dtype, then display a sample row to get a better sense of how the values in each column are formatted.
我们有11个对象列,其中包含需要转换为数字特征的文本。 让我们使用DataFrame方法select_dtype只选择对象列,然后显示一个示例行,以更好地了解每一列中的值如何格式化。
object_columns_df object_columns_df = = filtered_loansfiltered_loans .. select_dtypesselect_dtypes (( includeinclude == [[ 'object''object' ])
])
printprint (( object_columns_dfobject_columns_df .. ilociloc [[ 00 ])
])
term 36 months
grade B
emp_length 10+ years
home_ownership RENT
verification_status Verified
purpose credit_card
title Computer
addr_state AZ
earliest_cr_line Jan-1985
revol_util 83.7%
last_credit_pull_d Sep-2016
Name: 0, dtype: object
Notice that revol_util
column contains numeric values, but is formatted as object. We learned from the description of columns in the preview
DataFrame earlier that revol_util
is a revolving line utilization rate or the amount of credit the borrower is using relative to all available credit (read more here).
请注意, revol_util
列包含数字值,但其格式设置为对象。 我们从前面的preview
DataFrame中的列描述中了解到, revol_util
是循环使用率或借款人相对于所有可用信用使用的信用量( 在此处了解更多信息 )。
We need to format revol_util
as numeric values. Here’s what we should do:
我们需要将revol_util
格式化为数值。 这是我们应该做的:
str.rstrip()
string method to strip the right trailing percent sign (%
).astype()
method to convert to the type float
.revol_util
column in the filtered_loans
.str.rstrip()
字符串方法可以str.rstrip()
右尾的百分号( %
)。 astype()
方法转换为float
类型。 revol_util
列在filtered_loans
。 Moving on, these columns seem to represent categorical values:
继续,这些列似乎代表分类值:
home_ownership
– home ownership status, can only be 1 of 4 categorical values according to the data dictionary.verification_status
– indicates if income was verified by Lending Club.emp_length
– number of years the borrower was employed upon time of application.term
– number of payments on the loan, either 36 or 60.addr_state
– borrower’s state of residence.grade
– LC assigned loan grade based on credit score.purpose
– a category provided by the borrower for the loan request.title
– loan title provided the borrower.home_ownership
–房屋所有权状态,根据数据字典,只能是4个分类值中的1个。 verification_status
–指示收入是否已由Lending Club核实。 emp_length
–借款人在申请时受雇的年限。 term
–贷款的还款次数,为36或60。 addr_state
–借款人的居住地。 grade
– LC根据信用评分指定的贷款等级。 purpose
–借款人为贷款申请提供的类别。 title
-借款人提供的贷款所有权。 To be sure, lets confirm by checking the number of unique values in each of them.
可以肯定的是,通过检查每个值中的唯一值来进行确认。
Also, based on the first row’s values for purpose
and title
, it appears these two columns reflect the same information. We’ll explore their unique value counts separately to confirm if this is true.
同样,基于第一行的purpose
和title
值,看起来这两列反映了相同的信息。 我们将分别探索其唯一值计数,以确认是否为真。
Lastly, notice the first row’s values for both earliest_cr_line
and last_credit_pull_d
columns contain date values that would require a good amount of feature engineering for them to be potentially useful:
最后,注意第一行的值都earliest_cr_line
和last_credit_pull_d
列包含这将需要功能的工程量好日期值,他们是潜在的有用:
earliest_cr_line
– The month the borrower’s earliest reported credit line was openedlast_credit_pull_d
– The most recent month Lending Club pulled credit for this loanearliest_cr_line
–借款人最早报告的信贷额度开立的月份 last_credit_pull_d
– Lending Club最近一个月为这笔贷款提取了信贷 We’ll remove these date columns from the DataFrame.
我们将从DataFrame中删除这些日期列。
First, let’s explore the unique value counts of the six columns that seem like they contain categorical values
首先,让我们探索看起来好像包含分类值的六列的唯一值计数
cols cols = = [[ 'home_ownership''home_ownership' , , 'grade''grade' ,, 'verification_status''verification_status' , , 'emp_length''emp_length' , , 'term''term' , , 'addr_state''addr_state' ]
]
for for name name in in colscols :
:
printprint (( namename ,, ':'':' )
)
printprint (( object_columns_dfobject_columns_df [[ namename ]] .. value_countsvalue_counts (),(), '' nn '' )
)
home_ownership :
RENT 18677
MORTGAGE 17381
OWN 3020
OTHER 96
NONE 3
Name: home_ownership, dtype: int64
grade :
B 11873
A 10062
C 7970
D 5194
E 2760
F 1009
G 309
Name: grade, dtype: int64
verification_status :
Not Verified 16809
Verified 12515
Source Verified 9853
Name: verification_status, dtype: int64
emp_length :
10+ years 8715
< 1 year 4542
2 years 4344
3 years 4050
4 years 3385
5 years 3243
1 year 3207
6 years 2198
7 years 1738
8 years 1457
9 years 1245
n/a 1053
Name: emp_length, dtype: int64
term :
36 months 29041
60 months 10136
Name: term, dtype: int64
addr_state :
CA 7019
NY 3757
FL 2831
TX 2693
NJ 1825
IL 1513
PA 1493
VA 1388
GA 1381
MA 1322
OH 1197
MD 1039
AZ 863
WA 830
CO 777
NC 772
CT 738
MI 718
MO 677
MN 608
NV 488
SC 469
WI 447
OR 441
AL 441
LA 432
KY 319
OK 294
KS 264
UT 255
AR 241
DC 211
RI 197
NM 187
WV 174
HI 170
NH 169
DE 113
MT 84
WY 83
AK 79
SD 61
VT 53
MS 19
TN 17
IN 9
ID 6
IA 5
NE 5
ME 3
Name: addr_state, dtype: int64
Most of these coumns contain discrete categorical values which we can encode as dummy variables and keep. The addr_state
column, however,contains too many unique values, so it’s better to drop this.
这些列大多数包含离散的分类值,我们可以将其编码为虚拟变量并保留。 但是, addr_state
列包含太多唯一值,因此最好删除它。
Next, let’s look at the unique value counts for the purpose
and title
columns to understand which columns we want to keep.
接下来,让我们看一下purpose
列和title
列的唯一值计数,以了解我们要保留哪些列。
Unique Values in column: purpose
debt_consolidation 18355
credit_card 5073
other 3921
home_improvement 2944
major_purchase 2178
small_business 1792
car 1534
wedding 940
medical 688
moving 580
vacation 377
house 372
educational 320
renewable_energy 103
Name: purpose, dtype: int64
Unique Values in column: title
Debt Consolidation 2142
Debt Consolidation Loan 1670
Personal Loan 650
Consolidation 501
debt consolidation 495
Credit Card Consolidation 354
Home Improvement 350
Debt consolidation 331
Small Business Loan 317
Credit Card Loan 310
Personal 306
Consolidation Loan 255
Home Improvement Loan 243
personal loan 231
personal 217
Loan 210
Wedding Loan 206
Car Loan 198
consolidation 197
Other Loan 187
Credit Card Payoff 153
Wedding 152
Major Purchase Loan 144
Credit Card Refinance 143
Consolidate 126
Medical 120
Credit Card 115
home improvement 109
My Loan 94
Credit Cards 92
...
toddandkim4ever 1
Remainder of down payment 1
Building a Financial Future 1
Higher interest payoff 1
Chase Home Improvement Loan 1
Sprinter Purchase 1
Refi credit card-great payment record 1
Karen's Freedom Loan 1
Business relocation and partner buyout 1
Update My New House 1
tito 1
florida vacation 1
Back to 0 1
Bye Bye credit card 1
britschool 1
Consolidation 16X60 1
Last Call 1
Want to be debt free in "3" 1
for excellent credit 1
loaney 1
jamal's loan 1
Refying Lending Club-I LOVE THIS PLACE! 1
Consoliation Loan 1
Personal/ Consolidation 1
Pauls Car 1
Road to freedom loan 1
Pay it off FINALLY! 1
MASH consolidation 1
Destination Wedding 1
Store Charge Card 1
Name: title, dtype: int64
It appears the purpose
and title
columns do contain overlapping information, but the purpose
column contains fewer discrete values and is cleaner, so we’ll keep it and drop title
.
看起来purpose
和title
列确实包含重叠的信息,但是purpose
列包含的离散值更少并且更整洁,因此我们将其保留并删除title
。
Lets drop the columns we’ve decided not to keep so far:
让我们删除到目前为止我们决定不保留的列:
drop_cols drop_cols = = [[ 'last_credit_pull_d''last_credit_pull_d' ,, 'addr_state''addr_state' ,, 'title''title' ,, 'earliest_cr_line''earliest_cr_line' ]
]
filtered_loans filtered_loans = = filtered_loansfiltered_loans .. dropdrop (( drop_colsdrop_cols ,, axisaxis == 11 )
)
First, let’s understand the two types of categorical features we have in our dataset and how we can convert each to numerical features:
首先,让我们了解数据集中的两种分类特征,以及如何将它们转换为数字特征:
A < B < C < D < E < F < G ; where < means less riskier than
A 其中<表示比
emp_length
) based on years spent in the workforce:emp_length
)中订购贷款申请人: year 1 < year 2 < year 3 … < year N,
1年<2年<3年…
we can’t do that with the column purpose
. It wouldn’t make sense to say:
我们无法通过专栏purpose
做到这一点。 说:
car < wedding < education < moving < house
汽车<婚礼<教育<移动<房子
These are the columns we now have in our dataset:
这些是我们现在在数据集中的列:
grade
emp_length
home_ownership
verification_status
purpose
term
grade
emp_length
home_ownership
verification_status
purpose
term
There are different approaches to handle each of these two types. In the steps following, we’ll convert each of them accordingly.
有两种不同的方法来处理这两种类型。 在下面的步骤中,我们将相应地对其进行转换。
To map the ordinal values to integers, we can use the pandas DataFrame method replace()
to map both grade
and emp_length
to appropriate numeric values
要将序数值映射为整数,我们可以使用pandas DataFrame方法replace()
将grade
和emp_length
都emp_length
为适当的数值
emp_length | emp_length | grade | 年级 | ||
---|---|---|---|---|---|
0 | 0 | 10 | 10 | 2 | 2 |
1 | 1个 | 0 | 0 | 3 | 3 |
2 | 2 | 10 | 10 | 3 | 3 |
3 | 3 | 10 | 10 | 3 | 3 |
4 | 4 | 3 | 3 | 1 | 1个 |
Perfect! Let’s move on to the Nominal Values. The approach to converting nominal features into numerical features is to encode them as dummy variables. The process will be:
完善! 让我们继续看名义值。 将名义特征转换为数字特征的方法是将其编码为虚拟变量。 该过程将是:
get_dummies()
method to return a new DataFrame containing a new column for each dummy variableconcat()
method to add these dummy columns back to the original DataFrameget_dummies()
方法返回一个新的DataFrame,其中包含每个虚拟变量的新列 concat()
方法将这些虚拟列添加回原始DataFrame Lets’ go ahead and encode the nominal columns that we now have in our dataset.
让我们继续进行编码,以编码现在数据集中的标称列。
nominal_columns nominal_columns = = [[ "home_ownership""home_ownership" , , "verification_status""verification_status" , , "purpose""purpose" , , "term""term" ]
]
dummy_df dummy_df = = pdpd .. get_dummiesget_dummies (( filtered_loansfiltered_loans [[ nominal_columnsnominal_columns ])
])
filtered_loans filtered_loans = = pdpd .. concatconcat ([([ filtered_loansfiltered_loans , , dummy_dfdummy_df ], ], axisaxis == 11 )
)
filtered_loans filtered_loans = = filtered_loansfiltered_loans .. dropdrop (( nominal_columnsnominal_columns , , axisaxis == 11 )
)
loan_amnt | loan_amnt | installment | 分期付款 | grade | 年级 | emp_length | emp_length | annual_inc | Annual_inc | loan_status | 贷款状态 | dti | dti | delinq_2yrs | delinq_2yrs | inq_last_6mths | inq_last_6mths | open_acc | open_acc | pub_rec | pub_rec | revol_bal | revol_bal | revol_util | revol_util | total_acc | total_acc | fico_average | fico_average | home_ownership_MORTGAGE | home_ownership_MORTGAGE | home_ownership_NONE | home_ownership_NONE | home_ownership_OTHER | home_ownership_OTHER | home_ownership_OWN | home_ownership_OWN | home_ownership_RENT | home_ownership_RENT | verification_status_Not Verified | Verification_status_未验证 | verification_status_Source Verified | Verification_status_Source已验证 | verification_status_Verified | Verification_status_Verified | purpose_car | Purpose_car | purpose_credit_card | Purpose_credit_card | purpose_debt_consolidation | Purpose_debt_consolidation | purpose_educational | 目的_教育 | purpose_home_improvement | Purpose_home_improvement | purpose_house | 目的 | purpose_major_purchase | Purpose_major_purchase | purpose_medical | Purpose_medical | purpose_moving | 目的运动 | purpose_other | Purpose_other | purpose_renewable_energy | Purpose_renewable_energy | purpose_small_business | Purpose_small_business | purpose_vacation | Purpose_vacation | purpose_wedding | Purpose_wedding | term_ 36 months | 任期_ 36个月 | term_ 60 months | 任期_ 60个月 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 5000.0 | 5000.0 | 162.87 | 162.87 | 2 | 2 | 10 | 10 | 24000.0 | 24000.0 | 1 | 1个 | 27.65 | 27.65 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 13648.0 | 13648.0 | 83.7 | 83.7 | 9.0 | 9.0 | 737.0 | 737.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 | 1 | 1个 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 |
1 | 1个 | 2500.0 | 2500.0 | 59.83 | 59.83 | 3 | 3 | 0 | 0 | 30000.0 | 30000.0 | 0 | 0 | 1.00 | 1.00 | 0.0 | 0.0 | 5.0 | 5.0 | 3.0 | 3.0 | 0.0 | 0.0 | 1687.0 | 1687.0 | 9.4 | 9.4 | 4.0 | 4.0 | 742.0 | 742.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 | 1 | 1个 | 0 | 0 | 1 | 1个 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 |
2 | 2 | 2400.0 | 2400.0 | 84.33 | 84.33 | 3 | 3 | 10 | 10 | 12252.0 | 12252.0 | 1 | 1个 | 8.72 | 8.72 | 0.0 | 0.0 | 2.0 | 2.0 | 2.0 | 2.0 | 0.0 | 0.0 | 2956.0 | 2956.0 | 98.5 | 98.5 | 10.0 | 10.0 | 737.0 | 737.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 1 | 1个 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 |
3 | 3 | 10000.0 | 10000.0 | 339.31 | 339.31 | 3 | 3 | 10 | 10 | 49200.0 | 49200.0 | 1 | 1个 | 20.00 | 20.00 | 0.0 | 0.0 | 1.0 | 1.0 | 10.0 | 10.0 | 0.0 | 0.0 | 5598.0 | 5598.0 | 21.0 | 21.0 | 37.0 | 37.0 | 692.0 | 692.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 | 1 | 1个 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 |
4 | 4 | 5000.0 | 5000.0 | 156.46 | 156.46 | 1 | 1个 | 3 | 3 | 36000.0 | 36000.0 | 1 | 1个 | 11.20 | 11.20 | 0.0 | 0.0 | 3.0 | 3.0 | 9.0 | 9.0 | 0.0 | 0.0 | 7963.0 | 7963.0 | 28.3 | 28.3 | 12.0 | 12.0 | 732.0 | 732.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 0 | 0 | 1 | 1个 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1个 | 1 | 1个 | 0 | 0 |
To wrap things up, let’s inspect our final output from this section to make sure all the features are of the same length, contain no null value, and are numericals.
总结一下,让我们检查一下本节的最终输出,以确保所有要素的长度相同,不包含空值且为数字。
Let’s use pandas info method to inspect the filtered_loans DataFrame:
让我们使用pandas info方法来检查filtered_loans DataFrame:
filtered_loansfiltered_loans .. infoinfo ()
()
Int64Index: 39177 entries, 0 to 39238
Data columns (total 39 columns):
loan_amnt 39177 non-null float64
installment 39177 non-null float64
grade 39177 non-null int64
emp_length 39177 non-null int64
annual_inc 39177 non-null float64
loan_status 39177 non-null int64
dti 39177 non-null float64
delinq_2yrs 39177 non-null float64
inq_last_6mths 39177 non-null float64
open_acc 39177 non-null float64
pub_rec 39177 non-null float64
revol_bal 39177 non-null float64
revol_util 39177 non-null float64
total_acc 39177 non-null float64
fico_average 39177 non-null float64
home_ownership_MORTGAGE 39177 non-null uint8
home_ownership_NONE 39177 non-null uint8
home_ownership_OTHER 39177 non-null uint8
home_ownership_OWN 39177 non-null uint8
home_ownership_RENT 39177 non-null uint8
verification_status_Not Verified 39177 non-null uint8
verification_status_Source Verified 39177 non-null uint8
verification_status_Verified 39177 non-null uint8
purpose_car 39177 non-null uint8
purpose_credit_card 39177 non-null uint8
purpose_debt_consolidation 39177 non-null uint8
purpose_educational 39177 non-null uint8
purpose_home_improvement 39177 non-null uint8
purpose_house 39177 non-null uint8
purpose_major_purchase 39177 non-null uint8
purpose_medical 39177 non-null uint8
purpose_moving 39177 non-null uint8
purpose_other 39177 non-null uint8
purpose_renewable_energy 39177 non-null uint8
purpose_small_business 39177 non-null uint8
purpose_vacation 39177 non-null uint8
purpose_wedding 39177 non-null uint8
term_ 36 months 39177 non-null uint8
term_ 60 months 39177 non-null uint8
dtypes: float64(12), int64(3), uint8(24)
memory usage: 5.7 MB
It is a good practice to store the final output of each section or stage of your workflow in a separate csv file. One of the benefits of this practice is that it helps us to make changes in our data processing flow without having to recalculate everything.
最好将工作流的每个部分或阶段的最终输出存储在单独的csv文件中。 这种做法的好处之一是,它可以帮助我们更改数据处理流程,而不必重新计算所有内容。
In this post, we used the Data Dictionary Lending Club provided with the Loans_2007
DataFrame’s first row’s values to become familiar with the columns in the dataset and were able to removed many columns that aren’t useful for modeling. We also selected loan_status
as our target column and decided to focus our modeling efforts on binary classification.
在本文中,我们使用了Loans_2007
DataFrame第一行的值提供的Data Dictionary Lending Club来熟悉数据集中的列,并能够删除许多对建模没有用的列。 我们还选择了loan_status
作为目标列,并决定将建模工作重点放在二进制分类上 。
Then, we performed the last amount of data preparation necessary to get the features into data types that can be fed into machine learning algorithms. We converted all columns of object data type(Categorical features) to numerical values because those are the only type of values scikit-learn can work with.
然后,我们执行了最后必要的数据准备工作,以将功能部件转换为可以输入到机器学习算法中的数据类型。 我们将对象数据类型(分类特征)的所有列都转换为数值,因为它们是scikit-learn可以使用的唯一值类型。
翻译自: https://www.pybloggers.com/2016/12/machine-learning-walkthrough-part-one-preparing-the-data/
如何准备机器学习数据集