Lending Club贷款数据分析—

数据分析

接上篇
针对数据集的各个方面进行简单数据分析。
主要有

贷款基本情况

用户画像

平台业务分析

数据分析主要方面

先说结论：

贷款基本情况：

贷款品质中以好账为主，占比高达92.46%，坏账占比不到8%，情况乐观；

Lendding Club 平台在2011到2015期间飞速发展，坏账数量在2015年明显下降，说明平台开始重视对风险的控制；

用户画像1：

用户主要分布在加州，因为Lending Club总部在加州，对本地业务开拓比较深；其次是纽约州和德克萨斯州；

用户职业主要都是老师、管理者；

大部分用户工作了10+年以上，其余用户工作年限非常均匀，从1到9年的都有，数量相差不多;

极大部分用户年收入都大于20000美元，其中高于60000的用户占比大于50%；

用户画像2：

大部分用户贷款是为了债务整合(借新债还旧债)、还信用卡；

50% 以上用户房子还在按揭，40%用户还在租房，不到10%的用户拥有自己的房子；

绝大部分用户的贬损公共设施的记录次数小于三次

平台业务分析

人均贷款数额逐年增加，2009年不到10000美元，在2015年达到16000美元。贷款总额逐年飙升，从2012年不到10亿，到2015年接近60亿。

信用等级越高，贷款利率越低; 信用等级越高，好账率越低

大部分贷款用户的DTI低于35%，这部分用户还款压力较小，一小部分客户DIT达到40%，还款压力大，存在坏账风险

1.贷款质量情况

loanData.loan_status.value_counts()
# Current                                                558269
# Fully Paid                                             197119
# Charged Off                                             41288
# Late (31-120 days)                                      10683
# In Grace Period                                          5778
# Late (16-30 days)                                        2155
# Does not meet the credit policy. Status:Fully Paid       1862
# Default                                                  1131
# Does not meet the credit policy. Status:Charged Off       697
# Issued   40

将逾期15天以上的贷款视为坏账，简化贷款质量

good_loan = [
    'Current', 'Fully Paid',
    'Does not meet the credit policy. Status:Fully Paid', 'Issued '
]


def loan_condition(status):
    if status in good_loan:
        return 'good_loan'
    else:
        return 'bad_loan'


loanData['loan_condition'] = loanData.loan_status.apply(loan_condition)

贷款质量和贷款总额情况

#时间转换为年
loanData['issue_d'] = pd.to_datetime(loanData.issue_d)
loanData['issue_year'] = loanData.issue_d.dt.year

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
#贷款品质情况
loanData.loan_condition.value_counts().plot.pie(
    autopct='%1.2f%%', ax=ax1, fontsize=12, startangle=70)
ax1.set_title('GOOD OR BAD')
ax1.set_ylabel("% of Loan Condition")
ax1.legend()
#正负样本数量差距悬殊，对于后面建模而言是个很大问题

#发放贷款数量按照年度分布情况
sns.barplot(
    x='issue_year',
    y='loan_amnt',
    data=loanData,
    hue='loan_condition',
    estimator=lambda x: len(x) / len(loanData) * 100,
    ax=ax2)
ax2.set_title('Loan Amount by Year ')
ax2.set_ylabel('%')
ax2.set_xlabel('Issue Year')
ax2.legend()

贷款质量和贷款总额百分比

可以看出，坏账仅有不到8%，但是实际上的金额也是比较惊人的。
2011年后，贷款总额每年都在飙升

贷款人数量

f1, (ax3, ax4) = plt.subplots(1, 2, figsize=(16, 6))
day_dist = loanData.groupby(['issue_d']).size()
day_dist.plot(ax=ax3)
ax3.set_title('Amount of Borrowers by Day')
ax3.set_ylabel('Amount of borrowers')
ax3.set_xlabel('Time')
year_dist = loanData.groupby(['issue_year']).size()
year_dist.plot(kind='bar', ax=ax4)
ax4.set_title('Amount of Borrowers by Year')
ax4.set_ylabel('Amount of borrowers')
ax4.set_xlabel('Time')
#

贷款人数逐年飙升

可以看出2012年后Lending Club飞速发展，客户飞速增加，虽然有波动，但总体再增加

2.客户画像1

1-1地域分布

loanData.addr_state.value_counts()[:20].plot(kind='bar', figsize=(8, 4))

人数相差不大

2-1职业分布前20

#客户画像1-2：职业分布前20
loanData.emp_title.value_counts()[:20].plot(kind='bar', figsize=(8, 4))

职业分布

各行各业的人都有，居然是老师最多，管理者次之。

1-3：工作年限分布

#客户画像1-3：工作年限分布
loanData.emp_length.value_counts().plot(kind='bar')

工作年限分布

工作年限越长越容易贷款吗，看来是了

1-4：用户年收入（美元）分布

这里将年收入大致分为三个区间
20000以下的视为低年收入，20000-60000视为中等，高于60000的就是高收入人群

#客户画像1-4： 用户年收入收入分布
def inc_strata(income):
    if income <= 20000:
        return 'low'
    elif income > 2000 and income <= 60000:
        return 'mid'
    else:
        return 'hign'


loanData['inc_strata'] = loanData.annual_inc.apply(inc_strata)
loanData.inc_strata.value_counts().plot(kind='bar')

年收入分布

大部分客户年收入都在20000以上

#贷款质量与年收入的关系
sns.countplot(x='inc_strata', data=loanData, hue='loan_condition')

贷款质量与年收入的关系

中等人群坏账数量最多

2-1 贷款目的分布

#客户画像2-1  贷款目的分布
loanData.purpose.value_counts().plot(kind='barh')

贷款目的

可以看出人们贷款主要是为了债务整合和信用卡偿还，债务整合就是借信用卡还其他信用卡，和信用卡偿还貌似没区别

2-2 住房类型分布

#客户画像2-2 住房类型分布
loanData.home_ownership.value_counts().plot.pie(
    autopct='%.3f%%', figsize=(5, 5))

住房类型

一半客户按揭，四成客户租房。有房子的不足10%

2-3 贬损公共记录的次数

#客户画像2-3 贬损公共记录的次数
loanData.pub_rec.value_counts()[:3].plot(kind='bar')

贬损公共记录的次数

看来有不良记录的人很难申请贷款

业务分析

1-1 贷款量明细：每年人均贷款总额，年均贷款总金额

#业务分析1-1 贷款量明细：每年人均贷款总额，年均贷款总金额
f1, (ax4, ax5) = plt.subplots(1, 2, figsize=(16, 6))
loanData.groupby(['issue_year'])['loan_amnt'].mean().plot(kind='bar', ax=ax4)
ax4.set_xlabel('Year')
ax4.set_ylabel('Loan Amount per Capita')
loanData.groupby(['issue_year'])['loan_amnt'].sum().plot(kind='bar', ax=ax5)
ax5.set_xlabel('Year')
ax5.set_ylabel('Total Loan Amount')

贷款量明细

LC在2012-2015飞速发展，能发的钱越来越多

1-2平均贷款利率与信用等级关系、贷款情况与信用等级的关系

#业务分析1-2

f2, (ax6, ax7) = plt.subplots(1, 2, figsize=(20, 6))
groupby_grade = loanData.groupby(['grade'])
groupby_grade['int_rate'].mean().plot(kind='bar', ax=ax6)
ax6.set_title('Interest Rate vs Grade')
ax6.set_ylabel('Interest Rate')
ax6.set_xlabel('Grade')
#
sns.countplot(x='grade', data=loanData, hue='loan_condition', ax=ax7)
ax7.set_title('Amount of Borrower vs Grade')
ax7.set_ylabel('Amount of Borrower')

信用等级关系很大

信用等级越低，贷款利率越高

1-3 DTI分布情况

DTI:每月还款占月收入的比例

#业务分析1-3 DTI分布情况
#DTI 每月还款占月收入的比例
f3, ax8 = plt.subplots(1, 1, figsize=(8, 4))
loanData.dti.plot(kind='hist', bins=100, ax=ax8)
ax8.set_xlim(left=0, right=50)

DIT分布

大部分的贷款客户的DTI在35%以下，说明还款压力不是很大
一小部分客户DIT达到45%，存在风险
后续特征工程中将以35%为分界分为两类
在右侧看不见的地方还存在极小一部分，，基本属于风险很大的贷款

1-4 贷款期限分布

#业务分析1-4 贷款期限分布
sns.countplot(x='term', data=loanData)

贷款期限

LC平台以短期贷款为主，但长期贷款比例也不低

Lending Club贷款数据分析——数据分析(一)

数据分析

1.贷款质量情况

贷款质量和贷款总额情况

贷款人数量

2.客户画像1

1-1地域分布

2-1职业分布前20

1-3：工作年限分布

1-4：用户年收入（美元）分布

2-1 贷款目的分布

2-2 住房类型分布

2-3 贬损公共记录的次数

业务分析

1-1 贷款量明细：每年人均贷款总额，年均贷款总金额

1-2平均贷款利率与信用等级关系、贷款情况与信用等级的关系

1-3 DTI分布情况

1-4 贷款期限分布

你可能感兴趣的:(Lending Club贷款数据分析——数据分析(一))

Lending Club贷款数据分析——数据分析(一)

数据分析

1.贷款质量情况

贷款质量和贷款总额情况

贷款人数量

2.客户画像1

1-1地域分布

2-1职业分布前20

1-3：工作年限分布

1-4： 用户年收入（美元）分布

2-1 贷款目的分布

2-2 住房类型分布

2-3 贬损公共记录的次数

业务分析

1-1 贷款量明细：每年人均贷款总额，年均贷款总金额

1-2平均贷款利率与信用等级关系、贷款情况与信用等级的关系

1-3 DTI分布情况

1-4 贷款期限分布

你可能感兴趣的:(Lending Club贷款数据分析——数据分析(一))

1-4：用户年收入（美元）分布