天池学习赛——金融风控-贷款违约预测(02)

对数据进行探索性数据分析,理解变量的数据分布特点,有助于我们更好的了解数据,便于对数据进行预处理以及特征工程,构建更精确的模型。接下来对贷款违约预测数据进行EDA,探索其数据花园的秘密(滑稽)。
导入库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling

import warnings
warnings.filterwarnings("ignore")

导入数据

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("testA.csv")

查看训练集和测试集的数据量及特征数量:

print(train_df.shape)
print(test_df.shape)

打印训练集和测试集的摘要:

print(train_df.info())
print("---" * 17)
print(test_df.info())

查看训练集和测试集的各个特征的缺失情况:

(train_df.isnull().sum()/len(train_df)).sort_values(ascending=False)[:25]
(test_df.isnull().sum()/len(test_df)).sort_values(ascending = False)[:20]

对训练集中有缺失值特征的可视化

missing = train_df.isnull().sum()/len(train_df)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot(kind = "bar", figsize = (12, 6))

查看特征中哪些为数值型,哪些为字符型:

numerical_fea = list(train_df.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(train_df.columns)))
print(numerical_fea)
print('--------'*9)
print(category_fea)

查看字符型变量grade的数据分布情况

train_df.grade

查看数值型变量中哪些为连续型变量,哪些为离散型变量

def get_numerical_serial_fea(data, feas):
    numerical_serial_fea = []
    numerical_noserial_fea = []
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea, numerical_noserial_fea

numerical_serial_fea, numerical_noserial_fea = get_numerical_serial_fea(train_df,numerical_fea)

print(numerical_serial_fea)
print('------'*18)
print(numerical_noserial_fea)

举例term、homeOwnership查看离散型变量数据分布

print(train_df['term'].value_counts())  # 特征term
print("------" * 4)
print(train_df['homeOwnership'].value_counts())  # 特征homeOwnership

做连续性变量数据分布图,观察其数据分布情况

f = pd.melt(train_df, value_vars = numerical_serial_fea)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

对分类变量employmentLength的数据分布可视化

plt.figure(figsize=(12, 8))
sns.barplot(train_df["employmentLength"].value_counts(dropna=False)[:20],
            train_df["employmentLength"].value_counts(dropna=False).keys()[:20])
plt.show()

举例loanAmnt查看连续型变量在不同标签上的分布:

fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
train_df.loc[train_df['isDefault'] == 1]['loanAmnt'].apply(np.log).plot(
    kind='hist', bins=100, title='Log Loan Amt - Fraud' ,color='r', xlim=(-3, 10), ax= ax1)
train_df.loc[train_df['isDefault'] == 0]['loanAmnt'].apply(np.log).plot(
    kind='hist', bins=100, title='Log Loan Amt - Not Fraud', color='b', xlim=(-3, 10), ax=ax2)
total = len(train_df)
total_amt = train_df.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(12,5))
plt.subplot(121)
plot_tr = sns.countplot(x='isDefault',data=train_df)
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:
    height = p.get_height()
    plot_tr.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
    
percent_amt = (train_df.groupby(['isDefault'])['loanAmnt'].sum())
percent_amt = percent_amt.reset_index()
plt.subplot(122)
plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=percent_amt)
plot_tr_2.set_title("Total Amount in loanAmnt  \n 0: good user | 1: bad user", fontsize=14)
plot_tr_2.set_xlabel("Is fraud by percent", fontsize=16)
plot_tr_2.set_ylabel('Total Loan Amount Scalar', fontsize=16)
for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(
    height/total_amt * 100), ha="center", fontsize=15)

对时间格式的特征进行处理,并查看其数据分布
转化成时间格式issueDateDT特征表示数据日期离数据集中日期最早的日期(2007-06-01)的天数

import datetime
train_df['issueDate'] = pd.to_datetime(train_df['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
train_df['issueDateDT'] = train_df['issueDate'].apply(lambda x: x-startdate).dt.days
#转化成时间格式
test_df['issueDate'] = pd.to_datetime(train_df['issueDate'], format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
test_df['issueDateDT'] = test_df['issueDate'].apply(lambda x: x-startdate).dt.days
plt.figure(figsize = (10, 6))
plt.hist(train_df['issueDateDT'], label='train')
plt.hist(test_df['issueDateDT'], label='test')
plt.legend()
plt.title('Distribution of issueDateDT dates')

上述分析结果表明:训练集和测试集中的特征 issueDateDT 在日期上有重叠,所以使用基于时间的分割进行验证是不明智的
探索性数据分析到这基本结束了。
当然,有一个非常出色的python库——pandas_profiling,其能够更方便快捷地了解数据的全貌,这个库只需要一行代码就可以生成数据EDA报告。pandas_profiling是基于pandas的DataFrame数据类型,可以简单快速地进行探索性数据分析。代码如下:

pfr = pandas_profiling.ProfileReport(train_df)
pfr

你可能感兴趣的:(数据分析,python,可视化,机器学习)