Home Credit Default Risk比赛记录

2018/7/12

1、decription

Home Credict利用其他数据,包括电信和其他交易数据预测该客户的还款能力(概率)。

2、evaluation

ROC曲线面积

3、data

application.csv

性别、汽车、孩子数量、收入、消费贷款商品的价格、贷款信用额、贷款年金、申请贷款陪伴的人、收入来源、学历、家庭状况、房子类型、居住地方的人口数量

2018/7/14

EDA

对application_train进行数据可视化分析

"""
看一下标签的分布
"""
app_train['TARGET'].value_counts()
"""
可以看出类别不平衡
"""app_train['TARGET'].astype(int).plot.hist();
plt.show()

Home Credit Default Risk比赛记录_第1张图片

 

"""
检查缺失值
"""
def missing_values_table(df):
    mis_val = df.isnull().sum()
    
    mis_val_percent = 100*mis_val / len(df)
    
    mis_val_table = pd.concat([mis_val,mis_val_percent],axis=1)
    
    mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values',1:'% of Total Values'})
    
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values',
                                                                                                                ascending = False).round(1)
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")

    # Return the dataframe with missing information
    return mis_val_table_ren_columns
# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(40)

特征与标签的相关性

# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

将年龄弄成区间统计

# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)

# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups

plt.figure(figsize = (8, 8))

# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');

Home Credit Default Risk比赛记录_第2张图片
特征工程用PolynomialFeatures

 

 

你可能感兴趣的:(contest)