给定金融数据,预测贷款用户是否会逾期。
(status是标签:0表示未逾期,1表示逾期。)
Task5(特征工程1 - 数据预处理) - 数据类型转换, 无用特征删除, 缺失值处理(尝试不同的填充看效果)及数据探索
import numpy as np
import pandas as pd
# 数据集预览
data = pd.read_csv('data.csv')
data.drop_duplicates(inplace=True) # 去重
y = data.status
X = data.drop('status', axis=1)
# 观测正负样本是否均衡
y.value_counts()
# 观察变量类型
set(X.dtypes) # 输出: {dtype('int64'), dtype('float64'), dtype('O')}
删除原则
1)属性值单一的特征;
2)观测特征取值以及label意义, 挑选和预测无关的特征
1)属性值单一的特征 - ‘bank_card_no’,'source’的取值无区分度
for col in X.columns:
if len(X[col].unique()) == 1:
print(col, X[col].unique())
X.drop(col, axis = 1, inplace = True)
输出:
bank_card_no [‘卡号1’]
source [‘xs’]
2)观测特征取值以及label意义, 是否和预测无关 - ‘Unnamed: 0’, ‘custid’, 'trade_no’和id_name’与预测无关
for col in X.columns:
cnt = X[col].count() # 没有统计缺失值
if len(list(X[col].unique())) in [cnt, cnt+1]:
print(col)
# X['Unnamed: 0']
# X['custid']
# X['trade_no]
# X['id_name'].value_counts()
X.drop(['Unnamed: 0', 'custid', 'trade_no', 'id_name'], axis=1, inplace=True)
(此处主要针对时间特征, 且此处时间特征仅包含日期特征)
日期特征处理流程
1)浮点型日期转换成字符串型
2)取出日期,构建年份、月份、周几等特征
3)进一步(特征构建):groupby对特征进行统计分析
dateFeatures = ['first_transaction_time', 'latest_query_time', 'loans_latest_time']
X_date = X[dateFeatures]
1)浮点型日期转换成字符串型
# 首先填充缺失值
X_date['first_transaction_time'].fillna(X_date['first_transaction_time'].median(), inplace = True)
# 转成字符串型日期
X_date['first_transaction_time'] = X_date['first_transaction_time'].apply(lambda x:str(x)[:4] + '-' + str(x)[4:6] + '-' + str(x)[6:8])
2) 提取特征:年份、月份、星期几
X_date['first_transaction_time_year'] = pd.to_datetime(X_date['first_transaction_time']).dt.year
X_date['first_transaction_time_month'] = pd.to_datetime(X_date['first_transaction_time']).dt.month
X_date['first_transaction_time_weekday'] = pd.to_datetime(X_date['first_transaction_time']).dt.weekday
X_date['latest_query_time_year'] = pd.to_datetime(X_date['latest_query_time']).dt.year
X_date['latest_query_time_month'] = pd.to_datetime(X_date['latest_query_time']).dt.month
X_date['latest_query_time_weekday'] = pd.to_datetime(X_date['latest_query_time']).dt.weekday
X_date['loans_latest_time_year'] = pd.to_datetime(X_date['loans_latest_time']).dt.year
X_date['loans_latest_time_month'] = pd.to_datetime(X_date['loans_latest_time']).dt.month
X_date['loans_latest_time_weekday'] = pd.to_datetime(X_date['loans_latest_time']).dt.weekday
# 填充缺失值
X_date['latest_query_time_year'].fillna(X_date['latest_query_time_year'].median(), inplace = True)
X_date['latest_query_time_month'].fillna(X_date['latest_query_time_month'].median(), inplace = True)
X_date['latest_query_time_weekday'].fillna(X_date['latest_query_time_weekday'].median(), inplace = True)
X_date['loans_latest_time_year'].fillna(X_date['loans_latest_time_year'].median(), inplace = True)
X_date['loans_latest_time_month'].fillna(X_date['loans_latest_time_month'].median(), inplace = True)
X_date['loans_latest_time_weekday'].fillna(X_date['loans_latest_time_weekday'].median(), inplace = True)
X_date.drop(dateFeatures, axis = 1, inplace=True)
1)字符型类别特征编码
2)缺失值填充
3)类别特征Label与One-Hot编码
类别特征缺失值填充常用方法:分箱处理(单独填充为一个类别)、众数填充
# 观察取值和属性名称, 挑选类别特征
for col in X:
cnt = len(X[col].unique())
if cnt < 15:
print(col, cnt, X[col].unique())
输出
regional_mobility 6 [ 3. 4. 1. 2. 5. nan]
student_feature 3 [nan 1. 2.]
is_high_user 2 [0 1]
avg_consume_less_12_valid_month 13 [ 7. 5. 6. 8. 9. 3. 4. 11. 10. 0. 2. 1. nan]
top_trans_count_last_1_month 9 [0.15 0.05 0.65 1. 0.1 0.3 0.4 0.2 nan]
reg_preference_for_trad 6 [‘一线城市’ ‘三线城市’ ‘境外’ ‘二线城市’ ‘其他城市’ nan]
railway_consume_count_last_12_month 7 [ 0. 1. 2. 4. nan 3. 30.]
jewelry_consume_count_last_6_month 8 [ 0. 1. nan 2. 6. 3. 4. 5.]
categoryFeatures = ['regional_mobility', 'student_feature', 'is_high_user', 'avg_consume_less_12_valid_month', 'reg_preference_for_trad']
X_cate = X[categoryFeatures]
1)字符型类别特征编码
dic = {}
for i, val in enumerate(list(X_cate['reg_preference_for_trad'].unique())):
dic[val] = i
X_cate['reg_preference_for_trad'] = X_cate['reg_preference_for_trad'].map(dic)
2)缺失值填充:单独填充为一个类别/众数填充
student_feature 缺失占比一般以上, 将其缺失值单独填充为1个类别(用-1填充)。
X_cate['student_feature'].value_counts()
X_cate['student_feature'].fillna(-1, inplace = True)
其他特征缺失值数目较少, 考虑用众数填充
for col in X_cate.columns:
summ = X_cate[col].isnull().sum()
if summ:
X_cate[col].fillna(X_cate[col].mode()[0], inplace = True)
3)类别特征Label与One-Hot编码
编码方法 | 介绍 | 优点 | 缺点 | 适用场景 |
---|---|---|---|---|
label encoding | 将类别变量中每一类别赋一数值,从而转换成数值型 | 略 | 赋值没有数值意义 | |
One-hot encoding | 它的值只有0/1,不同的类型存储在垂直的空间 | xx | 当类别的数量很多时,特征空间会变得非常大 | ①除了树模型都要One-hot,因为label没有数值意义 |
此处没有额外的待处理非数字型特征 ~
X_str = X.select_dtypes(include=['O']).copy()
X_str.head()
1)缺失值处理
2)去掉取值变化小的特征:统计各个列标准差,将标准差小于0.1的特征剔除
连续特征缺失值填充常用方法:中位数填充,平均数一般不用的(均值受极端值影响太大)。
X_num = X.select_dtypes(exclude=['O']).copy()
for col in X_num.columns:
if col in dateFeatures + categoryFeatures:
print(col)
X_num.drop(col, axis = 1, inplace = True)
1)缺失值处理
主要填充方法采用众数、中位数和模型填充等,平均数一般不用的(均值受极端值影响太大)。
# 统计各列缺失值的比例
col_missing = {}
for col in X_num.columns:
summ = X_num[col].isnull().sum()
if summ:
col_missing[col] = float('%.4f'%(summ*100/len(data)))
col_missing = sorted(col_missing.items(), key = lambda d:-d[1])
for col, rate in col_missing[:10]:
print(rate, '%', col)
缺失特征用中位数填充。
for col in X_num.columns:
summ = X_num[col].isnull().sum()
if summ:
X_num[col].fillna(X_num[col].median(), inplace = True)
2)去掉取值变化小的特征
for col in X_num.columns:
rate = X_num[col].std()
if rate < 0.1:
print(col, rate)
X_num.drop(col, axis = 1, inplace = True)
3) 归一化(此处只是示例,代码部分没有实际运行)
归一化, 可以加快梯度下降求最优解的速度, 也有利于提高精度
# 最大最小归一化
# X_num = X_num.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
X = pd.concat([X_date, X_cate, X_num], axis=1)
import pickle
with open('feature.pkl', 'wb') as f:
pickle.dump(X, f)
特征数目和行数相同, 肯定和预测无关吗? 如果是属于类似连续型特征呢?
1)特征工程处理
2)数据分析工具之Python大法(一)
3)机器学习-Label Encoding与One Hot的区别-20180513
代码参见Github: https://github.com/libihan/Exercise-ML