代码一般使用于jupyter notebook中,如有特殊会标明。
后续会封装这些函数来更方便的调用和维护。
1.%%time显示该段代码执行时间
%%time
train = pd.read_table("filename")
CPU times: user 7.78 s, sys: 606 ms, total: 8.39 s
Wall time: 8.43 s
2.显示target的分布(二分类),并画图
train['target'].value_counts()
train['target'].astype(int).plot.hist()
3.检验缺失值,函数参数为dataframe
返回值为降序排列的含有缺失值占比的dataframe
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
4.统计不同类型列的数量
train.dtypes.value_counts()
float64 65
int64 41
object 16
dtype: int64
5.选中某一类型的列,展示每个列所包含的离散值个数(适用于object类型的列)
train.select_dtypes('object').apply(pd.Series.nunique, axis=0)
6.编码:针对离散值个数为2的离散变量一般只赋值0/1即可,对于多余2个的离散变量,采用one-hot
def label_encoder(app_train):
from sklearn.preprocessing import LabelEncoder
#创建sklearn的label编码器
le = LabelEncoder()
le_count = 0
# 对每一个列进行迭代
for col in app_train:
if app_train[col].dtype == 'object':# 如果为object类型
# 并且离散变量取值个数等于2
if len(list(app_train[col].unique())) <= 2:
le.fit(app_train[col])
# 只对train进行fit,但是对train/test都进行transform
app_train[col] = le.transform(app_train[col])
app_test[col] = le.transform(app_test[col])
le_count += 1 # 统计一共有多少列被编码
print('%d 列被编码' % le_count)
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)
7.将训练集和测试集对齐,确保两数据的列数相同以train为准
train_labels = app_train['TARGET']
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)
app_train['TARGET'] = train_labels
8.观察pearson相关系数,看每个特征与target的线性相关性
correlations = app_train.corr()['TARGET'].sort_values()
print('正向相关性:\n', correlations.tail(15))
print('\n负向相关性\n', correlations.head(15))
9.填充缺失值sklearn组件Imputer
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')#用中位数填充
imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(test)#用train的中位数填充test
10.特征重要性显示函数:用图像展示
def plot_feature_importances(df):
# 以重要性为索引进行排序
df = df.sort_values('importance', ascending = False).reset_index()
# 对重要性进行归一化
df['importance_normalized'] = df['importance'] / df['importance'].sum()
# 初始化matplotlib画图
plt.figure(figsize = (10, 6))
ax = plt.subplot()
# 排序,画图
ax.barh(list(reversed(list(df.index[:15]))),
df['importance_normalized'].head(15),
align = 'center', edgecolor = 'k')
# 设置label和y轴
ax.set_yticks(list(reversed(list(df.index[:15]))))
ax.set_yticklabels(df['feature'].head(15))
plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
plt.show()
return df