在上一篇文章中分享了baseline,这这篇文章中将会基于模型选择,特征工程方面来改进baseline。
使用交叉验证来评估模型,下图是模型训练中典型交叉验证工作流程的流程图。
该分类器通过随机梯度下降(SGD)学习实现正则化线性模型:每次估计每个样本的损失梯度,并随着学习率的递减而更新模型。SGD 允许通过该方法进行小批量(在线/核外)学习partial_fit。为了使默认学习率能获得最佳结果,数据应具有零均值和单位方差。
此实现适用于表示为特征的密集或稀疏浮点值数组的数据。其拟合的模型可以通过损失参数来控制;默认情况下,它适合线性支持向量机 (SVM)。
# 训练并验证SGDClassifier
pred = cross_val_predict(
SGDClassifier(max_iter=10),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
决策树是一种用于分类和回归的非参数监督学习方法。目标是创建一个模型,通过学习从数据特征推断出的简单决策规则来预测目标变量的值。树可以看作是分段常数近似。
决策树的一些优点是:
决策树的缺点包括:
# 训练并验证DecisionTreeClassifier
pred = cross_val_predict(
DecisionTreeClassifier(),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
用于多项模型的朴素贝叶斯分类器。
多项式朴素贝叶斯分类器适用于具有离散特征的分类(例如,用于文本分类的字数统计)。多项分布通常需要整数特征计数。然而,在实践中,诸如 tf-idf 之类的分数计数也可能有效。
# 训练并验证MultinomialNB
pred = cross_val_predict(
MultinomialNB(),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
随机森林是一种元估计器,它在数据集的各个子样本上拟合多个决策树分类器,并使用平均来提高预测准确性并控制过度拟合。子样本大小由max_samples参数 if bootstrap=True(默认)控制,否则使用整个数据集来构建每棵树。
# 训练并验证RandomForestClassifier
pred = cross_val_predict(
RandomForestClassifier(n_estimators=5),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
train_data['common_ts_day'] = train_data['common_ts'].dt.day
test_data['common_ts_day'] = test_data['common_ts'].dt.day
train_data['x1_freq'] = train_data['x1'].map(train_data['x1'].value_counts())
test_data['x1_freq'] = test_data['x1'].map(train_data['x1'].value_counts())
train_data['x1_mean'] = train_data['x1'].map(train_data.groupby('x1')['target'].mean())
test_data['x1_mean'] = test_data['x1'].map(train_data.groupby('x1')['target'].mean())
train_data['x2_freq'] = train_data['x2'].map(train_data['x2'].value_counts())
test_data['x2_freq'] = test_data['x2'].map(train_data['x2'].value_counts())
train_data['x2_mean'] = train_data['x2'].map(train_data.groupby('x2')['target'].mean())
test_data['x2_mean'] = test_data['x2'].map(train_data.groupby('x2')['target'].mean())
train_data['x3_freq'] = train_data['x3'].map(train_data['x3'].value_counts())
test_data['x3_freq'] = test_data['x3'].map(train_data['x3'].value_counts())
train_data['x3_mean'] = train_data['x3'].map(train_data.groupby('x3')['target'].mean())
test_data['x3_mean'] = test_data['x3'].map(train_data.groupby('x3')['target'].mean())
train_data['x4_freq'] = train_data['x4'].map(train_data['x4'].value_counts())
test_data['x4_freq'] = test_data['x4'].map(train_data['x4'].value_counts())
train_data['x4_mean'] = train_data['x4'].map(train_data.groupby('x4')['target'].mean())
test_data['x4_mean'] = test_data['x4'].map(train_data.groupby('x4')['target'].mean())
train_data['x6_freq'] = train_data['x6'].map(train_data['x6'].value_counts())
test_data['x6_freq'] = test_data['x6'].map(train_data['x6'].value_counts())
train_data['x6_mean'] = train_data['x6'].map(train_data.groupby('x6')['target'].mean())
test_data['x6_mean'] = test_data['x6'].map(train_data.groupby('x6')['target'].mean())
train_data['x7_freq'] = train_data['x7'].map(train_data['x7'].value_counts())
test_data['x7_freq'] = test_data['x7'].map(train_data['x7'].value_counts())
train_data['x7_mean'] = train_data['x7'].map(train_data.groupby('x7')['target'].mean())
test_data['x7_mean'] = test_data['x7'].map(train_data.groupby('x7')['target'].mean())
train_data['x8_freq'] = train_data['x8'].map(train_data['x8'].value_counts())
test_data['x8_freq'] = test_data['x8'].map(train_data['x8'].value_counts())
train_data['x8_mean'] = train_data['x8'].map(train_data.groupby('x8')['target'].mean())
test_data['x8_mean'] = test_data['x8'].map(train_data.groupby('x8')['target'].mean())
增加特征后个分类器交叉验证分数如下
SGDClassifier :
DecisionTreeClassifier :
MultinomialNB :
RandomForestClassifier :
本文在上次baseline的基础上,通过修改分类器,增加特征工程,对模型进行改进