传统的机器学习中最重要的是特征工程,特征工程即从候选特征中根据某些标准,对特征进行变换、降维、筛选,以提高模型的训练效果。一般来说,数据和特征决定了机器学习的上限,而模型和算法是在逼近这个上限。由此可见,好的数据和特征是模型和算法发挥更大的作用的前提。本文主要提供了一种建模的思路。
数据集选择的是真实世界的数据集,该数据集包含了1059条来自33个国家或地区的音乐的音频特征。每条音乐有116个音频特征和2个原创地坐标(经度和纬度)。本次实验将根据东西半球把音乐分为阳性(东半球)和阴性(西半球)。
Fang Zhou, Claire Q and Ross. D. King Predicting the Geographical Origin of Music, ICDM, 2014
http://archive.ics.uci.edu/ml/datasets/Geographical+Original+of+Music
首先是数据分析三剑客:Numpy、Pandas、Matplotlib,外加一个SciPy计算Pearson相关系数。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
特征筛选和模型建立主要依靠scikit learn库,用到的功能有基于权重的特征筛选(SelectFromModel)、随机拆分(train_test_split)、交叉验证(StratifiedKFold, cross_val_score)、模型评价(roc_auc_score, calibration_curve)
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.calibration import calibration_curve
用到的分类器为线性判别式分析(Linear Discriminant Analysis)、逻辑回归(Logistic Regression)、随机森林(Random Forest)、支持向量机(Support Vector Machine)、决策树(Decision Tree)和神经网络/多层感知机(Multilayer Perceptron)。另外,LASSO回归(直线回归+L1正则化)用于特征降维。
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
读取数据,随机拆分训练集和测试集。所有特征筛选和模型构建训练集上进行,测试集只用于测试最终的模型。因为这个数据集的数据已经进行了Z-score标准化,所以不需要再对特征进行预处理。
stratify = labels用于保证训练集合验证集随机抽样时正负样本比例与labels中相同(即分层抽样)。
data = pd.read_csv('E:\\default_plus_chromatic_features_1059_tracks.txt', header = None)
print(data.shape)
features = np.array(data.iloc[:, :-2])
labels = np.array(data.iloc[:, -1] > 0)
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 3 / 10, random_state = 1, stratify = labels)
print(features_train.shape, features_test.shape)
'''
(1059, 118)
(741, 116) (318, 116)
'''
如果两个特征高度相关,是本质相似而单位不同的特征,它们对于预测的贡献也相似,两个并不比一个好,因此一般可以去除相关性高的特征中的一个。在这里采用的是Pearson相关系数计算线性相关度,根据下面这篇论文,采用的是去除两个相关特征中平均相关性较高的特征。
Park HJ, Park B, Park SY, et al. Preoperative prediction of postsurgical outcomes in mass-forming intrahepatic cholangiocarcinoma based on clinical, radiologic, and radiomics features. Eur Radiol. 2021;31(11):8638-8648.
注意,另外一种运用Pearson相关的算法是计算单个特征与预测标签之间的相关性,选择相关性高的特征,相当于单因素分析,在预测模型建立时是不被推荐的方法。
从图论的角度来看,PCC计算出的相关性热图,也可以近似看作一个有权无向图的邻接矩阵。如下图,取阈值为0.95,5个特征中有四对特征高度相关,应该去除。对于具体的一对特征,如F1:F2,由于F2与其他特征的相关性之和较大(除以特征数即平均相关性),所以将F2去除。类似的,F3也被去除。这种方法相当于把节点最大可能的分散开了。
def pcc(features_array_train, features_array_test, threshold):
feature_num = features_array_train.shape[1]
pccmatrix = np.zeros((feature_num, feature_num))
for i in range(feature_num):
for j in range(feature_num):
pccmatrix[i, j] = pearsonr(features_array_train[:, i], features_array_train[:, j])[0]
spr = np.sum(pccmatrix, 0)#sum of pearson correlation coefficient of each rows
del_feauture = []
for i in range(feature_num):
for j in range(i+1, feature_num):
if pccmatrix[i, j] >= threshold:
del_feauture.append(_compare_spr(i, j, spr))
remain_feature = list(set(range(feature_num)).difference(set(del_feauture)))
features_array_train = features_array_train[:, remain_feature]
features_array_test = features_array_test[:, remain_feature]
return features_array_train, features_array_test
def _compare_spr(i, j, spr):
if spr[i] >= spr[j]:
return i
elif spr[i] < spr[j]:
return j
可以看到,Pearson相关去除了45个高度相关的特征。
features_train, features_test = pcc(features_train, features_test, threshold = 0.95)
print(features_train.shape, features_test.shape)
'''
(741, 71) (318, 71)
'''
L1正则化(LASSO regression)和L2正则化(Ridge regression)都可以通过减小特征权重的方法防止模型过拟合。不同于L2正则化,线性回归的L1正则化会直接把一些特征的权重压缩为0。因此LASSO也被经常用于特征降维,用于筛选出权重不为0的特征。这里在alpha=0.01的水平,LASSO去除了40个特征。
def Lasso_selection(features_array_train, labels_array_train, features_array_test, alpha):
Lasso_model = Lasso(alpha = alpha).fit(features_array_train, labels_array_train)
remain_feature = Lasso_model.coef_ != 0
features_array_train = features_array_train[:, remain_feature]
features_array_test = features_array_test[:, remain_feature]
return features_array_train, features_array_test
features_train, features_test = Lasso_selection(features_train, labels_train, features_test, 0.01)
print(features_train.shape, features_test.shape)
'''
(741, 31) (318, 31)
'''
迭代特征筛选,通过反复的拟合模型,每次剔除一个权重最低的特征,最后选择评价指标最高的特征组合(这里选的是AUC)最为最终的特征。与sklearn的序列特征筛选(Sequential Feature Selection)不同,这个算法的最优特征数是不定的,而序列特征筛选是预先设定好了特征数量,而以一种贪心的方式迭代筛选,直到达到指定的特征数,因此它只能保证这个特征组合是在这个特征数下最好的组合。
def iterative_feature_selection(features_array_train, labels_array_train, features_array_test, estimator):
feature_num = features_array_train.shape[1]
performance = pd.DataFrame(columns = ['feature number', 'select_mask', 'AUROC'])
train_feature = features_array_train.copy()
for i in range(feature_num, 0, -1):
selector = SelectFromModel(estimator = estimator, threshold = - np.inf, max_features = i).fit(train_feature, labels_array_train)
select_mask = selector.get_support(indices = True)
del_feature = list(set(range(feature_num)).difference(set(select_mask)))
train_feature[:, del_feature] = 0
skf = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 1)
auroc = cross_val_score(estimator, train_feature, labels_array_train, scoring = 'roc_auc', cv = skf)
performance.loc[i] = [i, select_mask, auroc.mean()]
performance = performance.sort_values(by = ['AUROC'], ascending = False)
print(performance.head(5))
remain_feature = performance['select_mask'].iat[0]
features_array_train = features_array_train[:, remain_feature]
features_array_test = features_array_test[:, remain_feature]
return features_array_train, features_array_test
这里选择的是逻辑回归作为分类器,可视化后,结合表格,可以看出在取27个特征时交叉验证AUC最高。
features_train, features_test = iterative_feature_selection(features_train, labels_train, features_test, LogisticRegression())
print(features_train.shape, features_test.shape)
'''
feature number select_mask AUROC
27 27 [0, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 1... 0.827418
26 26 [0, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, ... 0.826284
25 25 [0, 2, 3, 4, 5, 6, 10, 11, 12, 13, 14, 15, 17,... 0.826274
28 28 [0, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 1... 0.825522
29 29 [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... 0.825428
(741, 27) (318, 27)
'''
用10折交叉验证的方法,将训练集筛选后的特征数据分别拟合不同的分类器,最后计算平均评价指标(这里选择的是AUC),选择得分最高的分类器。
def classifier_selection(features_array_train, labels_array_train, candidate_classifiers = [LogisticRegression()]):
performance = pd.DataFrame(columns = ['classifier', 'AUROC'])
for index, classifier in enumerate(candidate_classifiers):
skf = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 1)
auroc = cross_val_score(classifier, features_array_train, labels_array_train, scoring = 'roc_auc', cv = skf)
performance.loc[index] = [classifier, auroc.mean()]
performance = performance.sort_values(by = ['AUROC'], ascending = False)
print(performance)
return performance['classifier'].iat[0]
最后选择到的是MLP,也就是神经网络(最简单的那种)。
candidate_classifiers = [
LogisticRegression(random_state = 1),
RandomForestClassifier(random_state = 1),
LinearSVC(random_state = 1),
DecisionTreeClassifier(random_state = 1),
MLPClassifier(random_state = 1),
LinearDiscriminantAnalysis()
]
classifier = classifier_selection(features_train, labels_train, candidate_classifiers)
'''
classifier AUROC
4 MLPClassifier(random_state=1) 0.841980
1 RandomForestClassifier(random_state=1) 0.829509
0 LogisticRegression(random_state=1) 0.827418
2 LinearSVC(random_state=1) 0.826854
5 LinearDiscriminantAnalysis() 0.824582
3 DecisionTreeClassifier(random_state=1) 0.646449
'''
特征和模型都选择好了,接下来就是用整个训练集拟合模型,再在测试集上验证模型的表现。
final_model = classifier.fit(features_train, labels_train)
#AUC
train_auc = roc_auc_score(labels_train, final_model.predict_proba(features_train)[:, 1])
test_auc = roc_auc_score(labels_test, final_model.predict_proba(features_test)[:, 1])
print('AUC_train:', train_auc,'AUC_test:', test_auc)
#calibration
fig, ax = plt.subplots()
fraction_of_positives, mean_predicted_value = calibration_curve(labels_train, final_model.predict_proba(features_train)[:, 1], n_bins = 5)
ax.plot(mean_predicted_value, fraction_of_positives, 'o-', label = 'train')
fraction_of_positives, mean_predicted_value = calibration_curve(labels_test, final_model.predict_proba(features_test)[:, 1], n_bins = 5)
ax.plot(mean_predicted_value, fraction_of_positives, 'o-', label = 'test')
ax.plot([0, 1], [0, 1], ':')
ax.set_title('Calibration plots')
ax.set_xlabel('mean_predicted_value')
ax.set_ylabel('fraction_of_positives')
ax.set_aspect('equal')
ax.legend()
plt.show()
'''
AUC_train: 0.9949388318863457 AUC_test: 0.8221153846153847
'''
结果筛选出的特征和分类器在训练集上几乎完全拟合(AUC = 0.99,毕竟是神经网络),在测试集上也达到了0.82的AUC。校准度曲线可以看出,模型的预测概率和样本的先验概率几乎完全符合。总体来说本次建模效果不错。
建模还有很多方法学上不够完美的地方,如该数据集为相当不对称的数据集(P:N = 800:259);在纳入LASSO前,特征数为71个,阳性样本比特征数约为11,表示建模建立过程中可能存在偏倚。不过由于样本量还可以,这种偏倚应该是比较小的;本次实验中SVM和MLP都没有收敛,收敛后可能结果会不同,并且因为没有对分类器的超参数进行优化,所以分类器选择部分可能存在改进空间;本次实验避开了单因素分析,如方差分析、卡方检验、Relief等,是因为基于P值的单因素分析可能会忽视变量间的相互作用,会在小样本或阳性样本特征数比值低时引入偏倚。不过在样本量较大时,这些方法也不失为一种好的特征筛选手段。