上图是训练集,观察发现每一行的第0个元素就是标签(label),接下来的22个元素分别是A和B的11个特征,总共23列。
这是训练集中的一个特例,意味着每行的节点不是只出现一次,例如,A跟B比较,A还跟C比较,要充分理解数据。
这是预测结果的示例,Id表示的是测试数据集中的比较的节点对的行号,Choice表示这个A的影响力占比,>0.5表示A比B更有影响力,反之B更有影响力。
In python (using the metrics module of scikit-learn):
fpr, tpr, thresholds = metrics.roc_curve(true_labels, predictions, pos_label=1)
auc = metrics.auc(fpr,tpr)
在维基百科中对ROC是这样描述的:
在信号检测理论中,接收者操作特征曲线(receiver operating characteristic curve,或者叫ROC曲线)是一种座标图式的分析工具,用于 :
(1) 选择最佳的信号侦测模型、舍弃次佳的模型。
(2) 在同一模型中设定最佳阈值。
在做决策时,ROC分析能不受成本/效益的影响,给出客观中立的建议。
举例来说,用血压值来检测一个人是否有高血压,测出的血压值是连续的实数,以收缩压140/舒张压90为阈值,阈值以上便诊断为有高血压,阈值未满者诊断为无高血压。
ROC曲线的绘图过程:给定m个正例,n个反例,根据学习器预测结果对样例进行排序,然后把分类阈值设为最大,即把所有样例均预测为反例,此时真正利率和假正利率均为0,在(0,0)处标记一个点,然后,将分类阈值设置为每个样例的预测值,即依次将每个样例划分为正例,然后用线段连接相邻节点即得。
从 (0, 0) 到 (1,1)的对角线将ROC空间划分为左上/右下两个区域,在这条线的以上的点代表了一个好的分类结果(胜过随机分类),而在这条线以下的点代表了差的分类结果(劣于随机分类)。
- 完美的预测是一个在左上角的点,在ROC空间座标 (0,1)点,X=0 代表着没有伪阳性,Y=1 代表着没有伪阴性。
首先看一眼数据
import pandas as pd
data = pd.read_csv('train.csv')
print data.describe()
由于给出的数据每一行代表一个节点对,因此需要将其分开,并且分别给予标签。例如第一行Choice=0,表示B比A更有影响力,那么分开后A的标签为0,B的标签为1.
# load train data
import pandas as pd
import numpy as np
data = pd.read_csv('train.csv')
num = len(data)
data_matrix = data.as_matrix()
train = np.zeros((num*2,12))
for i in range(num):
if data_matrix[i][0] == 1.: # the first 5500 is '1'
train[i][0] = 1
train[i][1:] = data_matrix[i][1:12]
train[i+num][0] = 0
train[i+num][1:] = data_matrix[i][12:23]
else: # the last 5500 is '0'
train[i][0] = 1
train[i][1:] = data_matrix[i][12:23]
train[i+num][0] = 0
train[i+num][1:] = data_matrix[i][1:12]
train = pd.DataFrame(data=train, index=range(2*num), columns=data.axes[1][0:12])
print train.describe()
由于特征的取值都是数值型数据且没有缺失值,因此不需要进行特殊处理,并且特征的取值范围波动较大且分布没有明显边界,因此进行均值方差归一化操作。具体原因
# normalize the data
from sklearn import preprocessing
index,columns = train.axes
train = preprocessing.scale(train)
train = pd.DataFrame(data=train, index=index, columns=columns)
print train.describe()
由于对数据特征没有很多理解,因此不构造新的特征,直接进行相关性分析。
# select better feature
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
features = train.axes[1][1:12]
# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(train[features], train["Choice"])
# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)
# Plot the scores.
plt.bar(range(len(features)), scores)
plt.xticks(range(len(features)), features, rotation='vertical')
plt.show()
如图所示,剔除掉相关性较低的三个属性,因此
features = ["A_follower_count","A_listed_count","A_mentions_sent","A_retweets_sent","A_posts","A_network_feature_1","A_network_feature_2","A_network_feature_3"]
1)由于是二分类问题,首先尝试用最简单的线性回归(Linear regression)。
# Linear regression
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation from sklearn.cross_validation import KFold # Initialize our algorithm class alg = LinearRegression() # Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test. # We set random_state to ensure we get the same splits every time we run this. kf = KFold(train.shape[0], n_folds=5, shuffle=True) predictions = [] for train_item, test_item in kf: # The predictors we're using the train the algorithm. Note how we only take the rows in the train folds. train_predictors = (train[features].iloc[train_item,:]) # The target we're using to train the algorithm. train_target = train["Choice"].iloc[train_item] # Training the algorithm using the predictors and target. alg.fit(train_predictors, train_target) # We can now make predictions on the test fold test_predictions = alg.predict(train[features].iloc[test_item,:]) predictions.append(test_predictions) # Linear regression 评估 # The predictions are in three separate numpy arrays. Concatenate them into one. # We concatenate them on axis 0, as they only have one axis. predictions = np.concatenate(predictions, axis=0) # Map predictions to outcomes (only possible outcomes are 1 and 0) predictions[predictions > .5] = 1. predictions[predictions <=.5] = .0 accuracy = .0 for i in range(len(predictions)): if predictions[i] == train["Choice"][i]: accuracy += 1. accuracy /= len(predictions) print accuracy
- output: 0.0250909090909
精度如此之低,甚是打击,然后开始怀疑人生。
在运行这个算法是,首先需要注意KFold这里,我采用划分前先洗牌,是因为我的数据集前面都是1,后面都是0,若直接划分,采样不均匀。
2)不要放弃,开始尝试随机森林(Random forest )
# Random forest
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, train[features], train["Choice"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
- output:0.691273202642
发现精度上升了一个数量级不止,简直吓呆宝宝了。
然后开始思考人生:为什么线性回归和随机森林的差距这么大呢?