kaggle——Influencers in Social Networks

Dataset

  • Each datapoint describes two individuals, A and B.
  • For each person, 11 pre-computed, non-negative numeric features based on twitter activity
    • such as volume of interactions, number of followers, etc.

kaggle——Influencers in Social Networks_第1张图片

上图是训练集,观察发现每一行的第0个元素就是标签(label),接下来的22个元素分别是A和B的11个特征,总共23列。

这是训练集中的一个特例,意味着每行的节点不是只出现一次,例如,A跟B比较,A还跟C比较,要充分理解数据。

kaggle——Influencers in Social Networks_第2张图片

这是预测结果的示例,Id表示的是测试数据集中的比较的节点对的行号,Choice表示这个A的影响力占比,>0.5表示A比B更有影响力,反之B更有影响力。

Goal

  • The binary label represents a human judgement about which one of the
    two individuals is more influential.
    • A label ‘1’ means A is more influential than B.
    • 0 means B is more influential than A.
  • The goal of the challenge is to train a machine learning model which,for pairs of individuals, predicts the human judgement on who is more influential with high accuracy.

Evaluation

  • Submissions are judged on area under the ROC curve.

In python (using the metrics module of scikit-learn):

fpr, tpr, thresholds = metrics.roc_curve(true_labels, predictions, pos_label=1)
auc = metrics.auc(fpr,tpr)

在维基百科中对ROC是这样描述的:

在信号检测理论中,接收者操作特征曲线(receiver operating characteristic curve,或者叫ROC曲线)是一种座标图式的分析工具,用于 :
(1) 选择最佳的信号侦测模型、舍弃次佳的模型。
(2) 在同一模型中设定最佳阈值。
在做决策时,ROC分析能不受成本/效益的影响,给出客观中立的建议。

举例来说,用血压值来检测一个人是否有高血压,测出的血压值是连续的实数,以收缩压140/舒张压90为阈值,阈值以上便诊断为有高血压,阈值未满者诊断为无高血压。

  • 二元分类模型的个案预测有四种结局:
    • 真阳性(TP):诊断为有,实际上也有高血压。
    • 伪阳性(FP):诊断为有,实际却没有高血压。
    • 真阴性(TN):诊断为没有,实际上也没有高血压。
    • 伪阴性(FN):诊断为没有,实际却有高血压。
  • TPR:在所有实际为阳性的样本中,被正确地判断为阳性之比率。
    • TPR = TP/(TP+FN)
  • FPR:在所有实际为阴性的样本中,被错误地判断为阳性之比率。
    • FPR = FP/(FP+TN)
  • ROC空间将伪阳性率(FPR)定义为 X 轴,真阳性率(TPR)定义为 Y 轴。给定一个二元分类模型和它的阈值,就能从所有样本的(阳性/阴性)真实值和预测值计算出一个 座标点 (X=FPR, Y=TPR)。

ROC曲线的绘图过程:给定m个正例,n个反例,根据学习器预测结果对样例进行排序,然后把分类阈值设为最大,即把所有样例均预测为反例,此时真正利率和假正利率均为0,在(0,0)处标记一个点,然后,将分类阈值设置为每个样例的预测值,即依次将每个样例划分为正例,然后用线段连接相邻节点即得。

从 (0, 0) 到 (1,1)的对角线将ROC空间划分为左上/右下两个区域,在这条线的以上的点代表了一个好的分类结果(胜过随机分类),而在这条线以下的点代表了差的分类结果(劣于随机分类)。

  • 完美的预测是一个在左上角的点,在ROC空间座标 (0,1)点,X=0 代表着没有伪阳性,Y=1 代表着没有伪阴性。

数据预处理

首先看一眼数据

import pandas as pd
data = pd.read_csv('train.csv')
print data.describe()

由于给出的数据每一行代表一个节点对,因此需要将其分开,并且分别给予标签。例如第一行Choice=0,表示B比A更有影响力,那么分开后A的标签为0,B的标签为1.

  • 首先用pandas打开csv文件,将其存为data。
  • 然后利用as_matrix将data转换为ndarray类型,存为data_matrix。
  • 创建一个名为train大小为[11000,12]的ndarray类型变量用来存放所有的数据点以及标签。
  • 循环data_matrix的每一行,将更有影响力的节点存入train的前5500,其余的存入后5500。
  • 最后将train转换成DataFrame类型,并且它的第一列仍然是”Choice”,A和B的特征名称相同,这样方便存储。
# load train data
import pandas as pd
import numpy as np
data = pd.read_csv('train.csv')
num = len(data)
data_matrix = data.as_matrix()
train = np.zeros((num*2,12)) 
for i in range(num):  
    if data_matrix[i][0] == 1.: # the first 5500 is '1'
        train[i][0] = 1
        train[i][1:] = data_matrix[i][1:12]
        train[i+num][0] = 0
        train[i+num][1:] = data_matrix[i][12:23]
    else:                       # the last 5500 is '0'
        train[i][0] = 1
        train[i][1:] = data_matrix[i][12:23]
        train[i+num][0] = 0
        train[i+num][1:] = data_matrix[i][1:12]
train = pd.DataFrame(data=train, index=range(2*num), columns=data.axes[1][0:12]) 
print train.describe() 

kaggle——Influencers in Social Networks_第3张图片

由于特征的取值都是数值型数据且没有缺失值,因此不需要进行特殊处理,并且特征的取值范围波动较大且分布没有明显边界,因此进行均值方差归一化操作。具体原因

  • 首先利用DataFrame对象的axes属性存储好训练集的index和columns。
  • 然后利用sklearn的preprocessing包对训练数据进行归一化,调用scale是进行均值方差归一化。
  • 由于scale返回的对象是ndarray对象,需要将其转换为DataFrame对象。
# normalize the data
from sklearn import preprocessing
index,columns = train.axes
train = preprocessing.scale(train)  
train = pd.DataFrame(data=train, index=index, columns=columns)   
print train.describe()

kaggle——Influencers in Social Networks_第4张图片

特征选择

由于对数据特征没有很多理解,因此不构造新的特征,直接进行相关性分析。

# select better feature
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
features = train.axes[1][1:12]
# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(train[features], train["Choice"])
# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)
# Plot the scores. 
plt.bar(range(len(features)), scores)
plt.xticks(range(len(features)), features, rotation='vertical')
plt.show()


如图所示,剔除掉相关性较低的三个属性,因此

features = ["A_follower_count","A_listed_count","A_mentions_sent","A_retweets_sent","A_posts","A_network_feature_1","A_network_feature_2","A_network_feature_3"]

构建分类器

1)由于是二分类问题,首先尝试用最简单的线性回归(Linear regression)。

# Linear regression
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation from sklearn.cross_validation import KFold # Initialize our algorithm class alg = LinearRegression() # Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test. # We set random_state to ensure we get the same splits every time we run this. kf = KFold(train.shape[0], n_folds=5, shuffle=True) predictions = [] for train_item, test_item in kf: # The predictors we're using the train the algorithm. Note how we only take the rows in the train folds. train_predictors = (train[features].iloc[train_item,:]) # The target we're using to train the algorithm. train_target = train["Choice"].iloc[train_item] # Training the algorithm using the predictors and target. alg.fit(train_predictors, train_target) # We can now make predictions on the test fold test_predictions = alg.predict(train[features].iloc[test_item,:]) predictions.append(test_predictions) # Linear regression 评估 # The predictions are in three separate numpy arrays. Concatenate them into one. # We concatenate them on axis 0, as they only have one axis. predictions = np.concatenate(predictions, axis=0) # Map predictions to outcomes (only possible outcomes are 1 and 0) predictions[predictions > .5] = 1. predictions[predictions <=.5] = .0 accuracy = .0 for i in range(len(predictions)): if predictions[i] == train["Choice"][i]: accuracy += 1. accuracy /= len(predictions) print accuracy
  • output: 0.0250909090909

精度如此之低,甚是打击,然后开始怀疑人生。

在运行这个算法是,首先需要注意KFold这里,我采用划分前先洗牌,是因为我的数据集前面都是1,后面都是0,若直接划分,采样不均匀。

2)不要放弃,开始尝试随机森林(Random forest )

# Random forest 
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, train[features], train["Choice"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
  • output:0.691273202642

发现精度上升了一个数量级不止,简直吓呆宝宝了。

然后开始思考人生:为什么线性回归和随机森林的差距这么大呢?

你可能感兴趣的:(kaggle——Influencers in Social Networks)