大数据挑战赛-鼠标轨迹识别

声明:本文属于原创,如想转载,请务必在抬头注明出处。

大数据挑战赛-鼠标轨迹识别,竞赛官网:http://bdc.saikr.com/c/cql/34541

1.我们看一下整个竞赛的详情

赛题描述

      鼠标轨迹识别当前广泛运用于多种人机验证产品中,不仅便于用户的理解记忆,而且极大增加了暴力破解难度。但攻击者可通过黑产工具产生类人轨迹批量操作以绕过检测,并在对抗过程中不断升级其伪造数据以持续绕过同样升级的检测技术。我们期望用机器学习算法来提高人机验证中各种机器行为的检出率,其中包括对抗过程中出现的新的攻击手段的检测。

比赛数据

     本题目数据来源于某人机验证产品采集的鼠标轨迹,经过脱敏处理,数据分为3部分(数据量分别为3000条,10万,200万)。

     赛事分为三个阶段(初赛、复赛、决赛答辩):5月26日初赛提供3000条数据作为训练样本,供参赛者下载进行建模和模型优化,同时提供10万条正式比赛数据供下载评测,识别结果为初赛得分;复赛提供200万条比赛数据(不可下载,数据不可见,仅供评测),识别结果为复赛得分;决赛将以现场答辩会的形式进行。

【训练数据】

训练数据表名称:dsjtzs_txfz_training

字段

类型

解释

a1

bigint

编号id

a2

string

鼠标移动轨迹(x,y,t)

a3

string

目标坐标(x,y)

label

string

类别标签:1-正常轨迹,0-机器轨迹

训练样例数据:见 dsjtzs_txfz_training_sample.txt

【测试数据】

初赛测试表名称:dsjtzs_txfz_test1

复赛测试表名称:dsjtzs_txfz_test2

字段

类型

解释

a1

bigint

编号id

a2

string

鼠标移动轨迹(x,y,t)

a3

string

目标坐标(x,y)

测试样例数据:见 dsjtzs_txfz_test_sample.txt

测评标准

选手请将识别为机器行为的编号id提交到计算平台,需要提交的结果表,只包含一个字段:编号id。

初赛提交表名:dsjtzs_txfzjh_preliminary

复赛提交表名:dsjtzs_txfzjh _semifinal

设定Precision为P,Recall为R,白样本为正常轨迹,黑样本为机器轨迹其中:

P = 判黑的数据中真正为黑的数量/判黑的数据总量,

R = 判黑的数据中真正为黑的数量/真实黑数据总量,

比如10w条数据,其中8w条为白样本,2w条为黑样本,参赛者一共将1w条判断为黑样本(其中真正的黑样本有8000条,错将2000条白样本判黑),那么,

P=8000/10000 = 80%,

R=8000/20000=40%,

参赛队伍最终得分F = 5PR/(2P+3R)*100。最终排名按照F值评判,F值越大,代表结果越优,排名越靠前。

2.竞赛的进程安排

初赛(5月26日—7月21日)

(1)参赛队伍可从大赛官方网站下载数据,在本地进行算法设计和调试,规定时间内在报名官网提交结果,每支队伍在一天内只能提交一次结果;

(2)5月26日起,系统向选手开放训练样本数据3000条(2600白样本,400条黑样本)供参赛者下载进行建模和模型优化,同时提供正式比赛数据10万条供参赛者下载评测;

复赛(7月25日-8月14日)

(1)所有比赛数据不可下载,选手需在腾讯数据平台部DIX平台上完成数据处理、建模、算法调试、产出结果等所有环节,可使用基于Spark、xgBoost及平台提供的机器学习相关基础算法。

(2)7月25日起系统提供200万条正式比赛数据(对参赛选手不可见,仅供平台对参赛作品进行评测);

决赛(8月20日)

1.  决赛将以现场答辩会的形式进行,具体安排另行通知;

2.  参赛队伍应提前准备现场答辩材料,包括PPT、算法代码;

综上所述:

每个竞赛的阶段数据集情况
比赛阶段 训练集(条) 测试集(条)
初赛(stage1) 3000 100000
初赛(stage2) 3000 100000
复赛 3000 2000000

3.训练数据和测试数据如下所示:

训练数据

70 276,2555,1234;290,2555,1261;339,2555,1306;374,2555,1357;409,2555,1405;430,2555,1456;451,2555,1567;451,2555,1879;458,2555,2338;479,2555,2365;507,2555,2404;591,2555,2458;745,2568,2509;801,2568,2557;822,2568,2608;829,2568,2656; 643.5,553 1
75 262,2503,316;262,2516,376;297,2516,406;353,2516,439;416,2490,472;493,2490,502;605,2477,532;717,2425,565;794,2412,598;857,2412,628;934,2412,664;955,2412,691;990,2412,724;1018,2412,757;1025,2412,787;1039,2412,880;1060,2412,913;1067,2412,946;1095,2386,1006;1109,2386,1069;1123,2386,1129;1130,2386,1312;1123,2386,2035;1109,2386,2215;1095,2386,2395;1088,2386,2608;1074,2386,2638;1067,2386,2671;1060,2386,2704;1053,2386,2764;1046,2399,2797;1039,2399,3937; 843.0,358 1
249 612,2607,352;836,2607,724;871,2607,1165;885,2607,2341;899,2607,2797;913,2607,3328;927,2607,3706;934,2607,4162;941,2607,4621;948,2607,5053;969,2607,5629;969,2620,8749;962,2620,10234;920,2620,11749;913,2633,12625;920,2633,15493;934,2633,16405;948,2633,16717;976,2633,17821; 727.5,189 1
2983 325,2438,484;381,2464,1201;458,2555,1276;479,2555,1312;528,2451,1396;570,2555,1501;640,2555,1606;703,2555,1699;745,2451,1723;801,2542,3730;829,2555,3796;892,2555,3841;920,2555,3907;976,2516,5986;1004,2555,9784;1046,2555,14935;1074,2555,17041;1088,2529,17083;1137,2555,22609;1193,2555,22642;1207,2555,27553;1235,2555,27619;1256,2555,32953;1270,2412,33028;1319,2425,33052;1333,2464,38623;1333,2555,38704; 1123.0,195.5 0
2984 297,2529,193;311,2594,2251;346,2594,6838;381,2594,10930;437,2594,10951;486,2594,15874;528,2594,15895;556,2672,21235;591,2633,21283;612,2594,21337;633,2594,21367;647,2503,25906;682,2568,28411;689,2594,28480;731,2594,33868;745,2594,33931; 528.0,670 0
2985 269,2542,217;311,2555,2650;346,2555,7423;367,2555,7513;395,2555,7594;402,2555,7633;444,2555,7669;472,2555,7735;528,2555,7813;570,2555,7837;605,2555,7906;640,2685,8014;654,2594,8122;689,2555,8170;696,2698,8221;773,2594,8266;794,2555,8305;808,2555,8371; 601.5,358 0

其中包含三个正样本(label=1)和三个副样本(label=0),没一行代表一个样本。从数据中可以看到,第一列都是轨迹的标号,如70,75,249,2983等等,而没一行的最后一列,为0或1,表示样本轨迹的label。

测试集数据

1 234,2620,196;241,2620,226;248,2620,256; 647.0,189
5 374,2503,1174;402,2503,1204;430,2503,1252;479,2503,1267;514,2503,1297;563,2503,1330;619,2503,1363;654,2503,1399;682,2503,1423;724,2503,1456;745,2503,1492;766,2503,1522;773,2503,1549;780,2503,1762;794,2503,2800;808,2503,2830;822,2503,2863;843,2503,2893;857,2503,2956;864,2503,3016; 545.5,189
67 514,2347,46;591,2373,76;668,2399,103;745,2425,130;822,2451,157;899,2477,187;976,2503,214;1053,2529,241;1130,2555,265;1207,2581,292;1284,2607,319;1361,2568,346;1438,2529,373;1515,2490,400;1592,2451,424; 1424.0,1359
92 486,2334,40;535,2334,61;584,2334,82;633,2334,103;682,2334,121;731,2334,142;780,2347,163;829,2360,184;878,2373,202;927,2386,223;976,2386,241;1025,2399,262;1074,2412,280;1123,2425,301;1172,2438,319;1221,2412,340;1270,2386,358;1319,2360,379;1368,2334,397;1417,2308,418;1466,2282,436;1515,2256,457; 1312.0,189
93 234,2386,61;234,2438,160;234,2464,181;241,2477,229;255,2490,280;269,2490,316;318,2503,364;367,2516,421;416,2516,469;458,2516,511;493,2516,562;514,2503,613;535,2503,664;556,2490,715;563,2490,766;563,2477,808;570,2477,847;577,2477,1102;584,2477,1150;605,2477,1213;619,2477,1261;668,2477,1312;766,2477,1363;878,2477,1411;1053,2477,1462;1144,2477,1513;1235,2477,1561;1284,2477,1615;1312,2477,1660;1326,2477,1711;1333,2477,1762;1347,2477,1825;1361,2477,1861;1382,2477,1915;1403,2477,1963;1438,2477,2011;1452,2477,2062;1466,2477,2110;1473,2477,2161;1466,2477,2413;1459,2477,2437;1466,2477,3091;1473,2477,3181;1480,2477,3253;1487,2477,3355;1494,2477,3412;1501,2477,3463;1508,2477,3517;1515,2477,3589;1501,2477,4237; 1361.0,202

此处共给出5个样本,测试集除了最后一列没有label外,其他的跟训练集一样。

现在我们仔细解析一下数据样本的结构:

根据第一部分的竞赛题目,以及对数据集的解析。我们可以明白,所有的样本都由一个序列构成的,我们选择训练样本中的id=75的样本来分析一下:

75 262,2503,316;262,2516,376;297,2516,406;353,2516,439;416,2490,472;493,2490,502;605,2477,532;717,2425,565;794,2412,598;857,2412,628;934,2412,664;955,2412,691;990,2412,724;1018,2412,757;1025,2412,787;1039,2412,880;1060,2412,913;1067,2412,946;1095,2386,1006;1109,2386,1069;1123,2386,1129;1130,2386,1312;1123,2386,2035;1109,2386,2215;1095,2386,2395;1088,2386,2608;1074,2386,2638;1067,2386,2671;1060,2386,2704;1053,2386,2764;1046,2399,2797;1039,2399,3937; 843.0,358 1

第一列75为样本id,最后一列是label=1,表明是正样本,中间是一串由分号作为分隔符的序列,分隔的每一部分包含三个数值如一个部分为(262,2503,316),根据题目可知,这里面的三个值对应(x,y,t)。根据题目的背景,整个序列是一个鼠标轨迹,那么,轨迹在采样的过程中,就是通过一个个点构成,所以(x,y,t)就是一个点的位置和时间参数。

另外,需要格外指出的是,每条轨迹坐标后面紧接着有一个由逗号分隔开的坐标如 843.0,358,该坐标就构成了这条轨迹的目标点。根据题目的意思,也就是说,每条轨迹最终的目的都是想挪动到目标点位置的。

 

4.样本轨迹的特征提取

看整个样本数据集,不同的轨迹长度是不一样的,或者说采样点的个数是不一样的。当我们最初接触到这个题目的时候,我们本想把整个轨迹点作为特征,直接用Xgboost训练,后来发现尴尬了,轨迹长度都不一样,还怎么训练。

思路一:

其实,既然有了不同的轨迹,也就是不同的序列,可以有一种很常见的做法,就是直接使用长短期记忆网络(LSTM)来进行训练并预测,基本上可以无脑训练了。但是,由于官方给出的建议是不推荐使用深度学来打比赛,因为最后决赛提供的腾de讯的机器学习平台不给开放深度学习资源。所以这种方法,只是从脑子里面过了一下,没有去执行。

思路二:

既然,给的是每条曲线的轨迹,我们完全可以把每条轨迹做图然后生成图片保存起来,然后通过深度学习,进行图像的分类。因为训练集本身比较少(才3000条),而测试集却相当大(初赛10万条、复赛200万条),所以可以想到使用迁移学习的方法,将已经在较大数据集如ImageNet、Ms COCO上预训练好的模型,如ResNet、Inception-V3等等来初始化自己的网络,然后在自己的数据集上进行fine-tuning。这确实不失为一种可以尝试的方法。但是,同样的原因,官方不提供深度学习资源。就我们自己的破笔记本,根本想都别想。

思路三:

前面的方法都行不通了,作为菜鸟级选手,我只能想到通过提取每条样本轨迹的统计特征了。如平均值,方差,极差,偏差,最大值,最小值,中值。然后,既然有了轨迹,就会有速度和加速度,所以可以继续求出速度的前述统计特征,当然还可以想到,如果当作一个时间序列的话,我们可以计算序列的一阶差分二阶差分,然后继续计算统计特征。这么一下来,每条轨迹的统计特征已经不少了,足够进行Xgboost训练了。

为了纪念一下整个竞赛的不容易,下面贴出我写的提取特征的baseline(简单看看就行,可直接略过): 

# -*- coding: utf-8 -*-
import pandas as pd
import fileinput
import numpy as np
import matplotlib.pyplot as plt
import sys
import os
sys.path.append(os.getcwd() + '/scr')
from subFunction import *


train_path = './data/dsjtzs_txfz_training.txt'
test_path = './data/dsjtzs_txfz_test1.txt'



mouseTrack_features_labels = ['numRecode', 'xmean', 'ymean', 'xEnt', 'yEnt', 'MaxTimeInterval', 'MinTimeInterval',
                              'tailXdiff', 'tailYdiff', 'tailTdiff', 'tailDis_xy', 'Vmean', 'Vmax', 'Vmin', 'Vvar',
                              'Vstd', 'Accmean', 'Accmax', 'Accmin', 'Accvar', 'Accstd', 'LastT20var',
                              'LastT20std',  'Vcov', 'VCorrelationCoefficient', 'XdiffVar', 'YdiffVar']
featureLabels = ['id']
featureLabels.extend(mouseTrack_features_labels)
featureLabels.extend(['YTarget', 'XTarget', 'label'])

#获取x,y的平均值 .2个特征
def get_XYmean(df_MouseTrack):
    x = df_MouseTrack["x"]
    y = df_MouseTrack["y"]
    xmean = np.mean(x)
    ymean = np.mean(y)
    xymean = [xmean, ymean]
    return xymean

#获取x,y的熵 2个特征
def get_XYentropy(df_MouseTrack):
    x = df_MouseTrack["x"]
    y = df_MouseTrack["y"]
    xEnt = calcShannonEnt(x)
    yEnt = calcShannonEnt(y)
    xyEnt = [xEnt, yEnt]
    return xyEnt

#获取间隔时间的最大值和最小值 2个特征
def get_MaxMinT(df_MouseTrack):
    if len(df_MouseTrack) < 2:
        maxT = df_MouseTrack['t'].max()
        minT = df_MouseTrack['t'].min()
    else:
        t = df_MouseTrack['t']
#获取最后一个时间段x,y,t的差分值,以及两个点之间的欧氏距离 4个特征
def get_tailFeature(df_MouseTrack):
    if len(df_MouseTrack) < 2:
        xdiff = 0
        ydiff = 0
        tdiff = 0
        dis = 0
        diff_and_Dis = [xdiff, ydiff, tdiff, dis]
    else:
        xdiff = calDiffenceResult(df_MouseTrack['x'])
        ydiff = calDiffenceResult(df_MouseTrack['y'])
        tdiff = calDiffenceResult(df_MouseTrack['t'])
        dis = np.sqrt(xdiff[-1]**2 + ydiff[-1]**2)
        diff_and_Dis = [xdiff[-1], ydiff[-1], tdiff[-1], dis]
    return diff_and_Dis

#速度的平均值,最大值,最小值,方差,标准差 5个特征
def get_xv_var(df_MouseTrack):
    if len(df_MouseTrack) < 2:
        speedMean = 0
        speedMax = 0
        speedMin = 0
        speedVar = 0
        speedStd = 0
    else:
        speed = calSpeed(df_MouseTrack)
        speedMean = speed.mean()
        speedMax = speed.max()
        speedMin = speed.min()
        speedVar = speed.var()
        speedStd = speed.std()
    speedFeat = [speedMean, speedMax, speedMin, speedVar, speedStd]
    return speedFeat


#加速度的平均值,最大值,最小值,方差,标准差 5个特征
def get_Acc_feat(df_MouseTrack):
    if len(df_MouseTrack) < 3:
        meanAcc = 0
        maxAcc = 0
        minAcc = 0
        varAcc = 0
        stdAcc = 0
    else:
        t = df_MouseTrack['t']
        t1 = np.array(t[0:-1])
        t2 = np.array(t[1:])
        v_t = (t1 + t2)/2
        v_tdiff = calDiffenceResult(v_t)
        speed = calSpeed(df_MouseTrack)
        SPdiff = calDiffenceResult(speed)
        Accelearation = SPdiff/v_tdiff
        if len(Accelearation) == 0:
            end = 1
        meanAcc = Accelearation.mean()
        maxAcc = Accelearation.max()
        minAcc = Accelearation.min()
        varAcc = Accelearation.var()
        stdAcc = Accelearation.std()
        # reciprocalAcc = 1/Accelearation
    AccFeat = [meanAcc, maxAcc, minAcc, varAcc, stdAcc]
    return AccFeat


#采样最后20段时间的方差和标准差 2个特征
def get_t_last20_var(df_MouseTrack):
    tdiff = calDiffenceResult(df_MouseTrack['t'])
    if len(tdiff) >= 20:
        useTdiff = tdiff[-20:]
    else:
        useTdiff = tdiff
    Tvar = useTdiff.var()
    Tstd = useTdiff.std()
    T20feat = [Tvar, Tstd]
    return T20feat

#记录速度的协方差,及相关系数 2个特征
def get_vx_cov_reverse(df_MouseTrack):
    if len(df_MouseTrack) < 4:
        CovXY = 0
        CorrelationCoefficient = 0
    else:
        v = calSpeed(df_MouseTrack)
        v1 = v[0:-1]
        v2 = v[1:]
        vCov = np.cov(v1, v2) #协方差矩阵
        CovXY = vCov[0, 1] #v1和v2的协方差
        CorrelationCoefficient = CovXY/(np.sqrt(vCov[0, 0])*np.sqrt(vCov[1, 1])) #求解相关系数
    vFeat = [CovXY, CorrelationCoefficient]
    return vFeat

#水平和垂直位移的方差 2个特征
def get_XYvar(df_MouseTrack):
    xdiff = calDiffenceResult(df_MouseTrack['x'])
    ydiff = calDiffenceResult(df_MouseTrack['y'])
    XDiffvar = xdiff.var()
    YDiffvar = ydiff.var()
    disVar = [XDiffvar, YDiffvar]
    return disVar

#时间噪声
def get_t_noisiness(df_MouseTrack):
    end = 1

#获取鼠标轨迹特征
def getFeatures(df_MouseTrack):
    m = len(df_MouseTrack)
    features = []
    features.append(m)

    XYmean = get_XYmean(df_MouseTrack)
    features.extend(XYmean)

    XYEnt = get_XYentropy(df_MouseTrack)
    features.extend(XYEnt)
    MaxMinT = get_MaxMinT(df_MouseTrack)
    features.extend(MaxMinT)

    tailfeat = get_tailFeature(df_MouseTrack)
    features.extend(tailfeat)

    vfeat = get_xv_var(df_MouseTrack)
    features.extend(vfeat)

    accfeat = get_Acc_feat(df_MouseTrack)
    features.extend(accfeat)

    last20tVar = get_t_last20_var(df_MouseTrack)
    features.extend(last20tVar)

    diffVfeat = get_vx_cov_reverse(df_MouseTrack)
    features.extend(diffVfeat)

    xyVar = get_XYvar(df_MouseTrack)
    features.extend(xyVar)

    return features


def make_train_data():

    traindata = pd.DataFrame(np.random.randn(1, len(featureLabels)), columns=featureLabels)
    for i, line in enumerate(fileinput.input(train_path)):
        features = []
        line = line.split()
        a1 = int(line[0]) #获取编号id
        features.append(a1)

        a2 = line[1].split(";")
        temp = [x.split(',') for x in a2]
        temp.pop()
        a2 = np.mat(temp, dtype=float)
        a2 = pd.DataFrame(a2, columns=list('xyt'))
        a2 = a2.groupby('t', as_index=False).first()
        if len(a2) < 2:
            continue
        a2_feature = getFeatures(a2)
        features.extend(a2_feature)

        a3 = line[2].split(',') #目标点的坐标
        a3_x = float(a3[0])
        a3_y = float(a3[1])
features.append(a3_x) #x坐标
        features.append(a3_y) #y坐标

        label = int(line[3])#标签
        features.append(label)

        traindata.ix[i] = pd.Series(np.array(features), index=traindata.columns)
    return traindata

def make_test_data():
    testFeatlabels = featureLabels
    testFeatlabels.pop() #测试样本数据集比训练样本集少一个标签“label”
    testdata = pd.DataFrame(np.random.randn(1, len(testFeatlabels)), columns=testFeatlabels)
    for i, line in enumerate(fileinput.input(test_path)):
        features = []
        line = line.split()
        a1 = int(line[0])#获取编号id
        features.append(a1)

        a2 = line[1].split(";")
        temp = [x.split(',') for x in a2]
        temp.pop()
        a2 = np.mat(temp, dtype=float)
        a2 = pd.DataFrame(a2, columns=list('xyt'))
        if len(a2) < 3:
            pause = 1
        a2 = a2.groupby('t', as_index=False).first()
        a2_feature = getFeatures(a2)
        features.extend(a2_feature)

        a3 = line[2].split(',') #目标点的坐标
        a3_x = float(a3[0])
        a3_y = float(a3[1])
        features.append(a3_x) #x坐标
        features.append(a3_y) #y坐标

        testdata.ix[i] = pd.Series(np.array(features), index=testdata.columns)
       

    return testdata
if __name__ == '__main__':
    traindata = make_train_data()
    testData = make_test_data()
    
    end = 1

说实话,就上面提取特征的方法,如果提取训练集的样本3000条,速度还可以,不是很慢,但是提取测试及的10万条就得花个半个小时,那就更别说最后复赛的200万条了,就是要命了。

5.特征提取实践

上一部分分析了,我们提取特征的思路,这一部分就来重点实现这个操作。就是要明白一点,我们的特征大多是统计特征,少部分是分析样本数据集之后,得出的特征,还有就是通过阅读文献得到的特征。最终提取的特征及说明如下:

特征及说明
特征 表示 说明
轨迹点个数 count 因为不同轨迹长短不一样,所以考虑使用,count来表征轨迹的长度
x的最小值 x_min 整个轨迹中,x坐标的最小值
轨迹x方向走一般花的时间比 x_ratio 因为鼠标轨迹在x方向上移动时,速度可能不叫均匀,所以行走到x方向一半所花的时间占整体时间比较大,但是如果人拖动的话,开始速度很快,后面快接近目标点的时候,会变慢,所以走一半的路花的时间会少一些。
y坐标的最小值 y_min  
y坐标的最大值 y_max  
x坐标一阶差分后标准差 x_diff_std  
x坐标一阶差分后的最大值 x_diff_max  
x坐标一阶差分后的最小值 x_diff_min  
x坐标一阶差分后的偏度 x_diff_skew  
y坐标一阶差分后的平均值 y_diff_std  
x坐标回退标记 x_back_num 因为有的轨迹在往x正方向移动的时候,是有一个目标点的,但是如果x移动越过了目标点,就会往回走。所以,通过该特征来表征轨迹是否有回退的情况
  DisPoint  
  Disx_forlat  
时间轴t差分后的均值 t_diff_mean  
t进行一阶差分后的标准差 t_diff_std  
总时间比上x轴总路程 duration_mean  
走一半路花的时间 timehalf  
所有相邻样本点距离最大值 xy_diff_std  
相邻样本点速度的标准差 vxy_std  
相邻样本点速度的平均值 vxy_mean  
轨迹最后两个样本点的速度 vxylast  
轨迹最开始两个样本点的速度 vxyfirst  
速度序列一阶差分的中值 vxy_diff_median  
速度序列一阶差分的平均值 vxy_diff_mean  
速度序列一阶差分的最大值 vxy_diff_max  
速度序列一阶差分的标准差 vxy_diff_std  
相邻点构成的角度序列的标准差 angle_std  
相邻轨迹点的角度序列的峰度 angle_kurt  
角度序列一阶差分的均值 angle_diff_mean  
角度序列一阶差分的标准差 angle_diff_std  
  Dis_pt2dst_diff_max  
  Dis_pt2st_diff_std  
  angle_upTriangle_num 样本轨迹中,相邻三个点构成的的上三角形的数量
  angle_downTriangle_num 样本轨迹中,相邻三个点构成的的下三角形的数量
     

特征提取代码如下(采用pyspark提取特征,速度非常快):

#coding=utf-8

import matplotlib.pyplot as plt
import numpy as np
from pyspark import SparkContext
import sys

output_file = sys.argv[1]
input_file  = sys.argv[2]

def get_TES_feat(element):
    feat = []
    data = element.split(" ")
    id = int(data[0])
    trace = data[1][:-1]
    trace = trace.split(';')
    trace = [[int(record.split(',')[0]), int(record.split(',')[1]), int(record.split(',')[2])] for record in trace]
    trace = np.array(trace)
    aim_x = float(data[2].split(',')[0])
    aim_y = float(data[2].split(',')[1])

    x = trace[:, 0]
    y = trace[:, 1]
    t = trace[:, 2]

    count = len(x)

    if len(x) == 1:
        x = np.array([x[0]]*3)
        y = np.array([y[0]]*3)
        t = np.array([t[0]]*3)
    elif len(x) == 2:
        x = np.array([x[0]]+[x[1]] * 2)
        y = np.array([y[0]]+[y[1]] * 2)
        t = np.array([t[0]]+[t[1]] * 2)

    x_min = x.min()
    x_ratio = 1.0*(x[len(x)-1] - x[0]) / len(x)
    y_min = y.min()
    y_max = y.max()

    x_diff = x[1:] - x[0:-1]
    y_diff = y[1:] - y[0:-1]
    t_diff = t[1:] - t[0:-1]

    x_diff_std = x_diff.std()
    x_diff_max = x_diff.max()
    x_diff_min = x_diff.min()
    x_diff_skew = ((x_diff**3).mean() - 3*x_diff.mean()*x_diff.var() - x_diff.mean()**3) / (x_diff.var() ** 1.5 + 0.000000001)

    y_diff_mean = np.fabs(y_diff[y_diff != 0]).mean()
    y_diff_std  = y_diff[y_diff != 0].std()

    x_back_num = (x_diff < 0).sum()

    DisPoint = 1.0 * sum((x_diff ** 2 + y_diff ** 2) ** 0.5) / len(x)
    Disx_forlat = sum(x_diff[0:len(x_diff) / 2]) / (sum(x_diff[len(x_diff) / 2:len(x_diff)]) + 0.000000001)

    t_diff_mean = t_diff.mean()
    t_diff_min = t_diff.min()
    t_diff_std = t_diff.std()
    duration_mean = 1.0 * (t[len(t) - 1] - t[0]) / len(x)
    timehalf = np.log1p((t[len(t) / 2] - t[0])) - np.log10(t[len(t) - 1] - t[len(t) / 2])

    xy_diff = (x_diff ** 2 + y_diff ** 2) ** 0.5
    xy_diff_max = xy_diff.max()

    Vxy = np.log1p(xy_diff) - np.log1p(t_diff)
    Vxy_diff = Vxy[1:] - Vxy[0:-1]
    Vxy = Vxy[(Vxy > 0) | (Vxy < 1)]
    Vxy_diff = Vxy_diff[(Vxy_diff > 0) | (Vxy_diff < 1)]
    if len(Vxy) < 1:
        vxy_std = 0
        vxy_mean = 0
        vxyfirst = 0
        vxylast = 0
    else:
        vxy_std = Vxy.std()
        vxy_mean = Vxy.mean()
        vxyfirst = Vxy[0]
        vxylast = Vxy[len(Vxy) - 1]


    if len(Vxy_diff) < 1:
        vxy_diff_median = 0
        vxy_diff_mean = 0
        vxy_diff_std = 0
        vxy_diff_max = 0
    else:
        Vxy_diff.sort()
        vxy_diff_median = (Vxy_diff[len(Vxy_diff) / 2] + Vxy_diff[~len(Vxy_diff) / 2]) * 1.0 / 2
        vxy_diff_mean = Vxy_diff.mean()
        vxy_diff_std = Vxy_diff.std()
        vxy_diff_max = Vxy_diff.max()

    angles = np.log1p(y_diff) - np.log1p(x_diff)
    angle_diff = angles[1:] - angles[0:-1]
    angle_diff = angle_diff[(angle_diff > 0) | (angle_diff < 1)]
    angles = angles[(angles > 0) | (angles < 1)]
    if len(angles)<1:
        angle_std = 0
        angle_kurt = 0
    else:
        angle_std = angles.std()
        angle_kurt = (angles ** 4).mean() / (angles.var() + 0.000000001)


    if len(angle_diff) == 0:
        angle_diff_mean = 0
        angle_diff_std = 0
    else:
        angle_diff_mean = angle_diff.mean()
        angle_diff_std = angle_diff.std()

    Dis_pt2dst = ((x - np.array([aim_x] * len(x))) ** 2 +
                 (y - np.array([aim_y] * len(y))) ** 2) ** 0.5
    Dis_pt2dst_diff = Dis_pt2dst[1:] - Dis_pt2dst[0:-1]
    Dis_pt2dst_diff_max = Dis_pt2dst_diff.max()
    Dis_pt2dst_diff_std = Dis_pt2dst_diff.std()

    #方向角
    DirectAngle = np.sign(x_diff).astype(int).astype(str).astype(object) + np.sign(y_diff).astype(int).astype(str).astype(object)
    ConnectDirectAngle = DirectAngle[1:] + DirectAngle[0:-1]
    angle_upTriangle_num = len(ConnectDirectAngle[ConnectDirectAngle == '111-1'])
    angle_downTriangle_num = len(ConnectDirectAngle[ConnectDirectAngle == '1-111'])

    feat = [count, x_min, x_ratio, y_min, y_max, x_diff_std, x_diff_max, x_diff_min, x_diff_skew, y_diff_mean,
            y_diff_std, x_back_num, DisPoint, Disx_forlat, t_diff_mean, t_diff_min, t_diff_std, duration_mean, timehalf,
            xy_diff_max, vxy_std, vxy_mean, vxyfirst, vxylast, vxy_diff_median, vxy_diff_mean, vxy_diff_max,
            vxy_diff_std, angle_std, angle_kurt, angle_diff_mean, angle_diff_std, Dis_pt2dst_diff_max,
            Dis_pt2dst_diff_std, angle_upTriangle_num, angle_downTriangle_num]
    feat = list(np.nan_to_num(feat))

    feat_str_list = [str(item) for item in feat]
    feat_str = ' '.join(feat_str_list)

    return feat_str
sc = SparkContext(appName='Test')
rdd = sc.textFile(input_file)
result = rdd.map(get_TES_feat)
result.saveAsTextFile(output_file)
end = 1

6.特征筛选

步骤5已经给出了提取特征的方法以及代码,但是事实上,如果按照统计特征的思维去提取特征的话,少说也得上百维特征了,但是最终我们,只是选取除了其中的36维特征,是因为我们在特征筛选的过程中去除掉了一些表现不好的特征。

你可能感兴趣的:(机器学习)