Spark数据挖掘-基于 K 均值聚类的网络流量异常检测(1): 数据探索、模型初探

Spark数据挖掘-基于 K 均值聚类的网络流量异常检测(1): 数据探索、模型初探

1 前言

分类和回归是强大易学的机器学习技术。需要注意的是:为了对新的样本预测未知的值, 必须从大量已知目标值的样本中去学习,这类技术统称为监督学习技术。 下面将会重点介绍非监督学习的算法:K均值聚类。这样的情况实际中总是可能碰到:当你手头的数据完全没有目标变量的正确结果。 例如:请根据电子商务站点提供的关于用户习惯和品味的数据给用户分群,输入数据是他们购物、点击、个人图片 、浏览、下单等等的特征,输出是用户的分组标识,也许其中某个类别的人是代表有时尚意识的用户,而另一个组的 用户更加喜欢价格便宜的商品等等。不过不用担心这类问题却可以用非监督学习的方法解决。这些技术并不是通过学习目标值才去预测目标值, 因为没有目标值。然而这些技术通过学习数据的结构,从而发现哪些样本之间是相似的,或者学到哪些输入可能会发生, 哪些又不会发生。放心上面的问题下面将会一步一步用Spark实战的方式得到初步解决。

2 异常值检测

异常值检测正如其名目的就是为了发现不寻常的事情。如果你已经有一批标记为异常的数据集合,那么存在大量的监督学习算法 非常容易就可以检测异常。这些算法将从标记有 “异常” 和 “非异常” 的数据集中 学会怎么去区别它们。然而现实世界里的异常值却是人们不知道的事情。换句话说,如果你都观测到或者理解异常值 的本意那么它们就不是异常值了。
异常值检测一些重要的用途有:欺诈发现、网络攻击检测、服务器或其他装备传感器的机械故障问题。在这些例子中, 一个重要的问题是新的异常类型之前从来没有出现过——新的欺诈、新的攻击、新的服务失败原因。
非监督学习对这些例子是非常有帮助的,因为它们可以从数据的结构中学到正常的数据应该是什么样子的,因此当模型碰到 不像之前正常的数据,就会发现异常。

3 K均值聚类

算法名字 K均值聚类,K-means Clustering, KMC
算法描述 k均值聚类是非监督算法中非常有名的。目标是对数据分组,最大化类与类之间的距离,最小化类内之间的距离。算法关键参数 K,指定分类的个数,非常重要也很难抉择。另个一个需要选择的是距离的定义。Spark 目前只能支持欧几里德距离。
算法原理 1. 首先随机选择K个点作为类中心
2. 计算每个点到类中心的距离,将点分到距离最近的类
3. 重新计算每个类的类中心
4. 重复2到3直到迭代次数达到指定的值或者类中心变化足够小(达到指定的值)算法结束
使用场景 非监督聚类学习算法
算法优缺点 优点: 1. 容易实现
缺点: 1. 可能收敛到局部最小值点 2. 在大数据集上收敛比较慢 3. 只适合数值型数据(定性变量需要编码选择合适的距离算法)4. 效果评价难
参考资料 1. 算法原理 机器学习实战
2. MLlib实现 MLlib - K-means Clustering

4 网络攻击

所谓的网络网络攻击今天随处可见。一些攻击尝试通过大量的访问让服务器流量暴涨以至于无法处理正常的请求。 另外一些攻击方式尝试利用网络软件的漏洞获得服务器的访问权限。前面一种情况是比较明显而且容易发现处理的, 但是检测漏洞就好像在这么多请求中大海捞针一样困难。
一些漏洞攻击者有一些固定的模式。比如,访问计算机上的每一个端口,正常的软件一般不会这么去做。 这是网络攻击者首先需要做的步骤,目的是尝试去发现服务器上面哪些应用可能存在漏洞。
如果你统计在很短时间里面同一个软件访问不同端口的个数,这个将会是一个很好的拿来发现端口扫描攻击的特征。 同样其他特征将用来检验其他类型的攻击如:发送和接受的数据的字节大小、TCP 错误类型等等。
然而当你面临的攻击是之前没有发现过的,这个时候该怎么办?假如之前你没有碰到过端口扫描攻击,你没有想到端口个数这个指标呢? 最大威胁就是这些从来没有被发现过被分类过的攻击。
K-means 可以用来检测这种未知的网络攻击,但是前提得定义多个比较通用的指标,这可能是一个悖论。你不知道有这个攻击之前 怎么想到这些指标呢?这里面就是人的大脑发挥作用的时候,经验、洞察力、逻辑思维等等,没有统一的标准,这里也没法讨论这个复杂的问题。 假设一切你都想好了,数据也收集好了,我们进行下一步。

5 模型初试基于数据集(KDD Cup 1999)

KDD Cup 是ACM的一个每年举行一次数据挖掘竞赛的特别组织。每一年这个组织会提出一个机器学习的问题,并且 附带一个数据集,优秀的解决这个问题的研究者被邀请发表详细的论文。在1999年,这个组织提出的就是网络入侵的检测问题, 数据集仍然是有效的。下面就是要利用 Spark 从这个数据集中学习如何检测网络异常。注意:这的确只是一个示例而已,相隔十几年 现在的情况早已不同。
幸运的是组织主办方已经将数据原始请求包处理成单次连接的各个属性的结构化的数据。并且保存为CSV的文件格式,总共大小 708 MB,包含 4898431 条的连接次数, 总共有41个特征,并且最后一列是攻击类型或者正常连接的标识。注意:即使最后一列有攻击标识,模型也不会尝试去从已知的攻击类型中学习,因为这次的目的是 发现未知的攻击,暂时就当作什么攻击都不知道吧。
首先从http://bit.ly/1ALCuZN下载数据集合,利用 tar 或者 7-zip 解压缩文件,就得到了研究主要文件:
kddcup.data.corrected

5.1 数据集简单探索

数据探索的代码:

  /**
   * kddcup.data.corrected 数据简单探索
   */
  def dataExplore(data: RDD[String]) = {
    val splitData = data.map{
      line => line.split(",")
    }
    splitData.take(10).foreach{
      line => println(line.mkString(","))
    }
    val sample = splitData.map(_.last).countByValue().toSeq.sortBy(_._2).reverse
    sample.foreach(println)
  }

数据探索得到结果:

# 查看10条样本数据
0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,1,1,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,2,2,1.00,0.00,0.50,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,233,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,3,3,1.00,0.00,0.33,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.00,0.00,0.00,0.00,1.00,0.00,0.00,4,4,1.00,0.00,0.25,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,238,1282,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,4,0.00,0.00,0.00,0.00,1.00,0.00,0.00,5,5,1.00,0.00,0.20,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,5,0.00,0.00,0.00,0.00,1.00,0.00,0.00,6,6,1.00,0.00,0.17,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,234,1364,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,7,7,1.00,0.00,0.14,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,239,1295,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,7,7,0.00,0.00,0.00,0.00,1.00,0.00,0.00,8,8,1.00,0.00,0.12,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
# 对每一种攻击类型数据量统计
(smurf.,2807886)
(neptune.,1072017)
(normal.,972781)
(satan.,15892)
(ipsweep.,12481)
(portsweep.,10413)
(nmap.,2316)
(back.,2203)
(warezclient.,1020)
(teardrop.,979)
(pod.,264)
(guess_passwd.,53)
(buffer_overflow.,30)
(land.,21)
(warezmaster.,20)
(imap.,12)
(rootkit.,10)
(loadmodule.,9)
(ftp_write.,8)
(multihop.,7)
(phf.,4)
(perl.,3)
(spy.,2)

下面给出上面样本数据每一列的含义和类型(continuous 表示数值型,symbolic 表示名义型):

duration: continuous.
protocol_type: symbolic.
service: symbolic.
flag: symbolic.
src_bytes: continuous.
dst_bytes: continuous.
land: symbolic.
wrong_fragment: continuous.
urgent: continuous.
hot: continuous.
num_failed_logins: continuous.
logged_in: symbolic.
num_compromised: continuous.
root_shell: continuous.
su_attempted: continuous.
num_root: continuous.
num_file_creations: continuous.
num_shells: continuous.
num_access_files: continuous.
num_outbound_cmds: continuous.
is_host_login: symbolic.
is_guest_login: symbolic.
count: continuous.
srv_count: continuous.
serror_rate: continuous.
srv_serror_rate: continuous.
rerror_rate: continuous.
srv_rerror_rate: continuous.
same_srv_rate: continuous.
diff_srv_rate: continuous.
srv_diff_host_rate: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
dst_host_same_srv_rate: continuous.
dst_host_diff_srv_rate: continuous.
dst_host_same_src_port_rate: continuous.
dst_host_srv_diff_host_rate: continuous.
dst_host_serror_rate: continuous.
dst_host_srv_serror_rate: continuous.
dst_host_rerror_rate: continuous.
dst_host_srv_rerror_rate: continuous.
back,buffer_overflow,ftp_write,guess_passwd,imap,ipsweep,land,loadmodule,multihop,neptune,nmap,normal,perl,phf,pod,portsweep,rootkit,satan,smurf,spy,teardrop,warezclient,warezmaster.

5.2 第一次聚类

首先注意到数据里面有非数值特征(没有大小关系的分类变量。比如第二列,它的值只有 tcp、udp 或则 icmp),但是 K-means 聚类算法需要数值变量。现在就简单的把这些列去掉,包括第2、3、4和最后一列。下面是数据准备的代码:

 /**
   * 将原始数据去掉分类变量,并且转成带标签的 double 向量
   * @param data 原始数据集合
   * @return
   */
  def dataPrepare(data: RDD[String]) = {
      val labelsAndData = data.map {
        line =>
          val buffer = line.split(",").toBuffer
          //从索引1开始,总共去掉三个数据(即去掉 1, 2, 3 列)
          buffer.remove(1, 3)
          val label = buffer.remove(buffer.length - 1)
          val vector = Vectors.dense(buffer.map(_.toDouble).toArray)
          (label, vector)
      }
    labelsAndData
  }

注意,下面只会去使用上面得到的 Tuple2 里面的 values 去训练 KMeansModel。注意:现在没有关心参数,也没有关心效果, 现在只是让整个训练跑通。

  //本地测试
  def main(args: Array[String]) {
    val rootDir = "please set your path"
    val conf = new SparkConf().setAppName("SparkInAction").setMaster("local[4]")
    val sc = new SparkContext(conf)
    val kddcupData = sc.textFile(rootDir + "/kddcup.data.corrected")
    //数据初步探索
    dataExplore(kddcupData)
    val data = dataPrepare(kddcupData).values
    data.cache()
    //参数是随意设置的,两个类,迭代次数50次
    val numClusters = 2
    val numIterations = 50
    val clusters = KMeans.train(data, numClusters, numIterations)
    //查看一下模型得到的每一个类中心
    clusters.clusterCenters.foreach(println)
    //使用均方误差查看模型效果
    val wssse = clusters.computeCost(data)
    println("Within Set Sum of Squared Errors = " + wssse)
  }

可以看到,k-means 算法拟合了两个类,类中心以及均方误差和分别打印如下:

[48.34019491959669,1834.6215497618625,826.2031900016945,5.7161172049003456E-6,6.487793027561892E-4,7.961734678254053E-6,0.012437658596734055,3.205108575604837E-5,0.14352904910348827,0.00808830584493399,6.818511237273984E-5,3.6746467745787934E-5,0.012934960793560386,0.0011887482315762398,7.430952366370449E-5,0.0010211435092468404,0.0,4.082940860643104E-7,8.351655530445469E-4,334.9735084506668,295.26714620807076,0.17797031701994304,0.17803698940272675,0.05766489875327384,0.05772990937912762,0.7898841322627527,0.021179610609915762,0.02826081009629794,232.98107822302248,189.21428335201279,0.753713389800417,0.030710978823818437,0.6050519309247937,0.006464107887632785,0.1780911843182427,0.17788589813471198,0.05792761150001037,0.05765922142400437]

[10999.0,0.0,1.309937401E9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,255.0,1.0,0.0,0.65,1.0,0.0,0.0,0.0,1.0,1.0]

# 误差
Within Set Sum of Squared Errors = 4.6634585670252554E18

可以预料这个效果肯定是不好的,因为数据集至少已知的分类都有23类。而这里只分了两类,当然下面还可以直观的感受一下,模型给出的分类中真实的分类的分布情况(比如:模型标记为1的类的数据集中,原本的类分布是如何, 当然理想是越纯越好,如果模型分出的类别为1的原来也只是对应一类,或者虽然有多个类,但都是攻击类,那就完美了),查看的代码如下:

  # 小trick如何同时对两个列汇总
  val clusterLabelCount = labelsAndData.map {
    case (label, data) =>
      val predLabel = clusters.predict(data)
      (predLabel, label)
  }.countByValue()
  clusterLabelCount.toSeq.sorted.foreach {
    case ((cluster, label), count) =>
      println(f"$cluster%1s$label%18s$count%8s")
  }

当然结果是相当的不理想:

0             back.    2203
0  buffer_overflow.      30
0        ftp_write.       8
0     guess_passwd.      53
0             imap.      12
0          ipsweep.   12481
0             land.      21
0       loadmodule.       9
0         multihop.       7
0          neptune. 1072017
0             nmap.    2316
0           normal.  972781
0             perl.       3
0              phf.       4
0              pod.     264
0        portsweep.   10412
0          rootkit.      10
0            satan.   15892
0            smurf. 2807886
0              spy.       2
0         teardrop.     979
0      warezclient.    1020
0      warezmaster.      20
1        portsweep.       1

个人微信公众号

欢迎关注本人微信公众号,会定时发送关于大数据、机器学习、Java、Linux 等技术的学习文章,而且是一个系列一个系列的发布,无任何广告,纯属个人兴趣。

转载于:https://my.oschina.net/u/1244232/blog/528247

你可能感兴趣的:(大数据,数据结构与算法,人工智能)