@(博客文章)[spark]
参考spark高级数据分析第2章
本项目根据训练数据,找出2个某个数据的类型(应该是true还是false),并用于下一步的预测。详细见第二部分的分析。
这里只使用了spark的基本API,没有使用mllib的算法。
下载并解压至~/Downloads/donation中
https://archive.ics.uci.edu/ml/machine-learning-databases/00210/donation.zip
本例先在local模式下运行
bin/spark-shell
或者将文件上传至hdfs
hadoop fs -put ./donation/ /tmp/
再使用:
bin/spark-shell --master yarn-client
scala> val rawblocks = sc.textFile("/Users/liaoliuqing/Downloads/donation")
rawblocks: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
scala> rawblocks.first
res0: String = "id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"
scala> rawblocks.take(5).foreach(println)
"id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"
37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE
39086,47614,1,?,1,?,1,1,1,1,1,TRUE
70031,70237,1,?,1,?,1,1,1,1,1,TRUE
84795,97439,1,?,1,?,1,1,1,1,1,TRUE
scala> rawblocks.count
res2: Long = 5749142
从上面的数据输出中可以看到第一行是标题行,表明每个列分别是什么意思。但在实际数据分析中,我们并不需要这一行,因此将其删除。
scala> val noheader = rawblocks.filter(line => !line.contains("id_1"))
noheader: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at filter at <console>:23
scala> noheader.count
res6: Long = 5749132
将行中有”id_1”字段的行去掉,这一般是标题行,当做也可以以其它字段作标准。去除后发现少了10行数据,目录中刚好有10个文件,每个文件去除第一行,即去除了10行。
数据中存在数据丢失,这些数据以?代替,因此要先处理,否则直接调用toDouble会出错:
def myToDouble(s:String) = {
if("?".equals(s)) Double.NaN else s.toDouble
}
关于NaN: In computing, NaN, standing for not a number, is a numeric data type value representing an undefined or unrepresentable value, especially in floating-point calculations.
验证一下上面的方法:
scala> myToDouble("4")
res10: Double = 4.0
scala>
scala> myToDouble("?")
res11: Double = NaN
def parse(line: String) = {
val pieces = line.split(",")
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2,11).map(myToDouble)
val matched = pieces(11).toBoolean
(id1,id2,scores,matched)
}
parse: (line: String)(Int, Int, Array[Double], Boolean)
这个方法将第1、2个字段作为id提供出来,中间9个字段作为double值组成一个array,最后是一个是否match的布尔值,它的返回是:
parse: (line: String)(Int, Int, Array[Double], Boolean)
验证一下上面的函数:
scala> noheader.take(5).map(parse).foreach(println)
(37291,53113,[D@2138bd8c,true)
(39086,47614,[D@1424435e,true)
(70031,70237,[D@58c2daa6,true)
(84795,97439,[D@60a0f5d0,true)
(36950,42116,[D@676a5c3f,true)
上面的返回是一个有4个元素的元组。下面我们将其封闭成一个对象返回。
scala> case class MatchData(id1: Int, id2: Int, scores: Array[Double], matched: Boolean)
defined class MatchData
def parse(line: String) = {
val pieces = line.split(",")
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2,11).map(myToDouble)
val matched = pieces(11).toBoolean
MatchData(id1,id2,scores,matched)
}
再看一下返回的结果:
scala> noheader.take(5).map(parse).foreach(println)
MatchData(37291,53113,[D@dd278c2,true)
MatchData(39086,47614,[D@74f60fa4,true)
MatchData(70031,70237,[D@467d13f9,true)
MatchData(84795,97439,[D@3daa6496,true)
MatchData(36950,42116,[D@7db1d37a,true)
scala> val parsed = noheader.map(line => parse(line))
parsed: org.apache.spark.rdd.RDD[MatchData] = MapPartitionsRDD[5] at map at <console>:31
scala> parsed.first
res15: MatchData = MatchData(37291,53113,[D@4e0d2c7f,true)
OK,现在数据已经提取好了,下面进一步分析。
将分析好的数据按照matched字段进行聚合
scala> val grouped = parsed.groupBy(md => md.matched)
grouped: org.apache.spark.rdd.RDD[(Boolean, Iterable[MatchData])] = ShuffledRDD[7] at groupBy at <console>:33
scala> grouped.mapValues(x=>x.size).foreach(println)
scala> val matchCount = parsed.map(md => md.matched).countByValue()
matchCount: scala.collection.Map[Boolean,Long] = Map(true -> 20931, false -> 5728201)
以下对输出结果进行排序:
scala> val matchCountsSeq = matchCount.toSeq
matchCountsSeq: Seq[(Boolean, Long)] = ArrayBuffer((true,20931), (false,5728201))
scala> matchCountsSeq.sortBy(_._1).foreach(println)
(false,5728201)
(true,20931)
scala> matchCountsSeq.sortBy(_._2).foreach(println)
(true,20931)
(false,5728201)
scala> matchCountsSeq.sortBy(_._2).reverse.foreach(println)
(false,5728201)
(true,20931)
先将对象转化为Seq类型,然后使用sortBy来排序。reverse可反序。
spark提供了stats对RDD[Double]进行概要信息的统计,它是RDD[Double]的一个隐式动作。
scala> parsed.map(md => md.scores(0)).stats()
res12: org.apache.spark.util.StatCounter = (count: 5749132, mean: NaN, stdev: NaN, max: NaN, min: NaN)
由于存在NaN的值,导致计算出错了,我们将其去除:
scala> import java.lang.Double.isNaN
import java.lang.Double.isNaN
scala> parsed.map(md => md.scores(0)).filter(!isNaN(_)).stats()
res13: org.apache.spark.util.StatCounter = (count: 5748125, mean: 0.712902, stdev: 0.388758, max: 1.000000, min: 0.000000)
只要你愿意,可以对scores中的所有值计算这个概要信息。
val stats = (0 until 9).map(i => {
parsed.map(md => md.scores(i)).filter(!isNaN(_)).stats()
})
本示例的数据有12列,其中第一、二列为2个id,第3~11是9个数值,这些数值表示这2个id所代表的事物(或者人)在9个属性上的比较数据,最后一个属性是一个布尔值,表示这2个id是否同一个事物:
"id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"
37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE
39086,47614,1,?,1,?,1,1,1,1,1,TRUE
70031,70237,1,?,1,?,1,1,1,1,1,TRUE
我们要做的就是分析这9个数据,得出一个模型,以便当提供这9个数据时,判断这2个id是否同一个事物。
创建一个case类,将每一行数据保存于一个对象中。
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
case class Scored(md: MatchData, score: Double)
数据的下载请见第一部分的介绍
val rawblocks = sc.textFile("file:///Users/liaoliuqing/Downloads/donation2")
当然,更常见的是读取hdfs中的数据。注意,如果全部使用donation中的数据,有可以机器的内存不足,因此删除数据只剩下2个文件即可(1个也不行,会出错)。
def isHeader(line: String) = line.contains("id_1")
val noheader = rawblocks.filter(x => !isHeader(x))
每个文件的第一行都是一个标题行,先将其去除。
文件记录中存在大量的?号,表示这个数据缺失了,我们需要将其转化为NaN,否则直接调用toDouble会出错
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line: String) = {
val pieces = line.split(',')
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
MatchData(id1, id2, scores, matched)
}
val parsed = noheader.map(line => parse(line))
parsed.cache()
val matchCounts = parsed.map(md => md.matched).countByValue()
对结果排序并输出
val matchCountsSeq = matchCounts.toSeq
matchCountsSeq.sortBy(_._2).reverse.foreach(println)
输出为:
(false,1145640)
(true,4186)
即样本中只4186个是true的,其余都是false的。
val stats = (0 until 9).map(i => {
parsed.map(_.scores(i)).filter(!_.isNaN).stats()
})
stats.foreach(println)
输出结果为:
(count: 1149603, mean: 0.712452, stdev: 0.389030, max: 1.000000, min: 0.000000)
(count: 20650, mean: 0.898884, stdev: 0.273071, max: 1.000000, min: 0.000000)
(count: 1149826, mean: 0.315906, stdev: 0.334438, max: 1.000000, min: 0.000000)
(count: 465, mean: 0.326669, stdev: 0.366702, max: 1.000000, min: 0.000000)
(count: 1149826, mean: 0.955133, stdev: 0.207011, max: 1.000000, min: 0.000000)
(count: 1149678, mean: 0.225125, stdev: 0.417664, max: 1.000000, min: 0.000000)
(count: 1149678, mean: 0.488465, stdev: 0.499867, max: 1.000000, min: 0.000000)
(count: 1149678, mean: 0.222706, stdev: 0.416062, max: 1.000000, min: 0.000000)
(count: 1147303, mean: 0.005550, stdev: 0.074288, max: 1.000000, min: 0.000000)
stats函数会分析RDD[Double]中的元素,计算数量,平均值,均方差,最大值,最小值等。
其实这一步对下面的分析没有直接作用,可忽略。
2个变量分别表示缺失值的数量以及一个StatCounter对象,StatCounter包括5个属性:
private var n: Long = 0 // Running count of our values
private var mu: Double = 0 // Running mean of our values
private var m2: Double = 0 // Running variance numerator (sum of (x - mean)^2)
private var maxValue: Double = Double.NegativeInfinity // Running max of our values
private var minValue: Double = Double.PositiveInfinity // Running min of our values
即与上面stats()方法的输出相同。
定义了2个NAStatCounter对象add时的操作,即如果这个值是NaN的话,则缺失值加1,否则的话就2个NAStatCounter对象执行merge方法。merge方法的定义为:
def merge(value: Double): StatCounter = {
val delta = value - mu
n += 1
mu += delta / n
m2 += delta * (value - mu)
maxValue = math.max(maxValue, value)
minValue = math.min(minValue, value)
this
}
即是如何更新它的几个数据而已。
使得打印时更好的表示内容
最后还定义了apply方法,表示创建一个NAStatCounter对象时的操作。
class NAStatCounter extends Serializable {
val stats: StatCounter = new StatCounter()
var missing: Long = 0
def add(x: Double): NAStatCounter = {
if (x.isNaN) {
missing += 1
} else {
stats.merge(x)
}
this
}
def merge(other: NAStatCounter): NAStatCounter = {
stats.merge(other.stats)
missing += other.missing
this
}
override def toString: String = {
"stats: " + stats.toString + " NaN: " + missing
}
}
object NAStatCounter extends Serializable {
def apply(x: Double) = new NAStatCounter().add(x)
}
将每个属性转化为一个NAStatCounter对象,并输出
val nasRDD = parsed.map(md => {
md.scores.map(d => NAStatCounter(d))
})
val reduced = nasRDD.reduce((n1, n2) => {
n1.zip(n2).map { case (a, b) => a.merge(b) }
})
reduced.foreach(println)
其实这一步对最终结果也没有作用,只用于中间调试。
输出为:
stats: (count: 1149603, mean: 0.712452, stdev: 0.389030, max: 1.000000, min: 0.000000) NaN: 223
stats: (count: 20650, mean: 0.898884, stdev: 0.273071, max: 1.000000, min: 0.000000) NaN: 1129176
stats: (count: 1149826, mean: 0.315906, stdev: 0.334438, max: 1.000000, min: 0.000000) NaN: 0
stats: (count: 465, mean: 0.326669, stdev: 0.366702, max: 1.000000, min: 0.000000) NaN: 1149361
stats: (count: 1149826, mean: 0.955133, stdev: 0.207011, max: 1.000000, min: 0.000000) NaN: 0
stats: (count: 1149678, mean: 0.225125, stdev: 0.417664, max: 1.000000, min: 0.000000) NaN: 148
stats: (count: 1149678, mean: 0.488465, stdev: 0.499867, max: 1.000000, min: 0.000000) NaN: 148
stats: (count: 1149678, mean: 0.222706, stdev: 0.416062, max: 1.000000, min: 0.000000) NaN: 148
stats: (count: 1147303, mean: 0.005550, stdev: 0.074288, max: 1.000000, min: 0.000000) NaN: 2523
定义statsWithMissing,用于分析缺失值
def statsWithMissing(rdd: RDD[Array[Double]]): Array[NAStatCounter] = {
val nastats = rdd.mapPartitions((iter: Iterator[Array[Double]]) => {
val nas: Array[NAStatCounter] = iter.next().map(d => NAStatCounter(d))
iter.foreach(arr => {
nas.zip(arr).foreach { case (n, d) => n.add(d) }
})
Iterator(nas)
})
nastats.reduce((n1, n2) => {
n1.zip(n2).map { case (a, b) => a.merge(b) }
})
}
val statsm = statsWithMissing(parsed.filter(_.matched).map(_.scores))
val statsn = statsWithMissing(parsed.filter(!_.matched).map(_.scores))
statsm.zip(statsn).map { case(m, n) =>
(m.missing + n.missing, m.stats.mean - n.stats.mean)
}.foreach(println)
输出结果:
(223,0.286371147556274)
(1129176,0.09237251848914796)
(0,0.6840609479157178)
(1149361,0.7866299180271783)
(0,0.03376179754806352)
(148,0.7736308747874063)
(148,0.5112459666546485)
(148,0.7760586525457857)
(2523,0.9562752950948621)
这里可以看出第2,5,6,7,8这5个属性比较大,即当结果属于不同类别时,这5个属性较大。因此我们选取这5个属性。
下面对结果进行一些分析
我们简单的将上述5个属性进行相加,作为评分的标准
def naz(d: Double) = if (Double.NaN.equals(d)) 0.0 else d
val ct = parsed.map(md => {
val score = Array(2, 5, 6, 7, 8).map(i => naz(md.scores(i))).sum
Scored(md, score)
})
最后ct是一个MatchData与score组成的对象的RDD。
我们设定了阈值分别为4.0与2.0,然后重新计算true和flase的数量
ct.filter(s => s.score >= 4.0).
map(s => s.md.matched).countByValue().foreach(println)
ct.filter(s => s.score >= 2.0).
map(s => s.md.matched).countByValue().foreach(println)
结果如下:
(false,134)
(true,4175)
(false,119766)
(true,4186)
对比原始数据:
(false,1145640)
(true,4186)
* 当阈值为4.0时,即这5个属性的值加起来大于4.0,我们将绝大部分的true类别选取出来了,同时只有少量的flase类别。
* 当阈值为2.0时,即这5个属性的值加起来大于2.0,我们将全部的true类别选取出来了,但同时混入了大量的false类别。
因此根据应用情景,如果我们需要尽可能多的true值,即将阈值降低。但如果要同时兼顾true和false这2种类型,则需要将阈值适度提高。
真正应用时,除了训练数据,应该还要有验证数据,用验证数据来检验模型的准确率。
先在本机测试,因此设置setMaster(“local[2]”),且目录为file:///
如果在集群中运行,将setMaster去掉,目录通过参数传入一个hdfs的地址。
package com.lujinhong.sparkdemo.ml.basic
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.util.StatCounter
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
case class Scored(md: MatchData, score: Double)
object RunIntro extends Serializable {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Intro").setMaster("local[2]"))
val rawblocks = sc.textFile("file:///Users/liaoliuqing/Downloads/donation2")
def isHeader(line: String) = line.contains("id_1")
val noheader = rawblocks.filter(x => !isHeader(x))
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
def parse(line: String) = {
val pieces = line.split(',')
val id1 = pieces(0).toInt
val id2 = pieces(1).toInt
val scores = pieces.slice(2, 11).map(toDouble)
val matched = pieces(11).toBoolean
MatchData(id1, id2, scores, matched)
}
val parsed = noheader.map(line => parse(line))
parsed.cache()
val matchCounts = parsed.map(md => md.matched).countByValue()
val matchCountsSeq = matchCounts.toSeq
matchCountsSeq.sortBy(_._2).reverse.foreach(println)
val stats = (0 until 9).map(i => {
parsed.map(_.scores(i)).filter(!_.isNaN).stats()
})
stats.foreach(println)
val nasRDD = parsed.map(md => {
md.scores.map(d => NAStatCounter(d))
})
val reduced = nasRDD.reduce((n1, n2) => {
n1.zip(n2).map { case (a, b) => a.merge(b) }
})
reduced.foreach(println)
val statsm = statsWithMissing(parsed.filter(_.matched).map(_.scores))
val statsn = statsWithMissing(parsed.filter(!_.matched).map(_.scores))
statsm.zip(statsn).map { case(m, n) =>
(m.missing + n.missing, m.stats.mean - n.stats.mean)
}.foreach(println)
def naz(d: Double) = if (Double.NaN.equals(d)) 0.0 else d
val ct = parsed.map(md => {
val score = Array(2, 5, 6, 7, 8).map(i => naz(md.scores(i))).sum
Scored(md, score)
})
ct.filter(s => s.score >= 4.0).
map(s => s.md.matched).countByValue().foreach(println)
ct.filter(s => s.score >= 2.0).
map(s => s.md.matched).countByValue().foreach(println)
}
def statsWithMissing(rdd: RDD[Array[Double]]): Array[NAStatCounter] = {
val nastats = rdd.mapPartitions((iter: Iterator[Array[Double]]) => {
val nas: Array[NAStatCounter] = iter.next().map(d => NAStatCounter(d))
iter.foreach(arr => {
nas.zip(arr).foreach { case (n, d) => n.add(d) }
})
Iterator(nas)
})
nastats.reduce((n1, n2) => {
n1.zip(n2).map { case (a, b) => a.merge(b) }
})
}
}
class NAStatCounter extends Serializable {
val stats: StatCounter = new StatCounter()
var missing: Long = 0
def add(x: Double): NAStatCounter = {
if (x.isNaN) {
missing += 1
} else {
stats.merge(x)
}
this
}
def merge(other: NAStatCounter): NAStatCounter = {
stats.merge(other.stats)
missing += other.missing
this
}
override def toString: String = {
"stats: " + stats.toString + " NaN: " + missing
}
}
object NAStatCounter extends Serializable {
def apply(x: Double) = new NAStatCounter().add(x)
}