来源:http://mocom.xmu.edu.cn/article/show/584d1fc5bd8177b41ebbd8bc/0/1
Spark目前支持皮尔森卡方检测(Pearson’s chi-squared tests),包括“适配度检定”(Goodness of fit)以及“独立性检定”(independence)。
首先,我们导入必要的包
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics._
接下来,我们从数据集中选择要分析的数据,比如说我们取出iris数据集中的前两条数据v1和v2。不同的输入类型决定了是做拟合度检验还是独立性检验。拟合度检验要求输入为Vector, 独立性检验要求输入是Matrix。
scala> val v1: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).first
v1: org.apache.spark.mllib.linalg.Vector = [5.1,3.5,1.4,0.2]
scala> val v2: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).take(2).last
v2: org.apache.spark.mllib.linalg.Vector = [4.9,3.0,1.4,0.2]
Goodness fo fit(适合度检验):验证一组观察值的次数分配是否异于理论上的分配。其 H0假设(虚无假设,null hypothesis)为一个样本中已发生事件的次数分配会服从某个特定的理论分配。实际执行多项式试验而得到的观察次数,与虚无假设的期望次数相比较,检验二者接近的程度,利用样本数据以检验总体分布是否为某一特定分布的统计方法。
通常情况下这个特定的理论分配指的是均匀分配,目前Spark默认的是均匀分配。以下是代码:
scala> val goodnessOfFitTestResult = Statistics.chiSqTest(v1)
goodnessOfFitTestResult: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 3
statistic = 5.588235294117647
pValue = 0.1334553914430291
No presumption against null hypothesis: observed follows the same distribution as expected..
可以看到P值,自由度,检验统计量,所使用的方法,以及零假设等信息。我们先简单介绍下每个输出的意义:
method: 方法。这里采用pearson方法。
statistic: 检验统计量。简单来说就是用来决定是否可以拒绝原假设的证据。检验统计量的值是利用样本数据计算得到的,它代表了样本中的信息。检验统计量的绝对值越大,拒绝原假设的理由越充分,反之,不拒绝原假设的理由越充分。
degrees of freedom:自由度。表示可自由变动的样本观测值的数目,
pValue:统计学根据显著性检验方法所得到的P 值。一般以P < 0.05 为显著, P<0.01 为非常显著,其含义是样本间的差异由抽样误差所致的概率小于0.05 或0.01。
一般来说,假设检验主要看P值就够了。在本例中pValue =0.133,说明两组的差别无显著意义。通过V1的观测值[5.1, 3.5, 1.4, 0.2],无法拒绝其服从于期望分配(这里默认是均匀分配)的假设。
卡方独立性检验是用来检验两个属性间是否独立。其中一个属性做为行,另外一个做为列,通过貌似相关的关系考察其是否真实存在相关性。比如天气温变化和肺炎发病率。
首先,我们通过v1、v2构造一个举证Matrix,然后进行独立性检验:
scala> val mat: Matrix =Matrices.dense(2,2,Array(v1(0),v1(1),v2(0),v2(1)))
mat: org.apache.spark.mllib.linalg.Matrix =
5.1 4.9
3.5 3.0
scala> val a =Statistics.chiSqTest(mat)
a: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 0.012787584067389817
pValue = 0.90996538641943
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
这里所要检验是否独立的两个属性,一个是样本的序号,另一个是样本的数据值。在本例中pValue =0.91,说明无法拒绝“样本序号与数据值无关”的假设。这也符合数据集的实际情况,因为v1和v2是从同一个样本抽取的两条数据,样本的序号与数据的取值应该是没有关系的。
我们也可以把v1作为样本,把v2作为期望值,进行卡方检验:
scala> val c1 = Statistics.chiSqTest(v1, v2)
c1: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 3
statistic = 0.03717820461517941
pValue = 0.9981145601231336
No presumption against null hypothesis: observed follows the same distribution as expected..
本例中pValue =0.998,说明样本v1与期望值等于V2的数据分布并无显著差异。事实上,v1=[5.1,3.5,1.4,0.2]与v2= [4.9,3.0,1.4,0.2]很像,v1可以看做是从期望值为v2的数据分布中抽样出来的的。
同样的,键值对也可以进行独立性检验,这里我们取iris的数据组成键值对:
scala> val data=sc.textFile("G:/spark/iris.data")
data: org.apache.spark.rdd.RDD[String] = G:/spark/iris.data MapPartitionsRDD[13] at textFile at :44
scala> val obs = data.map{ line =>
| val parts = line.split(',')
| LabeledPoint(if(parts(4)=="Iris-setosa") 0.toDouble else if (parts(4)=="Iris-versicolor") 1.toDouble else
| 2.toDouble, Vectors.dense(parts(0).toDouble,parts(1).toDouble,parts
(2).toDouble,parts(3).toDouble))}
obs: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[14] at map at :46
进行独立性检验,返回一个包含每个特征对于标签的卡方检验的数组:
scala> val featureTestResults= Statistics.chiSqTest(obs)
featureTestResults: Array[org.apache.spark.mllib.stat.test.ChiSqTestResult] =
Array(Chi squared test summary:
method: pearson
degrees of freedom = 68
statistic = 156.26666666666665
pValue = 6.6659873176888595E-9
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 44
statistic = 88.36446886446883
pValue = 8.303947787857702E-5
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 84
statistic = 271.79999999999995
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi...
这里实际上是把特征数据中的每一列都与标签进行独立性检验。可以看出,P值都非常小,说明可以拒绝“某列与标签列无关”的假设。也就是说,可以认为每一列的数据都与最后的标签有相关性。我们用foreach把完整结果打印出来:
scala> var i = 1
i: Int = 1
scala> featureTestResults.foreach { result =>
| println(s"Column $i:\n$result")
| i += 1
| }
Column 1:
Chi squared test summary:
method: pearson
degrees of freedom = 68
statistic = 156.26666666666665
pValue = 6.6659873176888595E-9
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 2:
Chi squared test summary:
method: pearson
degrees of freedom = 44
statistic = 88.36446886446883
pValue = 8.303947787857702E-5
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 3:
Chi squared test summary:
method: pearson
degrees of freedom = 84
statistic = 271.79999999999995
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 4:
Chi squared test summary:
method: pearson
degrees of freedom = 42
statistic = 271.75
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
spark也支持Kolmogorov-Smirnov 检验,下面将展示具体的步骤:
scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble)
test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at :44
// run a KS test for the sample versus a standard normal distribution
scala> val testResult = Statistics.kolmogorovSmirnovTest(test, "norm", 0, 1)
testResult: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult =
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.999991460094529
pValue = 0.0
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
// perform a KS test using a cumulative distribution function of our making
scala> val myCDF: Double => Double = (p=>p*2)
myCDF: Double => Double =
scala> val testResult2 = Statistics.kolmogorovSmirnovTest(test, myCDF)
testResult2: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult = Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 14.806666666666668
pValue = 0.0
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
RandomRDDs 是一个工具集,用来生成含有随机数的RDD,可以按各种给定的分布模式生成数据集,Random RDDs包下现支持正态分布、泊松分布和均匀分布三种分布方式。RandomRDDs提供随机double RDDS或vector RDDS。
下面的例子中生成一个随机double RDD,其值是标准正态分布N(0,1),然后将其映射到N(1,4)。
首先,导入必要的包:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._
生成1000000个服从正态分配N(0,1)的RDD[Double],并且分布在 10 个分区中:
scala> val u = normalRDD(sc, 10000000L, 10)
u: org.apache.spark.rdd.RDD[Double] = RandomRDD[35] at RDD at RandomRDD.scala:38
把生成的随机数转化成N(1,4) 正态分布:
scala> val v = u.map(x => 1.0 + 2.0 * x)
v: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[36] at map at :50
Spark ML 提供了一个工具类 KernelDensity 用于核密度估算,核密度估算的意思是根据已知的样本估计未知的密度,属於非参数检验方法之一。核密度估计的原理是。观察某一事物的已知分布,如果某一个数在观察中出现了,可认为这个数的概率密度很大,和这个数比较近的数的概率密度也会比较大,而那些离这个数远的数的概率密度会比较小。Spark1.6.2版本支持高斯核(Gaussian kernel)。
首先,导入必要的包:
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
同时留意到已经导入的数据:
scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble)
test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at :44
用样本数据构建核函数,这里用假设检验中得到的iris的第一个属性的数据作为样本数据进行估计:
scala> val kd = new KernelDensity().setSample(test).setBandwidth(3.0)
kd: org.apache.spark.mllib.stat.KernelDensity = org.apache.spark.mllib.stat.KernelDensity@26216fa3
其中setBandwidth表示高斯核的宽度,为一个平滑参数,可以看做是高斯核的标准差。
构造了核密度估计kd,就可以对给定数据数据进行核估计:
scala> val densities = kd.estimate(Array(-1.0, 2.0, 5.0, 5.8))
densities: Array[Double] = Array(0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114)
这里表示的是,在样本-1.0, 2.0, 5.0, 5.8等样本点上,其估算的概率密度函数值分别是:0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114。