------------------1.Bucketizer----------------------------------
分箱(分段处理)将(连续数值)转换为离散类别。-- 应用(去除离群值)
splits(分箱数):分箱数为n+1时,将产生n个区间。除了最后一个区间外,每个区间范围[x,y]由分箱的x,y决定。分箱必须是严格递增的。分箱(区间)见在分箱(区间)指定外的值将被归为错误。两个分裂的例子为Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity)以及Array(0.0, 1.0, 2.0)。
注意:
当不确定分裂的上下边界时,应当添加Double.NegativeInfinity和Double.PositiveInfinity以免越界。
分箱区间必须严格递增,例如: s0 < s1 < s2 < ... < sn
// Double.NegativeInfinity:负无穷
// Double.PositiveInfinity:正无穷
// 分为6个组:[负无穷,-100),[-100,-10),[-10,0),[0,10),[10,90),[90,正无穷)
val splits = Array(Double.NegativeInfinity, -100, -10, 0.0, 10, 90, Double.PositiveInfinity)
val data: Array[Double] = Array(-180,-160,-100,-50,-70,-20,-8,-5,-3, 0.0, 1,3,7,10,30,60,90,100,120,150)
val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val bucketizer = new Bucketizer().setInputCol("features").setOutputCol("bucketedFeatures").setSplits(splits)
val bucketedData = bucketizer.transform(dataFrame)
bucketedData.show(50,truncate=false)
+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
|-180.0 |0.0 |
|-160.0 |0.0 |
|-100.0 |1.0 |
|-50.0 |1.0 |
|-70.0 |1.0 |
|-20.0 |1.0 |
|-8.0 |2.0 |
|-5.0 |2.0 |
|-3.0 |2.0 |
|0.0 |3.0 |
|1.0 |3.0 |
|3.0 |3.0 |
|7.0 |3.0 |
|10.0 |4.0 |
|30.0 |4.0 |
|60.0 |4.0 |
|90.0 |5.0 |
|100.0 |5.0 |
|120.0 |5.0 |
|150.0 |5.0 |
+--------+----------------+
应用:过滤离群值
bucketedData.filter($"bucketedFeatures" <= 1).show(50,truncate=false)
+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
|-180.0 |0.0 |
|-160.0 |0.0 |
|-100.0 |1.0 |
|-50.0 |1.0 |
|-70.0 |1.0 |
|-20.0 |1.0 |
+--------+----------------+
比如特征是年龄,是一个连续数值,需要将其转换为离散类别(未成年人、青年人、中年人、老年人),就要用到Bucketizer了。
具体代码实现:
import org.apache.spark.ml.feature.Bucketizer
val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity)
val data = Array(-999.9, -0.5, -0.3, 0.0, 0.2, 999.9)
val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val bucketizer = new Bucketizer()
.setInputCol("features")
.setOutputCol("bucketedFeatures")
.setSplits(splits)
// Transform original data into its bucket index.
val bucketedData = bucketizer.transform(dataFrame)
println(s"Bucketizer output with ${bucketizer.getSplits.length-1} buckets")
bucketedData.show()
----------------------------------------2.QuantileDiscretizer(分位数离散化)----------------------------------------------
和Bucketizer(分箱处理)一样:将连续数值特征转换为离散类别特征。实际上Class QuantileDiscretizer extends (继承自) Class(Bucketizer)。不同的是这里不再自己定义splits(分类标准),而是定义分几箱(段)就可以。QD自己调用函数计算分位数,并完成离散化。
-参数1:(分级的数量)由numBuckets参数决定。
-参数3:(分级的范围)由渐进算法(approxQuantile )决定。上下边界将设置为正(+Infinity)负(-Infinity)无穷,覆盖所有实数范围。
-参数2:(渐进的精度)由relativeError参数决定。设置为0时,将会计算精确的分位点(计算代价较高)。
例如:
val discretizer = new QuantileDiscretizer().setInputCol("hour").setOutputCol("result")
.setNumBuckets(3) //分3(桶/段/箱)
.setRelativeError(0.1) //设置precision-控制相对误差
val result = discretizer.fit(df).transform(df)
例子:
we have a DataFrame with the columns id, hour:
id | hour
----|------
0 | 18.0
----|------
1 | 19.0
----|------
2 | 8.0
----|------
3 | 5.0
----|------
4 | 2.2
转换后:
id | hour | result
----|------|------
0 | 18.0 | 2.0
----|------|------
1 | 19.0 | 2.0
----|------|------
2 | 8.0 | 1.0
----|------|------
3 | 5.0 | 1.0
----|------|------
4 | 2.2 | 0.0
代码实现:
import org.apache.spark.ml.feature.QuantileDiscretizer
val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
val df = spark.createDataFrame(data).toDF("id", "hour")
val discretizer = new QuantileDiscretizer()
.setInputCol("hour")
.setOutputCol("result")
.setNumBuckets(3)
val result = discretizer.fit(df).transform(df)
result.show()