VectorAssembler 用法

从源数据中提取特征指标数据,这是一个比较典型且通用的步骤,因为我们的原始数据集里,经常会包含一些非指标数据,如 ID,Description 等。为方便后续模型进行特征输入,需要部分列的数据转换为特征向量,并统一命名,VectorAssembler类完成这一任务。VectorAssembler是一个transformer,将多列数据转化为单列的向量列。

  • 代码如下
object VectorAssemblerExample {
  def main(args: Array[String]) {
    val spark = SparkSession.builder().master("local[*]").appName("QuantileDiscretizerExample").getOrCreate()
    val sc = spark.sparkContext
    val sqlContext = spark.sqlContext
    import sqlContext.implicits._
    
    val dataset = spark.createDataFrame(
      Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
    ).toDF("id", "hour", "mobile", "userFeatures", "clicked")

    val assembler = new VectorAssembler()
      .setInputCols(Array("hour", "mobile", "userFeatures"))
      .setOutputCol("features")

    val output = assembler.transform(dataset)
    output.show(false)
    println(output.select("features", "clicked").first())
    sc.stop()
  }
}
  • 原始数据
+---+----+------+--------------+-------+
| id|hour|mobile|  userFeatures|clicked|
+---+----+------+--------------+-------+
|  0|  18|   1.0|[0.0,10.0,0.5]|    1.0|
+---+----+------+--------------+-------+
  • 转换之后的数据
+---+----+------+--------------+-------+-----------------------+
|id |hour|mobile|userFeatures  |clicked|features               |
+---+----+------+--------------+-------+-----------------------+
|0  |18  |1.0   |[0.0,10.0,0.5]|1.0    |[18.0,1.0,0.0,10.0,0.5]|
+---+----+------+--------------+-------+-----------------------+
  • 特征数据
[[18.0,1.0,0.0,10.0,0.5],1.0]

你可能感兴趣的:(VectorAssembler 用法)