Spark DataFrame统计某列特征不同个数

Scala版本

import spark.implicits._

var data1 = Seq(
  ("0", "2002", "196", "1", "bai"),
  ("1", "4004", "192", "2", "wang"),
  ("0", "7007", "95", "3", "wang"),
  ("0", "4004", "4", "4", "wang"),
  ("0", "7007", "15", "5", "wang"),
  ("1", "0",    "14", "6", "zhu"),
  ("0", "9009", "96", "7", "bai"),
  ("1", "9009", "126", "8", "bai"),
  ("0","10010", "19219", "9", "wei")
).toDF("label", "AMOUNT", "Pclass", "MAC_id", "PAYER_CODE")

data1.show()

计算某一列不同元素的个数

  • 方法一: approx_count_distinct
import org.apache.spark.sql.functions._

data1.agg(approx_count_distinct("label")).collect().map(_(0)).toList(0).toString

结果:
res661: String = 2
  • 方法二:countDistinct
import org.apache.spark.sql.functions._

data1.agg(countDistinct("label")).collect().map(_(0)).toList(0).toString

Python版本

可参考:这里

你可能感兴趣的:(Scala,DateFrame,Spark)