spark大数据分析:spark Struct Strreaming(23)去除重复数据

文章目录

    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("test")
      .getOrCreate()

    import spark.implicits._
    spark.sparkContext.setLogLevel("WARN")

    val lines = spark.readStream
      .format("socket")
      .option("host", "linux01")
      .option("port", 9999)
      .load()

    val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")

    val words = lines.as[String].map(s => {
      val arr = s.split(",")
      val date = sdf.parse(arr(1))
      (arr(0), new Timestamp(date.getTime), arr(2))
    }).toDF("uid", "ts", "word")

    val wordCounts = words
      .withWatermark("ts", "2 minutes")
      .dropDuplicates("uid")


    val query = wordCounts.writeStream
      .outputMode("append")
      .trigger(Trigger.ProcessingTime(0))
      .format("console")
      .start()

    query.awaitTermination()

dropDuplicates 设置数据去重依据,如果两条数据uid内容完全一致,认为它们是重复数据,可以包含多个列名
dropDuplicates不可聚合操作之后,即通过DataFrame,DataSe之后不能调用dropDuplicates

你可能感兴趣的:(spark-鲨鱼)