案例

简化需求

一张hive表,有三个字段id,score,date,分别代表用户,信用得分,日期。格式如下:

1,50,201909
2,60,201909
1,30,201910
2,62,201910

每个月根据用户的消费情况得出其信用得分,但是有个前提是,本月的得分不能低于上月

需求分析

也就是计算出本月得分后和上个月的得分进行比较,当本月得分小于上个月得分,则用上个月得分替代。可怎么实现需求呢?Hive根本不支持行级别的更改数据啊…
经过冥思苦想,感觉得换一个思路了,反正我就是要取两个值的最大值嘛,那不就是开窗函数就能搞定的吗?

具体实现

object Test{
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)
    val spark = SparkSession
      .builder()
      .appName(this.getClass.getSimpleName)
      .master("local[*]")
      .getOrCreate()
    import spark.implicits._
    spark.read.textFile("./data/comp")
      .map(_.split(","))
      .map(x => (x(0), x(1)))
      .toDF("id", "score")
      .createTempView(viewName = "view")

    import spark.sql
    sql(sqlText = "select id,score from (select id,score,row_number() " +
      "over(partition by id order by score desc) as rn from view) as t where t.rn = 1")
      .createTempView(viewName = "view1")

    sql(sqlText = "select id,score,'201910' as date from view1").show()
    spark.stop()
  }
}

结果

+---+-----+------+
| id|score|  date|
+---+-----+------+
|  1|   50|201910|
|  2|   62|201910|
+---+-----+------+

你可能感兴趣的:(Hive,Spark)