解决sparksql两个DataFrame合并后出现两列相同的情况

我们经常使用spark时会对表合并

import spark.implicits._

val data1 = Seq(  
 | ("1", "ming", "hlj"),  
 | ("2", "tian", "jl"),
 | ("3", "wang", "ln"),
 | ("4", "qi", "bj"),
 | ("5", "sun", "tj")
 | ).toDF("useid", "name", "live")  
 

val data2 = Seq(  
 | ("1", "ming", "sing"),  
 | ("2", "tian", "dance"),
 | ("3", "kun", "rap"),
 | ("4", "qi", "lanqiu"),
 | ("5", "ke", "zuqiu")
 | ).toDF("useid", "name", "hobby")  

解决sparksql两个DataFrame合并后出现两列相同的情况_第1张图片
解决sparksql两个DataFrame合并后出现两列相同的情况_第2张图片

将表data1和data2进行合并很简单,单会出现一些小问题:

frame = spark.sql(s"select" +
      s" *" +
      " from" +
      s"(select * from tmp_1" +
      s") aa inner join " +
      s"(select * from  tmp_2 " +
      s") bb" +
      s" on" +
      s"aa.useid=bb.useid and aa.name=bb.name"+
      s"")
 显示结果为:

解决sparksql两个DataFrame合并后出现两列相同的情况_第3张图片

出现两列相同名字:
如何只留下一列呢?

解决方法:

    var str = ""
    db1 = data1
    db2 = data2
    keynames = List("name","useid")
    var frame1 = spark.sql(s"select * from $db1")
    var frame2 = spark.sql(s"select * from $db2")

	// 分别将要合并的keyname后加上表后缀 方便分辨
    for (keyname<-keynames){
      println("++++++++++++++++"+keyname)
      frame1 = frame1.withColumnRenamed(keyname, keyname+s"_$db1")
      frame2 = frame2.withColumnRenamed(keyname, keyname+s"_$db2")

      str += s" aa.${keyname}_$db1 = bb.${keyname}_$db2 and"
    }
    frame1.createOrReplaceTempView(s"tmp_1")
    frame2.createOrReplaceTempView(s"tmp_2")
    str = str.dropRight(3)
    val frame = spark.sql(s"select" +
      s" *" +
      " from" +
      s"(select * from tmp_1" +
      s") aa inner join " +
      s"(select * from  tmp_2 " +
      s") bb" +
      s" on" +
      str +
      s"")


    //删除代表db1的列 并将db2代表的列改为初始名称
    var framefinal = frame
    for (keyname<-keynames){
      framefinal = framefinal.drop(s"${keyname}_$db1")
      framefinal = framefinal.withColumnRenamed(keyname+"_$db2", keyname)
    }
    framefinal.show()

framefinal是最终的df表结果:
解决sparksql两个DataFrame合并后出现两列相同的情况_第4张图片

你可能感兴趣的:(Spark,SQL,DateFrame)