2021.5.20 sparksql的Dataframe的api(更新中)

实际使用dataFrame的api的时候的时候才发现忘记的差不多了,常用的api做了一个整理,但是会涉及到公司的代码没有办法拿出来。
下面会写一些测试案例:

记录一下朋友的环境参数及版本号,以备以后查找使用,上次帮朋友下载依赖,自己的环境配置找不到了emmm:

<properties>
    <project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
    <maven.compiler.source>1.8maven.compiler.source>
    <maven.compiler.target>1.8maven.compiler.target>
    <hive-common.version>3.0.0-cdh6.3.2hive-common.version>
    <hive-client.version>3.0.0-cdh6.3.2hive-client.version>
    <hadoop-hdfs.version>3.0.0-cdh6.3.2hadoop-hdfs.version>
    <spark.version>2.4.0-cdh6.3.2spark.version>
properties>

没事我还有spark-shell,艰难:

第一个是表的连接:

join方法:
使用spark连接的hive数据库,用的是sparksql的方式生成的dataframe
1.我的exam库里有这么一张表:ex_exam_record

//使用exam库
scala> spark.sql("use exam")
res38: org.apache.spark.sql.DataFrame = []
//查看一下exam库里有哪些表
scala> spark.sql("show tables").show
+--------+----------------+-----------+
|database|       tableName|isTemporary|
+--------+----------------+-----------+
|    exam| ex_exam_anlysis|      false|
|    exam|ex_exam_question|      false|
|    exam|  ex_exam_record|      false|
+--------+----------------+-----------+
//查看一下表结构
scala> spark.sql("desc ex_exam_record").show
+-----------+---------+-------+
|   col_name|data_type|comment|
+-----------+---------+-------+
|   topic_id|   string|   null|
| student_id|   string|   null|
|question_id|   string|   null|
|      score|    float|   null|
+-----------+---------+-------+
//查询表中的10行结果,如图所示
scala> spark.sql("select * from ex_exam_record limit 10").show
+--------+-------------+-----------+-----+
|topic_id|   student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593|       8080|  1.0|
|34434459|8195023659592|       3762|  0.0|
| 3443449|8195023659591|       1736|  0.0|
|34434473|8195023659591|       7982|  0.0|
|34434444|8195023659595|       4753|  0.0|
|34434425|8195023659597|       7130|  1.0|
|34434497|8195023659594|       3365|  1.0|
|34434473|8195023659597|       7680|  1.0|
|34434417|8195023659593|       3727|  0.0|
|34434454|8195023659598|       6108|  1.0|
+--------+-------------+-----------+-----+

2.把这张表生成一个dataFrame,命名就叫recordDf

//生成recordDf
scala> val recordDf =  spark.sql("select * from ex_exam_record")
recordDf: org.apache.spark.sql.DataFrame = [topic_id: string, student_id: string ... 2 more fields]
//展示前10行
scala> recordDf.show(10)
+--------+-------------+-----------+-----+
|topic_id|   student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593|       8080|  1.0|
|34434459|8195023659592|       3762|  0.0|
| 3443449|8195023659591|       1736|  0.0|
|34434473|8195023659591|       7982|  0.0|
|34434444|8195023659595|       4753|  0.0|
|34434425|8195023659597|       7130|  1.0|
|34434497|8195023659594|       3365|  1.0|
|34434473|8195023659597|       7680|  1.0|
|34434417|8195023659593|       3727|  0.0|
|34434454|8195023659598|       6108|  1.0|
+--------+-------------+-----------+-----+
only showing top 10 rows

3.这时候为了展示关联,我想设置一个有相同字段的表,这时候新建一张表,假设叫test1
我们看到record表里的topic_id是string类型的
我们可以假设一张新的表,topic_id是int类型的

scala> spark.sql("create table test1 ( topic_id int comment '主题id', teacher_id int comment '教师id' )")
res46: org.apache.spark.sql.DataFrame = []

scala> spark.sql("desc test1")
res47: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field]

scala> spark.sql("desc test1").show()
+----------+---------+-------+
|  col_name|data_type|comment|
+----------+---------+-------+
|  topic_id|      int| 主题id|
|teacher_id|      int| 教师id|
+----------+---------+-------+

4.像表中插入一些数据,看到原表中有这样的id:34434412和34434459

scala> spark.sql("insert into table test1 values (34434412,1),(34434459,2)")
res50: org.apache.spark.sql.DataFrame = []
scala> spark.sql("select * from test1").show
+--------+----------+
|topic_id|teacher_id|
+--------+----------+
|34434412|         1|
|34434459|         2|
+--------+----------

5.这两个字段就插入成功了,这时候尝试进行join操作,注意现在的topic_id类型是不同的

//先把test1表建成dataFrame
scala> val test1Df = spark.sql("select * from test1")
test1Df: org.apache.spark.sql.DataFrame = [topic_id: int, teacher_id: int]
scala> test1Df.show()
+--------+----------+
|topic_id|teacher_id|
+--------+----------+
|34434412|         1|
|34434459|         2|
+--------+----------+
//然后尝试将recordDf和test1Df进行join操作
scala> test1Df.join(recordDf,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+----------+--------+-------------+-----------+-----+
|topic_id|teacher_id|topic_id|   student_id|question_id|score|
+--------+----------+--------+-------------+-----------+-----+
|34434412|         1|34434412|8195023659593|       8080|  1.0|
|34434412|         1|34434412|8195023659591|       8775|  0.0|
|34434412|         1|34434412|8195023659594|       2439|  0.5|
|34434412|         1|34434412|8195023659593|       9496|  1.0|
|34434412|         1|34434412|8195023659598|       7854|  0.0|
+--------+----------+--------+-------------+-----------+-----+
only showing top 5 rows
scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id|   student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593|       8080|  1.0|34434412|         1|
|34434459|8195023659592|       3762|  0.0|34434459|         2|
| 3443449|8195023659591|       1736|  0.0|    null|      null|
|34434473|8195023659591|       7982|  0.0|    null|      null|
|34434444|8195023659595|       4753|  0.0|    null|      null|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows
//这两个类型不同也join上了,查看一下dataFrame的schema
scala> recordDf.schema
res60: org.apache.spark.sql.types.StructType = StructType(StructField(topic_id,StringType,true), StructField(student_id,StringType,true), StructField(question_id,StringType,true), StructField(score,FloatType,true))
scala> test1Df.schema
res61: org.apache.spark.sql.types.StructType = StructType(StructField(topic_id,IntegerType,true), StructField(teacher_id,IntegerType,true))

6.可以看到在join的时候,如果类型不同,会尝试做自动的类型转换,记录一下手动类型转换的方法

scala> val newRecordDf = recordDf.withColumn("new_topic_id",col("topic_id") cast "Int")
scala> newRecordDf.schema
res64: org.apache.spark.sql.types.StructType = StructType(
StructField(topic_id,StringType,true), 
StructField(student_id,StringType,true), 
StructField(question_id,StringType,true), 
StructField(score,FloatType,true), 
StructField(new_topic_id,IntegerType,true)
)
//结论:可以看到这里是新增了一个列new_topic_id,类型是IntegerType
//     原来的列topic_id并没有消失,而且类型也没有发生改变
scala> val newRecordDf = recordDf.withColumn("topic_id",col("topic_id") cast "Int")
scala> newRecordDf.schema
res65: org.apache.spark.sql.types.StructType = StructType(
StructField(topic_id,IntegerType,true), 
StructField(student_id,StringType,true), 
StructField(question_id,StringType,true), 
StructField(score,FloatType,true)
)
//结论:如果新增的列名与原列名相同,则会直接进行覆盖,将其类型进行转换

5.23更新:
explode的使用方法:
对于array或者map类型的字段做处理

org.apache.spark.sql.AnalysisException: 
cannot resolve 'explode(exam.ex_exam_record.`topic_id`)' due to data type mismatch: 
input to function explode should be array or map type, not string;
修改表ddl语句
ALTER TABLE c_employee ADD COLUMNS (work string); -- 添加列
spark.sql("ALTER TABLE test1 ADD COLUMNS (mapCol map) ")

.withColumn("region_country", explode(col("region_countries")))

问题1:
map类型使用insert into的方式该如何插入数据?
不使用insert into的方式又可以怎么加载数据?将文本文件load到hdfs的对应目录下,注意分隔符\001 \002 \003这种默认的分隔符

增加一个列并给新增的列赋值:

scala> test1Df.withColumn("topic_id1",lit($"topic_id")).show
+--------+----------+---------+
|topic_id|teacher_id|topic_id1|
+--------+----------+---------+
|34434412|         1| 34434412|
|34434459|         2| 34434459|
+--------+----------+---------+
scala> test1Df.withColumn("topic_id1",lit("topic_id")).show
+--------+----------+---------+
|topic_id|teacher_id|topic_id1|
+--------+----------+---------+
|34434412|         1| topic_id|
|34434459|         2| topic_id|
+--------+----------+---------+

类似于case when在dataframe的实现:

scala> recordDf.show(10)
+--------+-------------+-----------+-----+
|topic_id|   student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593|       8080|  1.0|
|34434459|8195023659592|       3762|  0.0|
| 3443449|8195023659591|       1736|  0.0|
|34434473|8195023659591|       7982|  0.0|
|34434444|8195023659595|       4753|  0.0|
|34434425|8195023659597|       7130|  1.0|
|34434497|8195023659594|       3365|  1.0|
|34434473|8195023659597|       7680|  1.0|
|34434417|8195023659593|       3727|  0.0|
|34434454|8195023659598|       6108|  1.0|
+--------+-------------+-----------+-----+
only showing top 10 rows

scala> recordDf.withColumn("grade",when(col("score")>0.0,lit("优秀")).otherwise(col("score"))).show(10)
+--------+-------------+-----------+-----+-----+
|topic_id|   student_id|question_id|score|grade|
+--------+-------------+-----------+-----+-----+
|34434412|8195023659593|       8080|  1.0| 优秀|
|34434459|8195023659592|       3762|  0.0|  0.0|
| 3443449|8195023659591|       1736|  0.0|  0.0|
|34434473|8195023659591|       7982|  0.0|  0.0|
|34434444|8195023659595|       4753|  0.0|  0.0|
|34434425|8195023659597|       7130|  1.0| 优秀|
|34434497|8195023659594|       3365|  1.0| 优秀|
|34434473|8195023659597|       7680|  1.0| 优秀|
|34434417|8195023659593|       3727|  0.0|  0.0|
|34434454|8195023659598|       6108|  1.0| 优秀|
+--------+-------------+-----------+-----+-----+
only showing top 10 rows

总结一下取一个列的值的几种方法:

col("score")
$"score"
recordDf("score")

表关联join的几种方式:
主要注意的是left_anti这种方式,是选择关联不上的左表的字段

scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id|   student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593|       8080|  1.0|34434412|         1|
|34434459|8195023659592|       3762|  0.0|34434459|         2|
| 3443449|8195023659591|       1736|  0.0|    null|      null|
|34434473|8195023659591|       7982|  0.0|    null|      null|
|34434444|8195023659595|       4753|  0.0|    null|      null|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows


scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left_anti").show(5)
+--------+-------------+-----------+-----+
|topic_id|   student_id|question_id|score|
+--------+-------------+-----------+-----+
| 3443449|8195023659591|       1736|  0.0|
|34434473|8195023659591|       7982|  0.0|
|34434444|8195023659595|       4753|  0.0|
|34434425|8195023659597|       7130|  1.0|
|34434497|8195023659594|       3365|  1.0|
+--------+-------------+-----------+-----+
only showing top 5 rows


scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id")).show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id|   student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593|       8080|  1.0|34434412|         1|
|34434459|8195023659592|       3762|  0.0|34434459|         2|
|34434459|8195023659591|       5657|  0.5|34434459|         2|
|34434412|8195023659591|       8775|  0.0|34434412|         1|
|34434459|8195023659596|       8248|  0.5|34434459|         2|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows

scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"right").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id|   student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593|       8080|  1.0|34434412|         1|
|34434412|8195023659591|       8775|  0.0|34434412|         1|
|34434412|8195023659594|       2439|  0.5|34434412|         1|
|34434412|8195023659593|       9496|  1.0|34434412|         1|
|34434412|8195023659598|       7854|  0.0|34434412|         1|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows

好多这种写法 :_* 选择数组中的全部元素

val selectedFields = "student_id,question_id,score"
recordDf.selectExpr(selectedFields.split(","): _*).show(5)
+-------------+-----------+-----+
|   student_id|question_id|score|
+-------------+-----------+-----+
|8195023659593|       8080|  1.0|
|8195023659592|       3762|  0.0|
|8195023659591|       1736|  0.0|
|8195023659591|       7982|  0.0|
|8195023659595|       4753|  0.0|
+-------------+-----------+-----+
only showing top 5 rows
选择数组中的全部元素
scala> recordDf.selectExpr(selectedFields.split(",")(0) ).show(5)
+-------------+
|   student_id|
+-------------+
|8195023659593|
|8195023659592|
|8195023659591|
|8195023659591|
|8195023659595|
+-------------+
only showing top 5 rows
选择数组中下标为0的元素
scala> recordDf.select("topic_id","student_id").show(2)
+--------+-------------+
|topic_id|   student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows

scala> recordDf.select($"topic_id",$"student_id").show(2)
+--------+-------------+
|topic_id|   student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows


scala> val aaa = recordDf("topic_id")
aaa: org.apache.spark.sql.Column = topic_id

scala> recordDf.select(aaa,$"student_id").show(2)
+--------+-------------+
|topic_id|   student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows

scala> val aaa=$"topic_id"
aaa: org.apache.spark.sql.ColumnName = topic_id
scala> val aaa="topic_id"
aaa: String = topic_id

你可能感兴趣的:(大数据)