实际使用dataFrame的api的时候的时候才发现忘记的差不多了,常用的api做了一个整理,但是会涉及到公司的代码没有办法拿出来。
下面会写一些测试案例:
记录一下朋友的环境参数及版本号,以备以后查找使用,上次帮朋友下载依赖,自己的环境配置找不到了emmm:
<properties>
<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
<maven.compiler.source>1.8maven.compiler.source>
<maven.compiler.target>1.8maven.compiler.target>
<hive-common.version>3.0.0-cdh6.3.2hive-common.version>
<hive-client.version>3.0.0-cdh6.3.2hive-client.version>
<hadoop-hdfs.version>3.0.0-cdh6.3.2hadoop-hdfs.version>
<spark.version>2.4.0-cdh6.3.2spark.version>
properties>
没事我还有spark-shell,艰难:
join方法:
使用spark连接的hive数据库,用的是sparksql的方式生成的dataframe
1.我的exam库里有这么一张表:ex_exam_record
//使用exam库
scala> spark.sql("use exam")
res38: org.apache.spark.sql.DataFrame = []
//查看一下exam库里有哪些表
scala> spark.sql("show tables").show
+--------+----------------+-----------+
|database| tableName|isTemporary|
+--------+----------------+-----------+
| exam| ex_exam_anlysis| false|
| exam|ex_exam_question| false|
| exam| ex_exam_record| false|
+--------+----------------+-----------+
//查看一下表结构
scala> spark.sql("desc ex_exam_record").show
+-----------+---------+-------+
| col_name|data_type|comment|
+-----------+---------+-------+
| topic_id| string| null|
| student_id| string| null|
|question_id| string| null|
| score| float| null|
+-----------+---------+-------+
//查询表中的10行结果,如图所示
scala> spark.sql("select * from ex_exam_record limit 10").show
+--------+-------------+-----------+-----+
|topic_id| student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593| 8080| 1.0|
|34434459|8195023659592| 3762| 0.0|
| 3443449|8195023659591| 1736| 0.0|
|34434473|8195023659591| 7982| 0.0|
|34434444|8195023659595| 4753| 0.0|
|34434425|8195023659597| 7130| 1.0|
|34434497|8195023659594| 3365| 1.0|
|34434473|8195023659597| 7680| 1.0|
|34434417|8195023659593| 3727| 0.0|
|34434454|8195023659598| 6108| 1.0|
+--------+-------------+-----------+-----+
2.把这张表生成一个dataFrame,命名就叫recordDf
//生成recordDf
scala> val recordDf = spark.sql("select * from ex_exam_record")
recordDf: org.apache.spark.sql.DataFrame = [topic_id: string, student_id: string ... 2 more fields]
//展示前10行
scala> recordDf.show(10)
+--------+-------------+-----------+-----+
|topic_id| student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593| 8080| 1.0|
|34434459|8195023659592| 3762| 0.0|
| 3443449|8195023659591| 1736| 0.0|
|34434473|8195023659591| 7982| 0.0|
|34434444|8195023659595| 4753| 0.0|
|34434425|8195023659597| 7130| 1.0|
|34434497|8195023659594| 3365| 1.0|
|34434473|8195023659597| 7680| 1.0|
|34434417|8195023659593| 3727| 0.0|
|34434454|8195023659598| 6108| 1.0|
+--------+-------------+-----------+-----+
only showing top 10 rows
3.这时候为了展示关联,我想设置一个有相同字段的表,这时候新建一张表,假设叫test1
我们看到record表里的topic_id是string类型的
我们可以假设一张新的表,topic_id是int类型的
scala> spark.sql("create table test1 ( topic_id int comment '主题id', teacher_id int comment '教师id' )")
res46: org.apache.spark.sql.DataFrame = []
scala> spark.sql("desc test1")
res47: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field]
scala> spark.sql("desc test1").show()
+----------+---------+-------+
| col_name|data_type|comment|
+----------+---------+-------+
| topic_id| int| 主题id|
|teacher_id| int| 教师id|
+----------+---------+-------+
4.像表中插入一些数据,看到原表中有这样的id:34434412和34434459
scala> spark.sql("insert into table test1 values (34434412,1),(34434459,2)")
res50: org.apache.spark.sql.DataFrame = []
scala> spark.sql("select * from test1").show
+--------+----------+
|topic_id|teacher_id|
+--------+----------+
|34434412| 1|
|34434459| 2|
+--------+----------
5.这两个字段就插入成功了,这时候尝试进行join操作,注意现在的topic_id类型是不同的
//先把test1表建成dataFrame
scala> val test1Df = spark.sql("select * from test1")
test1Df: org.apache.spark.sql.DataFrame = [topic_id: int, teacher_id: int]
scala> test1Df.show()
+--------+----------+
|topic_id|teacher_id|
+--------+----------+
|34434412| 1|
|34434459| 2|
+--------+----------+
//然后尝试将recordDf和test1Df进行join操作
scala> test1Df.join(recordDf,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+----------+--------+-------------+-----------+-----+
|topic_id|teacher_id|topic_id| student_id|question_id|score|
+--------+----------+--------+-------------+-----------+-----+
|34434412| 1|34434412|8195023659593| 8080| 1.0|
|34434412| 1|34434412|8195023659591| 8775| 0.0|
|34434412| 1|34434412|8195023659594| 2439| 0.5|
|34434412| 1|34434412|8195023659593| 9496| 1.0|
|34434412| 1|34434412|8195023659598| 7854| 0.0|
+--------+----------+--------+-------------+-----------+-----+
only showing top 5 rows
scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id| student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593| 8080| 1.0|34434412| 1|
|34434459|8195023659592| 3762| 0.0|34434459| 2|
| 3443449|8195023659591| 1736| 0.0| null| null|
|34434473|8195023659591| 7982| 0.0| null| null|
|34434444|8195023659595| 4753| 0.0| null| null|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows
//这两个类型不同也join上了,查看一下dataFrame的schema
scala> recordDf.schema
res60: org.apache.spark.sql.types.StructType = StructType(StructField(topic_id,StringType,true), StructField(student_id,StringType,true), StructField(question_id,StringType,true), StructField(score,FloatType,true))
scala> test1Df.schema
res61: org.apache.spark.sql.types.StructType = StructType(StructField(topic_id,IntegerType,true), StructField(teacher_id,IntegerType,true))
6.可以看到在join的时候,如果类型不同,会尝试做自动的类型转换,记录一下手动类型转换的方法
scala> val newRecordDf = recordDf.withColumn("new_topic_id",col("topic_id") cast "Int")
scala> newRecordDf.schema
res64: org.apache.spark.sql.types.StructType = StructType(
StructField(topic_id,StringType,true),
StructField(student_id,StringType,true),
StructField(question_id,StringType,true),
StructField(score,FloatType,true),
StructField(new_topic_id,IntegerType,true)
)
//结论:可以看到这里是新增了一个列new_topic_id,类型是IntegerType
// 原来的列topic_id并没有消失,而且类型也没有发生改变
scala> val newRecordDf = recordDf.withColumn("topic_id",col("topic_id") cast "Int")
scala> newRecordDf.schema
res65: org.apache.spark.sql.types.StructType = StructType(
StructField(topic_id,IntegerType,true),
StructField(student_id,StringType,true),
StructField(question_id,StringType,true),
StructField(score,FloatType,true)
)
//结论:如果新增的列名与原列名相同,则会直接进行覆盖,将其类型进行转换
5.23更新:
explode的使用方法:
对于array或者map类型的字段做处理
org.apache.spark.sql.AnalysisException:
cannot resolve 'explode(exam.ex_exam_record.`topic_id`)' due to data type mismatch:
input to function explode should be array or map type, not string;
修改表ddl语句
ALTER TABLE c_employee ADD COLUMNS (work string); -- 添加列
spark.sql("ALTER TABLE test1 ADD COLUMNS (mapCol map) " )
.withColumn("region_country", explode(col("region_countries")))
问题1:
map类型使用insert into的方式该如何插入数据?
不使用insert into的方式又可以怎么加载数据?将文本文件load到hdfs的对应目录下,注意分隔符\001 \002 \003这种默认的分隔符
增加一个列并给新增的列赋值:
scala> test1Df.withColumn("topic_id1",lit($"topic_id")).show
+--------+----------+---------+
|topic_id|teacher_id|topic_id1|
+--------+----------+---------+
|34434412| 1| 34434412|
|34434459| 2| 34434459|
+--------+----------+---------+
scala> test1Df.withColumn("topic_id1",lit("topic_id")).show
+--------+----------+---------+
|topic_id|teacher_id|topic_id1|
+--------+----------+---------+
|34434412| 1| topic_id|
|34434459| 2| topic_id|
+--------+----------+---------+
类似于case when在dataframe的实现:
scala> recordDf.show(10)
+--------+-------------+-----------+-----+
|topic_id| student_id|question_id|score|
+--------+-------------+-----------+-----+
|34434412|8195023659593| 8080| 1.0|
|34434459|8195023659592| 3762| 0.0|
| 3443449|8195023659591| 1736| 0.0|
|34434473|8195023659591| 7982| 0.0|
|34434444|8195023659595| 4753| 0.0|
|34434425|8195023659597| 7130| 1.0|
|34434497|8195023659594| 3365| 1.0|
|34434473|8195023659597| 7680| 1.0|
|34434417|8195023659593| 3727| 0.0|
|34434454|8195023659598| 6108| 1.0|
+--------+-------------+-----------+-----+
only showing top 10 rows
scala> recordDf.withColumn("grade",when(col("score")>0.0,lit("优秀")).otherwise(col("score"))).show(10)
+--------+-------------+-----------+-----+-----+
|topic_id| student_id|question_id|score|grade|
+--------+-------------+-----------+-----+-----+
|34434412|8195023659593| 8080| 1.0| 优秀|
|34434459|8195023659592| 3762| 0.0| 0.0|
| 3443449|8195023659591| 1736| 0.0| 0.0|
|34434473|8195023659591| 7982| 0.0| 0.0|
|34434444|8195023659595| 4753| 0.0| 0.0|
|34434425|8195023659597| 7130| 1.0| 优秀|
|34434497|8195023659594| 3365| 1.0| 优秀|
|34434473|8195023659597| 7680| 1.0| 优秀|
|34434417|8195023659593| 3727| 0.0| 0.0|
|34434454|8195023659598| 6108| 1.0| 优秀|
+--------+-------------+-----------+-----+-----+
only showing top 10 rows
总结一下取一个列的值的几种方法:
col("score")
$"score"
recordDf("score")
表关联join的几种方式:
主要注意的是left_anti这种方式,是选择关联不上的左表的字段
scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id| student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593| 8080| 1.0|34434412| 1|
|34434459|8195023659592| 3762| 0.0|34434459| 2|
| 3443449|8195023659591| 1736| 0.0| null| null|
|34434473|8195023659591| 7982| 0.0| null| null|
|34434444|8195023659595| 4753| 0.0| null| null|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows
scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"left_anti").show(5)
+--------+-------------+-----------+-----+
|topic_id| student_id|question_id|score|
+--------+-------------+-----------+-----+
| 3443449|8195023659591| 1736| 0.0|
|34434473|8195023659591| 7982| 0.0|
|34434444|8195023659595| 4753| 0.0|
|34434425|8195023659597| 7130| 1.0|
|34434497|8195023659594| 3365| 1.0|
+--------+-------------+-----------+-----+
only showing top 5 rows
scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id")).show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id| student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593| 8080| 1.0|34434412| 1|
|34434459|8195023659592| 3762| 0.0|34434459| 2|
|34434459|8195023659591| 5657| 0.5|34434459| 2|
|34434412|8195023659591| 8775| 0.0|34434412| 1|
|34434459|8195023659596| 8248| 0.5|34434459| 2|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows
scala> recordDf.join(test1Df,recordDf("topic_id")===test1Df("topic_id"),"right").show(5)
+--------+-------------+-----------+-----+--------+----------+
|topic_id| student_id|question_id|score|topic_id|teacher_id|
+--------+-------------+-----------+-----+--------+----------+
|34434412|8195023659593| 8080| 1.0|34434412| 1|
|34434412|8195023659591| 8775| 0.0|34434412| 1|
|34434412|8195023659594| 2439| 0.5|34434412| 1|
|34434412|8195023659593| 9496| 1.0|34434412| 1|
|34434412|8195023659598| 7854| 0.0|34434412| 1|
+--------+-------------+-----------+-----+--------+----------+
only showing top 5 rows
好多这种写法 :_* 选择数组中的全部元素
val selectedFields = "student_id,question_id,score"
recordDf.selectExpr(selectedFields.split(","): _*).show(5)
+-------------+-----------+-----+
| student_id|question_id|score|
+-------------+-----------+-----+
|8195023659593| 8080| 1.0|
|8195023659592| 3762| 0.0|
|8195023659591| 1736| 0.0|
|8195023659591| 7982| 0.0|
|8195023659595| 4753| 0.0|
+-------------+-----------+-----+
only showing top 5 rows
选择数组中的全部元素
scala> recordDf.selectExpr(selectedFields.split(",")(0) ).show(5)
+-------------+
| student_id|
+-------------+
|8195023659593|
|8195023659592|
|8195023659591|
|8195023659591|
|8195023659595|
+-------------+
only showing top 5 rows
选择数组中下标为0的元素
scala> recordDf.select("topic_id","student_id").show(2)
+--------+-------------+
|topic_id| student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows
scala> recordDf.select($"topic_id",$"student_id").show(2)
+--------+-------------+
|topic_id| student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows
scala> val aaa = recordDf("topic_id")
aaa: org.apache.spark.sql.Column = topic_id
scala> recordDf.select(aaa,$"student_id").show(2)
+--------+-------------+
|topic_id| student_id|
+--------+-------------+
|34434412|8195023659593|
|34434459|8195023659592|
+--------+-------------+
only showing top 2 rows
scala> val aaa=$"topic_id"
aaa: org.apache.spark.sql.ColumnName = topic_id
scala> val aaa="topic_id"
aaa: String = topic_id