hive表已经创建好了,详见: hive实例:搜狗用户搜索日志
配置:
1. 把core-site.xml和hive-site.xml复制到spark的conf目录下
core-site.xml在hadoop的配置目录下,hive-site.xml在hive的配置目录下
2. 把mysql-connector-java-5.1.47-bin.jar复制到spark的jars目录
3. 修改spark的hive-site.xml
/spark/conf/hive-site.xml,加上
解释:
hive.metastore.uris:sparksql 连接到这里,这里是hive的metastore,用于获取hive表
另一个是禁用metastore的版本检测
4.开启hive的metastore元数据库
spark sql想要使用hive的表,还需要hive开启metastore
hive --service metastore &
启动后放后台就可以,供spark sql使用
或者用hive --service metastore,没研究他们的区别,好像一样
如果启动metastore时遇到版本不一致的错误
MetaException(message:Hive Schema version 2.3.0 does not match metastore's schema version 1.2.0 Metastore is not upgraded or corrupt)
就修改hive的配置文件/hive/conf/hive-site.xml,
也加上禁用metastore的版本检测即可
下面的常用操作还可以参考一下官方文档:
http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
使用:
启动spark shell:
[root@hadoop01 ~]# spark-shell
scala>
测试:
(1)查看hive的元数据仓库metastore有啥数据库
scala> spark.sql("show databases").show
+------------+
|databaseName|
+------------+
| default|
| hive|
+------------+
scala>
(2)查看hive的元数据仓库metastore有啥表
scala> val priors = spark.sql("show tables")
priors: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]
scala> priors.show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| sogou| false|
+--------+---------+-----------+
scala>
(3)查看sogou表有多少条数据
scala> val x = spark.sql("select count(*) from sogou")
x: org.apache.spark.sql.DataFrame = [count(1): bigint]
scala> x.show
+--------+
|count(1)|
+--------+
| 10000|
+--------+
scala>
(4)新建一张表
scala> spark.sql("create table if not exists people(id int,name string)")
res4: org.apache.spark.sql.DataFrame = []
scala> spark.sql("show tables").show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| people| false|
| default| sogou| false|
+--------+---------+-----------+
scala>
(5)插入数据
scala> spark.sql("insert into people values (1,'a'),(2,'b'),(3,'c')")
19/05/28 17:29:48 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
res7: org.apache.spark.sql.DataFrame = []
scala> spark.sql("select * from people").show()
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
| 3| c|
+---+----+
scala>
(6)导入本地数据
本地文件/test/age.txt为:
a,18
b,24
c,22
新建一个表,以逗号分割(sql语句参考hive):
spark.sql("create table if not exists people_age(name string,age int) row format delimited fields terminated by ','")
导入(sql语句参考hive):
scala> spark.sql("load data local inpath '/test/age.txt' into table people_age")
19/05/28 23:20:51 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
res10: org.apache.spark.sql.DataFrame = []
scala> spark.sql("select * from people_age").show
+----+---+
|name|age|
+----+---+
| a| 18|
| b| 24|
| c| 22|
+----+---+
scala>
(7)连表查询,筛选并倒序输出
scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>18 order by age desc").show
+---+----+---+
| id|name|age|
+---+----+---+
| 2| b| 24|
| 3| c| 22|
+---+----+---+
scala>
如果想要设置只输出n条,再加上limit n
(8)保存为表,存在本地
val resultDF = spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>18 order by age")
保存为永久表:
resultDF.write.saveAsTable("people_result")
scala> spark.sql("show tables").show
+--------+-------------+-----------+
|database| tableName|isTemporary|
+--------+-------------+-----------+
| default| people| false|
| default| people_age| false|
| default|people_result| false|
| default| sogou| false|
+--------+-------------+-----------+
可以直接用spark的table方法加载已经保存的永久表:
scala> spark.table("people_result").show
+---+----+---+
| id|name|age|
+---+----+---+
| 3| c| 22|
| 2| b| 24|
+---+----+---+
scala>
以上都是使用sql语句,也可以使用dsl语句,还有rdd的相关操作
首先创建一个dataframe:
scala> val df = spark.sql("select * from people_result")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
或者:
scala> val df = spark.table("people_result")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 3| c| 22|
| 2| b| 24|
+---+----+---+
dsl语句
(1)查询、筛选、过滤
scala> df.select("name").show //df.select("*").show
+----+
|name|
+----+
| c|
| b|
+----+
scala> df.where("id=2").show //df.select("name").where("id=2").show
+---+----+---+
| id|name|age|
+---+----+---+
| 2| b| 24|
+---+----+---+
scala> df.filter("age<23").show
+---+----+---+
| id|name|age|
+---+----+---+
| 3| c| 22|
+---+----+---+
(3)排序
scala> df.sort("id").show //df.sort($"id".desc).show
+---+----+---+
| id|name|age|
+---+----+---+
| 2| b| 24|
| 3| c| 22|
+---+----+---+
(4)toDF新建dataframe,toDF方法可以指定列名
scala> val df_new = df.toDF
df_new: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> df_new.show
+---+----+---+
| id|name|age|
+---+----+---+
| 3| c| 22|
| 2| b| 24|
+---+----+---+
toDF方法可以指定列名
scala> val df_new = df.toDF("newTD","newName","newAge")
df_new: org.apache.spark.sql.DataFrame = [newTD: int, newName: string ... 1 more field]
scala> df_new.show
+-----+-------+------+
|newTD|newName|newAge|
+-----+-------+------+
| 3| c| 22|
| 2| b| 24|
+-----+-------+------+
scala>
rdd操作:
dataframe也是rdd,是对rdd的封装,相当于rdd+表头schema
collect、count、cache、filter、sort、join、sample等 rdd操作都是可用的
scala> df.collect
res35: Array[org.apache.spark.sql.Row] = Array([3,c,22], [2,b,24])
scala> df.take(2)
res36: Array[org.apache.spark.sql.Row] = Array([3,c,22], [2,b,24])
scala> df.first
res38: org.apache.spark.sql.Row = [3,c,22]
scala> df.count
res39: Long = 2
scala> df.cache
res40: df.type = [id: int, name: string ... 1 more field]
scala> df.sample(0.5).show //df.sample(false,0.5).show false代表不放回抽样
+---+----+---+
| id|name|age|
+---+----+---+
| 2| b| 24|
+---+----+---+
Dataframe转换为RDD:
val rdd1 = df.rdd
spark自带的操作hive例子在:
examples/src/main/scala/org/apache/spark/examples/sql/
SparkHiveExample.scala