Spark 操作Hive可以采用两种方式,一种是在Spark-sql中操作Hive表,另一种是通过Hive 的MetaStore在IDEA中操作Hive表,接下来分别介绍这两种方式
$SPARK_HOME/bin/spark-sql
spark-sql (default)> show databases;
databaseName
company
default
Time taken: 0.024 seconds, Fetched 11 row(s)
spark-sql (company)> select * from score;
id name subject
1 tom ["HuaXue","Physical","Math","Chinese"]
2 jack ["HuaXue","Animal","Computer","Java"]
3 john ["ZheXue","ZhengZhi","SiXiu","history"]
4 alice ["C++","Linux","Hadoop","Flink"]
进行如上测试,说明测试成功。
在上一步配置的基础上书写Spark代码打包到集群中运行
读取本地文件写入Hive表中
/**
* 读取本地文件写入Hive表中
*/
object SaveToHiveTable {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName(this.getClass.getSimpleName)
// 开启Hive支持
.enableHiveSupport()
// 配置Hivewarehouse地址
.config("spark.sql.warehouse.dir","hdfs:///user/hive/warehouse")
.getOrCreate()
// 读取文件
val rawRdd = spark.sparkContext.textFile("file:///home/hadoop/data/spark/course.txt")
import spark.implicits._
val frame = rawRdd.map(line => {
val splits = line.split(",")
(splits(0), splits(1), splits(2))
}).toDF("id", "cour", "score")
// 创建临时表
frame.createOrReplaceTempView("score")
// 采用Insert into的方式可以避免发生FileFormart的错误
spark.sql("insert into spark.course select id,cour,score from score")
}
}
提交到yarn上面运行
./bin/spark-submit \
--class com.ruozedata.sparksql.hive.SaveToHiveTable \
--master yarn \
/home/hadoop/jars/spark/spark-sql-1.0.jar \
2
从Hive中读取表数据并展示
/**
* Spark 读Hive表
*/
object ReadHiveTable {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName(this.getClass.getSimpleName)
// 开启Hive支持
.enableHiveSupport()
// 配置Hivewarehouse地址,可有可无
.config("spark.sql.warehouse.dir","hdfs:///user/hive/warehouse")
.getOrCreate()
val df = spark.sql("select id,cour,score from spark.course")
df.createOrReplaceTempView("terminal")
spark.sql("select * from terminal where score >90").show(false)
}
}
提交到集群当中运行
./bin/spark-submit \
--class com.ruozedata.sparksql.hive.ReadHiveTable \
--master yarn \
/home/hadoop/jars/spark/spark-sql-1.0.jar \
2
从ODS层读取数据,经过汇总去空值处理之后存储到DWD层的Hive表中
/**
* Hive 的ODS层数据经过聚合写入到DWD层里面
* 清洗:将空值过滤掉
*/
object HiveToHive {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName(this.getClass.getSimpleName)
// 开启Hive支持
.enableHiveSupport()
// 配置Hivewarehouse地址
.config("spark.sql.warehouse.dir", "hdfs:///user/hive/warehouse")
.getOrCreate()
val ods_sql =
"""
|select cour,sum(score) total
|from spark.ods_course
|where cour is not null and length(cour) > 0
|group by cour
|""".stripMargin
val ods_df = spark.sql(ods_sql)
// 创建临时表会导致Hive表中出现临时表
ods_df.createOrReplaceTempView("dwd_tmp")
// 采用Insert into的方式可以避免发生FileFormart的错误
spark.sql("insert into spark.dwd_course select cour,total from dwd_tmp").show(false)
}
}
注意:在使用Spark-Sql的时候,产生shuffle时候task的个数是200个,需要配置参数调整一下
另外:创建临时表会在hive中出现临时表,不建议使用临时表,可以直接一套sql搞定