spark集成hive
------------------------
1.创建hive-site.xml软连接到spark/conf下
$>xcall.sh "ln -s /soft/hive/conf/hive-site.xml /soft/spark/conf/hive-site.xml"
2.mysql驱动程序复制到spark/jars
...
3.启动spark-shell
$>spark.sql("use default").show()
$>spark.sql("show tables").show()
$>spark.sql("select * from custs").show(1000,false)
$>val rdd1= sc.parallize(Array((1, "tom1",12) ,(2, "tom2",13) ,(2, "tom3",14) ))
$>val df = rdd1.toDF("id" , "name" , "age")
$>df.select("id").show() //select id from ..
$>df.registerTempTable("_cust")
$>spark.sql("select * from _cust").show(1000 ,false) ;
idea下进行spark sql编程,操纵hive数据
--------------------------------------
1.引入spark-sql依赖和spark-hive依赖
2.复制hive-site.xml|core-site.xml|hdfs-site.xml到resources目录下
注意:修改hive-site.xml,关闭schema检查
3.编写scala程序
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
/**
* Created by Administrator on 2018/8/2.
*/
object SQLScala {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("SQLScala")
conf.setMaster("local[4]")
//conf.set("spark.sql.warehouse.dir", "hdfs://mycluster/user/hive/warehouse")
//创建sparksession
val sess = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val df = sess.sql("show databases")
sess.sql("use default").show()
sess.sql("select dep, max(salary) from emp group by dep").show(10000 , false)
//
val rdd1 = sess.sparkContext.parallelize(Array((1,"tom1",12),(2,"tom2",12),(3,"tom3",14)))
import sess.implicits._
val df1 = rdd1.toDF("id" , "name" , "age")
df1.select("id","age").show(1000)
df1.createOrReplaceTempView("_cust")
sess.sql("select id,name,age from _cust where age < 14").show()
}
}
4.编写java版的sql程序
public class SQLJava {
public static void main(String[] args) {
SparkConf conf = new SparkConf( ) ;
conf.setAppName("SQLJava") ;
conf.setMaster("local[*]") ;
SparkSession sess = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
Dataset
ds.show();
ds.select("id" , "name").show();
ds.createOrReplaceTempView("_cust");
sess.sql("select * from _cust where age < 12").show();
}
}
5.Dataset
强类型集合可以转换成并行计算,有个别名DataFrame = Dataset
类似于rdd。transfermation生成新的dataset,action执行计算并返回结果。
dataset是延迟计算,只有当调用action时才会触发执行。内部变现为逻辑计划,action调用时,spark的查询优化器对逻辑计划进行优化,
生成物理计划,用于分布式行为下高效的执行。