我的场景:社区免费版CDH5.7.6 、Spark要on Yarn;CDH从5.5开始Spark distro不带Thrift Server分布式SQL引擎、以及spark-sql脚本。Thrift Server是Spark异构数据大融合愿景重要入口之一,spark-sql脚本是测试SQL利器,但CDH优先推自家impala:
why and when to use which engine (Hive, Impala, and Spark)
Please add Spark Thrift Server to the CDH Spark distro
CDH的Kudu对标SparkSQL、livy对标SJS(Spark Job Server)、收购FAST进军机器学习领域,与Spark相比以不同方式构造大一统的大数据平台,这段话基本概括了:
For interfacing with SQL BI tools like Tableau, Excel, etc, Impala is the best engine. It was specifically designed for BI tools like Tableau and is fully certified/supported with most major BI tools including Tableau. Please see the following blog for more details on why and when to use which engine (Hive, Impala, and Spark)
For those looking for a Spark server to develop applications against, the Thrift Server for Spark is architecturally limited to exposing just SQL (in addition to other architectural limitations around security, multi-tenancy, redundancy, concurrency, etc). As such Cloudera founded the Livy project which aims to enable an interface for applications to better interface with Spark broadly (available as community preview in Cloudera Labs for feedback and community participation):http://blog.cloudera.com/blog/2016/07/livy-the-open-source-rest-service-for-apache-spark-joins-cloudera-labs/?_ga=1.116860357.2120376933.1474491928
SQL虽然不是Spark的主业,但SQL是通向Hive和RDB的大门,而且Spark的SQL解析器增加支持一些SQL语法比如注册临时表,这个表可以存在于任何关系数据存储系统(RDB、Hive),只要有驱动就可以,不必编程,还是挺强大的。暂时不想用impala或SJS只能自行编译Spark替换CDH distro了:
How to upgrade Spark on CDH5.5
编译原生Spark的话、最吸引人的是:“you can always run the latest version of Spark on CDH.”,问题来了,编译原生Spark 好还是编译CDH Spark distro好?,Spark二进制发布包的编译语句(参数与mvn构建一样):
make-distribution.sh -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0-cdh5.7.6 -Dscala-2.10.5 -DskipTests -Phive -Phive-thriftserver
其中-Phive 和 -Phive-thriftserver 参数指定了将hive依赖以及thriftserver编译打包,但不能指定Hive版本,默认是按照Hive1.2.1......在这一篇CDH社区帖子:
CDH 5.5 does not have Spark Thrift Server
有提到:The thrift server in Spark is not tested, and might not be compatible, with the Hive version that is in CDH. Hive in CDH is 1.1 (patched) and Spark uses Hive 1.2.1. You might see API issues during compilation or run time failures due to that.
CDH的Spark distro会提前于社区做一些修正和改进,而且直至目前最新版CDH5.12其使用的Spark仍然是这个1.6.0;再加上是要部署到生产集群上,保险起见决定编译CDH Spark distro,编译过程顺利无误,以下是步骤:
1、源码包下载:archive.cloudera.com/cdh5/cdh/5/;停掉CDH的Spark服务,其实没啥好停的对于Spark就一个His Server角色实例;
2、编译,语句就是上述make-distribution,编出来最主要的三个jar会汇总放在dist/lib,最省事儿的方式是将assembly、examples 以及 spark-1.6.0-cdh5.7.6-yarn-shuffle.jar直接替换拷贝到所有集群节点机:/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/jars 目录下;事实上该目录是所有节点机的Java ClassPath,查看18088端口HisServer列出的环境信息可以看得到,得益于Java按需加载Class机制,所有jar都可以扔到这一个地方;
3、将assembly包上传到HDFS:/user/spark/share/lib,就是个存放assembly的目录可以随便定义;为jar包以及spark目录授权:hdfs dfs -chmod 755 /user/spark 使得namenode:50070可以看到该目录;
4、将assembly配置到CDH Spark:
5、我这里另启了一个Hive metastore专门为Spark服务,根据spark-env.sh中配置的HIVE_CONF_DIR找到配置文件:/etc/hive/conf/hive-site.xml,修改了hive.metastore.uri配置项;
6、拷贝thrift的启停脚本到集群节点机、启动:
./start-thriftserver.sh --master yarn-client --hiveconf hive.server2.thrift.port=10001
7、测试:beeline -u jdbc:hive2://localhost:10001 -n hdfs
beeline参数:-u表示后跟标准JDBC url串;-n是用户名;-p是密码,集群未启用安全验证密码可以为空,注意不知道是HS2还是beeline的毛病,指定连接库名是没用的,默认直连default库。详细beeline用法参见:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineExample
CDH将Spark、Hive等称作服务、而它们的各种组件如Spark的HisServer2、Hive的metastoreService都称作隶属于该服务的角色,这些角色在某台节点机实际运行起来即角色实例,所有所谓的gateway网关是它们的客户端,也就是可以运行spark-shell或hive、beeline这些CLI的节点机。CDH的SPARK_HOME是:/opt/cloudera/parcels/CDH-5.7.6-1.cdh5.7.6.p0.6/lib/spark;CDH的各个服务配置目录都在/etc/如对于Spark是/etc/spark/conf,这里放置了Spark关键的两个东东:全局默认配置spark-default.conf以及环境初始化脚本spark-env.sh;对于CDH来说Spark比较“松散”,不像HDFS或Hive那样在指定节点机上有特定服务存在。
ST(Spark ThriftServer)到位了,开始测试使用ST去join查询mySQL和Hive,关联关系是hive库表:idp_pub_ans.dim_aclineend 的gid字段与mysql库表 hive.test 的resource_id字段相关联。首先将mySQL驱动告诉Spark,以原生spark-default.conf配置文件为例、以如下形式配置额外ClassPath:
spark.driver.extraClassPath=/usr/appsoft/spark/lib/*
spark.executor.extraClassPath=/usr/appsoft/spark/lib/*
以上分别为driver和executor配置了额外的CP,指向了一个目录下的所有jar,mySQL驱动就扔在该目录下,这样比较方便以后再有附加jar扔进去即可,上述配置在CDH界面配置即可,然后更新到集群,别忘了将mySQL驱动上传到所有节点机相应目录下,上述配置的目录路径随意,只要在节点机物理磁盘存在既可。
然后重启ST、令ST执行建表SQL语句/注册临时表,将mySQL库中的表以SparkSQL临时表形式进行注册,注册名是mySQLtest:
CREATE TEMPORARY table mySQLtest
USING org.apache.spark.sql.jdbc
OPTIONS(
url "jdbc:mysql://192.11.1.1:3306/hive",
dbtable "test",
user 'hive',
password 'hive'
);
beeline命令一行输入:CREATE TEMPORARY table mySQLtest USING org.apache.spark.sql.jdbc OPTIONS(url "jdbc:mysql://192.11.1.1:3306/hive",dbtable "test",user 'hive',password 'hive');
o了,现在可以用dim_aclineend.gid=mySQLtest.resource_id关联关系来join查询两个异构数据源的数据了,以此类推,Spark对oracle等等RDB、HDFS数据文件如Json、CSV等等各种数据源都可以类似处理,这便是Df:Data Fusion,对HBase稍有不同,以后可以开文另说关于shc;但是ST有个毛病:它每次接收到一个JDBC连接时,才会创建一个sparkContext供使用,因为它使用的是spark-submmit脚本提交任务,而且每次都是一个JVM进程,这样导致临时表的生命周期仅限于一次会话,如果对跨异构数据源数据JOIN需求很多,可以考虑更靠谱的SJS,这一点还可以参见HS2介绍:
http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/
提到:For each client connection, it creates a new execution context that serves Hive SQL requests from the client. 可见HS2的设计是为每客户端连接单独创建上下文。
注册临时视图的方式:
CREATE TEMPORARY VIEW mysqlInfo
USING org.apache.spark.sql.jdbc
OPTIONS (......
以spark-sql脚本方式:
bin/spark-sql --driver-class-path lib/mysql-connector-java.jar
附:
SparkThrfitServer多用户资源竞争与分配问题
SparkThriftServer的高可用-HA实现与配置