Spark SQL可以读取hive中的数据,开启Thrift JDBC/ODBC Server服务可以使其他语言客户端使用Spark SQL.关于Spark SQL中对hive的支持,官方文档说明让人疑惑,好像没有把hive编译进去,需要自己手动编译,官方文档提及:
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
其实Spark预编译版本中已经加入hive-1.2.1,只是没有编译到Spark assembly中,spark lib下有hive支持包.在官方文档后面可以看到:
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
Spark SQL既支持通过JDBC访问其他数据库,也可以通过Spark SQL JDBC server来使其他应用使用Spark SQL查询,需要注意这两者之间的区别.
Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources. The JDBC data source is also easier to use from Java or Python as it does not require the user to provide a ClassTag. (Note that this is different than the Spark SQL JDBC server, which allows other applications to run queries using Spark SQL).
安装mysql,在终端输入
sudo apt-get install mysql-server mysql-client libmysqlclient-dev
期间会弹出设置root账户的密码框,输入两次相同密码。
测试是否安装成功
sudo netstat -tap | grep mysql
使用root用户登录MySql
mysql -u root -p
mysql> use mysql;
mysql> select host,user,password from user;
mysql> insert into user(host,user,password) values('local',hive,PASSWORD('123456'));
mysql>grant all on . to hive@’%’ identified by ‘123456’;
mysql>flush privileges;
设置远程访问(正常情况下,mysql占用的3306端口只是在IP 127.0.0.1上监听,拒绝了其他IP的访问(通过netstat可以查看到)。取消本地监听需要修改 /etc/mysql/mysql.conf.d/mysqld.cnf 文件:)
sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
bind-address = 127.0.0.1 //找到此内容并且注释
重启mysql服务
service mysql restart
远程连接mysql,查看是否可以成功登录
mysql -u hive -h fang-ubuntu -p //这里fang-ubuntu是我的主机名
下载hive安装包并解压(我安装的是hive-2.1.1),配置环境变量
export HIVE_HOME=/home/hadoop/software/hive-2.1.1
export PATH=$PATH:$HIVE_HOME/bin
在hive-2.1.1/conf目录下创建hive-site.xml,添加如下内容
javax.jdo.option.ConnectionURL
jdbc:mysql://fang-ubuntu:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionUserName
hive
username to use against metastore database
javax.jdo.option.ConnectionPassword
123456
password to use against metastore database
复制驱动程序到HIVE_HOME/lib目录下
cp mysql-connector-java-6.0.5.jar ../hive-2.1.1/lib/
初始化hive元数据信息(需要先启动hdfs)
./bin/schematool -initSchema -dbType mysql
启动hive服务
hive --service metastore &
在SPARK_HOME/conf目录下添加hive-site.xml(Spark集群环境需要先配置好)
hive.metastore.uris
thrift://fang-ubuntu:9083
thrift URI
启动Spark集群
./bin/start-all.sh
在spark目录下,启动thrift服务(spark://fang-ubuntu:7077是我的spark集群地址)
./start-thriftserver.sh --master spark://fang-ubuntu:7077 --driver-class-path /home/hadoop/software/hive-2.1.1/lib/mysql-connector-java-6.0.5.jar
在spark目录下,启动beeline,连接hive,可以设置访问密码,默认是没有的.
./bin/beeline
beeline>!connect jdbc:hive2://fang-ubuntu:10000
package com.fang.spark.demo;
import java.sql.*;
/**
* Created by fang on 17-1-8.
* 通过jdbc访问Spark SQL
* 该程序可以使其他应用程序访问使用Spark SQL,而Spark SQL又可以访问各种数据源(包括Hive及通过JDBC访问其他关系型数据库,这里起到一个桥梁的作用),本例是使用Java语言访问Spark SQL提供的server服务,然后Spark SQL访问的是hive中的数据.
*/
public class ConnectSparkSQL {
public static void main(String[] args) throws SQLException {
String url = "jdbc:hive2://fang-ubuntu:10000/";
try {
Class.forName("org.apache.hive.jdbc.HiveDriver");
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Connection conn = DriverManager.getConnection(url,"hive","123456");
Statement stmt = conn.createStatement();
String sql = "select * from info";
System.out.println("Running"+sql);
ResultSet res = stmt.executeQuery(sql);
while(res.next()){
System.out.println("id: "+res.getInt(1)+"\tname: "+res.getString(2)+"\tage: "+res.getString(3));
}
}
}
启动hive服务出错(hive –service metastore &),没有初始化,执行该命令/bin/schematool -initSchema -dbType mysql
Exception in thread “main” java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:591)
at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:531)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:226)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:366)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:310)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:290)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:266)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:558)
… 9 more
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1654)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:80)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:130)
错误客户端连接Spark SQL失败,Caused by: java.net.ConnectException: 拒绝连接,mysql远端服务没有开启,在/etc/mysql/mysql.conf.d/mysqld.cnf中,注释掉#bind-address = 127.0.0.1
Exception in thread “main” java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:591)
at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:531)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:226)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:366)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:310)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:290)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:266)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:558)
… 9 more
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1654)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:80)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:130)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:101)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3367)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3406)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3386)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3640)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:236)
at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:221)
… 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1652)
… 23 more
The end