注意,一般来说hive版本需要与spark版本对应,官网有给出对应版本。这里使用的hive版本,spark版本,hadoop版本都没有使用官方推荐。
下载Spark 源码,以spark-2.4.4 为例。
编译Spark 源码。
<profile>
<id>hadoop-2.6id>
profile>
<profile>
<id>hadoop-2.7id>
<properties>
<hadoop.version>2.7.3hadoop.version>
<curator.version>2.7.1curator.version>
properties>
profile>
<profile>
<id>hadoop-2.8id>
<properties>
<hadoop.version>2.8.5hadoop.version>
<curator.version>2.7.1curator.version>
properties>
profile>
<profile>
<id>hadoop-3.1id>
<properties>
<hadoop.version>3.1.0hadoop.version>
<curator.version>2.12.0curator.version>
<zookeeper.version>3.4.9zookeeper.version>
properties>
profile>
安装编译后的源码包。上诉安装包发送到与hive安装所在的机器上,解压。配置spark-env.sh脚本,添加如下配置(不配置也可以!):
export HADOOP_HOME=${haoop的安装目录}
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=${spark的安装目录}
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_EXECUTOR_CORES=2
export SPARK_EXECUTOR_MEMORY=2G
export SPARK_DRIVER_MEMORY=2G
在hdfs新建一个spark-jars目录,并将上面安装的spark目录下的jars下的jar包都上传到该目录下。
配置hive
HADOOP_HOME=${hadoop安装目录}
export HIVE_CLASSPATH=${hive安装目录}/lib/
export HIVE_CLASSPATH=$HIVE_CLASSPATH:${spark安装目录}/jars/
<property>
<name>javax.jdo.option.ConnectionURLname>
<value>jdbc:mysql://${host}:3306/hive?createDatabaseIfNotExist=truevalue>
<description>hive元数据存储数据库连接URL,注意host为mysql服务所在的主机名description>
property>
<property>
<name>javax.jdo.option.ConnectionDriverNamename>
<value>com.mysql.cj.jdbc.Drivervalue>
<description>hive存储元数据数据库的连接驱动类,这里用的MYSQL8.0+,所以驱动用8.0+的方式,注意区别于MYSQL6.0之前的版本description>
property>
<property>
<name>javax.jdo.option.ConnectionUserNamename>
<value>hivevalue>
<description>hive元数据存储数据库用户名description>
property>
<property>
<name>javax.jdo.option.ConnectionPasswordname>
<value>hivevalue>
<description>hive元数据存储数据库的密码description>
property>
<property>
<name>hive.metastore.warehouse.dirname>
<value>hdfs://${host}:8020/hivevalue>
<description>hive在hdfs的目录,将用于存储表的数据description>
property>
<property>
<name>hive.server2.thrift.portname>
<value>10000value>
property>
<property>
<name>hive.server2.thrift.bind.hostname>
<value>${host}value>
<description>启动hive2时的主机名description>
property>
<property>
<name>hive.metastore.urisname>
<value>thrift://${host}:9083value>
<description>hive的thriftURI,好比jdbc一样,注意host为启动hivethriftserver的主机名description>
property>
<property>
<name>hive.execution.enginename>
<value>sparkvalue>
<description>修改hive的执行引擎为saprkdescription>
property>
<property>
<name>spark.serializername>
<value>org.apache.spark.serializer.KryoSerializervalue>
<description>配置spark的序列化类description>
property>
<property>
<name>spark.yarn.jarsname>
<value>hdfs://${host}:8020/spark-jars/*value>
<description>配置spark的lib包在hdfs的位置,注意host为hdfs namenode所在的主机名,或者高可用映射description>
property>
<property>
<name>spark.mastername>
<value>yarnvalue>
<description>配置spark on yarndescription>
property>
<property>
<name>spark.executor.extraClassPathname>
<value>${hive的安装目录}/libvalue>
<description>配置spark 用到的hive的jar包description>
property>
<property>
<name>spark.eventLog.enabledname>
<value>truevalue>
property>
<property>
<name>spark.executor.memoryname>
<value>16gvalue>
property>
<property>
<name>spark.yarn.executor.memoryOverheadname>
<value>3072mvalue>
property>
<property>
<name>spark.driver.memoryname>
<value>16gvalue>
property>
<property>
<name>spark.yarn.driver.memoryOverheadname>
<value>400mvalue>
property>
<property>
<name>spark.executor.coresname>
<value>6value>
property>
<property>
<name>spark.shuffle.service.enabledname>
<value>truevalue>
property>
<property>
<name>spark.dynamicAllocation.enabledname>
<value>truevalue>
property>
<property>
<name>spark.dynamicAllocation.minExecutorsname>
<value>0value>
property>
<property>
<name>spark.dynamicAllocation.maxExecutorssname>
<value>14value>
property>
<property>
<name>spark.dynamicAllocation.initialExecutorsname>
<value>4value>
property>
<property>
<name>spark.dynamicAllocation.executorIdleTimeoutname>
<value>60000value>
property>
<property>
<name>spark.dynamicAllocation.schedulerBacklogTimeoutname>
<value>1000value>
property>
启动初始化hive:${HVIE_HOME}/bin/schematool -dbType mysql -initSchema
启动元数据连接服务(thrift server 服务):hive --service metastore &
启动hiveserver2服务:hive --service hiveserver2 &
进入hive shell,并进行查询。可能会如下错误:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable
at org.apache.hadoop.hive.ql.parse.spark.GenSparkProcContext.(GenSparkProcContext.java:163)
at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.generateTaskTree(SparkCompiler.java:328)
at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:279)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11273)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:512)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: scala.collection.Iterable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 24 more
解决方案如下:
export SPARK_LIB=${spark安装位置}/jars
for f in ${HIVE_LIB}/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# 加入:
for f in ${SPARK_LIB}/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done