Hive on Spark 搭建过程(hvie-2.3.6 spark-2.4.4 hadoop-2.8.5)

Hive On Spark 官方教程

注意,一般来说hive版本需要与spark版本对应,官网有给出对应版本。这里使用的hive版本,spark版本,hadoop版本都没有使用官方推荐。

  1. 下载Spark 源码,以spark-2.4.4 为例。

  2. 编译Spark 源码。

    1. 编译时选择hadoop版本,可选的有hadoop2.7.3和 hadoop2.6的,我要使用2.8.5,所以修改pom文件如下:
    <profile>
       <id>hadoop-2.6id>
       
     profile>
    
     <profile>
       <id>hadoop-2.7id>
       <properties>
         <hadoop.version>2.7.3hadoop.version>
         <curator.version>2.7.1curator.version>
       properties>
     profile>
     
     
     <profile>
       <id>hadoop-2.8id>
       <properties>
         <hadoop.version>2.8.5hadoop.version>
         <curator.version>2.7.1curator.version>
       properties>
     profile>
     
     
     <profile>
       <id>hadoop-3.1id>
       <properties>
         <hadoop.version>3.1.0hadoop.version>
         <curator.version>2.12.0curator.version>
         <zookeeper.version>3.4.9zookeeper.version>
       properties>
     profile>
    
    1. 按照官方提示,Hive on Spark 搭建过程(hvie-2.3.6 spark-2.4.4 hadoop-2.8.5)_第1张图片执行如下命令,进行编译:./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.8,parquet-provided,orc-provided",编译成功后,会在源码根目录下存在一个名为spark-2.4.4-bin-hadoop2-without-hive.tgz的安装包,
  3. 安装编译后的源码包。上诉安装包发送到与hive安装所在的机器上,解压。配置spark-env.sh脚本,添加如下配置(不配置也可以!):

    export HADOOP_HOME=${haoop的安装目录}
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export SPARK_HOME=${spark的安装目录}
    export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export SPARK_EXECUTOR_CORES=2
    export SPARK_EXECUTOR_MEMORY=2G
    export SPARK_DRIVER_MEMORY=2G
    
  4. 在hdfs新建一个spark-jars目录,并将上面安装的spark目录下的jars下的jar包都上传到该目录下。

  5. 配置hive

    1. 配置hive-env.sh脚本,如下:
      HADOOP_HOME=${hadoop安装目录}
      export HIVE_CLASSPATH=${hive安装目录}/lib/
      export HIVE_CLASSPATH=$HIVE_CLASSPATH:${spark安装目录}/jars/
      
    2. 配置hive-site.sh,如下:
      <property>
          <name>javax.jdo.option.ConnectionURLname>
          <value>jdbc:mysql://${host}:3306/hive?createDatabaseIfNotExist=truevalue>
          <description>hive元数据存储数据库连接URL,注意host为mysql服务所在的主机名description>
      property> 
      
      <property>
          <name>javax.jdo.option.ConnectionDriverNamename>
          <value>com.mysql.cj.jdbc.Drivervalue>
          <description>hive存储元数据数据库的连接驱动类,这里用的MYSQL8.0+,所以驱动用8.0+的方式,注意区别于MYSQL6.0之前的版本description>
      property>
      
      <property>
          <name>javax.jdo.option.ConnectionUserNamename>
          <value>hivevalue>
          <description>hive元数据存储数据库用户名description>
      property>
      
      <property>
          <name>javax.jdo.option.ConnectionPasswordname>
          <value>hivevalue>
          <description>hive元数据存储数据库的密码description>
      property>
      
      <property>
          <name>hive.metastore.warehouse.dirname>
          <value>hdfs://${host}:8020/hivevalue>
          <description>hive在hdfs的目录,将用于存储表的数据description>
      property>
      
      <property>
          <name>hive.server2.thrift.portname>
          <value>10000value>
      property>
      
      <property>
          <name>hive.server2.thrift.bind.hostname>
          <value>${host}value>
          <description>启动hive2时的主机名description>
      property>
      
      <property>
          <name>hive.metastore.urisname>
          <value>thrift://${host}:9083value>
          <description>hive的thriftURI,好比jdbc一样,注意host为启动hivethriftserver的主机名description>
      property>
      
      <property>
          <name>hive.execution.enginename>
          <value>sparkvalue>
          <description>修改hive的执行引擎为saprkdescription>
      property>
      
      <property>
         <name>spark.serializername>
         <value>org.apache.spark.serializer.KryoSerializervalue>
         <description>配置spark的序列化类description>
      property>
      
      <property>
         <name>spark.yarn.jarsname>
         <value>hdfs://${host}:8020/spark-jars/*value>
         <description>配置spark的lib包在hdfs的位置,注意host为hdfs namenode所在的主机名,或者高可用映射description>
      property>
      
      <property>
         <name>spark.mastername>
         <value>yarnvalue>
         <description>配置spark on yarndescription>
      property>
      
      <property>
         <name>spark.executor.extraClassPathname>
         <value>${hive的安装目录}/libvalue>
         <description>配置spark 用到的hive的jar包description>
      property>
      
      <property>
          <name>spark.eventLog.enabledname>
          <value>truevalue>
      property>
      
      <property>
          <name>spark.executor.memoryname>
          <value>16gvalue>
      property>
      
      <property>
          <name>spark.yarn.executor.memoryOverheadname>
          <value>3072mvalue>
      property>
      <property>
          <name>spark.driver.memoryname>
          <value>16gvalue>
      property>
      <property>
          <name>spark.yarn.driver.memoryOverheadname>
          <value>400mvalue>
      property>
      <property>
          <name>spark.executor.coresname>
          <value>6value>
      property>
      
      
      <property>
          <name>spark.shuffle.service.enabledname>
          <value>truevalue>
      property>
      <property>
          <name>spark.dynamicAllocation.enabledname>
          <value>truevalue>
      property>
      
      <property>
          <name>spark.dynamicAllocation.minExecutorsname>
          <value>0value>
      property>
      
      <property>
          <name>spark.dynamicAllocation.maxExecutorssname>
          <value>14value>
      property>
      
      <property>
          <name>spark.dynamicAllocation.initialExecutorsname>
          <value>4value>
      property>
      
      <property>
          <name>spark.dynamicAllocation.executorIdleTimeoutname>
          <value>60000value>
      property>
      
      <property>
          <name>spark.dynamicAllocation.schedulerBacklogTimeoutname>
          <value>1000value>
      property>
      
  6. 启动初始化hive:${HVIE_HOME}/bin/schematool -dbType mysql -initSchema

  7. 启动元数据连接服务(thrift server 服务):hive --service metastore &

  8. 启动hiveserver2服务:hive --service hiveserver2 &

  9. 进入hive shell,并进行查询。可能会如下错误:

    Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable
            at org.apache.hadoop.hive.ql.parse.spark.GenSparkProcContext.(GenSparkProcContext.java:163)
            at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.generateTaskTree(SparkCompiler.java:328)
            at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:279)
            at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11273)
            at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286)
            at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
            at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:512)
            at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
            at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
            at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
            at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
            at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
            at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
            at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
            at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
            at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787)
            at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
            at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
    Caused by: java.lang.ClassNotFoundException: scala.collection.Iterable
            at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
            at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
            ... 24 more   
    
  10. 解决方案如下:

    1. 在hive-env.sh中加入:
      export SPARK_LIB=${spark安装位置}/jars
      
    2. 编辑bin/hive,如下:
      for f in ${HIVE_LIB}/*.jar; do
        CLASSPATH=${CLASSPATH}:$f;
      done
      
      # 加入:
      for f in ${SPARK_LIB}/*.jar; do
        CLASSPATH=${CLASSPATH}:$f;
      done
      
    3. 从新进入hive,问题解决。

你可能感兴趣的:(Spark,hive,on,spark)