Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装

Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装

注意

1、本文档使用的基础hadoop环境是基于本人写的另一篇文章的基础上新增的spark和hudi的安装部署文档,基础环境部署文档:
Hadoop2.7.6+Mysql5.7+Hive2.3.2+Hbase1.4.9+Kylin2.4单机伪分布式安装文档
2、整篇文章配置相对简单,走了一些坑,没有写在文档里,为了像我一样的小白看我的文档,按着错误的路径走了,文章整体写的较为详细,按照文章整体过程来做应该不会出错,如果需要搭建基础大数据环境的,可以看上面本人写的hadoop环境部署文档,写的较为详细。
3、关于spark和hudi的介绍这里不再赘述,网上和官方文档有很多的文字介绍,本文所有安装所需的介质或官方文档均已给出可以直接下载或跳转的路径,方便各位免费下载与我文章安装的一致版本的介质。
4、下面是本实验安装完成后本人实验环境整体hadoop系列组件的版本情况:

软件名称 版本号
Hadoop 2.7.6
Mysql 5.7
Hive 2.3.2
Hbase 1.4.9
Spark 2.4.4
Hudi 0.5.2
JDK 1.8.0_151
Scala 2.11.12
OGG for bigdata 12.3
Kylin 2.4
Kafka 2.11-1.1.1
Zookeeper 3.4.6
Oracle Linux 6.8x64

一、安装spark依赖的Scala

因为其他版本的Spark都是基于2.11.版本,只有2.4.2版本的才使用Scala2.12. 版本进行开发,hudi官方用的是spark2.4.4,而spark:“Using Scala version 2.11.12 (Java HotSpot™ 64-Bit Server VM, Java 1.8.0_151)”,所以这里我们下载scala2.11.12。

1.1 下载和解压缩Scala

下载地址:
点击进入
下载linux版本:
Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第1张图片
在Linux服务器的opt目录下新建一个名为scala的文件夹,并将下载的压缩包上载上去:

[root@hadoop opt]# cd /usr/
[root@hadoop usr]# mkdir scala
[root@hadoop usr]# cd scala/
[root@hadoop scala]# pwd
/usr/scala
[root@hadoop scala]# ls
scala-2.11.12.tgz
[root@hadoop scala]# tar -zxvf scala-2.11.12.tgz 
[root@hadoop scala]# ls
scala-2.11.12  scala-2.11.12.tgz
[root@hadoop scala]# rm -rf *tgz
[root@hadoop scala]# cd scala-2.11.12/
[root@hadoop scala-2.11.12]# pwd
/usr/scala/scala-2.11.12

1.2 配置环境变量

编辑/etc/profile这个文件,在文件中增加配置:

export SCALA_HOME=/usr/scala/scala-2.11.12
在该文件的PATH变量中增加下面的内容:
${SCALA_HOME}/bin

添加完成后,我的/etc/profile的配置如下:

export JAVA_HOME=/usr/java/jdk1.8.0_151
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/hadoop/
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"
export HIVE_HOME=/hadoop/hive
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export HCAT_HOME=$HIVE_HOME/hcatalog
export HIVE_DEPENDENCY=/hadoop/hive/conf:/hadoop/hive/lib/*:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-core-2.3.3.jar:/hadoop/hiv
e/hcatalog/share/hcatalog/hive-hcatalog-server-extensions-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-streaming-2.3.3.jar:/hadoop/hive/lib/hive-exec-2.3.3.jarexport HBASE_HOME=/hadoop/hbase/
export ZOOKEEPER_HOME=/hadoop/zookeeper
export KAFKA_HOME=/hadoop/kafka
export KYLIN_HOME=/hadoop/kylin/
export GGHOME=/hadoop/ogg12
export SCALA_HOME=/usr/scala/scala-2.11.12
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$HCAT_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME:$KAFKA_HOME:$KYLIN_HOME/bin:${SCALA_HOME}/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:${HIVE_HOME}/lib:$HBASE_HOME/lib:$KYLIN_HOME/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/libjsig.so:$JAVA_HOME/jre/lib/amd64/server/libjvm.so:$JAVA_HOME/jre/lib/amd64/server:$JAVA_HOME/jre/lib/amd64:$GG_HOME:/lib

保存退出,source一下使环境变量生效:

[root@hadoop ~]# source /etc/profile

1.3 验证Scala

[root@hadoop scala-2.11.12]# scala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

二、 下载和解压缩Spark

2.1、下载Spark

下载地址:
点击进入

2.2 解压缩Spark

在/hadoop创建spark目录用户存放spark。

[root@hadoop scala-2.11.12]# cd /hadoop/
[root@hadoop hadoop]# mkdir spark
[root@hadoop hadoop]# cd spark/
通过xftp上传安装包到spark目录
[root@hadoop spark]# tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz
[root@hadoop spark]# ls
spark-2.4.4-bin-hadoop2.7  spark-2.4.4-bin-hadoop2.7.tgz
[root@hadoop spark]# rm -rf *tgz
[root@hadoop spark]# mv spark-2.4.4-bin-hadoop2.7/* .
[root@hadoop spark]# ls
bin  conf  data  examples  jars  kubernetes  LICENSE  licenses  NOTICE  python  R  README.md  RELEASE  sbin  spark-2.4.4-bin-hadoop2.7  yarn

三、Spark相关的配置

3.1、配置环境变量

编辑/etc/profile文件,增加

export  SPARK_HOME=/hadoop/spark

上面的变量添加完成后编辑该文件中的PATH变量,添加

${SPARK_HOME}/bin

修改完成后,我的/etc/profile文件内容是:

export JAVA_HOME=/usr/java/jdk1.8.0_151
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/hadoop/
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR"
export HIVE_HOME=/hadoop/hive
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export HCAT_HOME=$HIVE_HOME/hcatalog
export HIVE_DEPENDENCY=/hadoop/hive/conf:/hadoop/hive/lib/*:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-core-2.3.3.jar:/hadoop/hiv
e/hcatalog/share/hcatalog/hive-hcatalog-server-extensions-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-streaming-2.3.3.jar:/hadoop/hive/lib/hive-exec-2.3.3.jarexport HBASE_HOME=/hadoop/hbase/
export ZOOKEEPER_HOME=/hadoop/zookeeper
export KAFKA_HOME=/hadoop/kafka
export KYLIN_HOME=/hadoop/kylin/
export GGHOME=/hadoop/ogg12
export SCALA_HOME=/usr/scala/scala-2.11.12
export SPARK_HOME=/hadoop/spark
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$HCAT_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME:$KAFKA_HOME:$KYLIN_HOME/bin:${SCALA_HOME}/bin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:${HIVE_HOME}/lib:$HBASE_HOME/lib:$KYLIN_HOME/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/libjsig.so:$JAVA_HOME/jre/lib/amd64/server/libjvm.so:$JAVA_HOME/jre/lib/amd64/server:$JAVA_HOME/jre/lib/amd64:$GG_HOME:/lib

编辑完成后,执行命令 source /etc/profile使环境变量生效。

3.2、配置参数文件

进入conf目录

[root@hadoop conf]# pwd
/hadoop/spark/conf

复制一份配置文件并重命名

root@hadoop conf]# cp spark-env.sh.template   spark-env.sh
[root@hadoop conf]# ls 
docker.properties.template  fairscheduler.xml.template  log4j.properties.template  metrics.properties.template  slaves.template  spark-defaults.conf.template  spark-env.sh  spark-env.sh.template

编辑spark-env.h文件,在里面加入配置(具体路径以自己的为准):

export SCALA_HOME=/usr/scala/scala-2.11.12
export JAVA_HOME=/usr/java/jdk1.8.0_151
export HADOOP_HOME=/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_HOME=/hadoop/spark
export SPARK_MASTER_IP=192.168.1.66
export SPARK_EXECUTOR_MEMORY=1G

source /etc/profile生效。

3.3、新建slaves文件

以spark为我们创建好的模板创建一个slaves文件,命令是:

[root@hadoop conf]# pwd
/hadoop/spark/conf
[root@hadoop conf]# cp slaves.template slaves

四、启动spark

因为spark是依赖于hadoop提供的分布式文件系统的,所以在启动spark之前,先确保hadoop在正常运行。

[root@hadoop hadoop]# jps
23408 RunJar
23249 JobHistoryServer
23297 RunJar
24049 Jps
22404 DataNode
22774 ResourceManager
23670 Kafka
22264 NameNode
22889 NodeManager
23642 QuorumPeerMain
22589 SecondaryNameNode

在hadoop正常运行的情况下,在hserver1(也就是hadoop的namenode,spark的marster节点)上执行命令:

[root@hadoop hadoop]# cd /hadoop/spark/sbin
[root@hadoop sbin]# ./start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /hadoop/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-hadoop.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /hadoop/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop.out
[root@hadoop sbin]# cat /hadoop/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-hadoop.out
Spark Command: /usr/java/jdk1.8.0_151/bin/java -cp /hadoop/spark/conf/:/hadoop/spark/jars/*:/hadoop/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host hadoop --port 7077 --webui-port 8080
========================================
20/03/30 22:42:27 INFO master.Master: Started daemon with process name: 24079@hadoop
20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for TERM
20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for HUP
20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for INT
20/03/30 22:42:27 WARN master.MasterArguments: SPARK_MASTER_IP is deprecated, please use SPARK_MASTER_HOST
20/03/30 22:42:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/30 22:42:27 INFO spark.SecurityManager: Changing view acls to: root
20/03/30 22:42:27 INFO spark.SecurityManager: Changing modify acls to: root
20/03/30 22:42:27 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/30 22:42:27 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/30 22:42:27 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permiss
ions: Set(root); groups with modify permissions: Set()20/03/30 22:42:27 INFO util.Utils: Successfully started service 'sparkMaster' on port 7077.
20/03/30 22:42:27 INFO master.Master: Starting Spark master at spark://hadoop:7077
20/03/30 22:42:27 INFO master.Master: Running Spark version 2.4.4
20/03/30 22:42:28 INFO util.log: Logging initialized @1497ms
20/03/30 22:42:28 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/03/30 22:42:28 INFO server.Server: Started @1560ms
20/03/30 22:42:28 INFO server.AbstractConnector: Started ServerConnector@6182300a{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
20/03/30 22:42:28 INFO util.Utils: Successfully started service 'MasterUI' on port 8080.
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f1f0276{/app,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f1af444{/app/json,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@259b10d3{/,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6fc2f56f{/json,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@37a28407{/static,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@e99fa57{/app/kill,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@66be5bb8{/driver/kill,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO ui.MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://hadoop:8080
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6b2c0980{/metrics/master/json,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4ac1749f{/metrics/applications/json,null,AVAILABLE,@Spark}
20/03/30 22:42:28 INFO master.Master: I have been elected leader! New state: ALIVE
20/03/30 22:42:31 INFO master.Master: Registering worker 192.168.1.66:39384 with 8 cores, 4.6 GB RAM
[root@hadoop sbin]# cat  /hadoop/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop.out
Spark Command: /usr/java/jdk1.8.0_151/bin/java -cp /hadoop/spark/conf/:/hadoop/spark/jars/*:/hadoop/etc/hadoop/ -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://hadoop:7077
========================================
20/03/30 22:42:29 INFO worker.Worker: Started daemon with process name: 24173@hadoop
20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for TERM
20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for HUP
20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for INT
20/03/30 22:42:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/30 22:42:30 INFO spark.SecurityManager: Changing view acls to: root
20/03/30 22:42:30 INFO spark.SecurityManager: Changing modify acls to: root
20/03/30 22:42:30 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/30 22:42:30 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/30 22:42:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permiss
ions: Set(root); groups with modify permissions: Set()20/03/30 22:42:30 INFO util.Utils: Successfully started service 'sparkWorker' on port 39384.
20/03/30 22:42:30 INFO worker.Worker: Starting Spark worker 192.168.1.66:39384 with 8 cores, 4.6 GB RAM
20/03/30 22:42:30 INFO worker.Worker: Running Spark version 2.4.4
20/03/30 22:42:30 INFO worker.Worker: Spark home: /hadoop/spark
20/03/30 22:42:31 INFO util.log: Logging initialized @1682ms
20/03/30 22:42:31 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/03/30 22:42:31 INFO server.Server: Started @1758ms
20/03/30 22:42:31 INFO server.AbstractConnector: Started ServerConnector@3d598dff{HTTP/1.1,[http/1.1]}{0.0.0.0:8081}
20/03/30 22:42:31 INFO util.Utils: Successfully started service 'WorkerUI' on port 8081.
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5099c1b0{/logPage,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@64348087{/logPage/json,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46dcda1b{/,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1617f7cc{/json,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@56e77d31{/static,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@643123b6{/log,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO ui.WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://hadoop:8081
20/03/30 22:42:31 INFO worker.Worker: Connecting to master hadoop:7077...
20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1cf30aaa{/metrics/json,null,AVAILABLE,@Spark}
20/03/30 22:42:31 INFO client.TransportClientFactory: Successfully created connection to hadoop/192.168.1.66:7077 after 36 ms (0 ms spent in bootstraps)
20/03/30 22:42:31 INFO worker.Worker: Successfully registered with master spark://hadoop:7077

启动没问题,访问Webui:http://192.168.1.66:8080/
Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第2张图片

五、运行Spark提供的计算圆周率的示例程序

这里只是简单的用local模式运行一个计算圆周率的Demo。按照下面的步骤来操作。

[root@hadoop sbin]# cd /hadoop/spark/
[root@hadoop spark]# ./bin/spark-submit  --class  org.apache.spark.examples.SparkPi  --master local   examples/jars/spark-examples_2.11-2.4.4.jar 
20/03/30 22:45:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/30 22:45:59 INFO spark.SparkContext: Running Spark version 2.4.4
20/03/30 22:45:59 INFO spark.SparkContext: Submitted application: Spark Pi
20/03/30 22:45:59 INFO spark.SecurityManager: Changing view acls to: root
20/03/30 22:45:59 INFO spark.SecurityManager: Changing modify acls to: root
20/03/30 22:45:59 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/30 22:45:59 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/30 22:45:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permiss
ions: Set(root); groups with modify permissions: Set()20/03/30 22:45:59 INFO util.Utils: Successfully started service 'sparkDriver' on port 39352.
20/03/30 22:45:59 INFO spark.SparkEnv: Registering MapOutputTracker
20/03/30 22:45:59 INFO spark.SparkEnv: Registering BlockManagerMaster
20/03/30 22:45:59 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/03/30 22:45:59 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/03/30 22:45:59 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-63bf7c92-8908-4784-8e16-4c6ef0c93dc0
20/03/30 22:45:59 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
20/03/30 22:45:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
20/03/30 22:46:00 INFO util.log: Logging initialized @2066ms
20/03/30 22:46:00 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
20/03/30 22:46:00 INFO server.Server: Started @2179ms
20/03/30 22:46:00 INFO server.AbstractConnector: Started ServerConnector@3abd581e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/03/30 22:46:00 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@36dce7ed{/jobs,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6a1ebcff{/jobs/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@19868320{/jobs/job,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c20be82{/jobs/job/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@13c612bd{/stages,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3ef41c66{/stages/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6b739528{/stages/stage,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f577419{/stages/stage/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28fa700e{/stages/pool,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3d526ad9{/stages/pool/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@e041f0c{/storage,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6a175569{/storage/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@11963225{/storage/rdd,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3f3c966c{/storage/rdd/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@11ee02f8{/environment,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4102b1b1{/environment/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@61a5b4ae{/executors,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3a71c100{/executors/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b69fd74{/executors/threadDump,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f325091{/executors/threadDump/json,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@437e951d{/static,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@467f77a5{/,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1bb9aa43{/api,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@66b72664{/jobs/job/kill,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7a34b7b8{/stages/stage/kill,null,AVAILABLE,@Spark}
20/03/30 22:46:00 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop:4040
20/03/30 22:46:00 INFO spark.SparkContext: Added JAR file:/hadoop/spark/examples/jars/spark-examples_2.11-2.4.4.jar at spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar with timestamp 1585579560287
20/03/30 22:46:00 INFO executor.Executor: Starting executor ID driver on host localhost
20/03/30 22:46:00 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38875.
20/03/30 22:46:00 INFO netty.NettyBlockTransferService: Server created on hadoop:38875
20/03/30 22:46:00 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/03/30 22:46:00 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop, 38875, None)
20/03/30 22:46:00 INFO storage.BlockManagerMasterEndpoint: Registering block manager hadoop:38875 with 366.3 MB RAM, BlockManagerId(driver, hadoop, 38875, None)
20/03/30 22:46:00 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop, 38875, None)
20/03/30 22:46:00 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop, 38875, None)
20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6f8e0cee{/metrics/json,null,AVAILABLE,@Spark}
20/03/30 22:46:01 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:38
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Parents of final stage: List()
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Missing parents: List()
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
20/03/30 22:46:01 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 366.3 MB)
20/03/30 22:46:01 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 366.3 MB)
20/03/30 22:46:01 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop:38875 (size: 1256.0 B, free: 366.3 MB)
20/03/30 22:46:01 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
20/03/30 22:46:01 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
20/03/30 22:46:01 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7866 bytes)
20/03/30 22:46:01 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
20/03/30 22:46:01 INFO executor.Executor: Fetching spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar with timestamp 1585579560287
20/03/30 22:46:01 INFO client.TransportClientFactory: Successfully created connection to hadoop/192.168.1.66:39352 after 45 ms (0 ms spent in bootstraps)
20/03/30 22:46:01 INFO util.Utils: Fetching spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar to /tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839/userFiles-86767584-1e78-45f2-a9ed-8ac4360ab170/fetchFileTem
p2974211155688432975.tmp20/03/30 22:46:01 INFO executor.Executor: Adding file:/tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839/userFiles-86767584-1e78-45f2-a9ed-8ac4360ab170/spark-examples_2.11-2.4.4.jar to class loader
20/03/30 22:46:01 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 824 bytes result sent to driver
20/03/30 22:46:01 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7866 bytes)
20/03/30 22:46:01 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
20/03/30 22:46:01 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 308 ms on localhost (executor driver) (1/2)
20/03/30 22:46:01 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 824 bytes result sent to driver
20/03/30 22:46:01 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 31 ms on localhost (executor driver) (2/2)
20/03/30 22:46:01 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
20/03/30 22:46:01 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.606 s
20/03/30 22:46:01 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.703911 s
Pi is roughly 3.1386756933784667
20/03/30 22:46:01 INFO server.AbstractConnector: Stopped Spark@3abd581e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/03/30 22:46:01 INFO ui.SparkUI: Stopped Spark web UI at http://hadoop:4040
20/03/30 22:46:01 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/03/30 22:46:01 INFO memory.MemoryStore: MemoryStore cleared
20/03/30 22:46:01 INFO storage.BlockManager: BlockManager stopped
20/03/30 22:46:01 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
20/03/30 22:46:01 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/03/30 22:46:01 INFO spark.SparkContext: Successfully stopped SparkContext
20/03/30 22:46:01 INFO util.ShutdownHookManager: Shutdown hook called
20/03/30 22:46:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e019897d-3160-4bb1-ab59-f391e32ec47a
20/03/30 22:46:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839

可以看到输出:Pi is roughly 3.137355686778434
已经打印出了圆周率。
上面只是使用了单机本地模式调用Demo,使用集群模式运行Demo,请继续看。

六、用yarn-cluster模式执行计算程序

进入到Spark的安装目录,执行命令,用yarn-cluster模式运行计算圆周率的Demo:

[root@hadoop spark]# ./bin/spark-submit  --class  org.apache.spark.examples.SparkPi  --master  yarn-cluster   examples/jars/spark-examples_2.11-2.4.4.jar 
Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
20/03/30 22:47:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/30 22:47:48 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032
20/03/30 22:47:48 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
20/03/30 22:47:48 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
20/03/30 22:47:48 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
20/03/30 22:47:48 INFO yarn.Client: Setting up container launch context for our AM
20/03/30 22:47:48 INFO yarn.Client: Setting up the launch environment for our AM container
20/03/30 22:47:48 INFO yarn.Client: Preparing resources for our AM container
20/03/30 22:47:48 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/03/30 22:47:51 INFO yarn.Client: Uploading resource file:/tmp/spark-d554f7cd-c7d4-4dfa-bc86-11a340925db6/__spark_libs__3389017089811757919.zip -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_
1585579247054_0001/__spark_libs__3389017089811757919.zip20/03/30 22:47:59 INFO yarn.Client: Uploading resource file:/hadoop/spark/examples/jars/spark-examples_2.11-2.4.4.jar -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_1585579247054_0001/spark-exa
mples_2.11-2.4.4.jar20/03/30 22:47:59 INFO yarn.Client: Uploading resource file:/tmp/spark-d554f7cd-c7d4-4dfa-bc86-11a340925db6/__spark_conf__559264393694354636.zip -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_1
585579247054_0001/__spark_conf__.zip20/03/30 22:47:59 INFO spark.SecurityManager: Changing view acls to: root
20/03/30 22:47:59 INFO spark.SecurityManager: Changing modify acls to: root
20/03/30 22:47:59 INFO spark.SecurityManager: Changing view acls groups to: 
20/03/30 22:47:59 INFO spark.SecurityManager: Changing modify acls groups to: 
20/03/30 22:47:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permiss
ions: Set(root); groups with modify permissions: Set()20/03/30 22:48:01 INFO yarn.Client: Submitting application application_1585579247054_0001 to ResourceManager
20/03/30 22:48:01 INFO impl.YarnClientImpl: Submitted application application_1585579247054_0001
20/03/30 22:48:02 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:02 INFO yarn.Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1585579681188
	 final status: UNDEFINED
	 tracking URL: http://hadoop:8088/proxy/application_1585579247054_0001/
	 user: root
20/03/30 22:48:03 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:04 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:05 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:06 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:07 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:08 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:09 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:11 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:12 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:13 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:14 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:15 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:16 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:17 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:19 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:20 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:21 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:22 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:23 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:24 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:25 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:26 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:27 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:28 INFO yarn.Client: Application report for application_1585579247054_0001 (state: ACCEPTED)
20/03/30 22:48:29 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:29 INFO yarn.Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: hadoop
	 ApplicationMaster RPC port: 37844
	 queue: default
	 start time: 1585579681188
	 final status: UNDEFINED
	 tracking URL: http://hadoop:8088/proxy/application_1585579247054_0001/
	 user: root
20/03/30 22:48:30 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:31 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:32 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:33 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:34 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:35 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:36 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:37 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:38 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:39 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:40 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:41 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:42 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:43 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:44 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:45 INFO yarn.Client: Application report for application_1585579247054_0001 (state: RUNNING)
20/03/30 22:48:46 INFO yarn.Client: Application report for application_1585579247054_0001 (state: FINISHED)
20/03/30 22:48:46 INFO yarn.Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: hadoop
	 ApplicationMaster RPC port: 37844
	 queue: default
	 start time: 1585579681188
	 final status: SUCCEEDED
	 tracking URL: http://hadoop:8088/proxy/application_1585579247054_0001/
	 user: root
20/03/30 22:48:46 INFO util.ShutdownHookManager: Shutdown hook called
20/03/30 22:48:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-4c243c24-9489-4c8a-a1bc-a6a9780615d6
20/03/30 22:48:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-d554f7cd-c7d4-4dfa-bc86-11a340925db6

注意,使用yarn-cluster模式计算,结果没有输出在控制台,结果写在了Hadoop集群的日志中,如何查看计算结果?注意到刚才的输出中有地址:
tracking URL: http://hadoop:8088/proxy/application_1585579247054_0001/
进去看看:

Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第3张图片
再点进logs:
Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第4张图片
查看stdout内容:
在这里插入图片描述
圆周率结果已经打印出来了。
这里再给出几个常用命令:

启动spark
./sbin/start-all.sh
启动Hadoop以**及Spark:
./starths.sh
停止命令改成stop

七、配置spark读取hive表

由于在hive里面操作表是通过mapreduce的方式,效率较低,本文主要描述如何通过spark读取hive表到内存进行计算。

第一步,先把$HIVE_HOME/conf/hive-site.xml放入$SPARK_HOME/conf内,使得spark能够获取hive配置

[root@hadoop spark]# pwd
/hadoop/spark
[root@hadoop spark]# cp $HIVE_HOME/conf/hive-site.xml conf/
[root@hadoop spark]# chmod 777 conf/hive-site.xml
[root@hadoop spark]# cp /hadoop/hive/lib/mysql-connector-java-5.1.47.jar jars/

通过spark-shell进入交互界面

[root@hadoop spark]# /hadoop/spark/bin/spark-shell
20/03/31 10:31:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/31 10:32:41 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/03/31 10:32:41 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
Spark context Web UI available at http://hadoop:4042
Spark context available as 'sc' (master = local[*], app id = local-1585621962060).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala>  import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala>  val hiveContext = new HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@62966c9f

scala> hiveContext.sql("show databases").show()
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.client.capability.check does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.false.positive.probability does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.broker.address.default does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.orc.time.counters does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve-fraction.min does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.ms.footer.cache.ppd.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.event.message.factory does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.metrics.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.hs2.user.access does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.storage.storageDirectory does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.liveness.connection.timeout.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.dynamic.semijoin.reduction.threshold does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.connect.retry.limit does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.xmx.headroom does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.dynamic.semijoin.reduction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.direct does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.stats does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.client.consistent.splits does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.session.lifetime does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.timedout.txn.reaper.start does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.ttl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.management.acl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.delegation.token.lifetime does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.guidKey does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.ats.hook.queue.capacity does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.large.query does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bigtable.minsize.semijoin.reduction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.alloc.min does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.user does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.alloc.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.wait.queue.comparator.class.name does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.cache.use.soft.references does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve.fraction.max does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.listener.thread-count does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.container.max.java.heap.fraction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.stats.column.autogather does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.am.liveness.heartbeat.interval.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.decoding.metrics.percentiles.intervals does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.groupby.position.alias does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.txn.store.impl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.groupby.shuffle does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.object.cache.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.parallel.ops.in.session does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.groupby.limit.extrastep does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.use.ssl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.file.location does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.retry.delay.seconds does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.fileformat does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.num.file.cleaner.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.fail.compaction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.use.blobstore.as.scratchdir does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.class does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.mmap.path does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.download.permanent.fns does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.max.historic.queries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.execution.reducesink.new.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.max.num.delta does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.attempted does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.initiator.failed.compacts.threshold does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.reporter does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.max.pending.writes does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.execution.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.enable.grace.join.in.llap does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.memory.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.threadpool.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.select.threshold does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.scratchdir.lock does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.use.spnego does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.file.frequency does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.hs2.coordinator.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.timeout.seconds does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.filter.stats.reduction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.orc.base.delta.ratio does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.fastpath does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.clear.dangling.scratchdir does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.fail.heartbeater does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.file.cleanup.delay.seconds does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.management.rpc.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mapjoin.hybridgrace.bloomfilter does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.tree does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.stats.ndv.tuner does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.query.length does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.failed does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.close.session.on.disconnect does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.ppd.windowing does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.initial.metadata.count.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.host does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.ms.footer.cache.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.point.lookup.min does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.file.metadata.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.service.refresh.interval.sec does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.max.output.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.driver.parallel.compilation does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.remote.token.requires.signing does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bucket.pruning does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.cache.allow.synthetic.fileid does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.hash.table.inflation.factor does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.hbase.ttl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.vectorized does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.writeset.reaper.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.vector.serde.deserialize does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.order.columnalignment does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.send.buffer.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.schema.evolution does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.elements.values.clause does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.llap.concurrent.queries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.allow.uber does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.partition.size.max does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.auth does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.include.fileid does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.communicator.num.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orderby.position.alias does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.connection.sleep.between.retries.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.max.partitions does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.hadoop2.component does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.yarn.shuffle.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.elements.in.clause does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.passiveWaitTimeMs does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.load.dynamic.partitions.thread does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.segments.granularity does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.http.response.header.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.conf.internal.variable.list does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose.reductionpercentage does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.retry.limit does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.serialize.in.tasks does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.query.timeout.seconds does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.hadoop2.frequency does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.directory.batch.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.reader.wait does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.reenable.max.timeout.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.max.open.txns does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.auto.convert.sortmerge.join.reduce.side does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.zookeeper.publish.configs does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.auto.convert.join.hashtable.max.entries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.sessions.init.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.authorization.storage.check.externaltable.drop does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.execution.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cbo.cnf.maxnodes does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.adaptor.usage.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.rewriting does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.groupMembershipKey does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.catalog.cache.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cbo.show.warnings does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.fshandler.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.max.bloom.filter.entries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.metadata.fraction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.serde does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.scheduler.wait.queue.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.cache.entries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.operational.properties does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.memory.ttl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.rpc.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.nonvector.wrapper.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.cache.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.vectorized.input.format does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.cte.materialize.threshold does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.clean.until does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.semijoin.conversion does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.dynamic.partition.pruning does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.metrics.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.rootdir does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.limit.partition.request does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.async.log.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.logger does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.allow.udf.load.on.demand does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cli.tez.session.async does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bloom.filter.factor does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.am-reporter.max.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.file.size.for.mapjoin does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.bucketing does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bucket.pruning.compat does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.spnego.principal does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.preemption.metrics.intervals does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.shuffle.dir.watcher.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.arena.count does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.use.SSL does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.connection.timeout.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.transpose.aggr.join does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.maxTries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.dynamic.partition.pruning.max.data.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.metadata.base does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.invalidator.frequency does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.use.lrfu does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.mmap does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.coordinator.address.default does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.max.fetch.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.conf.hidden.list does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.io.sarg.cache.max.weight.mb does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.clear.dangling.scratchdir.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.sleep.time does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.row.serde.deserialize does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.compile.lock.timeout does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.timedout.txn.reaper.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.max.variance does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.lrfu.lambda does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.metadata.db.type does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.stream.timeout does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.transactional.events.mem does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.default.fetch.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.retain does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.merge.cardinality.check does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.groupClassKey does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.point.lookup does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.allow.permanent.fns does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.web.ssl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.manager.dump.lock.state.on.acquire.timeout does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.succeeded does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.use.fileid.path does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.slice.row.count does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mapjoin.optimized.hashtable.probe.percent does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.select.distribute does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.use.fqdn does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.reenable.min.timeout.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.validate.acls does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.support.special.characters.tablename does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mv.files.thread does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.skip.compile.udf.check does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.vector.serde.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.sleep.interval.between.start.attempts does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.yarn.container.mb does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.http.read.timeout does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.optimizations.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.orc.gap.cache does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.dynamic.partition.hashjoin does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.copyfile.maxnumfiles does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.formats does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.http.numConnection does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.scheduler.enable.preemption does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.num.executors does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.full does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.connection.class does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.sessions.custom.queue.allowed does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.slice.lrr does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.password does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.writer.wait does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.http.request.header.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.max.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose.reductiontuples does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.rollbacktxn does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.num.schedulable.tasks.per.node does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.acl does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.memory.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.type.safety does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.async.exec.async.compile does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.max.input.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.enable.memory.manager does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.msck.repair.batch.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.supported.schemes does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.allow.synthetic.fileid does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.stats.filter.in.factor does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.op.stats does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.input.listing.max.threads does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.session.lifetime.jitter does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.web.port does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.cartesian.product does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.rpc.num.handlers does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.vcpus.per.instance does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.count.open.txns.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.min.bloom.filter.entries does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.partition.columns.separate does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.cache.stripe.details.mem.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.heartbeat.threadpool.size does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.locality.delay does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cmrootdir does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.disable.backoff.factor does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.liveness.connection.sleep.between.retries.ms does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.exec.inplace.progress does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.working.directory does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.memory.per.instance.mb does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.msck.path.validation does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve.fraction does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.merge.nway.joins does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.reaper.interval does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.strict.locking.mode does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.vector.serde.async.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.input.generate.consistent.splits does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.in.place.progress does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.memory.rownum.max does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.xsrf.filter.enabled does not exist
20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.alloc.max does not exist
+------------+
|databaseName|
+------------+
|     default|
|      hadoop|
+------------+
scala>  hiveContext.sql("show tables").show()
+--------+--------------------+-----------+
|database|           tableName|isTemporary|
+--------+--------------------+-----------+
| default|                  aa|      false|
| default|                  bb|      false|
| default|                  dd|      false|
| default|       kylin_account|      false|
| default|        kylin_cal_dt|      false|
| default|kylin_category_gr...|      false|
| default|       kylin_country|      false|
| default|kylin_intermediat...|      false|
| default|kylin_intermediat...|      false|
| default|         kylin_sales|      false|
| default|                test|      false|
| default|           test_null|      false|
+--------+--------------------+-----------+

可以看到已经查询到结果了,但是为啥上面报了一堆WARN 。
比如:
WARN conf.HiveConf: HiveConf of name hive.llap.skip.compile.udf.check does not exis

hive-site配置文件删除掉:

<property>
    <name>hive.llap.skip.compile.udf.check</name>
    <value>false</value>
    <description>
      Whether to skip the compile-time check for non-built-in UDFs when deciding whether to
      execute tasks in LLAP. Skipping the check allows executing UDFs from pre-localized
      jars in LLAP; if the jars are not pre-localized, the UDFs will simply fail to load.
    </description>
  </property>

再次登录执行警告就消失了。

八、配置Hudi

8.1、检阅官方文档重点地方

先来看下官方文档getstart首页:
Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第5张图片
我之前装的hadoop环境是2.7版本的,前面之所以装spark2.4.4就是因为目前官方案例就是用的hadoop2.7+spark2.4.4,而且虽然现在hudi、spark是支持scala2.11.x/2.12.x,但是官网这里也是用的2.11,我这里为了保持和hudi官方以及spark2.4.4(Using Scala version 2.11.12 (Java HotSpot™ 64-Bit Server VM, Java 1.8.0_151))一致,也就装的2.11.12版本的scala。
因为目前为止,Hudi已经出了0.5.2版本,但是Hudi官方仍然用的0.5.1的做示例,接下来,先切换到hudi0.5.1的发布文档:
点击查看
Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第6张图片
Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第7张图片
上面发布文档讲的意思是:

版本升级 将Spark版本从2.1.0升级到2.4.4 将Avro版本从1.7.7升级到1.8.2
将Parquet版本从1.8.1升级到1.10.1
将Kafka版本从0.8.2.1升级到2.0.0,这是由于将spark-streaming-kafka
artifact从0.8_2.11升级到0.10_2.11/2.12间接升级 重要:Hudi
0.5.1版本需要将spark的版本升级到2.4+

Hudi现在支持Scala 2.11和2.12,可以参考Scala 2.12构建来使用Scala 2.12来构建Hudi,另外,
hudi-spark, hudi-utilities, hudi-spark-bundle and
hudi-utilities-bundle包名现已经对应变更为 hudi-spark_{scala_version},
hudi-spark_{scala_version}, hudi-utilities_{scala_version},
hudi-spark-bundle_{scala_version}和
hudi-utilities-bundle_{scala_version}. 注意这里的scala_version为2.11或2.12。
在0.5.1版本中,对于timeline元数据的操作不再使用重命名方式,这个特性在创建Hudi表时默认是打开的。对于已存在的表,这个特性默认是关闭的,在已存在表开启这个特性之前,请参考这部分(https://hudi.apache.org/docs/deployment.html#upgrading)。若开启新的Hudi
timeline布局方式(layout),即避免重命名,可设置写配置项hoodie.timeline.layout.version=1。当然,你也可以在CLI中使用repair
overwrite-hoodie-props命令来添加hoodie.timeline.layout.version=1至hoodie.properties文件。注意,无论使用哪种方式,在升级Writer之前请先升级Hudi
Reader(查询引擎)版本至0.5.1版本。 CLI支持repair
overwrite-hoodie-props来指定文件来重写表的hoodie.properties文件,可以使用此命令来的更新表名或者使用新的timeline布局方式。注意当写hoodie.properties文件时(毫秒),一些查询将会暂时失败,失败后重新运行即可。
DeltaStreamer用来指定表类型的参数从–storage-type变更为了–table-type,可以参考wiki来了解更多的最新变化的术语。
配置Kafka Reset
Offset策略的值变化了。枚举值从LARGEST变更为LATEST,SMALLEST变更为EARLIEST,对应DeltaStreamer中的配置项为auto.offset.reset。
当使用spark-shell来了解Hudi时,需要提供额外的–packages
org.apache.spark:spark-avro_2.11:2.4.4,可以参考quickstart了解更多细节。 Key
generator(键生成器)移动到了单独的包下org.apache.hudi.keygen,如果你使用重载键生成器类(对应配置项:hoodie.datasource.write.keygenerator.class),请确保类的全路径名也对应进行变更。
Hive同步工具将会为MOR注册带有_ro后缀的RO表,所以查询也请带_ro后缀,你可以使用–skip-ro-suffix配置项来保持旧的表名,即同步时不添加_ro后缀。
0.5.1版本中,供presto/hive查询引擎使用的hudi-hadoop-mr-bundle包shaded了avro包,以便支持real
time
queries(实时查询)。Hudi支持可插拔的记录合并逻辑,用户只需自定义实现HoodieRecordPayload。如果你使用这个特性,你需要在你的代码中relocate
avro依赖,这样可以确保你代码的行为和Hudi保持一致,你可以使用如下方式来relocation。
org.apache.avro.
org.apache.hudi.org.apache.avro.
DeltaStreamer更好的支持Delete,可参考blog了解更多细节。
DeltaStreamer支持AWS Database Migration Service(DMS) ,可参考blog了解更多细节。
支持DynamicBloomFilter(动态布隆过滤器),默认是关闭的,可以使用索引配置项hoodie.bloom.index.filter.type=DYNAMIC_V0来开启。
HDFSParquetImporter支持bulkinsert,可配置–command为bulkinsert。 支持AWS WASB和
WASBS云存储。

8.2、错误的安装尝试

好了,看完了发布文档,而且已经定下了我们的使用版本关系,那么直接切换到Hudi0.5.2最新版本的官方文档:
点此跳转Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第8张图片因为之前没用过spark和hudi,在看到hudi官网的第一眼时候,首先想到的是先下载一个hudi0.5.1对应的应用程序,然后再进行部署,部署好了之后再执行上面官网给的命令代码,比如下面我之前做的错误示范:

由于官方目前案例都是用的0.5.1,所以我也下载这个版本:
https://downloads.apache.org/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz
将下载好的安装包,上传到/hadoop/spark目录下并解压:
[root@hadoop spark]# ls
bin  conf  data  examples  hudi-0.5.1-incubating.src.tgz  jars  kubernetes  LICENSE  licenses  logs  NOTICE  python  R  README.md  RELEASE  sbin  spark-2.4.4-bin-hadoop2.7  work  yarn
[root@hadoop spark]# tar -zxvf hudi-0.5.1-incubating.src.tgz
[root@hadoop spark]# ls
bin  conf  data  examples  hudi-0.5.1-incubating  hudi-0.5.1-incubating.src.tgz  jars  kubernetes  LICENSE  licenses  logs  NOTICE  python  R  README.md  RELEASE  sbin  spark-2.4.4-bin-hadoop2.7  work  yarn
[root@hadoop spark]# rm -rf *tgz
[root@hadoop ~]# /hadoop/spark/bin/spark-shell \
>     --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
>     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/hadoop/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hudi#hudi-spark-bundle_2.11 added as a dependency
org.apache.spark#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5717aa3e-7bfb-42c4-aadd-2a884f3521d5;1.0
	confs: [default]
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
You probably access the destination server through a proxy server that is not well configured.
:: resolution report :: resolve 454ms :: artifacts dl 1ms
	:: modules in use:
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   0   |   0   |
	---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
	Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.1-incubating/hudi-spark-bundle_2.11-0.5.1-incubating.pom

	Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.1-incubating/hudi-spark-bundle_2.11-0.5.1-incubating.jar
。。。。。。。。。。

		::::::::::::::::::::::::::::::::::::::::::::::

		::          UNRESOLVED DEPENDENCIES         ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating: not found

		:: org.apache.spark#spark-avro_2.11;2.4.4: not found

		::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating: not found, unresolved dependency: org.apache.spark#spark-avro_2.11;2.4.4: 
not found]	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
	at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

8.3、正确的“安装部署”

其实下载的这个应该算是个源码包,不是可直接运行的。
而且spark-shell --packages是指定java包的maven地址,若不给定,则会使用该机器安装的maven默认源中下载此jar包,也就是说指定的这两个jar是需要自动下载的,我的虚拟环境一没设置外部网络,二没配置maven,这肯定会报错找不到jar包。
官方这里的代码:
–packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
说白了其实就是指定maven项目pom文件的依赖,翻了一下官方文档,找到了Hudi给的中央仓库地址,然后从中找到了官方案例代码中指定的两个包:
Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第9张图片
直接拿出来,就是下面这两个:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>2.4.4</version>
</dependency>
<dependency>
  <groupId>org.apache.hudi</groupId>
  <artifactId>hudi-spark-bundle_2.11</artifactId>
  <version>0.5.2-incubating</version>
</dependency>

好吧,那我就在这直接下载了这俩包,然后再继续看官方文档:
Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装_第10张图片
这里说了我也可以通过自己构建hudi来快速开始, 并在spark-shell命令中使用–jars /packaging/hudi-spark-bundle/target/hudi-spark-bundle- ..*-SNAPSHOT.jar, 而不是–packages org.apache.hudi:hudi-spark-bundle:0.5.2-incubating
,看到这个提示,我在linux看了下 spark-shell的帮助:

[root@hadoop external_jars]# /hadoop/spark/bin/spark-shell --help
Usage: ./bin/spark-shell [options]

Scala REPL options:
  -I <file>                   preload <file>, enforcing line-by-line interpretation

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

原来–jasrs是指定机器上存在的jar文件,接下来将前面下载的两个包上传到服务器:

[root@hadoop spark]# mkdir external_jars
[root@hadoop spark]# cd external_jars/
[root@hadoop external_jars]# pwd
/hadoop/spark/external_jars
通过xftp上传jar到此目录
[root@hadoop external_jars]# ls
hudi-spark-bundle_2.11-0.5.2-incubating.jar  scala-library-2.11.12.jar  spark-avro_2.11-2.4.4.jar  spark-tags_2.11-2.4.4.jar  unused-1.0.0.jar

然后将官方案例代码:

spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

修改为:

[root@hadoop external_jars]# /hadoop/spark/bin/spark-shell --jars /hadoop/spark/external_jars/spark-avro_2.11-2.4.4.jar,/hadoop/spark/external_jars/hudi-spark-bundle_2.11-0.5.2-incubating.jar --conf 'spark.seri
alizer=org.apache.spark.serializer.KryoSerializer
'20/03/31 15:19:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://hadoop:4040
Spark context available as 'sc' (master = local[*], app id = local-1585639157881).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

OK!!!没有报错了,接下来开始尝试进行增删改查操作。

8.4、Hudi增删改查

基于上面步骤

8.4.1、设置表名、基本路径和数据生成器来生成记录

scala> import org.apache.hudi.QuickstartUtils._
import org.apache.hudi.QuickstartUtils._

scala> import scala.collection.JavaConversions._
import scala.collection.JavaConversions._

scala> import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql.SaveMode._

scala> import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceReadOptions._

scala> import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.DataSourceWriteOptions._

scala> import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.config.HoodieWriteConfig._

scala> val tableName = "hudi_cow_table"
tableName: String = hudi_cow_table

scala> val basePath = "file:///tmp/hudi_cow_table"
basePath: String = file:///tmp/hudi_cow_table

scala> val dataGen = new DataGenerator
dataGen: org.apache.hudi.QuickstartUtils.DataGenerator = org.apache.hudi.QuickstartUtils$DataGenerator@4bf6bc2d

数据生成器 可以基于行程样本模式 生成插入和更新的样本。

8.4.2、插入数据
生成一些新的行程样本,将其加载到DataFrame中,然后将DataFrame写入Hudi数据集中,如下所示。

scala> val inserts = convertToStringList(dataGen.generateInserts(10))
inserts: java.util.List[String] = [{"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.4726905879569653, "begin_lon": 0.46157858450465483, "e
nd_lat": 0.754803407008858, "end_lon": 0.9671159942018241, "fare": 34.158284716382845, "partitionpath": "americas/brazil/sao_paulo"}, {"ts": 0.0, "uuid": "0d612dd2-5f10-4296-a434-b34e6558e8f1", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.6100070562136587, "begin_lon": 0.8779402295427752, "end_lat": 0.3407870505929602, "end_lon": 0.5030798142293655, "fare": 43.4923811219014, "partitionpath": "americas/brazil/sao_paulo"}, {"ts": 0.0, "uuid": "0e170de4-7eda-4ab5-8c06-e351e8b23e3d", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.5731835407930634, "begin_...scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
warning: there was one deprecation warning; re-run with -deprecation for details
df: org.apache.spark.sql.DataFrame = [begin_lat: double, begin_lon: double ... 8 more fields]

scala> df.write.format("org.apache.hudi").
     |     options(getQuickstartWriteConfigs).
     |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |     option(TABLE_NAME, tableName).
     |     mode(Overwrite).
     |     save(basePath);
20/03/31 15:28:11 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.

mode(Overwrite)覆盖并重新创建数据集(如果已经存在)。 您可以检查在/tmp/hudi_cow_table////下生成的数据。我们提供了一个记录键 (schema中的uuid),分区字段(region/county/city)和组合逻辑(schema中的ts) 以确保行程记录在每个分区中都是唯一的。更多信息请参阅 对Hudi中的数据进行建模, 有关将数据提取到Hudi中的方法的信息,请参阅写入Hudi数据集。 这里我们使用默认的写操作:插入更新。 如果您的工作负载没有更新,也可以使用更快的插入或批量插入操作。 想了解更多信息,请参阅写操作。

8.4.3、查询数据
将数据文件加载到DataFrame中。

scala> df.write.format("org.apache.hudi").
     |     options(getQuickstartWriteConfigs).
     |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |     option(TABLE_NAME, tableName).
     |     mode(Overwrite).
     |     save(basePath);
20/03/31 15:28:11 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.

scala> val roViewDF = spark.
     |     read.
     |     format("org.apache.hudi").
     |     load(basePath + "/*/*/*/*")
20/03/31 15:30:03 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.
roViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]

scala> roViewDF.registerTempTable("hudi_ro_table")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_ro_table where fare > 20.0").show()

+------------------+-------------------+-------------------+---+
|              fare|          begin_lon|          begin_lat| ts|
+------------------+-------------------+-------------------+---+
| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+------------------+-------------------+-------------------+---+


scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_ro_table").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|     20200331152807|264170aa-dd3f-4a7...|  americas/united_s...|rider-213|driver-213| 93.56018115236618|
|     20200331152807|0e170de4-7eda-4ab...|  americas/united_s...|rider-213|driver-213| 64.27696295884016|
|     20200331152807|fb06d140-cd00-413...|  americas/united_s...|rider-213|driver-213| 27.79478688582596|
|     20200331152807|eb1d495c-57b0-4b3...|  americas/united_s...|rider-213|driver-213| 33.92216483948643|
|     20200331152807|2b3380b7-2216-4ca...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|     20200331152807|81a9b76c-655b-452...|  americas/brazil/s...|rider-213|driver-213|34.158284716382845|
|     20200331152807|d24e8cb8-69fd-4cc...|  americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
|     20200331152807|0d612dd2-5f10-429...|  americas/brazil/s...|rider-213|driver-213|  43.4923811219014|
|     20200331152807|a6a7e7ed-3559-4ee...|    asia/india/chennai|rider-213|driver-213|17.851135255091155|
|     20200331152807|824ee8d5-6f1f-4d5...|    asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+

该查询提供已提取数据的读取优化视图。由于我们的分区路径(region/country/city)是嵌套的3个级别 从基本路径开始,我们使用了load(basePath + “/*/*/*/*”)。 有关支持的所有存储类型和视图的更多信息,请参考存储类型和视图。

8.4.4、更新数据
这类似于插入新数据。使用数据生成器生成对现有行程的更新,加载到DataFrame中并将DataFrame写入hudi数据集。

scala> val updates = convertToStringList(dataGen.generateUpdates(10))
updates: java.util.List[String] = [{"ts": 0.0, "uuid": "0e170de4-7eda-4ab5-8c06-e351e8b23e3d", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.7340133901254792, "begin_lon": 0.5142184937933181, "en
d_lat": 0.7814655558162802, "end_lon": 0.6592596683641996, "fare": 49.527694252432056, "partitionpath": "americas/united_states/san_francisco"}, {"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.1593867607188556, "begin_lon": 0.010872312870502165, "end_lat": 0.9808530350038475, "end_lon": 0.7963756520507014, "fare": 29.47661370147079, "partitionpath": "americas/brazil/sao_paulo"}, {"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.71801964677...scala> val df = spark.read.json(spark.sparkContext.parallelize(updates, 2));
warning: there was one deprecation warning; re-run with -deprecation for details
df: org.apache.spark.sql.DataFrame = [begin_lat: double, begin_lon: double ... 8 more fields]

scala> df.write.format("org.apache.hudi").
     |     options(getQuickstartWriteConfigs).
     |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |     option(TABLE_NAME, tableName).
     |     mode(Append).
     |     save(basePath);
20/03/31 15:32:27 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.

注意,保存模式现在为追加。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。 查询现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的commit 。在之前提交的相同的_hoodie_record_key中寻找_hoodie_commit_time, rider, driver字段变更。

8.4.5、增量查询
Hudi还提供了获取给定提交时间戳以来已更改的记录流的功能。 这可以通过使用Hudi的增量视图并提供所需更改的开始时间来实现。 如果我们需要给定提交之后的所有更改(这是常见的情况),则无需指定结束时间。

scala> // reload data

scala> spark.
     |     read.
     |     format("org.apache.hudi").
     |     load(basePath + "/*/*/*/*").
     |     createOrReplaceTempView("hudi_ro_table")
20/03/31 15:33:55 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.

scala> 

scala> val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_ro_table order by commitTime").map(k => k.getString(0)).take(50)
commits: Array[String] = Array(20200331152807, 20200331153224)

scala> val beginTime = commits(commits.length - 2) // commit time we are interested in
beginTime: String = 20200331152807
scala> // 增量查询数据

scala> val incViewDF = spark.
     |     read.
     |     format("org.apache.hudi").
     |     option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
     |     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     |     load(basePath);
20/03/31 15:34:40 WARN hudi.DefaultSource: hoodie.datasource.view.type is deprecated and will be removed in a later release. Please use hoodie.datasource.query.type
incViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]

scala> incViewDF.registerTempTable("hudi_incr_table")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_incr_table where fare > 20.0").show()
+-------------------+------------------+--------------------+-------------------+---+
|_hoodie_commit_time|              fare|           begin_lon|          begin_lat| ts|
+-------------------+------------------+--------------------+-------------------+---+
|     20200331153224|49.527694252432056|  0.5142184937933181| 0.7340133901254792|0.0|
|     20200331153224|  98.3428192817987|  0.3349917833248327| 0.4777395067707303|0.0|
|     20200331153224|  90.9053809533154| 0.19949323322922063|0.18294079059016366|0.0|
|     20200331153224| 90.25710109008239|  0.4006983139989222|0.08528650347654165|0.0|
|     20200331153224| 29.47661370147079|0.010872312870502165| 0.1593867607188556|0.0|
|     20200331153224| 63.72504913279929|   0.888493603696927| 0.6570857443423376|0.0|
+-------------------+------------------+--------------------+-------------------+---+

这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。

8.4.6、特定时间点查询
让我们看一下如何查询特定时间的数据。可以通过将结束时间指向特定的提交时间,将开始时间指向”000”(表示最早的提交时间)来表示特定时间。

scala> val beginTime = "000" // Represents all commits > this time.
beginTime: String = 000

scala> val endTime = commits(commits.length - 2) // commit time we are interested in
endTime: String = 20200331152807

scala> 

scala> // 增量查询数据

scala> val incViewDF = spark.read.format("org.apache.hudi").
     |     option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
     |     option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
     |     option(END_INSTANTTIME_OPT_KEY, endTime).
     |     load(basePath);
20/03/31 15:36:00 WARN hudi.DefaultSource: hoodie.datasource.view.type is deprecated and will be removed in a later release. Please use hoodie.datasource.query.type
incViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields]

scala> incViewDF.registerTempTable("hudi_incr_table")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_incr_table where fare > 20.0").show()
+-------------------+------------------+-------------------+-------------------+---+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+---+
|     20200331152807| 93.56018115236618|0.14285051259466197|0.21624150367601136|0.0|
|     20200331152807| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|0.0|
|     20200331152807| 27.79478688582596| 0.6273212202489661|0.11488393157088261|0.0|
|     20200331152807| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|0.0|
|     20200331152807|34.158284716382845|0.46157858450465483| 0.4726905879569653|0.0|
|     20200331152807| 66.62084366450246|0.03844104444445928| 0.0750588760043035|0.0|
|     20200331152807|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|0.0|
|     20200331152807| 41.06290929046368| 0.8192868687714224|  0.651058505660742|0.0|
+-------------------+------------------+-------------------+-------------------+---+

你可能感兴趣的:(大数据相关技术,Hadoop)