HADOOP目前的版本有Apache社区版、CDH版和HDP版等,Apache社区版有些包依赖不一致会存在问题,CDH版本国内70%~80%的公司在用,因此这个笔记安装的都是CDH版本的hadoop。具体版本为cdh5.7.0,hadoop2.6.0,hive1.1.0。
PATH设置错误:export PATH=/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin
参考mac安装教程,添加ssh localhost免密登录,目的是为了让hadoop免密登录。
System Preference -> Sharing
中,左边勾选 Remote Login
,右边选择 All Users
。
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Hadoop的安装教程参考官网
下载tar.gz文件,解压后添加HADOOP_HOME变量
vi ~/.bash_profile
export HADOOP_HOME=/Users/xiafengsheng/app/hadoop-2.6.0-cdh5.7.0
export PATH=$HADOOP_HOME/bin:$PATH
source ~/.bash_profile
参考官网,更改$HADOOP_HOME/etc/hadoop 下的 core-site.xml /hadoop-env.sh /hdfs-site.xml
/mapred-site.xml /yarn-site.xml 在配置yarn时再更改
添加JAVA_HOME变量
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home
添加fs.defaultFS 属性和hadoop.tmp.dir 属性,fs.defaultFS制定文件系统,hadoop.tmp.dir制定临时文件目录,默认的临时文件目录会在系统重启后被清楚,所以要设置,端口设置为8020是因为新版本的hadoop默认端口是8020
fs.defaultFS
hdfs://localhost:8020
hadoop.tmp.dir
/Users/xiafengsheng/app/tmp
添加dfs.replication属性,dfs.replication属性控制hadoop备份文件数量,由于是单节点,所以改为1
dfs.replication
1
首次启动需要先格式化HDFS
bin/hdfs namenode -format
启动hdfs
sbin/start-dfs.sh
有两种方式可以查看是否启动成功
jps指令
xiafengshengdeMacBook-Pro:hadoop xiafengsheng$ jps
3329 Jps
627 RemoteMavenServer
1525 NameNode
1909 RunJar
1864 NodeManager
1690 SecondaryNameNode
634 Launcher
1789 ResourceManager
1597 DataNode
登录Localhost:50070
Unable to load native-hadoop library for your platform… using builtin-java classes where applicable 解决方法
Hadoop2之后YARN取代MapReduce进行资源调度
关于YARN的调度,可以参考视频1-19.
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar 有样例
参考官网配置$HADOOP_HOME/etc/hadoop/目录下的mapred-site.xml和yarn-site.xml,如果文件不存在,但是有template模板,就复制直接复制模板然后更改
mapreduce.framework.name
yarn
yarn.nodemanager.aux-services
mapreduce_shuffle
启动yarn
sbin/start-yarn.sh
有两种方式可以查看是否启动成功
jps指令
xiafengshengdeMacBook-Pro:hadoop xiafengsheng$ jps
3329 Jps
627 RemoteMavenServer
1525 NameNode
1909 RunJar
1864 NodeManager
1690 SecondaryNameNode
634 Launcher
1789 ResourceManager
1597 DataNode
登录Localhost:8088
hive的安装教程参考官网
下载完编译好的tar.gz文件解压后,添加HIVE_HOME环境变量,添加HIVE_HOME环境变量到PATH,添加好后直接进入步骤2,不用参考官网
vi ~/.bash_profile
export HIVE_HOME=/Users/xiafengsheng/app/hive-1.1.0-cdh5.7.0
export PATH=$HIVE_HOME/bin:$PATH
source ~/.bash_profile
为什么要设置?因为Metadata默认存放在hive的Derby数据库,为了实现matadata的共享,采用mysql来保存metadata。共享metadata数据后,spark等也能操作hive数据。步骤如下。
下载Mysql驱动,选择Platform Independent
把Mysql驱动拷贝到$HIVE_HOME/lib/目录下(也可以设置软链接 )
ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar
给metadata创建数据database(此步骤可以忽略)
$ mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.14.0.mysql.sql;
给metadata创建用户(同样可以忽略,直接选择root)
mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hivepassword';
mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hivepassword';
mysql> flush privileges;
在$HIVE_HOME/conf目录下创建 hive-site.xml文件,添加metadata配置
javax.jdo.option.ConnectionURL
jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true&useSSL=false
metadata is stored in a MySQL server
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
MySQL JDBC driver class
javax.jdo.option.ConnectionUserName
hiveuser
user name for connecting to mysql server
javax.jdo.option.ConnectionPassword
hivepassword
password for connecting to mysql server
步骤3不创建表时,必须把createDatabaseIfNotExist设置为true,mysql8要求设置SSL,添加&useSSL=false,这点参考了这篇文章
create table hive_wc(context string);
load data inpath '/cihon/wc.txt' into table hive_wc;
select word,count(1) from hive_wc lateral view explode(split(context,' ')) wc as word group by word;
HIVESQL转换成MRjob,YARN来调度MRjob
YARN资源调度框架 MR计算框架
为什么用Spark?Run workloads 100x faster.
Spark可以选择不编译,直接采用官网内嵌hadoop的版本(pre-bulit for apache hadoop 2.7 and later)进行单机测试,直接运行。
./bin/spark-shell.sh
当选择IDEA作为spark的IDE时,新建工程如果选择maven工程,配置好pom.xml文件后,IDEA也为自动帮忙下载各种依赖包。
如果电脑里没有scala,就需要在IDEA的plugins设置里添加scala
Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+
export MAVEN_OPTS=”-Xmx2g -XX:ReservedCodeCacheSize=512m”
1.pom.xml
<properties>
<java.version>1.8java.version>
<maven.compiler.source>${java.version}maven.compiler.source>
<maven.compiler.target>${java.version}maven.compiler.target>
<maven.version>3.3.9maven.version>
<sbt.project.name>sparksbt.project.name>
<hadoop.version>2.6.5hadoop.version>
properties>
替换properties参数用-Dhadoop.version=2.4.0
<profile>
<id>hadoop-2.7id>
<properties>
<hadoop.version>2.7.3hadoop.version>
<curator.version>2.7.1curator.version>
properties>
profile>
profile参数用-PHadoop-2.4
最终代码-普通模式
mvn -Psparkr -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package
第一次运行时报错
Could not find artifact org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.7.0 in central (https://repo.maven.apache.org/maven2)
这是因为默认的maven源缺少2.6.0-cdh5.7.0,在pom.xml中添加cloudera源
cloudera
cloudera Repository
https://repository.cloudera.com/artifactory/cloudera-repos/
最终代码-生成tgz包
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes -Dhadoop.version=2.6.0-cdh5.7.0
编译过程遇到报错
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet, :
pdflatex is not available
在R里面运行Sys.which(“pdflatex”)提示不存在,安装latex
编译完成后文件是spark-$VERSION-bin-$NAME.tgz
添加环境变量SPARK_HOME和更新PATH
export SPARK_HOME=/Users/xiafengsheng/app/spark-2.3.1-bin-custom-spark
export PATH=$SPARK_HOME/bin:$PATH
local模式无需其他额外操作
进入local模式
spark-shell --master local[2]
Standalone部署模式也是1master+nworkder
需要配饰spark-env.sh文件
You can optionally configure the cluster further by setting environment variables in conf/spark-env.sh
. Create this file by starting with the conf/spark-env.sh.template
# Options for the daemons used in the standalone deploy mode
SPARK_MASTER_HOST=localhost
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
Environment Variable | Meaning |
---|---|
SPARK_MASTER_HOST |
Bind the master to a specific hostname or IP address, for example a public one. |
SPARK_WORKER_CORES |
Total number of cores to allow Spark applications to use on the machine (default: all available cores). |
SPARK_WORKER_MEMORY |
Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m , 2g (default: total memory minus 1 GB); note that each application’s individual memory is configured using its spark.executor.memory property. |
配饰好后直接运行
./sbin/start-all.sh
运行后可以通过log日志查看spark的地址
操作conf/slaves
要使用启动脚本启动 Spark standalone 集群,你应该首先在 Spark 目录下创建一个叫做 conf/slaves 的文件,这个文件中必须包含所有你想要启动的 Spark workers 的机器的 hostname ,每个 hostname 占一行。 如果 conf/slaves 不存在,启动脚本默认启动单个机器(localhost),这对于测试是有效的。 注意, master 机器通过 ssh 访问所有的 worker 机器。
## 4.连接Spark集群
具体参数配置参考官网
spark-shell --master spark://localhost:7077
简单的wordcount
val file = spark.sparkContext.textFile("file:///Users/xiafengsheng/Desktop/wc.txt")
//不加file默认起始路径为spark路径
val wordCounts = file.flatMap(line => line.split(" ").map((word => (word,1)))).reduceByKey(_ + _)
wordCounts.collect
res2: Array[(String, Int)] = Array((0,9), (entries,3), (to,3), (of,3), (Showing,3))
在Spark集群模式或者Local模式连接HIVE数据库,这里用SparkR
sparkR --master local[*]
sparkR.session(enableHiveSupport = TRUE)
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
执行sql语句时报错
spark Error creating transactional connection factory
原因是因为Hive采用Mysql配置Metadata,Spark默认没有连接Mysql的组件,使用driver-class-path 参数来添加mysql组件,有没有直接添加环境变量的方法?
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath
sparkR --driver-class-path /Users/xiafengsheng/app/hive-1.1.0-cdh5.7.0/lib/mysql-connector-java-5.1.46.jar
用SparkR连接Hive进行WordCount
collect(sql("select * from hive_wc"))
collect(sql("select word,count(1) from hive_wc lateral view explode(split(context,' ')) wc as word group by word"))
word count(1)
1 0 9
2 entries 3
3 of 3
4 to 3
5 Showing 3
sparkR --master yarn --deploy-mode client --driver-class-path /Users/xiafengsheng/app/hive-1.1.0-cdh5.7.0/lib/mysql-connector-java-5.1.46.jar
https://blog.csdn.net/aiu_iu/article/details/71844568?utm_source=itdadao&utm_medium=referral
1