一、hadoop概述:
安装包:hadoop-3.0.1.tar.gz
hadoop-3.0.1的改变:
3.0.1是基于hadoop3改变以上的一些修改和改善(修改了49个BUG)
以下是hadoop3比之前hadoop版本的改变及3.0.1的特性
1)jdk至少java8。
2)NameNode 的默认 RPC 端口改回了 8020(在 3.0.0 版本中为 9820)
3)Hadoop Common部分
a.精简Hadoop内核,包括剔除过期的API和实现,将默认组件实现替换成最高效的实现
(比如将FileOutputCommitter缺省实现换为v2版本,废除hftp转由webhdfs替代,
移除Hadoop子实现序列化库org.apache.hadoop.Records
b.Classpath isolation以防止不同版本jar包冲突,比如google Guava在混合使
用Hadoop、HBase和Spark时,很容易产生冲突。(https://issues.apache.org/jira/browse/HADOOP-11656)
Shell脚本重构。 Hadoop 3.0对Hadoop的管理脚本进行了重构,修复了大量bug,增加了新特性,
支持动态命令等。https://issues.apache.org/jira/browse/HADOOP-9902
4)Hadoop HDFS
a.HDFS支持数据的擦除编码,这使得HDFS在不降低可靠性的前提下,节省一半存储空间。
(https://issues.apache.org/jira/browse/HDFS-7285)
b.多NameNode支持,即支持一个集群中,一个active、多个standby namenode部署方式。
注:多ResourceManager特性在hadoop 2.0中已经支持。(https://issues.apache.org/jira/browse/HDFS-6440)
5)Hadoop MapReduce
a.Tasknative优化。为MapReduce增加了C/C++的map output collector实现(包括Spill,Sort和IFile等),
通过作业级别参数调整就可切换到该实现上。对于shuffle密集型应用,其性能可提高约30%。
(https://issues.apache.org/jira/browse/MAPREDUCE-2841)
b.MapReduce内存参数自动推断。在Hadoop 2.0中,为MapReduce作业设置内存参数非常繁琐,涉及到两个参数
:mapreduce.{map,reduce}.memory.mb和mapreduce.{map,reduce}.java.opts,一旦设置不合理,则会使得内
存资源浪费严重,比如将前者设置为4096MB,但后者却是“-Xmx2g”,则剩余2g实际上无法让java heap使用到。
(https://issues.apache.org/jira/browse/MAPREDUCE-5785)
6)Hadoop YARN
a.基于cgroup的内存隔离和IO Disk隔离
b.用curator实现RM leader选举
c.containerresizing
d.Timelineserver next generation
注意:
更改配置文件之后不需要重新格式化,重启即可
本次环境是ubuntu16.04系统
一共修改的文件:/etc/hosts、/etc/hostname、~/.bashrc、/etc/profile
hadoop-env.sh、core-site.xml、hdfs-site.xml、mapred-site.xml、start-dfs.sh
start-yarn.sh、stop-dfs.sh、stop-yarn.sh、workers、yarn-site.xml
二、hadoop-3.0.1安装配置
1)环境变量的配置
首先需要在各服务器解析域名:
vim /etc/hosts
10.10.100.19 hadoop01
10.10.100.18 hadoop02
10.10.100.17 hadoop03
10.10.100.16 hadoop04
10.10.100.15 hadoop05
尝试ping或者ssh连接
然后在windows添加域名
C:\Windows\System32\drivers\etc
编辑hosts文件
10.10.100.19 hadoop01
10.10.100.18 hadoop02
10.10.100.17 hadoop03
10.10.100.16 hadoop04
10.10.100.15 hadoop05
方便在windows系统远程操作
注意:配置文件中127.0.0.1 localhost,本地解析不能注释
注释之后zookeeper,tomcat等组件都会出现让你崩溃的麻烦,切忌。
如果你对你的主机不满意,可以修改主机名称:vim /etc/hostname
修改后重启系统即可
vim ~/.bashrc
JAVA_HOME=/usr/software/java/jdk1.8.0_152
HADOOP_HOME=/usr/software/hadoop-3.0.1
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$JAVA_HOME/bin:HADOOP_HOME/bin:$PATH
export JAVA_HOME CLASSPATH PATH HADOOP_HOME
路径要对应上
vim /etc/profile
加入:
export HADOOP_HOME=/usr/software/hadoop-3.0.1
export PATH=$PATH:$JAVA HOME/bin:$HADOOP HOME/bin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
保存后执行profile,source /etc/profile
2)hadoop-3.0.1安装配置
hadoop-3.0.1.tar.gz把安装包移动到/usr/software/
解压压缩包:
tar -xzvf hadoop-3.0.1.tar.gz
解压后hadoop-1.0.1目录出现
3)配置hadoop-env.sh文件
hadoop的配置文件路径:/usr/software/hadoop-3.0.1/etc/hadoop
进入该路径后,编辑vim hadoop-env.sh
修改该文件52行为:export JAVA_HOME=/usr/software/java/jdk1.8.0_152
export HADOOP_HOME=/usr/software/hadoop-3.0.1
#export HADOOP_CLASSPATH=${HADOOP_HOME}
保存退出
4)配置slaves
hadoop2.x配置的是slaves文件,这里有所改变,编辑workers
find / -name workers
cd /usr/software/hadoop-3.0.1/etc/hadoop
vim workers
hadoop01
hadoop02
hadoop03
hadoop04
hadoop05
保存退出文档编辑
5)配置core-site.xml
cd /usr/software/hadoop-3.0.1/etc/hadoop
vim core-site.xml
保存退出
6)配置hdfs-site.xml文件
vim /usr/software/hadoop-3.0.1/etc/hadoop/hdfs-site.xml
7)配置mapred-site.xml文件
vim /usr/software/hadoop-3.0.1/etc/hadoop/mapred-site.xml
/usr/software/hadoop-3.0.1/etc/hadoop,
/usr/software/hadoop-3.0.1/etc/hadoop/common/*,
/usr/software/hadoop-3.0.1/etc/hadoop/common/lib/*,
/usr/software/hadoop-3.0.1/etc/hadoop/hdfs/*,
/usr/software/hadoop-3.0.1/etc/hadoop/hdfs/lib/*,
/usr/software/hadoop-3.0.1/etc/hadoop/mapreduce/*,
/usr/software/hadoop-3.0.1/etc/hadoop/mapreduce/lib/*,
/usr/software/hadoop-3.0.1/etc/hadoop/yarn/*,
/usr/software/hadoop-3.0.1/etc/hadoop/yarn/lib/*
8)配置yarn-site.xml文件
/usr/software/hadoop-3.0.1/etc/hadoop,
/usr/software/hadoop-3.0.1/share/hadoop/common/*, 不但需要指定mapred-site.xml中的路径,
/usr/software/hadoop-3.0.1/share/hadoop/common/lib/*, 也需要指定yarn-site.xml的yarn.application.classpath路径
/usr/software/hadoop-3.0.1/share/hadoop/hdfs/*,
/usr/software/hadoop-3.0.1/share/hadoop/hdfs/lib/*,
/usr/software/hadoop-3.0.1/share/hadoop/mapreduce/*,
/usr/software/hadoop-3.0.1/share/hadoop/mapreduce/lib/*,
/usr/software/hadoop-3.0.1/share/hadoop/yarn/*,
/usr/software/hadoop-3.0.1/share/hadoop/yarn/lib/*
9)配置好的hadoop-3.0.1复制到每一个节点
scp -r hadoop-3.0.1 hadoop02@hadoop02:/home/hadoop02
把文件从/home/hadoop02路径转移到/usr/software
10)格式化hdfs
第一次安装hdfs的时候,需要对hdfs进行相关的格式化操作,以后就不需要了。
a.保持zookeeper启动(每个节点)
/usr/software/zookeeper-3.4.11/bin/./zkServer.sh status
b.启动journalnode(每个节点)
cd /usr/software/hadoop-3.0.1/sbin
./hadoop-daemon.sh start journalnode
或sbin/hadoop-daemon.sh start journalnode
多出一个JournalNode进程
c.格式化HDFS(hadoop01节点执行格式化即可)
cd /usr/software/hadoop-3.0.1/
bin/hdfs namenode -format
出现正确格式化结果:
bin/hdfs zkfc -formatZK //格式化高可用
11)hadoop01上启动namenode节点
关闭所有节点journalnode
cd /usr/software/hadoop-3.0.1/
sbin/hadoop-daemon.sh stop journalnode
cd /usr/software/hadoop-3.0.1/
bin/hdfs namenode
启动之后在其它子节点启动:bin/hdfs namenode -bootstrapStandby
其它子节点同步完master上的元数据之后,在master节点上按下ctrl+c来结束namenode进程。
12)启动HDFS(在hadoop01执行即可,hadoop01作为master)
cd /usr/software/hadoop-3.0.1
sbin/start-dfs.sh
在web上查看HDFS,输入URL:
http:hadoop01:9870/
http:hadoop02:9870/
http:hadoop03:9870/
http:hadoop04:9870/
http:hadoop05:9870/
master机器显示active,slaver机器显示standy
13)启动Yarn
注意:把namenode和resourcemanager分开是因为性能问题,因为他们
都要占用大量资源,所以要分别在不同的机器上启动
先启动resourcemanager在hadoop01和hadoop02
cd /usr/software/hadoop-3.0.1/bin
./yarn --daemon start resourcemanager
在5台机器hadoop01~hadoop05分别启动nodemanager
cd /usr/software/hadoop-3.0.1/bin
./yarn --daemon start nodemanager
在配置文件中配置好之后可以直接:
sbin/start-yarn.sh
sbin/stop-yarn.sh
输入URL
http://hadoop01:8088/cluster
14)验证HDFS HA
查看HDFS:http://10.10.100.19:9870/
查看RM:http://10.10.100.19:8088/cluster
首先向HDFS传一个文件:
进入目录:cd /usr/software/hadoop-3.0.1
a.首先创建一个目录:bin/hdfs dfs -mkdir /input
b.创建一个文件:cd /usr/software/hadoop-3.0.1
touch words.txt
文件写入内容:
one two hadoop
element dog cat go for cat
nick hello two
c.上传一个文件:必须在文件对应路径下执行命令,words.txt路径是/usr/software/hadoop-3.0.1
bin/hdfs dfs -put words.txt /input
d.查看已上传文件内容:bin/hdfs dfs -cat /input/words.txt
在其它机器上执行此查看命令,看是否能看到上传的文件,如果能代表成功。
e.测试故障转移及文件的持久化
手动杀掉NameNode进程:kill -9 13212
通过浏览器访问hadoop04,发现hadoop04的状态从standby->active
在其它服务器上查看文件words.txt,发现文件依然存在
手动启动hadoop01的namenode,sbin/hadoop-daemon.sh start namenode
启动后,hadoop01为standby状态,hadoop04为active状态
查看文件依然存在:bin/hdfs dfs -cat /input/words.txt
f.运行任务:
bin/hdfs dfs -rm -r /output
bin/hadoop jar /usr/software/hadoop-3.0.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.1.jar wordcount /input/words.txt /output/
解释:
/input/:hdfs中的路径,表示输入路径
/output/:hdfs中的路径,表示输出路径(统计结果会在这个目录下)
记住output这个目录一定不要存在,不然会报output folder already exists
执行成功:
问题总结:
namenode和datanode断连接解决办法:
问题产生原因:执行文件系统格式化时(即执行命令$ bin/hadoop namenode -format 后),会在namenode数据文件夹
(即配置文件中dfs.name.dir在本地系统的路径)中保存一个current/VERSION(我的路径为 usr/local/hadoop/tmp/dfs/name/current/VERSION,
其中hadoop为建立虚拟机时为虚拟机系统所取的用户名)文件,记录namespaceID,标识了所格式化的namenode的版本。如果我们频繁的格式化namenode,
那么datanode中保存(即配置文件中dfs.data.dir在本地系统的路径)的current/VERSION文件(路径为usr/local/hadoop/tmp/dfs/data/current/VERSION)
只是你第一次格式化时保存的namenode的ID,因此就会造成namdenode 的 namespaceID 与 datanode的namespaceID 不一致,从而导致namenode和datanode
的断连。
解决办法:把/tmp下的Hadoop开关的临时文件删除、把/hadoop.tmp.dir目录清空
注:在每次执行hadoop namenode -format时,都会为NameNode生成namespaceID,,但是在hadoop.tmp.dir目录下的DataNode还是保留上次的namespaceID,
因为namespaceID的不一致,而导致DataNode无法启动,所以只要在每次执行hadoop namenode -format之前,先删除hadoop.tmp.dir(路径为
/usr/local/hadoop/下的)tmp目录就可以启动成功,或者删除/usr/local/hadoop/tmp/dfs下的data目录,然后重新启动dfs(在hadoop安装路径
/usr/local/hadoop/ 下,运行命令./sbin/start-dfs.sh)即可。请注意是删除hadoop.tmp.dir对应的本地目录,即/usr/local/hadoop/下的tmp文件夹,
而不是HDFS目录。
问题2:启动HDFS时
Starting namenodes on [hadoop01 hadoop02 hadoop03 hadoop04 hadoop05]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting journal nodes [hadoop01 hadoop02 hadoop03 hadoop04 hadoop05]
ERROR: Attempting to operate on hdfs journalnode as root
ERROR: but there is no HDFS_JOURNALNODE_USER defined. Aborting operation.
Starting ZK Failover Controllers on NN hosts [hadoop01 hadoop02 hadoop03 hadoop04 hadoop05]
ERROR: Attempting to operate on hdfs zkfc as root
ERROR: but there is no HDFS_ZKFC_USER defined. Aborting operation.
编辑vim start-dfs.sh 和 vim stop-dfs.sh
添加
HDFS_DATANODE_USER=root
HDFS_NAMENODE_USER=root
HDFS_ZKFC_USER=root
HDFS_JOURNALNODE_USER=root
问题3:再次启动HDFS时
首先把root用户ssh设置为可远程访问
vim /etc/ssh/sshd_config
#PermitRootLogin prohibit-password
PermitRootLogin yes
sudo service ssh restart
此处必须把各机器之间root用户和hadoop用户均调节成无密码关联状态
具体步骤网上有很多
问题5:执行mapReduce(wordcount)时
bin/hadoop jar /usr/software/hadoop-3.0.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.1.jar wordcount /input/words.txt /output/
需要在mapred-site.xml和yarn-site.xml中都加入
mapred-site.xml加入红色部分
/usr/software/hadoop-3.0.1/etc/hadoop,
/usr/software/hadoop-3.0.1/etc/hadoop/common/*,
/usr/software/hadoop-3.0.1/etc/hadoop/common/lib/*,
/usr/software/hadoop-3.0.1/etc/hadoop/hdfs/*,
/usr/software/hadoop-3.0.1/etc/hadoop/hdfs/lib/*,
/usr/software/hadoop-3.0.1/etc/hadoop/mapreduce/*,
/usr/software/hadoop-3.0.1/etc/hadoop/mapreduce/lib/*,
/usr/software/hadoop-3.0.1/etc/hadoop/yarn/*,
/usr/software/hadoop-3.0.1/etc/hadoop/yarn/lib/*
yarn-site.xml加入红色部分
/usr/software/hadoop-3.0.1/etc/hadoop,
/usr/software/hadoop-3.0.1/share/hadoop/common/*,
/usr/software/hadoop-3.0.1/share/hadoop/common/lib/*,
/usr/software/hadoop-3.0.1/share/hadoop/hdfs/*,
/usr/software/hadoop-3.0.1/share/hadoop/hdfs/lib/*,
/usr/software/hadoop-3.0.1/share/hadoop/mapreduce/*,
/usr/software/hadoop-3.0.1/share/hadoop/mapreduce/lib/*,
/usr/software/hadoop-3.0.1/share/hadoop/yarn/*,
/usr/software/hadoop-3.0.1/share/hadoop/yarn/lib/*