前言
搭建hadoo集群是学习大数据不可或缺的一步,然而自己练习又没有那么多台主机,只能在虚拟机上搭建了,自己也是第一次搭建,所以在这进行一些必要步骤的记录,下面开始吧.
官网下载软件很慢,所以下面提供的都是我的百度云盘下载地址
安装vmware
vmware10.0下载地址,然后就是傻瓜式安装,这里就不详细说了.
安装linux(我用的是CentOS6.8)
CentOS6.8下载地址, 如果自己不需要设置分区的话,基本就是下一步下一步完成安装.具体google,百度也有很多教程.
安装完成后对网络进行设置,使用nat模式,点击安装好的虚拟机,右键设置,选择网络适配器,然后选择nat模式.启动虚拟机进行登录.
查看IP并验证是否能上网,分别执行以下操作:
ifconfig
eth0 那块网卡的addr就是nat模式下的ip
ping www.baidu.com
虽然百度害人不浅,但在验证网络上还是有一点用处的,哈哈~~~
系统默认DHCP动态分配IP,当系统每次重启,IP可能会变,所以我们要手动设置静态ip.
编辑网卡文件:
vim /etc/sysconfig/network-scripts/ifcfg-eth0
我的设置如下:
DEVICE=eth0 ## 网卡名
TYPE=Ethernet
ONBOOT=yes ## 设置为开机启动
BOOTPROTO=static
HWADDR=00:0C:29:65:9C:41
IPADDR=192.168.170.170 ## NAT模式下静态IP
PREFIX=24 ## 和子网掩码对应
GATEWAY=192.168.170.2 ## 网关
DNS1=192.168.170.2 ## DNS
设置完成后重启网络:
service network restart
然后查看IP,并ping一下百度验证是否能上网.若不行,仔细查看配置信息是否正确.
添加hadoop用户
user add hadoop
passwd hadoop
- 登录:
su hadoop
安装JDK
hadoop的运行依赖java所以得先安装JDK.
卸载OpenJDK
CentOS6.8自带OpenJDK,安装之前先卸载OpenJDK.
- 查看当前安装的OpenJDK
rpm -qa | grep jdk
得到以下信息
java-1.7.0-openjdk.x86_64 1:1.7.0.99-2.6.5.1.el6
- 使用yum命令卸载
yum -y remove java-1.7.0-openjdk.x86_64 1:1.7.0.99-2.6.5.1.el6
安装SunJDK
- JDK1.7下载地址
- 在linux上下载或从windows上传到linux上.FTP工具
- 解压jdk
tar -zxvf jdk-7u79-linux-x64.tar.gz
- 将解压的得到的jdk1.7.0_79目录移动到
/usr/local
目录下并重命名为java1.7
mv jdk1.7.0_79 /usr/local/java1.7
- 配置环境变量
在当前用户家目录执行:vim .bashrc
添加以下内容:
export JAVA_HOME=/usr/local/java1.7
export CLASSPATH=.:$JAVA_HOME/lib/
export PATH=$PATH:$JAVA_HOME/bin
- 让环境变量生效
source .bashrc
- 验证是否配置成功
java -version
得到以下信息说明配置成功了
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
安装hadoop
- hadoop2.7.2下载地址
- 解压并移动到/usr/local目录下
tar -zxvf hadoop-2.7.2.tar.gz
mv hadoop-2.7.2 /usr/local/hadoop
- 配置环境变量
vim .bashrc
java和hadoop最终配置如下:
export JAVA_HOME=/usr/local/java1.7
export CLASSPATH=$JAVA_HOME/lib/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$JAVA_HOME/bin/:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
然后 source .bashrc
- 查看是否配置成功
hadoop version
得到以下信息说明配置成功
Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.2.jar
配置Hadoop分布式集群
修改hostname为master
vim /etc/sysconfig/network
改为: HOSTNAME=master
设置hosts文件使主机名和IP地址相对应
vim /etc/hosts
添加以下信息
192.168.170.170 master ## 本机IP
192.168.170.171 slave1 ## 一会要设置的从节点IP
192.168.170.172 slave2
其他两台slave的主机也修改对应的slave, slave2.如果修改完主机名后没有生效.那么重启系统便可以.三台机子的主机名与hosts均要修改
在hadoop文件夹下建立如下四个文件夹
- 目录/tmp, 用来存储临时生成的文件
- 目录/hdfs, 用来存储集群数据
- 目录hdfs/data, 用来存储真正的数据
- 目录hdfs/name, 用来存储文件系统元数据
mkdir tmp hdfs hdfs/data hdfs/name
配置hadoop文件
在hadoop/etc/hadoop目录下,主要配置以下7个文件:
- slaves
- hadoop-env.sh
- yarn-env.sh
- core-site.xml
- mapred-site.xml
- hdfs-site.xml
- yarn-site.xml
修改slaves文件
修改为:
slave1
slave2
修改hadoop-env.sh文件
将 export JAVA_HOME=${JAVA_HOME}
那一行改为 export JAVA_HOME=/usr/local/java1.7
修改yarn-env.sh文件
同样将 export JAVA_HOME=${JAVA_HOME}
那一行改为 export JAVA_HOME=/usr/local/java1.7
修改core-site.xml文件
配置为:
fs.defaultFS
hdfs://master:9000
hadoop.tmp.dir
file:/usr/local/hadoop/tmp
io.file.buffer.size
131072
修改mapred-site.xml文件
先把mapred-site.xml.template文件复制一份为mapred-site.xml再修改.
配置为:
mapreduce.framework.name
yarn
mapreduce.jobhistory.address
master:10020
mapreduce.jobhistory.webapp.address
master:19888
修改hdfs-site.xml文件
dfs.namenode.name.dir
file:/usr/local/hadoop/hdfs/name
dfs.datanode.data.dir
file:/usr/local/hadoop/hdfs/data
dfs.replication
2
dfs.namenode.secondary.http-address
master:9001
dfs.webhdfs.enabled
true
修改yarn-site.xml文件
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
yarn.resourcemanager.address
master:8032
yarn.resourcemanager.scheduler.address
master:8030
yarn.resourcemanager.resource-tracker.address
master:8031
yarn.resourcemanager.admin.address
master:8033
yarn.resourcemanager.webapp.address
master:8088
关闭防火墙和SELinux
切记一定要关闭,否则Hadoop启动会出错.
- 关闭防火墙
查看防火墙状态:service iptables status
若未关闭,则执行:service iptables stop
- 关闭SELinux
查看SELinux状态:getenforce
关闭SELinux:setenforce 0
系统重启后还会生效
永久关闭:vim /etc/selinux/config
将selinux状态改为disabled
安装并配置从节点
- 再安装两个虚拟机,重复以上步骤进行配置,这样有点麻烦也浪费时间,所以我们可以直接复制master节点的文件为两份,分别为slave1和slave2,这样就不用重复配置了,省事不少,然后在vmware里面打开.
- 因为从节点是复制的,所以网卡信息也复制了,将会导致从节点上不了网.因此,我们要在vmware里面点击打开的slave1节点,点击设置,选中网络适配器将其移除,然后再添加一块新的网络适配器(网卡).注意也要选择nat模式, slave2同理.然后同时启动slave1和slave2.
- 启动后设置slave1节点IP为之前master设定好的IP:
192.168.170.171
slave2节点IP为:192.168.170.172
- 修改hostname,将slave1和slave2节点的hostname分别改为slave1和slave2,参考master修改hostname.
实现免密码登陆
hadoop各个节点之间需要数据通信,为了方便操作,在数据通信时,不要我们再逐台输入密码.我们需要配置免密码登录的ssh.
- 使用以下命令在三台虚拟机上分别生成各自的公钥,私钥.
ssh-keygen -t rsa
全部按回车进行生成,默认生成位置在家目录下的.ssh文件夹 - 将master的公钥追加到authorized_keys中.
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
- 将slave1和slave2公钥追加到master的authorized_keys文件中.
在两台从节点都执行:ssh-copy-id -i .ssh/id_rsa.pub master
- 将master的authorized_keys复制到两个从节点上
scp .ssh/authorized_keys slave1:~/.ssh/
scp .ssh/authorized_keys slave2:~/.ssh/
- 测试
在三台虚拟机上用ssh命令进行互相测试:ssh master, ssh slave1, ssh slave2.
以slave1为例:
[hadoop@slave1 ~]$ ssh master
Last login: Thur Jul 20 13:16:39 2017 from slave2
[hadoop@slave1 ~]$ ssh slave2
Last login: Thur Jul 20 13:17:21 2017 from slave1
不需要密码,是不是方便多了呢。
格式化hdfs文件系统
OK,到这,我们就算配置完成了.不过,我们要想使用hadoop,还需要格式化hdfs文件系统.
hadoop2.7.2版本的格式化命令也有变化,之前用hadoop namenode -format,现在是hdfs.
执行: hdfs namenode -format
执行上面的命令后,在后面的提示中如果看到successfully formatted的字样,说明hdfs格式化成功!
启动集群
所有启动关闭脚本在/usr/local/hadoop/sbin
目录下
之前版本使用 start-all.sh
一个脚本启动,但是我想看看它到底启动了什么,查看源代码后发现脚本里有这么一句话: "This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh",表明已经过时了,要分别用start-dfs.sh 和 start-yarn.sh来启动.
- 启动dfs并查看进程
[hadoop@master ~]$ start-dfs.sh
17/07/20 13:42:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [master]
master: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-namenode-master.out
slave1: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-slave1.out
slave2: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-datanode-slave2.out
Starting secondary namenodes [master]
master: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-secondarynamenode-master.out
17/07/20 13:45:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@master ~]$ jps
27742 SecondaryNameNode
27885 Jps
27574 NameNode
出现的警告由于glibc版本(系统的是2.12, hadoop需要的是2.14)问题引起的,解决办法: 在/usr/local/hadoop/etc/hadoop/log4j.properties
文件最后添加: log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
即可解决.
- 启动yarn并查看进程
[hadoop@master ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-resourcemanager-master.out
slave2: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-slave2.out
slave1: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hadoop-nodemanager-slave1.out
[hadoop@master ~]$ jps
28053 ResourceManager
27742 SecondaryNameNode
28122 Jps
27574 NameNode
- 在从节点查看进程
[hadoop@slave1 ~]$ jps
4603 DataNode
4932 Jps
4785 NodeManager
[hadoop@slave2 ~]$ jps
5962 Jps
5674 DataNode
5830 NodeManager
三台虚拟机状态都正常.
测试
在浏览器输入网址查看
输入 192.168.170.170:8088
或者 192.168.170.170:50070
能看到网页信息说明启动成功,可以查看节点信息.
检验hadoop可用性
单纯的看进程和网页信息还不能完全的说明hadoop可用,我们要使用hadoop提供的 hadoop-mapreduce-examples-2.7.2.jar
包的wordcount功能检验一下hadoop是不是真的可以进行数据的分析.
准备工作
- 在hdfs文件系统创建input文件夹:
hdfs dfs -mkdir /input
- 上传需要分析处理的数据文件:
hdfs dfs -put /usr/local/hadoop/*.txt /input
- 启动historyserver,不然下一步会报以下错误信息:
Caused by: java.net.ConnectException: 拒绝连接
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
... 33 more
启动: mr-jobhistory-daemon.sh start historyserver
开始运行mapreduce
执行: hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input /output
具体如下:
[hadoop@master ~]$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /test/input /test/output
17/07/20 15:35:55 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.170.170:8032
17/07/20 15:36:05 INFO input.FileInputFormat: Total input paths to process : 3
17/07/20 15:36:06 INFO mapreduce.JobSubmitter: number of splits:3
17/07/20 15:36:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1502431075552_0003
17/07/20 15:36:13 INFO impl.YarnClientImpl: Submitted application application_1502431075552_0003
17/07/20 15:36:13 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1502431075552_0003/
17/07/20 15:36:13 INFO mapreduce.Job: Running job: job_1502431075552_0003
17/07/20 15:38:39 INFO mapreduce.Job: Job job_1502431075552_0003 running in uber mode : false
17/07/20 15:38:39 INFO mapreduce.Job: map 0% reduce 0%
17/07/20 15:42:34 INFO mapreduce.Job: map 44% reduce 0%
17/07/20 15:42:39 INFO mapreduce.Job: map 78% reduce 0%
17/07/20 15:42:40 INFO mapreduce.Job: map 89% reduce 0%
17/07/20 15:42:42 INFO mapreduce.Job: map 100% reduce 0%
17/07/20 15:44:08 INFO mapreduce.Job: map 100% reduce 100%
17/07/20 15:44:18 INFO mapreduce.Job: Job job_1502431075552_0003 completed successfully
17/07/20 15:44:20 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=12989
FILE: Number of bytes written=495819
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=17212
HDFS: Number of bytes written=8983
HDFS: Number of read operations=12
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=781665
Total time spent by all reduces in occupied slots (ms)=59106
Total time spent by all map tasks (ms)=781665
Total time spent by all reduce tasks (ms)=59106
Total vcore-milliseconds taken by all map tasks=781665
Total vcore-milliseconds taken by all reduce tasks=59106
Total megabyte-milliseconds taken by all map tasks=800424960
Total megabyte-milliseconds taken by all reduce tasks=60524544
Map-Reduce Framework
Map input records=322
Map output records=2347
Map output bytes=24935
Map output materialized bytes=13001
Input split bytes=316
Combine input records=2347
Combine output records=897
Reduce input groups=840
Reduce shuffle bytes=13001
Reduce input records=897
Reduce output records=840
Spilled Records=1794
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=31699
CPU time spent (ms)=34140
Physical memory (bytes) snapshot=413155328
Virtual memory (bytes) snapshot=3358351360
Total committed heap usage (bytes)=364929024
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=16896
File Output Format Counters
Bytes Written=8983
查看处理后的数据:
hdfs dfs -cat /output/part-r-00000
大概如下:
......
only 4
or 67
or, 1
org.apache.hadoop.util.bloom.* 1
origin 1
original 2
other 9
otherwise 3
otherwise, 3
our 2
out 1
outstanding 1
own 4
owner 4
owner. 1
owner] 1
ownership 2
page" 1
part 4
patent 5
patent, 1
percent 1
perform, 1
performing 1
permission 1
permission. 1
permissions 3
permitted 2
permitted. 1
perpetual, 2
......
数据有点多...
也可以在浏览器访问192.168.170.170:50070
,点击Utilities
下面的Browse the file system
.在里面查看文件,也可以下载后查看.
后记
至此,hadoop集群搭建完成,wordcount小例子也完运行.该休息下了 -_-||