本篇是整理了搭建hadoop环境需要的一些配置和步骤,其中有些配置是完全分布式的时候才用的,这里注意需要注意。
linux系统设置
1,配置或者搭建集群过程中,全部用root账户登入
2,修改主机名方法
1)暂时修改机器名:hostname
hostname命令可以临时修改机器名,但是机器重启之后就会恢复原来的值
hostname 查看机器名
hostname -i //查看本机器对应的ip地址
2)永久修改机器名:修改系统配置文件
修改/etc/sysconfig/network,修改系统配置文件,才能有效改变机器名,修改主机名
HOSTNAME=hadoop1
3,windos和linux域名解析
1)windows下的域名解析路径为:C:\Windows\System32\drivers\etc\hosts
192.168.9.103 hadoop1
192.168.9.104 hadoop2
2)linux的域名解析,路径为/etc/hosts
192.168.9.103 hadoop1
192.168.9.104 hadoop2
4,关闭防火墙
5,配置ssh免密登入
客户端生成秘钥:
ssh-keygen -t rsa
把公钥复制给要登入的目标主机,目标主机上将这个公钥加入到授权列表:
Cat id rsa.pub >> authorized keys
目标主机还要将这个授权列表文件权限修改为:600
chmod 600 authorized keys
在用户目录下执行5步:
rm -rf .ssh/
ssh-keygen -t rsa
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
chmod 700 .ssh
chmod 600 .ssh/authorized keys
6,jdk环境配置
7,所有的hadoop的节点需要配置NTP服务,进行时间同步
集群中所有的主机必须保持时间同步,如果时间相差较大会引起各种问题。master节点作为NTP服务器与外界的对时中心同步时间,随后,对所有DataNode节点提供时间同步服务。所有DataNode节点以master节点为基础同步时间。
所有节点安装相关组件:
yum install ntp
完成后,配置开机启动:
chkconfig ntpd on
检查是否设置成功:
chkconfig --list ntpd
1)主节点配置
在配置之前,先使用ntpdate手动同步一下时间,免得本机与对时中心时间差距太大,使用ntpd不能正常同步。这里选用127.127.1.0作为对时中心:
ntpdate -u 127.127.1.0
NTP服务只有一个配置文件/etc/ntp.conf
有用的配置如下:
driftfile /var/lib/ntp/drift
restrict 127.0.0.1
restrict -6 ::1
restrict default nomodify notrap
server 127.127.1.0 prefer
includefile /etc/ntp/crypto/pw
keys /etc/ntp/keys
2)配置NTP客户端(所有DataNode节点)
DataNode的NTP配置如下:
driftfile /var/lib/ntp/drift
restrict 127.0.0.1
restrict -6 ::1
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
#这里是主节点的主机名或者ip
server hadoop1
includefile /etc/ntp/crypto/pw
keys /etc/ntp/keys
启动NTP服务:
service ntpd start
登入master机器执行命令:
ntpdate -u127.127.1.0
登入所有的slave机器,同步主节点执行命令:
ntpdate -u hadoop1
检查是否成功,查看命令:
watch nepq -p
上面的步骤都是hadoop的基础环境的配置,有些是实际的分布式搭建的时候需要用到的,比如时间同步ntp,单机的时间基本一样,关系不大,
下面进行具体的hadoop伪分布式的搭建:
1,下载hadoop
2,下载jdk,安装好jdk环境
3,配置免密登入
如果输入:ssh localhost,需要密码登入,则免密登入没配置好,需要进行配置免密登入,步骤如下:
[root@izbp1dmlbagds9s70r8luxz ~]# cd ~/.ssh/
[root@izbp1dmlbagds9s70r8luxz .ssh]# ls
authorized_keys known_hosts
[root@izbp1dmlbagds9s70r8luxz .ssh]# ss
[root@izbp1dmlbagds9s70r8luxz .ssh]# cd ~/.ssh/
[root@izbp1dmlbagds9s70r8luxz .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:24Af7GQnrJ9ylyyhtzfkZWmomcJQtYWJHzav3oxV60g root@izbp1dmlbagds9s70r8luxz
The key's randomart image is:
+---[RSA 2048]----+
| . o |
| . B . |
| + * |
| .+o . . |
| .. S..o o |
| . *oBE * |
| o.o+#.B |
| =.X.@ . |
| =+= . |
+----[SHA256]-----+
[root@izbp1dmlbagds9s70r8luxz .ssh]# ls
authorized_keys id_rsa id_rsa.pub known_hosts
[root@izbp1dmlbagds9s70r8luxz .ssh]# cat id_rsa.pub >>authorized_keys
[root@izbp1dmlbagds9s70r8luxz .ssh]# ll
total 16
-rw------- 1 root root 410 Apr 5 21:47 authorized_keys
-rw------- 1 root root 1679 Apr 5 21:46 id_rsa
-rw-r--r-- 1 root root 410 Apr 5 21:46 id_rsa.pub
-rw-r--r-- 1 root root 171 Apr 5 21:41 known_hosts
[root@izbp1dmlbagds9s70r8luxz .ssh]# chmod 600 authorized_keys
[root@izbp1dmlbagds9s70r8luxz .ssh]#
上面是整个执行流程的代码,其实只有简单的几步:
cd ~/.ssh/
ssh-keygen -t rsa #一直回车
cat id_rsa.pub >>authorized_keys
chmod 700 .ssh #给文件夹授权
chmod 600 authorized_keys
经过上面的步骤,在次登入,可以免密登入了:
[root@izbp1dmlbagds9s70r8luxz .ssh]# ssh localhost
Last login: Thu Apr 5 21:44:33 2018 from 127.0.0.1
Welcome to Alibaba Cloud Elastic Compute Service !
[root@izbp1dmlbagds9s70r8luxz ~]#
执行的是ssh localhost命令。
4,解压hadoop到相应的目录下
比如:/usr/local/hadoop2.7.5
5,配置hadoop的环境变量
[root@izbp1dmlbagds9s70r8luxz hadoop2.7.5]# vi /etc/profile
[root@izbp1dmlbagds9s70r8luxz hadoop2.7.5]# source /etc/profile
[root@izbp1dmlbagds9s70r8luxz hadoop2.7.5]#
配置文件的部分参数如下:
export HADOOP_HOME=/usr/local/hadoop2.7.5
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:/usr/local/mysql5.7.21/bin:$NGINX_HOME:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
6,配置hadoop-env.sh文件,修改JAVA_HOME
/usr/local/hadoop2.7.5/etc/hadoop
# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/local/jdk1.8.0_162
7,执行测试命令
# bin/hadoop # 执行hadoop指令,测试
结果如下:
[root@izbp1dmlbagds9s70r8luxz hadoop2.7.5]# bin/hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
or
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar run a jar file
note: please use "yarn jar" to launch
YARN applications, not this command.
checknative [-a|-h] check native hadoop and compression libraries availability
distcp copy file or directories recursively
archive -archiveName NAME -p
* create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
trace view and modify Hadoop tracing settings
Most commands print help when invoked w/o parameters.
[root@izbp1dmlbagds9s70r8luxz hadoop2.7.5]#
8,配置slaves文件,增加slave节点
/usr/local/hadoop2.7.5/etc/hadoop
文件内容:
localhost
如果有多个节点,可以配置如下:
hadoop1
hadoop2
hadoop3
上面的hadoop1,hadoop2,hadoop3是主机名,由于这里是单机的伪分布式,所以默认采用localhost就行了,不需要做其他配置。
9,配置core-site文件
core-site.xml:
hadoop.tmp.dir
file:/usr/local/hadoop2.7.5/tmp
Abase for other temporary directories.
fs.defaultFS
hdfs://172.16.2.243:9000
ha.zookeeper.quorum
izbp1dmlbagds9s70r8luxz:2181,izbp1dmlbagds9s70r8luxz:2182,izbp1dmlbagds9s70r8luxz:2183
10,配置hdfs-site.xml
hdfs-site.xml:
dfs.replication
1
dfs.namenode.name.dir
file:/usr/local/hadoop2.7.5/tmp/dfs/name
dfs.datanode.data.dir
file:/usr/local/hadoop2.7.5/tmp/dfs/data
dfs.permissions.enabled
false
上面很多都注释掉了,这是因为这些参数在做真正的分布式搭建的时候需要配置的,这里就只用了几个,其他的有一个印象就好了。
11,配置mapred-site.xml文件
mapred-stie.xml,解压包里面是没有这个文件的,拷贝mapred-site.xml.template文件进行修改:
mapreduce.framework.name
yarn
12,配置yarn-site.xml文件
yarn-site.xml:
yarn.nodemanager.aux-services
mapreduce_shuffle
13,如果是搭建分布式集群,需要将配置好的hadoop文件复制到另一台slave机器上
命令: scp -r 目录 远程机器用户名(root)@ip(或者hostname):远程机器目录(/usr/local)
伪分布式,不需要这一步操作
14,启动journalnode(分别在hadoop1,hadoop2,hadoop3上执行,分布式环境)
cd 到hadoop安装目录
sbin/hadoop-daemon.sh start journalnode
运行jps命令,查看journalnode进程
这里还用不到这一步,先记录下,后面会用到的
15,格式化HDFS
在hadoop的bin目录下执行命令:
hdfs namenode -format
格式化后根据core-site.xml中的hadoop.tmp.dir配置生成文件,然后使用scp -r命令把生成的文件拷贝到其他的机器上(集群模式)
16,格式化ZKFC
ZooKeeper格式化命令:hdfs zkfc -formatZK
17,启动HDFS
启动HDFS命令:sbin/start-dfs.sh
18,启动YARN
sbin/start-yarn.sh
19,验证HDFS HA
查看HDFS:
http://ip:50070/
查看RM:
http://hadoop:8088
20,向HDFS上传一个文件
文件上传:hadoop fs -put /etc/profile /
上传文件查看命令:
hdfs dfs -ls /
Found 1 items
-rw-r--r-- 1 root supergroup 2096 2018-04-07 10:20 /profile
[root@izbp1dmlbagds9s70r8luxz hadoop2.7.5]#
启动NameNode命令:
sbin/hadoop-daemon.sh start namenode
21,验证YARN
运行一下hadoop提供个demo中的WordCount程序:
命令:
hadoop jar /usr/local/hadoop2.7.5/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount /profile /out
[root@izbp1dmlbagds9s70r8luxz hadoop2.7.5]# hadoop jar /usr/local/hadoop2.7.5/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar wordcount /profile /out
18/04/07 10:34:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/07 10:34:33 INFO input.FileInputFormat: Total input paths to process : 1
18/04/07 10:34:33 INFO mapreduce.JobSubmitter: number of splits:1
18/04/07 10:34:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523068042585_0001
18/04/07 10:34:33 INFO impl.YarnClientImpl: Submitted application application_1523068042585_0001
18/04/07 10:34:34 INFO mapreduce.Job: The url to track the job: http://izbp1dmlbagds9s70r8luxz:8088/proxy/application_1523068042585_0001/
18/04/07 10:34:34 INFO mapreduce.Job: Running job: job_1523068042585_0001
18/04/07 10:34:46 INFO mapreduce.Job: Job job_1523068042585_0001 running in uber mode : false
18/04/07 10:34:46 INFO mapreduce.Job: map 0% reduce 0%
18/04/07 10:34:53 INFO mapreduce.Job: map 100% reduce 0%
18/04/07 10:35:00 INFO mapreduce.Job: map 100% reduce 100%
18/04/07 10:35:01 INFO mapreduce.Job: Job job_1523068042585_0001 completed successfully
18/04/07 10:35:02 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=2366
FILE: Number of bytes written=248633
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2193
HDFS: Number of bytes written=1725
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=4424
Total time spent by all reduces in occupied slots (ms)=5308
Total time spent by all map tasks (ms)=4424
Total time spent by all reduce tasks (ms)=5308
Total vcore-milliseconds taken by all map tasks=4424
Total vcore-milliseconds taken by all reduce tasks=5308
Total megabyte-milliseconds taken by all map tasks=4530176
Total megabyte-milliseconds taken by all reduce tasks=5435392
Map-Reduce Framework
Map input records=86
Map output records=262
Map output bytes=2891
Map output materialized bytes=2366
Input split bytes=97
Combine input records=262
Combine output records=159
Reduce input groups=159
Reduce shuffle bytes=2366
Reduce input records=159
Reduce output records=159
Spilled Records=318
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=197
CPU time spent (ms)=1370
Physical memory (bytes) snapshot=312586240
Virtual memory (bytes) snapshot=4208975872
Total committed heap usage (bytes)=165810176
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2096
File Output Format Counters
Bytes Written=1725
[root@izbp1dmlbagds9s70r8luxz hadoop2.7.5]#
总结:花了很多时间,爬过一些坑,把单机的伪分布式的hadoop环境搭建起来了,这里没有详细介绍每一个步骤和每个步骤的详细说明,只是提纲挈领的列出了搭建环境过程中的一些主要步骤和配置。在使用云服务器的时候,出现了一些问题,要考虑服务器的的端口拦截规则,否则不能通过外网访问。万事开头难,后面会深入学些总结一些hadoop相关的东西。
欢迎加群:331227121学习交流