目前线上环境都是传统的虚机,所以docker在日常开发中大部分是用来在本地搭建开发用的一些中间件比如redis、kafka啥的,不得不说,docker这个东西到目前没有感觉还是挺好的,最主要的是它比vm ware轻太多了,一般如果用vm ware的话,开四五台已经是上限了,但是如果用docker,限制就会笑很多,而且,搭建好的环境还可以导出镜像,已被在其他地方使用,很方便。
目前正准备复习和总结一下Hadoop和Spark相关的知识点,所以就准备搭建一个集群,但是目前用的是mac电脑,没有双系统,也不想装虚机,就用docker搭了一个,很方便,下面记录一个搭建docker集群的过程。
主要分两步,第一步是搭建基础环境,导出镜像,第二步是利用公共镜像搭建hadoop集群,下面一步一步操作。
>docker pull centos:7
>docker run -dit --name DockerCentos centos:7 /bin/bash
>docker cp ~/Downloads/jdk-8u231-linux-x64.tar.gz containerId:/usr/local/app/jdk
yum install -y ntpdate
yum install openssh-clients
yum install openssh-server
yum install net-tools -y
注释掉127.0.0.1 localhost,不然不知道为什么hadoop命令执行会报错。
yum insall -y which
>tart -xvf jdk-8u231-linux-x64.tar.gz
>vi /etc/profile
export JAVA_HOME=/usr/local/app/jdk/jdk1.8.0_231
export PATH=$JAVA_HOME/bin:$PATH
>source /etc/profile
>java -version
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
wget http://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
>docker cp hadoop-2.7.1/hadoop-2.7.1.tar.gz containerId:/usr/local/app/hadoop/
>tar -xzvf hadoop-2.7.1/hadoop-2.7.1.tar.gz
>vi /etc/profile
export HADOOP_HOME=/usr/local/app/hadoop/hadoop-2.7.1
export PATH=$HADOOP_HOME/bin:$PATH
export PATH=$HADOOP_HOME/sbin:$PATH
>source /etc/profile
>hadoop version
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
This command was run using /usr/local/app/hadoop/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
/usr/local/app/hadoop/tmp
/usr/local/app/hadoop/namenode_data
/usr/local/app/hadoop/datanode_data
hadoop.tmp.dir
file:/usr/local/hadoop/tmp
Abase for other temporary directories.
fs.defaultFS
hdfs://master:9000
dfs.namenode.name.dir
file:/usr/local/hadoop/namenode_dir
dfs.datanode.data.dir
file:/usr/local/hadoop/datanode_dir
dfs.replication
3
cp mapred-site.xml.template mapred-site.xml
mapreduce.framework.name
yarn
– 修改yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.resourcemanager.hostname
master
yarn.nodemanager.resource.memory-mb
20480
yarn.scheduler.minimum-allocation-mb
2048
yarn.nodemanager.vmem-pmem-ratio
2.1
172.17.0.2 master
172.17.0.3 slave01
172.17.0.4 slave02
>passwd root
docker commit containerId centos/hadoopinstalled
docker run --privileged -itd -h master --name master imgId init
docker run --privileged -itd -h slave01 --name slave01imgId init
docker run --privileged -itd -h slave02 --name slave01imgId init
这里有两个个坑,首先 --privileged 和 后面的init 一定要加
否则docker环境用不了systemctl
其次-h master/slave01/slave02 一定要指定,对应docker环境中的hostname
不能通过vi /ect/hostname 来修个,重启会被覆盖。
# 生成rsa key
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
#彼此进行复制,mater -> slave01/slave02 slave01->master/slave02 slave02->master/slave01
ssh-copy-id -i ~/.ssh/id_dsa.pub
vi slave
slave01
slave02
bin/hdfs namenode -forma
sbin/start-all.sh
#创建目录
hdfs dfs -mkdir -p /user/hadoop/input
#上传本地文件
hdfs dfs -put local.text /user/hadoop/input
#运行mapreduce
hadoop jar hadoop-mapreduce-examples-2.7.1.jar wordcount /user/hadoop/input /user/hadoop/output01
如果出现以下结果,表示运行成功
19/11/29 13:28:06 INFO client.RMProxy: Connecting to ResourceManager at master/172.17.0.2:8032
19/11/29 13:28:08 INFO input.FileInputFormat: Total input paths to process : 1
19/11/29 13:28:08 INFO mapreduce.JobSubmitter: number of splits:1
19/11/29 13:28:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1575033976424_0001
19/11/29 13:28:09 INFO impl.YarnClientImpl: Submitted application application_1575033976424_0001
19/11/29 13:28:09 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1575033976424_0001/
19/11/29 13:28:09 INFO mapreduce.Job: Running job: job_1575033976424_0001
19/11/29 13:28:22 INFO mapreduce.Job: Job job_1575033976424_0001 running in uber mode : false
19/11/29 13:28:22 INFO mapreduce.Job: map 0% reduce 0%
19/11/29 13:28:31 INFO mapreduce.Job: map 100% reduce 0%
19/11/29 13:28:39 INFO mapreduce.Job: map 100% reduce 100%
19/11/29 13:28:40 INFO mapreduce.Job: Job job_1575033976424_0001 completed successfully
19/11/29 13:28:40 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=72
FILE: Number of bytes written=230895
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=147
HDFS: Number of bytes written=46
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=6980
Total time spent by all reduces in occupied slots (ms)=4441
Total time spent by all map tasks (ms)=6980
Total time spent by all reduce tasks (ms)=4441
Total vcore-seconds taken by all map tasks=6980
Total vcore-seconds taken by all reduce tasks=4441
Total megabyte-seconds taken by all map tasks=7147520
Total megabyte-seconds taken by all reduce tasks=4547584
Map-Reduce Framework
Map input records=4
Map output records=5
Map output bytes=56
Map output materialized bytes=72
Input split bytes=110
Combine input records=5
Combine output records=5
Reduce input groups=5
Reduce shuffle bytes=72
Reduce input records=5
Reduce output records=5
Spilled Records=10
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=103
CPU time spent (ms)=1650
Physical memory (bytes) snapshot=400334848
Virtual memory (bytes) snapshot=3895169024
Total committed heap usage (bytes)=275775488
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=37
File Output Format Counters
Bytes Written=46