基于docker搭建hadoop集群

目前线上环境都是传统的虚机,所以docker在日常开发中大部分是用来在本地搭建开发用的一些中间件比如redis、kafka啥的,不得不说,docker这个东西到目前没有感觉还是挺好的,最主要的是它比vm ware轻太多了,一般如果用vm ware的话,开四五台已经是上限了,但是如果用docker,限制就会笑很多,而且,搭建好的环境还可以导出镜像,已被在其他地方使用,很方便。

目前正准备复习和总结一下Hadoop和Spark相关的知识点,所以就准备搭建一个集群,但是目前用的是mac电脑,没有双系统,也不想装虚机,就用docker搭了一个,很方便,下面记录一个搭建docker集群的过程。

主要分两步,第一步是搭建基础环境,导出镜像,第二步是利用公共镜像搭建hadoop集群,下面一步一步操作。

  • 从Docker Hub拉取一个热乎的centos
>docker pull centos:7
  • 启动容器
>docker run -dit --name DockerCentos centos:7 /bin/bash
  • 下载jdk并赋值到docker环境中(在另一个shell中下载到本地,防止docker环境gg,还得重下)
>docker cp ~/Downloads/jdk-8u231-linux-x64.tar.gz containerId:/usr/local/app/jdk
  • 按照时间同步工具,了能会用拿到
yum install -y ntpdate
  • 按照ssh
yum install openssh-clients
yum install openssh-server
  • 安装网络工具,不然执行不了ifconfig
yum install net-tools -y
  • 修改本地DNS
注释掉127.0.0.1 localhost,不然不知道为什么hadoop命令执行会报错。
  • 安装which,莫名其妙没有which命令,hadoop命令执行报错
yum insall -y which
  • 配置jdk
>tart -xvf jdk-8u231-linux-x64.tar.gz
>vi /etc/profile
export JAVA_HOME=/usr/local/app/jdk/jdk1.8.0_231
export PATH=$JAVA_HOME/bin:$PATH

>source /etc/profile
>java -version
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
  • 下载hadoop安装包(在另一个shell中下载到本地电脑中)
wget http://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
  • 官网资源巨慢,找个电影,看完再继续。
  • 复制到docker中
>docker cp hadoop-2.7.1/hadoop-2.7.1.tar.gz containerId:/usr/local/app/hadoop/
  • 配置hadoop环境
>tar -xzvf hadoop-2.7.1/hadoop-2.7.1.tar.gz
>vi /etc/profile
export HADOOP_HOME=/usr/local/app/hadoop/hadoop-2.7.1
export PATH=$HADOOP_HOME/bin:$PATH
export PATH=$HADOOP_HOME/sbin:$PATH

>source /etc/profile
>hadoop version
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
This command was run using /usr/local/app/hadoop/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
  • 创建目录
/usr/local/app/hadoop/tmp      
/usr/local/app/hadoop/namenode_data  
/usr/local/app/hadoop/datanode_data	
  • 修改core-site.xml

	hadoop.tmp.dir 
	file:/usr/local/hadoop/tmp
	Abase for other temporary directories. 
 
 
	fs.defaultFS 
	hdfs://master:9000 

  • 修改hdfs-site.xml
 
	dfs.namenode.name.dir 
	file:/usr/local/hadoop/namenode_dir 
 

	dfs.datanode.data.dir 	
	file:/usr/local/hadoop/datanode_dir 
 
 
	dfs.replication 
	3 
 
  • 修改mapred-site.xml
cp mapred-site.xml.template mapred-site.xml
 
mapreduce.framework.name 
yarn 
 

– 修改yarn-site.xml

 
	yarn.nodemanager.aux-services 
	mapreduce_shuffle 

 
	yarn.resourcemanager.hostname 
	master 
 
 
	yarn.nodemanager.resource.memory-mb 
	20480 
 
 
	yarn.scheduler.minimum-allocation-mb 
	2048 
 
 
	yarn.nodemanager.vmem-pmem-ratio 
	2.1 
 
  • 配置host
172.17.0.2 master
172.17.0.3 slave01
172.17.0.4 slave02
  • 修改root 密码
>passwd root
  • 导出镜像(在另一个shell中,不要停掉当前container,否则有可能之前配置的东西都没了)
docker commit containerId centos/hadoopinstalled
  • 分别启动三个实例,master、slave01、slave02
docker run --privileged -itd -h master --name master imgId init
docker run --privileged -itd -h slave01 --name slave01imgId init 
docker run --privileged -itd -h slave02 --name slave01imgId init 

这里有两个个坑,首先 --privileged 和 后面的init 一定要加
否则docker环境用不了systemctl

其次-h master/slave01/slave02 一定要指定,对应docker环境中的hostname
不能通过vi /ect/hostname 来修个,重启会被覆盖。
  • 机器之间配置免密码登录
# 生成rsa key
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
#彼此进行复制,mater -> slave01/slave02    slave01->master/slave02   slave02->master/slave01
ssh-copy-id -i ~/.ssh/id_dsa.pub
  • 修个master上的slave配置
vi slave
slave01
slave02
  • 初始化hdfs
bin/hdfs namenode -forma
  • 启动hdfs集群
sbin/start-all.sh 
  • 测试,运行wordcount
#创建目录
hdfs dfs -mkdir -p /user/hadoop/input
#上传本地文件
hdfs dfs -put local.text /user/hadoop/input
#运行mapreduce
hadoop jar hadoop-mapreduce-examples-2.7.1.jar wordcount /user/hadoop/input /user/hadoop/output01


如果出现以下结果,表示运行成功
19/11/29 13:28:06 INFO client.RMProxy: Connecting to ResourceManager at master/172.17.0.2:8032
19/11/29 13:28:08 INFO input.FileInputFormat: Total input paths to process : 1
19/11/29 13:28:08 INFO mapreduce.JobSubmitter: number of splits:1
19/11/29 13:28:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1575033976424_0001
19/11/29 13:28:09 INFO impl.YarnClientImpl: Submitted application application_1575033976424_0001
19/11/29 13:28:09 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1575033976424_0001/
19/11/29 13:28:09 INFO mapreduce.Job: Running job: job_1575033976424_0001
19/11/29 13:28:22 INFO mapreduce.Job: Job job_1575033976424_0001 running in uber mode : false
19/11/29 13:28:22 INFO mapreduce.Job:  map 0% reduce 0%
19/11/29 13:28:31 INFO mapreduce.Job:  map 100% reduce 0%
19/11/29 13:28:39 INFO mapreduce.Job:  map 100% reduce 100%
19/11/29 13:28:40 INFO mapreduce.Job: Job job_1575033976424_0001 completed successfully
19/11/29 13:28:40 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=72
		FILE: Number of bytes written=230895
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=147
		HDFS: Number of bytes written=46
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=6980
		Total time spent by all reduces in occupied slots (ms)=4441
		Total time spent by all map tasks (ms)=6980
		Total time spent by all reduce tasks (ms)=4441
		Total vcore-seconds taken by all map tasks=6980
		Total vcore-seconds taken by all reduce tasks=4441
		Total megabyte-seconds taken by all map tasks=7147520
		Total megabyte-seconds taken by all reduce tasks=4547584
	Map-Reduce Framework
		Map input records=4
		Map output records=5
		Map output bytes=56
		Map output materialized bytes=72
		Input split bytes=110
		Combine input records=5
		Combine output records=5
		Reduce input groups=5
		Reduce shuffle bytes=72
		Reduce input records=5
		Reduce output records=5
		Spilled Records=10
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=103
		CPU time spent (ms)=1650
		Physical memory (bytes) snapshot=400334848
		Virtual memory (bytes) snapshot=3895169024
		Total committed heap usage (bytes)=275775488
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=37
	File Output Format Counters 
		Bytes Written=46

你可能感兴趣的:(Code-Hadoop,docker,hadoop)