这篇文章应该算是我大数据系列的第一篇文章吧,路漫漫其修远兮,我本来不想这么快就往大数据方向前进的,只是这边项目需要,不得不学。本文章对新手极不友好。前置技能有很多,你总得安装个docker吧,还要了解Docker常用命令, 以及Dockerfile,docker-compose的使用。我的不通用的Python程序员练级攻略里有相应的学习资料。而该项目的地址是 https://github.com/Niracler/bigdata-exercise
这篇文章主要讲的是使用docker搭建hadoop伪分布式的细节,假如觉得麻烦的话,有更简便的方法。我这里先说为敬,或者你们可以先试试成功的喜悦。
将这个项目克隆下来,我的Dockerfile都是基于这个项目改写的:
$git clone https://github.com/Niracler/bigdata-exercise.git
$cd bigdata-exercise/docker-hadoop/
启动
$sudo su
$docker-compose -f docker-compose-local.yml up -d
访问http://localhost:50070/ 效果
进入容器
$docker exec -it namenode bash
上传文件测试
$touch test
$hdfs dfs -put test /
$hdfs dfs -ls /
然后?就没有然后了,下面是dockerfile等文件的细节
构建镜像
目录结构
你要先在该目录结构下
构建hadoop-base镜像
hadoop-base的Dockerfile,之后具体的Dockerfile都是基于该Dockerfile,该Dockerfile的主要工作是下载jdk与hadoop
FROM openjdk:8
MAINTAINER Niracler
# 下载并安装hadoop http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
ENV HADOOP_VERSION 2.7.7
ENV HADOOP_URL http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
# 解压,放到指定位置,并删除
RUN set -x \
&& curl -fSL "$HADOOP_URL" -o /tmp/hadoop.tar.gz \
&& tar -xvf /tmp/hadoop.tar.gz -C /opt/ \
&& rm /tmp/hadoop.tar.gz*
RUN ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop
RUN cp /etc/hadoop/mapred-site.xml.template /etc/hadoop/mapred-site.xml
RUN mkdir /opt/hadoop-$HADOOP_VERSION/logs
RUN mkdir /hadoop-data
# 设置环境变量
ENV HADOOP_PREFIX=/opt/hadoop-$HADOOP_VERSION
ENV HADOOP_CONF_DIR=/etc/hadoop
ENV MULTIHOMED_NETWORK=1
ENV USER=root
ENV PATH $HADOOP_PREFIX/bin/:$PATH
# 添加启动文件
ADD entrypoint.sh /entrypoint.sh
RUN chmod a+x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
启动文件entrypoint.sh
#!/bin/bash
# Set some sensible defaults
export CORE_CONF_fs_defaultFS=${CORE_CONF_fs_defaultFS:-hdfs://`hostname -f`:8020}
function addProperty() {
local path=$1
local name=$2
local value=$3
local entry="$name ${value} "
local escapedEntry=$(echo $entry | sed 's/\//\\\//g')
sed -i "/<\/configuration>/ s/.*/${escapedEntry}\n&/" $path
}
function configure() {
local path=$1
local module=$2
local envPrefix=$3
local var
local value
echo "Configuring $module"
for c in `printenv | perl -sne 'print "$1 " if m/^${envPrefix}_(.+?)=.*/' -- -envPrefix=$envPrefix`; do
name=`echo ${c} | perl -pe 's/___/-/g; s/__/@/g; s/_/./g; s/@/_/g;'`
var="${envPrefix}_${c}"
value=${!var}
echo " - Setting $name=$value"
addProperty /etc/hadoop/$module-site.xml $name "$value"
done
}
configure /etc/hadoop/core-site.xml core CORE_CONF
configure /etc/hadoop/hdfs-site.xml hdfs HDFS_CONF
configure /etc/hadoop/yarn-site.xml yarn YARN_CONF
configure /etc/hadoop/httpfs-site.xml httpfs HTTPFS_CONF
configure /etc/hadoop/kms-site.xml kms KMS_CONF
if [ "$MULTIHOMED_NETWORK" = "1" ]; then
echo "Configuring for multihomed network"
# HDFS
addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.rpc-bind-host 0.0.0.0
addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.servicerpc-bind-host 0.0.0.0
addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.http-bind-host 0.0.0.0
addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.https-bind-host 0.0.0.0
addProperty /etc/hadoop/hdfs-site.xml dfs.client.use.datanode.hostname true
addProperty /etc/hadoop/hdfs-site.xml dfs.datanode.use.datanode.hostname true
# YARN
addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0
# MAPRED
addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
fi
if [ -n "$GANGLIA_HOST" ]; then
mv /etc/hadoop/hadoop-metrics.properties /etc/hadoop/hadoop-metrics.properties.orig
mv /etc/hadoop/hadoop-metrics2.properties /etc/hadoop/hadoop-metrics2.properties.orig
for module in mapred jvm rpc ugi; do
echo "$module.class=org.apache.hadoop.metrics.ganglia.GangliaContext31"
echo "$module.period=10"
echo "$module.servers=$GANGLIA_HOST:8649"
done > /etc/hadoop/hadoop-metrics.properties
for module in namenode datanode resourcemanager nodemanager mrappmaster jobhistoryserver; do
echo "$module.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31"
echo "$module.sink.ganglia.period=10"
echo "$module.sink.ganglia.supportsparse=true"
echo "$module.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both"
echo "$module.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40"
echo "$module.sink.ganglia.servers=$GANGLIA_HOST:8649"
done > /etc/hadoop/hadoop-metrics2.properties
fi
exec $@
构建hadoop-base镜像
$docker build -t="hadoop-base" ./base
构建hadoop-namenode
hadoop-namenode的Dockerfile
FROM hadoop-base
MAINTAINER niracler
HEALTHCHECK CMD curl -f http://localhost:50070/ || exit 1
ENV HDFS_CONF_dfs_namenode_name_dir=file:///hadoop/dfs/name
RUN mkdir -p /hadoop/dfs/name
VOLUME /hadoop/dfs/name
ADD run.sh /run.sh
RUN chmod a+x /run.sh
EXPOSE 50070
CMD ["/run.sh"]
启动文件run.sh
#!/bin/bash
namedir=`echo $HDFS_CONF_dfs_namenode_name_dir | perl -pe 's#file://##'`
if [ ! -d $namedir ]; then
echo "Namenode name directory not found: $namedir"
exit 2
fi
if [ -z "$CLUSTER_NAME" ]; then
echo "Cluster name not specified"
exit 2
fi
if [ "`ls -A $namedir`" == "" ]; then
echo "Formatting namenode name directory: $namedir"
$HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR namenode -format $CLUSTER_NAME
fi
$HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR namenode
构建hadoop-namenode镜像
$docker build -t="hadoop-namenode" ./namenode
构建hadoop-datanode
hadoop-datanode的Dockerfile
FROM hadoop-base
MAINTAINER niracler
HEALTHCHECK CMD curl -f http://localhost:50075/ || exit 1
ENV HDFS_CONF_dfs_datanode_data_dir=file:///hadoop/dfs/data
RUN mkdir -p /hadoop/dfs/data
VOLUME /hadoop/dfs/data
ADD run.sh /run.sh
RUN chmod a+x /run.sh
EXPOSE 50075
CMD ["/run.sh"]
启动文件run.sh
#!/bin/bash
datadir=`echo $HDFS_CONF_dfs_datanode_data_dir | perl -pe 's#file://##'`
if [ ! -d $datadir ]; then
echo "Datanode data directory not found: $datadir"
exit 2
fi
$HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR datanode
构建hadoop-datanode镜像
$docker build -t="hadoop-datanode" ./datanode
构建hadoop-nodemanager
hadoop-nodemanager的Dockerfile
FROM hadoop-base
MAINTAINER Niracler
HEALTHCHECK CMD curl -f http://localhost:8042/ || exit 1
ADD run.sh /run.sh
RUN chmod a+x /run.sh
EXPOSE 8042
CMD ["/run.sh"]
启动文件run.sh
#!/bin/bash
$HADOOP_PREFIX/bin/yarn --config $HADOOP_CONF_DIR nodemanager
构建hadoop-nodemanager
$docker build -t="hadoop-nodemanager" ./nodemanager
构建hadoop-resourcemanager
hadoop-resourcemanager的Dockerfile
FROM hadoop-base
MAINTAINER niracler
HEALTHCHECK CMD curl -f http://localhost:8088/ || exit 1
ADD run.sh /run.sh
RUN chmod a+x /run.sh
EXPOSE 8088
CMD ["/run.sh"]
启动文件run.sh
#!/bin/bash
$HADOOP_PREFIX/bin/yarn --config $HADOOP_CONF_DIR resourcemanager
$docker build -t="hadoop-resourcemanager" ./resourcemanager
构建hadoop-historyserver
hadoop-historyserver的Dockerfile
FROM hadoop-base
MAINTAINER niracler
HEALTHCHECK CMD curl -f http://localhost:8188/ || exit 1
ENV YARN_CONF_yarn_timeline___service_leveldb___timeline___store_path=/hadoop/yarn/timeline
RUN mkdir -p /hadoop/yarn/timeline
VOLUME /hadoop/yarn/timeline
ADD run.sh /run.sh
RUN chmod a+x /run.sh
EXPOSE 8188
CMD ["/run.sh"]
启动文件run.sh
#!/bin/bash
$HADOOP_PREFIX/bin/yarn --config $HADOOP_CONF_DIR historyserver
$docker build -t="hadoop-historyserver" ./historyserver
配置文件
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource___tracker_address=resourcemanager:8031
启动
这里使用docker-compose,docker-compose-local.yml内容如下:
version: "2"
services:
namenode:
image: hadoop-namenode
hostname: namenode
container_name: namenode
volumes:
- ./data/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop.env
ports:
- "50070:50070"
resourcemanager:
image: hadoop-resourcemanager
hostname: resourcemanager
container_name: resourcemanager
depends_on:
- "namenode"
links:
- "namenode"
env_file:
- ./hadoop.env
historyserver:
image: hadoop-historyserver
hostname: historyserver
container_name: historyserver
volumes:
- ./data/historyserver:/hadoop/yarn/timeline
depends_on:
- "namenode"
links:
- "namenode"
env_file:
- ./hadoop.env
nodemanager1:
image: hadoop-nodemanager
hostname: nodemanager1
container_name: nodemanager1
depends_on:
- "namenode"
- "resourcemanager"
links:
- "namenode"
- "resourcemanager"
env_file:
- ./hadoop.env
datanode1:
image: hadoop-datanode
hostname: datanode1
container_name: datanode1
depends_on:
- "namenode"
links:
- "namenode"
volumes:
- ./data/datanode1:/hadoop/dfs/data
env_file:
- ./hadoop.env
datanode2:
image: hadoop-datanode
hostname: datanode2
container_name: datanode2
depends_on:
- "namenode"
links:
- "namenode"
volumes:
- ./data/datanode2:/hadoop/dfs/data
env_file:
- ./hadoop.env
datanode3:
image: hadoop-datanode
hostname: datanode3
container_name: datanode3
depends_on:
- "namenode"
links:
- "namenode"
volumes:
- ./data/datanode3:/hadoop/dfs/data
env_file:
- ./hadoop.env
启动
$docker-compose -f docker-compose-local.yml up -d
访问 http://localhost:50070/ 效果
参考文章
- 使用docker搭建hadoop分布式集群
- CentOS更换阿里云YUM源
- docker-hadoop
- debian系linux,更换apt-get官方源为国内源
- 如何使用Apt-Get在Debian 8上安装Java
- debian8无痛安装JDK8
- hadoop安装
- HBase Python API
- 基于 docker 搭建 hadoop 跨主机集群
- Docker常用命令
- 使用Docker Swarm模式搭建Swarm集群
- Hadoop: …be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation