上篇已经把Hadoop安装到各虚拟机(大数据系列-Hadoop集群搭建(1)),本篇通过认识
Hadoop,规划虚拟机部署。
目录
1. 初步认识Hadoop
1.1 模块认知
1.2 关联工程/项目
1.3 Hadoop 架构
1.3.1 HDFS架构
1.3.2 YARN架构
2. Hadoop 部署规划
2.1节点规划
2.2 端口规划
3. Hadoop 集群配置
3.1 配置文件说明
3.1.1 自定义配置文件
3.1.2 默认配置
3.2 非安全模式配置Hadoop
3.2.1 Hadoop 配置说明
3.3 启动集群&验证
3.3.1 启动hdfs
3.3.2 启动yarn
3.3.3 启动jobhistory服务
3.3.4 集群验证
3.4 相关进程与端口情况
3.4.1 leader 节点
3.4.2 follower1 节点
3.4.3 follower2 节点
包括以下模块(The project includes these modules):
HDFS是一个主从架构,由一个NameNode已经多个DataNodes 组成。
NameNode:
Secondary NameNode(2nn): 每隔一段时间对NameNode进行元数据备份。
DataNodes:
YARN的基本思想是将资源管理和作业调度/监视的功能分解为单独的守护进程,其思想是拥有一个全局的ResourceManager (RM)和每个应用程序的ApplicationMaster (AM)。应用程序可以是单个作业,也可以是作业的DAG。
Yarn由RM(ResourceManager)与NM(NodeManager)组成。
ResourceManager:ResourceManager是在系统中所有应用程序之间仲裁资源的最终权限
RM 有两个主要的组件:调度器(Scheduler)、应用管理器(ApplicationManager)
NodeManager:每台机器的框架代理,它负责容器,监视它们的资源使用情况(cpu、内存、磁盘、网络),并向ResourceManager/Scheduler报告相同的情况。
192.168.56.101 (leader) |
192.168.56.102 (follower1) |
192.168.56.103 (follower2) |
|
HDFS | NameNode DataNode | SecondaryNameNode DataNode |
DataNode |
YARN | NodeManager | NodeManager | ResourceManager NodeManager |
Hadoop 配置文件有两类:默认配置文件、自定义配置文件,自定义配置文件优先级高。
*-site.xml为自定义配置文件
默认配置可以参考如下位置的文件:
上面3.1.1图*-env 文件,为相应的环境变量的配置。
关键配置说明:
Parameter | Value | Notes |
---|---|---|
fs.defaultFS | NameNode URI | hdfs://host:port/ |
io.file.buffer.size | 131072 (128M) |
Size of read/write buffer used in SequenceFiles. 序列文件中使用的读/写缓冲区的大小。 |
Parameter | Value | Notes |
---|---|---|
dfs.namenode.name.dir | Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. 存放namesapce 以及事务日志持久化的地方 |
If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. 可以用逗号分割,冗余存储到不同目录下 |
dfs.hosts / dfs.hosts.exclude | List of permitted/excluded DataNodes. 允许/排除的主机列表 |
If necessary, use these files to control the list of allowable datanodes. 使用这些文件来控制允许的datanode列表。 |
dfs.blocksize | 268435456 (256M) |
HDFS blocksize of 256MB for large file-systems. HDFS文件的块大小 |
dfs.namenode.handler.count | 100 | More NameNode server threads to handle RPCs from large number of DataNodes. NameNode用来处理来自DataNode的RPC请求的线程数量 |
etc/hadoop/hdfs-site.xml
Configurations for NameNode:
Parameter | Value | Notes |
---|---|---|
dfs.datanode.data.dir | Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. 存储数据块的目录 |
If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. 逗号分割,可以存储到不同的设备 |
Parameter | Value | Notes |
---|---|---|
yarn.acl.enable | true / false | Enable ACLs? Defaults to false. 是否开启访问 |
yarn.admin.acl | Admin ACL | ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access. 用户访问控制列表,*代表所有人可以访问,空格表示没人可以访问 |
yarn.log-aggregation-enable | false | Configuration to enable or disable log aggregation 启用/禁用日志聚合的配置,默认为不开启 |
etc/hadoop/yarn-site.xml
Configurations for ResourceManager and NodeManager:
Parameter | Value | Notes |
---|---|---|
yarn.resourcemanager.address | ResourceManager host:port for clients to submit jobs. 客户端提交任务的访问主机与端口 |
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.scheduler.address | ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. 供AM与RM获取资源的地址 |
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.resource-tracker.address | ResourceManager host:port for NodeManagers. 供NodeManager访问的RM地址 |
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.admin.address | ResourceManager host:port for administrative commands. 管理控制台地址 |
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.webapp.address | ResourceManager web-ui host:port. 资源管理的WEB服务地址 |
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.hostname | ResourceManager host. 资源管理的主机名 |
host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources. Results in default ports for ResourceManager components. |
yarn.resourcemanager.scheduler.class | ResourceManager Scheduler class. | CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler. Use a fully qualified class name, e.g., org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler. |
yarn.scheduler.minimum-allocation-mb | Minimum limit of memory to allocate to each container request at the Resource Manager. 每个容器请求分配的最小内存大小 |
In MBs |
yarn.scheduler.maximum-allocation-mb | Maximum limit of memory to allocate to each container request at the Resource Manager. 每个容器请求分配的最大内存大小 |
In MBs |
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path | List of permitted/excluded NodeManagers. 允许/排除nodemanager的列表。 |
If necessary, use these files to control the list of allowable NodeManagers. |
Parameter | Value | Notes |
---|---|---|
yarn.nodemanager.resource.memory-mb | Resource i.e. available physical memory, in MB, for given NodeManager NM可使用的物理内存(MB单位) |
Defines total available resources on the NodeManager to be made available to running containers |
yarn.nodemanager.vmem-pmem-ratio | Maximum ratio by which virtual memory usage of tasks may exceed physical memory (任务的虚拟内存使用可能超过物理内存的最大比率) |
The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. |
yarn.nodemanager.local-dirs | Comma-separated list of paths on the local filesystem where intermediate data is written. (可保存的中间数据的路径) |
Multiple paths help spread disk i/o. |
yarn.nodemanager.log-dirs | Comma-separated list of paths on the local filesystem where logs are written. 日志保存地址 |
Multiple paths help spread disk i/o. |
yarn.nodemanager.log.retain-seconds | 10800 NM日志保留时间 (3小时) |
Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. |
yarn.nodemanager.remote-app-log-dir | /logs |
HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled. |
yarn.nodemanager.remote-app-log-dir-suffix | logs | Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled. |
yarn.nodemanager.aux-services | mapreduce_shuffle | Shuffle service that needs to be set for Map Reduce applications. |
yarn.nodemanager.env-whitelist | Environment properties to be inherited by containers from NodeManagers | For mapreduce application in addition to the default values HADOOP_MAPRED_HOME should to be added. Property value should JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME |
etc/hadoop/mapred-site.xml
Configurations for MapReduce Applications:
Parameter | Value | Notes |
---|---|---|
mapreduce.framework.name | yarn | Execution framework set to Hadoop YARN. |
mapreduce.map.memory.mb | 1536 | Larger resource limit for maps. |
mapreduce.map.java.opts | -Xmx1024M | Larger heap-size for child jvms of maps. |
mapreduce.reduce.memory.mb | 3072 | Larger resource limit for reduces. |
mapreduce.reduce.java.opts | -Xmx2560M | Larger heap-size for child jvms of reduces. |
mapreduce.task.io.sort.mb | 512 | Higher memory-limit while sorting data for efficiency. |
mapreduce.task.io.sort.factor | 100 | More streams merged at once while sorting files. |
mapreduce.reduce.shuffle.parallelcopies | 50 | Higher number of parallel copies run by reduces to fetch outputs from very large number of maps. |
Parameter | Value | Notes |
---|---|---|
mapreduce.jobhistory.address | MapReduce JobHistory Server host:port | Default port is 10020. |
mapreduce.jobhistory.webapp.address | MapReduce JobHistory Server Web UI host:port | Default port is 19888. |
mapreduce.jobhistory.intermediate-done-dir | /mr-history/tmp | Directory where history files are written by MapReduce jobs. |
mapreduce.jobhistory.done-dir | /mr-history/done | Directory where history files are managed by the MR JobHistory Server. |
3.2.2 集群配置
以下为我的集群的配置:
/opt/module/hadoop-3.3.1/etc/hadoop/hdfs-site.xml
dfs.namenode.http-address
leader:9870
dfs.namenode.secondary.http-address
follower1:9868
/opt/module/hadoop-3.3.1/etc/hadoop/core-site.xml
fs.defaultFS
hdfs://leader:9820
hadoop.tmp.dir
/opt/module/hadoop-3.3.1/data
hadoop.http.staticuser.user
hadoop
hadoop.proxyuser.hadoop.hosts
*
/opt/module/hadoop-3.3.1/etc/hadoop/mapred-site.xml
mapreduce.framework.name
yarn
mapreduce.jobhistory.address
leader:10020
mapreduce.jobhistory.webapp.address
leader:19888
/opt/module/hadoop-3.3.1/etc/hadoop/yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.resourcemanager.hostname
follower2
yarn.nodemanager.env-whitelist
JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP
_MAPRED_HOME
yarn.scheduler.minimum-allocation-mb
512
yarn.scheduler.maximum-allocation-mb
4096
yarn.nodemanager.resource.memory-mb
4096
yarn.nodemanager.pmem-check-enabled
false
yarn.nodemanager.vmem-check-enabled
false
yarn.log-aggregation-enable
true
yarn.log.server.url
http://leader:19888/jobhistory/logs
yarn.log-aggregation.retain-seconds
604800
/opt/module/hadoop-3.3.1/etc/hadoop/workers
leader
follower1
follower2
*重要,以上配置leader同步到follower1、follower2
如:scp /opt/module/hadoop-3.3.1/etc/hadoop/workers hadoop@follower1:/opt/module/hadoop-3.3.1/etc/hadoop/
第一次启动在leader节点上,格式化一次NameNode
$hdfs namenode -format
#再启动hdfs
$start-dfs.sh
在配置了ResourceManager的节点,follower2上启动YARN
$start-yarn.sh
在leader节点上,启动历史historyserver
$mapred --daemon start historyserver
启动中如果出现以下提示:
Java HotSpot(TM) Client VM warning: You have loaded library /opt/module/hadoop-3.3.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c
解决方案:
在cust.sh 中增加最后3行export,最终如下:
[hadoop@leader hadoop]$ more /etc/profile.d/cust.sh
#JAVA_HOME
JAVA_HOME=/opt/module/jdk1.8.0_311
PATH=$PATH:$JAVA_HOME/bin
HADOOP_HOME=/opt/module/hadoop-3.3.1
PATH=$PATH:$HADOOP_HOME/bin
PATH=$PATH:$HADOOP_HOME/sbin
export JAVA_HOME
export HADOOP_HOME
export PATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=/opt/module/hadoop-3.3.1/lib/native
$ hadoop fs -mkdir /input
#把之前 wordcount的文本文件put到hdfs的/input 目录下
$ hadoop fs -put $HADOOP_HOME/wcinput/wcinput_sample.txt /input
#把JDK put 到/目录下
$ hadoop fs -put /opt/software/jdk-8u212-linux-x64.tar.gz /
可以在配置的data目录下查看,如leader节点
NameNodeUI查看
(可忽略output、tmp目录,后面验证用到)
验证2:运行wordcount 任务
#先清理文件系统中存在的output 目录,以免报目录已存在的异常
$hadoop fs -rm -r /output
#运行wordcount样例,把结果输出到文件系统中的/output
$hadoop jar /opt/module/hadoop-3.3.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output
运行完,在Jobhistory中可以看到相关job及其日志
在hdfs /output目录下可以看到结果
以上是整个Hadoop搭建的过程,后续将用一些案例,来进阶熟悉Hadoop生态。