大数据系列-Hadoop集群搭建(2)集群配置

上篇已经把Hadoop安装到各虚拟机(大数据系列-Hadoop集群搭建(1)),本篇通过认识

Hadoop,规划虚拟机部署。

目录

1. 初步认识Hadoop

1.1 模块认知

1.2 关联工程/项目 

1.3  Hadoop 架构

1.3.1 HDFS架构

1.3.2 YARN架构

2. Hadoop 部署规划

 2.1节点规划

 2.2 端口规划

3. Hadoop 集群配置

3.1 配置文件说明

3.1.1 自定义配置文件

3.1.2 默认配置

3.2 非安全模式配置Hadoop     

3.2.1 Hadoop 配置说明

3.3 启动集群&验证

3.3.1  启动hdfs

3.3.2  启动yarn

3.3.3 启动jobhistory服务

3.3.4 集群验证

 3.4 相关进程与端口情况

3.4.1 leader 节点

3.4.2 follower1 节点

3.4.3 follower2 节点



1. 初步认识Hadoop

1.1 模块认知

包括以下模块(The project includes these modules):

  • Hadoop Common: 公共模块,用于支持以下其他3个模块。(The common utilities that support the other Hadoop modules.  )      
  • Hadoop Distributed File System (HDFS™): 分布式文件系统,可实现对应于数据的高吞吐访问。(A distributed file system that provides high-throughput access to application data.)
  • Hadoop YARN:分布式调度以及资源管理的框架。( A framework for job scheduling and cluster resource management.)
  • Hadoop MapReduce: 基于Yarn实现并行处理大数据集的系统,我们常说的分而治之,主要是通过它。(A YARN-based system for parallel processing of large data sets.

1.2 关联工程/项目 

  • Ambari™: Hadoop的可观测系统,支持Hadoop HDFS、Hadoop MapReduce、Hive、HCatalog、HBase、ZooKeeper、Oozie、Pig和Sqoop。Ambari还提供了一个仪表板,用于查看集群健康状况,如热图,并能够以用户友好的方式可视化地查看MapReduce、Pig和Hive应用程序,以及诊断其性能特征的功能   (A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.)
  • Avro™: 一种数据序列化框架。 (A data serialization system.)
  • Cassandra™: 可扩展多主数据库,无单点故障。(A scalable multi-master database with no single points of failure.)
  • Chukwa™: 管理大型分布式系统的数据收集系统 (A data collection system for managing large distributed systems.)
  • HBase™: 可伸缩的分布式数据库,支持大型表的结构化数据存储。(A scalable, distributed database that supports structured data storage for large tables.)
  • Hive™:提供数据汇总和特别查询的数据仓库基础设施( A data warehouse infrastructure that provides data summarization and ad hoc querying.)
  • Mahout™: 一个可扩展的机器学习和数据挖掘库(A Scalable machine learning and data mining library.)
  • Ozone™: 一个可伸缩的、冗余的、分布式的Hadoop对象存储 (A scalable, redundant, and distributed object store for Hadoop.)
  • Pig™: 并行计算的高级数据流语言和执行框架(A high-level data-flow language and execution framework for parallel computation.)
  • Spark™: 用于Hadoop 数据的快速通用的计算引擎。它提供一个简单而富有表现力的编程模型,支持广泛的应用,包括ETL、机器学习、流处理和图计算。 (A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.)
  • Submarine: 一个统一的人工智能平台,允许工程师和数据科学家在分布式集群中运行机器学习和深度学习工作负载 (A unified AI platform which allows engineers and data scientists to run Machine Learning and Deep Learning workload in distributed cluster.)
  • Tez™: 一个通用的数据流编程框架,构建在Hadoop YARN之上,它提供了一个强大而灵活的引擎,可以执行任意任务的DAG来处理批处理和交互式用例的数据。Tez被Hadoop生态系统中的Hive、Pig和其他框架所采用,也被其他商业软件(如ETL工具)所采用,以取代Hadoop MapReduce作为底层的执行引擎  (A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.)
  • ZooKeeper™:  高性能的分布式应用协调服务 (A high-performance coordination service for distributed applications.)

1.3  Hadoop 架构

1.3.1 HDFS架构

HDFS是一个主从架构,由一个NameNode已经多个DataNodes 组成。

NameNode

  • 管理文件系统名称空间并调节客户端对文件的访问的主服务器,如命名空间操作,如打开、关闭和重命名文件和目录等管理;
  • 元数据管理,并且维护了数据块(Blocks)与datanode 的映射关系,有点类似注册中心的角色。

Secondary NameNode(2nn): 每隔一段时间对NameNode进行元数据备份。

DataNodes

  • 用于存储数据的节点,一个文件会被分拆为一个或多个块,并存储在dataNode中;
  • 负责处理来自文件系统客户端的读写请求;
  • 根据NameNode的指令进行块的创建、删除和复制

大数据系列-Hadoop集群搭建(2)集群配置_第1张图片

1.3.2 YARN架构

YARN的基本思想是将资源管理和作业调度/监视的功能分解为单独的守护进程,其思想是拥有一个全局的ResourceManager (RM)和每个应用程序的ApplicationMaster (AM)。应用程序可以是单个作业,也可以是作业的DAG。

Yarn由RM(ResourceManager)与NM(NodeManager)组成。

ResourceManager:ResourceManager是在系统中所有应用程序之间仲裁资源的最终权限

RM 有两个主要的组件:调度器(Scheduler)、应用管理器(ApplicationManager)

  • Scheduler:调度程序负责根据熟悉的容量(监控上报的情况)、队列等约束为各种正在运行的应用程序分配资源
  •  ApplicationsManager: 负责接受作业提交,与第一个容器协商执行特定于应用程序的ApplicationMaster,并在失败时提供重新启动ApplicationMaster容器的服务。每个应用程序的ApplicationMaster负责从Scheduler协商合适的资源容器,跟踪它们的状态并监视进度。

NodeManager:每台机器的框架代理,它负责容器,监视它们的资源使用情况(cpu、内存、磁盘、网络),并向ResourceManager/Scheduler报告相同的情况。

大数据系列-Hadoop集群搭建(2)集群配置_第2张图片

2. Hadoop 部署规划

        2.1节点规划

        

192.168.56.101

(leader)

192.168.56.102

(follower1)

192.168.56.103

(follower2)

HDFS         NameNode                       DataNode   

SecondaryNameNode   

DataNode     

DataNode
YARN         NodeManager NodeManager

ResourceManager

NodeManager

        2.2 端口规划

3. Hadoop 集群配置

3.1 配置文件说明

Hadoop 配置文件有两类:默认配置文件、自定义配置文件,自定义配置文件优先级高。

3.1.1 自定义配置文件

        *-site.xml为自定义配置文件

 3.1.2 默认配置

        默认配置可以参考如下位置的文件:

3.2 非安全模式配置Hadoop     

3.2.1 Hadoop 配置说明

上面3.1.1图*-env 文件,为相应的环境变量的配置。

关键配置说明:

  • etc/hadoop/core-site.xml
Parameter Value Notes
fs.defaultFS NameNode URI hdfs://host:port/
io.file.buffer.size

131072

(128M)

Size of read/write buffer used in SequenceFiles.

序列文件中使用的读/写缓冲区的大小。

Parameter Value Notes
dfs.namenode.name.dir

Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.

存放namesapce 以及事务日志持久化的地方

If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

可以用逗号分割,冗余存储到不同目录下

dfs.hosts / dfs.hosts.exclude

List of permitted/excluded DataNodes.

允许/排除的主机列表

If necessary, use these files to control the list of allowable datanodes.

使用这些文件来控制允许的datanode列表。

dfs.blocksize

268435456

(256M)

HDFS blocksize of 256MB for large file-systems.

HDFS文件的块大小

dfs.namenode.handler.count 100

More NameNode server threads to handle RPCs from large number of DataNodes.

NameNode用来处理来自DataNode的RPC请求的线程数量

  • etc/hadoop/hdfs-site.xml

  • Configurations for NameNode:

  • Configurations for DataNode:
Parameter Value Notes
dfs.datanode.data.dir

Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.

存储数据块的目录

If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

逗号分割,可以存储到不同的设备

Parameter Value Notes
yarn.acl.enable true / false Enable ACLs? Defaults to false.  是否开启访问
yarn.admin.acl Admin ACL

ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.

用户访问控制列表,*代表所有人可以访问,空格表示没人可以访问

yarn.log-aggregation-enable false

Configuration to enable or disable log aggregation

启用/禁用日志聚合的配置,默认为不开启

  • etc/hadoop/yarn-site.xml

  • Configurations for ResourceManager and NodeManager:

  • Configurations for ResourceManager:

Parameter Value Notes
yarn.resourcemanager.address

ResourceManager host:port for clients to submit jobs.

客户端提交任务的访问主机与端口

host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.scheduler.address

ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.

供AM与RM获取资源的地址

host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.resource-tracker.address

ResourceManager host:port for NodeManagers.

供NodeManager访问的RM地址

host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.admin.address

ResourceManager host:port for administrative commands.

管理控制台地址

host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.webapp.address

ResourceManager web-ui host:port.

资源管理的WEB服务地址

host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.hostname

ResourceManager host.

资源管理的主机名

host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources. Results in default ports for ResourceManager components.
yarn.resourcemanager.scheduler.class ResourceManager Scheduler class. CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler. Use a fully qualified class name, e.g., org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
yarn.scheduler.minimum-allocation-mb

Minimum limit of memory to allocate to each container request at the Resource Manager.

每个容器请求分配的最小内存大小

In MBs
yarn.scheduler.maximum-allocation-mb

Maximum limit of memory to allocate to each container request at the Resource Manager.

每个容器请求分配的最大内存大小

In MBs
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path

List of permitted/excluded NodeManagers.

允许/排除nodemanager的列表。

If necessary, use these files to control the list of allowable NodeManagers.
Parameter Value Notes
yarn.nodemanager.resource.memory-mb

Resource i.e. available physical memory, in MB, for given NodeManager

NM可使用的物理内存(MB单位)

Defines total available resources on the NodeManager to be made available to running containers
yarn.nodemanager.vmem-pmem-ratio

Maximum ratio by which virtual memory usage of tasks may exceed physical memory

(任务的虚拟内存使用可能超过物理内存的最大比率)

The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
yarn.nodemanager.local-dirs

Comma-separated list of paths on the local filesystem where intermediate data is written.

(可保存的中间数据的路径)

Multiple paths help spread disk i/o.
yarn.nodemanager.log-dirs

Comma-separated list of paths on the local filesystem where logs are written.

日志保存地址

Multiple paths help spread disk i/o.
yarn.nodemanager.log.retain-seconds

10800

NM日志保留时间

(3小时)

Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir

/logs

HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.remote-app-log-dir-suffix logs Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-services mapreduce_shuffle Shuffle service that needs to be set for Map Reduce applications.
yarn.nodemanager.env-whitelist Environment properties to be inherited by containers from NodeManagers For mapreduce application in addition to the default values HADOOP_MAPRED_HOME should to be added. Property value should JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME
  • etc/hadoop/mapred-site.xml

  • Configurations for MapReduce Applications:

Parameter Value Notes
mapreduce.framework.name yarn Execution framework set to Hadoop YARN.
mapreduce.map.memory.mb 1536 Larger resource limit for maps.
mapreduce.map.java.opts -Xmx1024M Larger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb 3072 Larger resource limit for reduces.
mapreduce.reduce.java.opts -Xmx2560M Larger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb 512 Higher memory-limit while sorting data for efficiency.
mapreduce.task.io.sort.factor 100 More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies 50 Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.
  • Configurations for MapReduce JobHistory Server:
Parameter Value Notes
mapreduce.jobhistory.address MapReduce JobHistory Server host:port Default port is 10020.
mapreduce.jobhistory.webapp.address MapReduce JobHistory Server Web UI host:port Default port is 19888.
mapreduce.jobhistory.intermediate-done-dir /mr-history/tmp Directory where history files are written by MapReduce jobs.
mapreduce.jobhistory.done-dir /mr-history/done Directory where history files are managed by the MR JobHistory Server.

3.2.2 集群配置

以下为我的集群的配置:

/opt/module/hadoop-3.3.1/etc/hadoop/hdfs-site.xml


	
    
        dfs.namenode.http-address
        leader:9870
    
	
    
        dfs.namenode.secondary.http-address
        follower1:9868
    

/opt/module/hadoop-3.3.1/etc/hadoop/core-site.xml


	
	
		fs.defaultFS
		hdfs://leader:9820
	

	
		hadoop.tmp.dir
		/opt/module/hadoop-3.3.1/data
	


        
                hadoop.http.staticuser.user
                hadoop
        


        
                hadoop.proxyuser.hadoop.hosts
                *
        

/opt/module/hadoop-3.3.1/etc/hadoop/mapred-site.xml



    
        mapreduce.framework.name
        yarn
    
	
    
        mapreduce.jobhistory.address
        leader:10020
    

        
    
         mapreduce.jobhistory.webapp.address
         leader:19888
    

/opt/module/hadoop-3.3.1/etc/hadoop/yarn-site.xml





    
        yarn.nodemanager.aux-services
        mapreduce_shuffle


    
        yarn.resourcemanager.hostname
        follower2


    
        yarn.nodemanager.env-whitelist
        JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP
_MAPRED_HOME


    
        yarn.scheduler.minimum-allocation-mb
        512
    
    
        yarn.scheduler.maximum-allocation-mb
        4096


    
        yarn.nodemanager.resource.memory-mb
        4096


    
        yarn.nodemanager.pmem-check-enabled
        false
    
    
        yarn.nodemanager.vmem-check-enabled
        false
    
    
    
        yarn.log-aggregation-enable
        true
    

    
        yarn.log.server.url
        http://leader:19888/jobhistory/logs
    

    
        yarn.log-aggregation.retain-seconds
        604800
    

/opt/module/hadoop-3.3.1/etc/hadoop/workers

leader
follower1
follower2

*重要,以上配置leader同步到follower1、follower2

如:scp /opt/module/hadoop-3.3.1/etc/hadoop/workers  hadoop@follower1:/opt/module/hadoop-3.3.1/etc/hadoop/

3.3 启动集群&验证

 3.3.1  启动hdfs

        第一次启动在leader节点上,格式化一次NameNode

$hdfs namenode -format

#再启动hdfs
$start-dfs.sh

 3.3.2  启动yarn

         在配置了ResourceManager的节点,follower2上启动YARN

$start-yarn.sh

 3.3.3 启动jobhistory服务

        在leader节点上,启动历史historyserver

$mapred --daemon start historyserver

注意:

启动中如果出现以下提示:

Java HotSpot(TM) Client VM warning: You have loaded library /opt/module/hadoop-3.3.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.

解决方案:

在cust.sh 中增加最后3行export,最终如下:

[hadoop@leader hadoop]$ more /etc/profile.d/cust.sh
#JAVA_HOME
JAVA_HOME=/opt/module/jdk1.8.0_311
PATH=$PATH:$JAVA_HOME/bin
HADOOP_HOME=/opt/module/hadoop-3.3.1

PATH=$PATH:$HADOOP_HOME/bin
PATH=$PATH:$HADOOP_HOME/sbin
export JAVA_HOME
export HADOOP_HOME
export PATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=/opt/module/hadoop-3.3.1/lib/native

3.3.4 集群验证

  • 验证1:往HDFS中创建目录,input 文件
$ hadoop fs -mkdir /input
#把之前 wordcount的文本文件put到hdfs的/input 目录下
$ hadoop fs -put $HADOOP_HOME/wcinput/wcinput_sample.txt /input
#把JDK put 到/目录下
$ hadoop fs -put  /opt/software/jdk-8u212-linux-x64.tar.gz  /

 可以在配置的data目录下查看,如leader节点

大数据系列-Hadoop集群搭建(2)集群配置_第3张图片

NameNodeUI查看

(可忽略output、tmp目录,后面验证用到)

大数据系列-Hadoop集群搭建(2)集群配置_第4张图片

验证2:运行wordcount 任务

#先清理文件系统中存在的output 目录,以免报目录已存在的异常
$hadoop fs -rm -r /output

#运行wordcount样例,把结果输出到文件系统中的/output
$hadoop jar /opt/module/hadoop-3.3.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar  wordcount  /input /output

运行完,在Jobhistory中可以看到相关job及其日志

大数据系列-Hadoop集群搭建(2)集群配置_第5张图片

大数据系列-Hadoop集群搭建(2)集群配置_第6张图片

大数据系列-Hadoop集群搭建(2)集群配置_第7张图片

在hdfs /output目录下可以看到结果

大数据系列-Hadoop集群搭建(2)集群配置_第8张图片

 3.4 相关进程与端口情况

3.4.1 leader 节点

大数据系列-Hadoop集群搭建(2)集群配置_第9张图片

3.4.2 follower1 节点

3.4.3 follower2 节点

大数据系列-Hadoop集群搭建(2)集群配置_第10张图片

以上是整个Hadoop搭建的过程,后续将用一些案例,来进阶熟悉Hadoop生态。 

你可能感兴趣的:(大数据,hadoop,big,data,hdfs)