007 How Hadoop Works Internally – Inside Hadoop
Apache Hadoop is an open source software framework that stores data in a distributed manner and process that data in parallel. Hadoop provides the world’s most reliable storage layer – HDFS, a batch processing engine – MapReduce and a resource management layer – YARN. In this tutorial on ‘How Hadoop works internally’, we will learn what is Hadoop, how Hadoop works, different components of Hadoop, daemons in Hadoop, roles of HDFS, MapReduce, and Yarn in Hadoop and various steps to understand How Hadoop works.
Apache Hadoop 是一个以分布式方式存储数据并并行处理数据的开源软件框架. Hadoop 提供了世界上最可靠的存储层HDFS 一个批量处理引擎 MapReduce 和资源管理层-Yarn. 在本教程中Hadoop 如何在内部工作我们将学习什么是 Hadoop,Hadoop 是如何工作的,Hadoop 的不同组件,Hadoop 中的守护进程,Hadoop 、 MapReduce 和 Hadoop 在 Hadoop 中的角色,以及了解 Hadoop 是如何工作的
How Hadoop Works Internally – Inside Hadoop
What is Hadoop?
Before learning how Hadoop works, let’s brush the basic Hadoop concept. Apache Hadoop is a set of open-source software utilities. They facilitate usage of a network of many computers to solve problems involving massive amounts of data. It provides a software framework for distributed storage and distributed computing. It divides a file into the number of blocks and stores it across a cluster of machines. Hadoop also achieves fault tolerance by replicating the blocks on the cluster. It does distributed processing by dividing a job into a number of independent tasks. These tasks run in parallel over the computer cluster.
在学习 Hadoop 如何工作之前,让我们先了解一下 Hadoop 的基本概念.Apache Hadoop 是一组开源软件实用程序.它们促进了许多计算机网络的使用,以解决涉及大量数据的问题.它为分布式存储和分布式计算提供了一个软件框架.它将一个文件分成若干块,并将其存储在一组机器上.Hadoop 还通过在集群上复制块来实现容错.它通过将一个作业分成若干个独立的任务来进行分布式处理.这些任务在计算机集群上并行运行.
Hadoop Components and Domains
You can’t understand the working of Hadoop without knowing its core components. So, Hadoop consists of three layers (core components) and they are:-
如果不了解 Hadoop 的核心组件,你就无法理解它的工作.因此,Hadoop 由三层 (核心组件) 组成,它们是:-
HDFS – Hadoop Distributed File System provides for the storage of Hadoop. As the name suggests it stores the data in a distributed manner. The file gets divided into a number of blocks which spreads across the cluster of commodity hardware.
HDFS- 分布式文件系统提供 Hadoop 的存储.顾名思义,它以分布式的方式存储数据.文件被分成许多块,这些块分布在商品硬件集群中.
MapReduce – This is the processing engine of Hadoop.** MapReduce works on the principle of distributed processing**. It divides the task submitted by the user into a number of independent subtasks. These sub-task executes in parallel thereby increasing the throughput.
MapReduce-这是 Hadoop 的处理引擎.MapReduce 的工作原理是分布式处理.它将用户提交的任务划分为若干个独立的子任务.这些子任务并行执行,从而提高了吞吐量.
Yarn – Yet Another Resource Manager provides resource management for Hadoop. There are two daemons running for Yarn. One is NodeManager on the slave machines and other is the Resource Manager on the master node. Yarn looks after the allocation of the resources among various slave competing for it.
*Yarn- 为 Hadoop 提供资源管理.Yarn 有两个守护进程在运行.一个是从机上的 NodeManager,另一个是主节点上的资源管理器.Yarn 关注各种Slave之间争夺资源的分配.
**Learn about all the Hadoop Ecosystem Components in just 7 mins. **
Daemons are the processes that run in the background. The Hadoop Daemons are:-
c) **Resource Manager **– It runs on YARN master node for MapReduce.
d) Node Manager – It runs on YARN slave node for MapReduce.
These 4 daemons run for Hadoop to be functional.
为了使 Hadoop 发挥作用,这 4 个守护进程运行.
How Hadoop Works?
Hadoop 是如何工作的?
Hadoop does distributed processing for huge data sets across the cluster of commodity servers and works on multiple machines simultaneously. To process any data, the client submits data and program to Hadoop. HDFS stores the data while **MapReduce **process the data and Yarn divide the tasks.
Hadoop 对商品服务器集群中的大量数据集进行分布式处理,并同时在多台机器上工作.客户端向 Hadoop 提交数据和程序,以处理任何数据.HDFS同时存储数据MapReduce处理数据和纱线划分任务.
Let’s discuss in detail how Hadoop works –
我们来详细讨论一下 Hadoop 是如何工作的
i. HDFS
Hadoop Distributed File System has master-slave topology. It has got two daemons running, they are NameNode and DataNode.
分布式文件系统 主从式拓扑.它有两个守护进程在运行,它们是 NameNode 和 DataNode.
NameNode
NameNode is the daemon running of the master machine. It is the centerpiece of an HDFS file system. NameNode stores the directory tree of all files in the file system. It tracks where across the cluster the file data resides. It does not store the data contained in these files.
NameNode 是运行主机的守护进程.它是 HDFS 文件系统的核心.NameNode 在文件系统中存储所有文件的目录树.它跟踪文件数据在集群中的位置.它不存储这些文件中包含的数据.
When the client applications want to add/copy/move/delete a file, they interact with NameNode. The NameNode responds to the request from client by returning a list of relevant DataNode servers where the data lives.
当客户端应用程序想要添加/复制/移动/删除文件时,它们会与 NameNode 交互.NameNode 通过返回数据所在的相关 DataNode 服务器列表来响应客户端的请求.
Recommended Reading – NameNode High Availability
DataNode
DataNode daemon runs on the slave nodes. It stores data in the HadoopFileSystem. In functional file system data replicates across many DataNodes.
DataNode 守护进程在从属节点上运行.它在 hadoop 文件系统中存储数据.在功能文件系统中,跨多个数据节点复制数据.
On startup, a DataNode connects to the NameNode. It keeps on looking for the request from NameNode to access data. Once the NameNode provides the location of the data, client applications can talk directly to a DataNode, while replicating the data, DataNode instances can talk to each other.
在启动时,一个 DataNode 连接、复制指令.它一直在寻找 NameNode 访问数据的请求.一旦 NameNode 提供了数据的位置,客户端应用程序就可以在复制数据的同时直接与 DataNode 对话,DataNode 实例可以相互对话.
Replica Placement
The placement of replica decides HDFS reliability and performance. Optimization of replica placement makes HDFS apart from other distributed system. Huge HDFS instances run on a cluster of computers spreads across many racks. The communication between nodes on different racks has to go through the switches. Mostly the network bandwidth between nodes on the same rack is more than that between the machines on separate racks.
副本的放置决定了 HDFS 的可靠性和性能.副本放置的优化使得 HDFS 与其他分布式系统不同.运行在计算机集群上的巨大 HDFS 实例分布在许多机架上.不同机架上的节点之间的通信必须通过交换机.大多数情况下,同一机架上节点之间的网络带宽比单独机架上的机器之间的网络带宽要多.
The rack awareness algorithm determines the rack id of each DataNode. Under a simple policy, the replicas get placed on unique racks. This prevents data loss in the event of rack failure. Also, it utilizes bandwidth from multiple racks while reading data. However, this method increases the cost of writes.
机架感知算法 确定每个 DataNode 的机架 id.在一个简单的策略下,副本被放置在唯一的机架上.这可以防止机架故障时的数据丢失.此外,它在读取数据时利用了多个机架的带宽.然而,这种方法增加了写操作的成本.
Let us assume that the replication factor is three. Suppose HDFS’s placement policy places one replica on a local rack and other two replicas on the remote but same rack. This policy cuts the inter-rack write traffic thereby improving the write performance. The chances of rack failure are less than that of node failure. Hence this policy does not affect data reliability and availability. But, it does reduce the aggregate network bandwidth used when reading data. This is because a block gets placed in only two unique racks rather than three.
我们假设复制因子是 3.假设 HDFS 的放置策略在本地机架上放置一个副本,在远程但相同的机架上放置另外两个副本.该策略减少了机架间的写入流量,从而提高了写入性能.机架故障的几率小于节点故障的几率.因此,此策略不会影响数据的可靠性和可用性.但是,它确实减少了读取数据时使用的聚合网络带宽.这是因为只有两个独特的机架而不是三个机架中放置了一个块.
ii. MapReduce
The general idea of the MapReduce algorithm is to process the data in parallel on your distributed cluster. It subsequently combine it into the desired result or output.
MapReduce 的总体思路算法是在分布式集群上并行处理数据.它随后将其组合成期望的结果或输出.
Hadoop MapReduce includes several stages:
Hadoop MapReduce 包括几个阶段:
In the first step, the program locates and reads the « input file » containing the raw data.
As the file format is arbitrary, there is a need to convert data into something the program can process. The « InputFormat » and « RecordReader » (RR) does this job.
在第一步中,程序定位并读取包含原始数据的输入文件.
由于文件格式是任意的,因此需要将数据转换成程序可以处理的内容.InputFormat 和 RecordReader (RR) 完成了这项工作.
InputFormat uses InputSplit function to split the file into smaller pieces
InputFormat 使用 InputSplit 函数将文件分割成更小的部分
Then the RecordReader transforms the raw data for processing by the map. It outputs a list of key-value pairs.
然后是记录阅读器将原始数据转换为地图处理.它输出键值对的列表.
Once the mapper process these key-value pairs the result goes to « OutputCollector ». There is another function called « Reporter » which intimates the user when the mapping task finishes.
一旦映射程序处理了这些键值对,结果就会转到输出收集器.还有一个名为 Reporter 的函数,当映射任务完成时,它会向用户暗示.
In the next step, the Reduce function performs its task on each key-value pair from the mapper.
Finally, OutputFormat organizes the key-value pairs from Reducer for writing it on HDFS.
Being the heart of the Hadoop system, Map-Reduce process the data in a highly resilient, fault-tolerant manner.
在接下来的步骤中,Reduce 函数对来自映射程序的每个键值对执行其任务.
最后,out普特 format 从 Reducer 组织键值对,以便将其写入 HDFS.
Map-Reduce 是 Hadoop 系统的核心,它以高度弹性、容错的方式处理数据.
iii. Yarn
Yarn divides the task on resource management and job scheduling/monitoring into separate daemons. There is one ResourceManager and per-application ApplicationMaster. An application can be either a job or a DAG of jobs.
将资源管理和作业调度/监控任务划分为单独的守护进程.每个应用程序都有一个资源管理器和应用程序管理器.应用程序可以是作业,也可以是作业的 DAG.
The ResourceManger have two components – Scheduler and AppicationManager.
资源管理器有两个组件: 调度器和应用程序管理器.
The scheduler is a pure scheduler i.e. it does not track the status of running application. It only allocates resources to various competing applications. Also, it does not restart the job after failure due to hardware or application failure. The scheduler allocates the resources based on an abstract notion of a container. A container is nothing but a fraction of resources like CPU, memory, disk, network etc.
调度器是纯调度程序,即不跟踪运行应用程序的状态.它只为各种竞争的应用程序分配资源.此外,由于硬件或应用程序故障,它不会在故障后重新启动作业.调度器根据容器的抽象概念分配资源.容器只是 CPU 、内存、磁盘、网络等资源的一小部分.
Following are the tasks of ApplicationManager:-
以下是 application manager 的任务:-
Accepts submission of jobs by client.
Negotaites first container for specific ApplicationMaster.
Restarts the container after application failure.
接受客户提交工作.
特定应用程序的第一个容器.
应用程序失败后重新启动容器.
Below are the responsibilities of ApplicationMaster
以下是 applymaster 的职责
Negotiates containers from Scheduler
Tracking container status and monitoring its progress.
从调度器协商容器
跟踪集装箱状态并监控其进度.
Yarn supports the concept of Resource Reservation via ReservationSystem. In this, a user can fix a number of resources for execution of a particular job over time and temporal constraints. The ReservationSystem makes sure that the resources are available to the job until its completion. It also performs admission control for reservation.
Yarn 通过预订系统支持资源预订的概念.在这种情况下,随着时间和时间的限制,用户可以修复一些资源来执行特定的作业.预订系统确保在作业完成之前,资源可供使用.它还对预订进行入场控制.
Yarn can scale beyond a few thousand nodes via Yarn Federation. YARN Federation allows to wire multiple sub-cluster into the single massive cluster. We can use many independent clusters together for a single large job. It can be used to achieve a large scale system.
纱线可以通过纱线联合会扩展到几千个节点.YARN Federation 允许将多个子集群连接到单个大规模集群中.我们可以将许多独立的集群一起用于一项大型工作.它可以用来实现一个大规模的系统.
Let us summarize how Hadoop works step by step:
让我们总结一下Hadoop一步一步的工作:
Input data is broken into blocks of size** 128 Mb** and then blocks are moved to different nodes.
Once all the blocks of the data are stored on data-nodes, the user can process the data.
Resource Manager then schedules the program (submitted by the user) on individual nodes.
Once all the nodes process the data, the output is written back to HDFS.
输入数据被分成大小块128 Mb然后将块移动到不同的节点.
一旦所有数据块存储在数据节点上,用户就可以处理数据.
然后,资源管理器在各个节点上安排程序 (由用户提交).
所有节点处理数据后,输出被写回 HDFS.
So, this was all on How Hadoop Works Tutorial.
所以,这都是关于 Hadoop 如何工作的教程.
Conclusion
结论
In conclusion to How Hadoop Works, we can say, the client first submits the data and program. HDFS stores that data and MapReduce processes that data. So now when we have learned Hadoop introduction and How Hadoop works, let us now learn how to Install Hadoop on a single node and** multi-node** to move ahead in the technology.
总之,我们可以说,Hadoop 是如何工作的,客户端首先提交数据和程序.HDFS 存储数据,MapReduce 处理数据.所以现在,当我们学习了 Hadoop 的介绍以及 Hadoop 是如何工作的时候,如何在单个节点上安装 Hadoop和多节点在技术上继续前进.
Drop a comment if you like the tutorial or have any queries and feedback on ‘How Hadoop Works’ we will get back to you.
如果你喜欢这个教程,或者对 “Hadoop 如何工作” 有任何疑问和反馈,请留言,我们会回复你.
https://data-flair.training/blogs/how-hadoop-works-internally