000 Hadoop Ecosystem and Their Components – A Complete Tutorial
1. Hadoop Ecosystem Components
1. Hadoop 生态组件
The objective of this Apache Hadoop ecosystem components tutorial is to have an overview of what are the different components of Hadoop ecosystem that make Hadoop so powerful and due to which several Hadoop job roles are available now. We will also learn about Hadoop ecosystem components like HDFS and HDFS components, MapReduce, YARN, Hive,** Apache Pig, Apache HBase and HBase components, HCatalog, Avro, Thrift**, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie to deep dive into Big Data Hadoop and to acquire master level knowledge of the Hadoop Ecosystem.
Apache Hadoop 生态系统组件教程概述了 Hadoop 生态系统的不同组件,这些组件使得 Hadoop 如此强大. 我们还将了解 Hadoop 生态系统组件,如 HDFS and HDFS components, MapReduce, YARN, Hive,** Apache Pig, Apache HBase and HBase components, HCatalog, Avro, Thrift**, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie 深入研究大数据 Hadoop,掌握 Hadoop 生态系统的知识.
Hadoop Ecosystem and Their Components
2. Introduction to Hadoop Ecosystem
2. Hadoop 的生态系统
As we can see the different Hadoop ecosystem explained in the above figure of Hadoop Ecosystem. Now We are going to discuss the list of Hadoop Components in this section one by one in detail.
正如我们可以看到 Hadoop 生态系统上图中解释的不同 Hadoop 生态系统. 现在,我们将在本节中逐一详细讨论 Hadoop 组件.
2.1. Hadoop Distributed File System
2.1.分布式文件系统
It is the most important component of Hadoop Ecosystem.** HDFS** is the primary storage system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a distributed filesystem that runs on commodity hardware. HDFS is already configured with default configuration for many installations. Most of the time for large clusters configuration is needed. Hadoop interact directly with HDFS by shell-like commands.
它是 Hadoop 生态系统最重要的组成部分. HDFS 是 Hadoop 的主要存储系统. Hadoop 分布式文件系统 (HDFS) 是一种基于 java 的文件系统,为用户提供可扩展、容错、可靠且经济高效的数据存储. 大数据. HDFS 是一个运行在普通硬件上的分布式文件系统.HDFS 已经为许多安装配置了默认配置.大型集群配置的大部分时间是需要的.Hadoop 通过类似 shell 的命令直接与 HDFS 交互.
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s now discuss these Hadoop HDFS Components-
有两个主要组成部分的 Hadoop HDFS NameNode 和 DataNode.现在我们来讨论一下 Hadoop 的 HDFS 组件
i. NameNode
It is also known as* Master* node. NameNode does not store actual data or dataset. NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is stored and other details. It consists of files and directories.
它也被称为 Master 节点. NameNode 不存储实际数据或数据集. NameNode 存储元数据,即blocks,它们的位置、存储数据的机架和其他详细信息. 它由文件和目录组成.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming, closing, opening files and directories.
管理文件系统命名空间.
管理客户端对文件的访问.
执行命名、关闭、打开文件和目录等文件系统执行.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFS. Datanode performs read and write operation as per the request of the clients. Replica block of Datanode consists of 2 files on the file system. The first file is for data and second file is for recording the block’s metadata. HDFS Metadata includes checksums for data. At startup, each Datanode connects to its corresponding Namenode and does handshaking. Verification of namespace ID and software version of DataNode take place by handshaking. At the time of mismatch found, DataNode goes down automatically.
它也被称为Slave .HDFS HDFS Datanode 负责实际数据的存储. Datanode 根据客户的要求响应请求 读写操作 . Datanode 的副本块由文件系统上的 2 个文件组成. 第一个文件用于数据,第二个文件用于记录块的元数据. HDFS 元数据包括数据校验和启动时,每个 Datanode 都连接到相应的 Namenode,并进行握手. 通过握手验证 DataNode 的命名空间 ID 和软件版本. 发现不匹配时,DataNode 会自动下线.
Tasks of HDFS DataNode
DataNode performs operations like block replica creation, deletion, and replication according to the instruction of NameNode.
DataNode manages data storage of the system.
DataNode 根据 NameNode 的指令执行块副本创建、删除和复制等操作.
DataNode 管理系统的数据存储.
This was all about HDFS as a Hadoop Ecosystem component.
作为 Hadoop 生态系统组件,这都是关于 HDFS 的相关文档
Refer HDFS Comprehensive Guide to read Hadoop HDFS in detail and then proceed with the Hadoop Ecosystem tutorial.
参考 HDFS 综合指南 详细阅读 Hadoop HDFS,然后继续学习 Hadoop 生态系统教程.
2.2. MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides data processing. MapReduce is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this parallel processing.
Hadoop MapReduce 是提供数据处理的核心 Hadoop 生态系统组件. MapReduce 是一种软件框架,用于轻松编写处理存储在 Hadoop 分布式文件系统中的大量结构化和非结构化数据的应用程序.
MapReduce 程序本质上是并行的,因此对于使用集群中的多台机器执行大规模数据分析非常有用. 从而提高了集群并行处理的速度和可靠性.
Hadoop MapReduce
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:
Hadoop 生态系统组件 “mapreduc” 将处理分为两个阶段:
- Map phase
- Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also specifies two functions:** map function and reduce function**
每个阶段都有 键值对作为输入和输出. 另外,程序员还指定了两个功能: map function 和 reduce function
Map function takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Read Mapper in detail.
Reduce **function **takes the output from the Map as an input and combines those data tuples based on the key and accordingly modifies the value of the key. Read Reducer in detail.
map function 获取一组数据,并将其转换为另一组数据,其中单个元素被分解为元组 (键/值对).
reduce function 将来自 map 的输出作为输入,并根据键组合这些数据元组,从而修改键的值.
Features of ****MapReduce
**Simplicity – **MapReduce jobs are easy to run. Applications can be written in any language such as java, C++, and python.
**Scalability – **MapReduce can process petabytes of data.
**Speed – **By means of parallel processing problems that take days to solve, it is solved in hours and minutes by MapReduce.
**Fault Tolerance – **MapReduce takes care of failures. If one copy of data is unavailable, another machine has a copy of the same key pair which can be used for solving the same subtask.
简单 MapReduce 作业很容易运行.应用程序可以用任何语言编写,如Java,C + +,和Python.
可扩展性 MapReduce 可以处理数 pb 的数据.
速度- 通过需要几天时间来解决的并行处理问题,MapReduce 以小时和分钟的时间来解决.
容错- MapReduce 处理故障.如果一个数据副本不可用,另一台机器有相同密钥对的副本,可以用于解决相同的子任务.
Refer MapReduce Comprehensive Guide for more details.
参考 MapReduce 综合指南 更多细节
Hope the Hadoop Ecosystem explained is helpful to you. The next component we take is YARN.
希望 Hadoop 生态系统的讲解对您有所帮助.我们要的下一个部件是 Yarn.
2.3. YARN
Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that provides the resource management. Yarn is also one the most important component of Hadoop Ecosystem. YARN is called as the operating system of Hadoop as it is responsible for managing and monitoring workloads. It allows multiple data processing engines such as real-time streaming and batch processing to handle data stored on a single platform.
Yanr 是提供资源管理的 Hadoop 生态系统组件. Yarn 也是 Hadoop 生态系统最重要的组成部分之一.YARN 被称为 Hadoop 的操作系统,因为它负责管理和监控工作负载.它允许实时流和批处理等多种数据处理引擎处理存储在单个平台上的数据.
Hadoop Yarn Diagram
YARN has been projected as a data operating system for Hadoop2. Main features of YARN are:
Flexibility – Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming. Due to this feature of YARN, other applications can also be run along with Map Reduce programs in Hadoop2.
Efficiency – As many applications run on the same cluster, Hence, efficiency of Hadoop increases without much effect on quality of service.
Shared – Provides a stable, reliable, secure foundation and shared operational services across multiple workloads. Additional programming models such as graph processing and iterative modeling are now possible for data processing.
灵活性- 支持 MapReduce (批处理) 之外的其他专用数据处理模型,如交互式和流式处理.由于 YARN 的这一特性,其他应用程序也可以与 hadoop2 中的 Map Reduce 程序一起运行.
效率- 由于许多应用程序在同一个集群上运行,因此 Hadoop 的效率会提高,而不会对服务质量产生太大影响.
共享- 跨多个工作负载提供稳定、可靠、安全的基础和共享的操作服务.现在,数据处理可以使用其他编程模型,如图形处理和迭代建模.
Refer YARN Comprehensive Guide for more details.
2.4. Hive
The Hadoop ecosystem component,** Apache Hive,** is an open source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hive do three main functions: data summarization, query, and analysis.
Hadoop 生态系统组件 Apache Hive 是一个用于查询和分析存储在 Hadoop 文件中的大型数据集的开源数据仓库系统.Hive 主要做三个功能:数据汇总、查询和分析.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs which will execute on Hadoop.
Hive 使用的语言叫做 HiveQL(HQL),类似于 SQL. HiveQL 自动将类似 SQL 的查询转换为MapReduce 作业将在 Hadoop 上执行.
Hive Diagram
Main parts of Hive are:
- Metastore – It stores the metadata.
- Driver – Manage the lifecycle of a HiveQL statement.
- Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
- Hive server – Provide a thrift interface and JDBC/ODBC server.
Refer Hive Comprehensive Guide for more details.
2.5. Pig
Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored in HDFS. Pig as a component of Hadoop Ecosystem uses ***PigLatin ***language. It is very similar to SQL. It loads the data, applies the required filters and dumps the data in the required format. For Programs execution, pig requires Java runtime environment.
Apache Pig 是一个用于分析和查询存储在 HDFS 中的巨大数据集的高级语言平台.Pig 作为 Hadoop 生态系统使用的组件 PigLatin 语言.它和 SQL 非常相似.它加载数据,应用所需的过滤器,并以所需的格式转储数据.对于程序执行,pig 需要 Java 运行时环境.
Pig Diagram
Features of Apache Pig:
**Extensibility – **For carrying out special purpose processing, users can create their own function.
**Optimization opportunities – **Pig allows the system to optimize automatic execution. This allows the user to pay attention to semantics instead of efficiency.
**Handles all kinds of data – **Pig analyzes both structured as well as unstructured.
扩展性-用户可以创建自己的功能,进行特殊目的的处理.
优化机会-Pig 允许系统优化自动执行.这使得用户可以关注语义而不是效率.
处理各种数据-Pig 分析结构化和非结构化.
Refer Pig – A Complete guide for more details.
2.6. HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database that was designed to store structured data in tables that could have billions of row and millions of columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS. HBase, provide real-time access to read or write data in HDFS.
Apache HBase是一个 Hadoop 生态系统组件,它是一个分布式数据库,旨在将结构化数据存储在可能有数十亿行和数百万列的表中.HBase 是基于 HDFS 构建的可扩展、分布式和 NoSQL 数据库.HBase,提供在 HDFS 中读取或写入数据的实时访问.
HBase Diagram
Components of Hbase
Hbase 的组成部分
There are two HBase Components namely- HBase Master and RegionServer.
有两个 HBase 元件 ─ HBase 主 RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all RegionServer.
它不是实际数据存储的一部分,而是跨所有区域服务器协商负载平衡.
Maintain and monitor the Hadoop cluster.
Performs administration (interface for creating, updating and deleting tables.)
Controls the failover.
HMaster handles DDL operation.
维护和监控 Hadoop 集群.
执行管理 (用于创建、更新和删除表的接口.))
控制故障切换.
HMaster 处理 dll 操作.
ii. RegionServer
It is the worker node which handles read, writes, updates and delete requests from clients. Region server process runs on every node in Hadoop cluster. Region server runs on HDFS DateNode.
它是处理来自客户端的读、写、更新和删除请求的工作节点.Region server 进程在 Hadoop 集群中的每个节点上运行.HDFS DateNode 域服务器上运行.
Refer HBase Tutorial for more details.
参考HBase 入门教程更多细节
2.7. HCatalog
It is a table and storage management layer for Hadoop. HCatalog supports different components available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data from the cluster. HCatalog is a key component of Hive that enables the user to store their data in any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.
它是 Hadoop 的表和存储管理层.HCatalog 支持 Hadoop 生态系统中可用的不同组件,如 MapReduce 、 Hive 和 Pig,以便从集群中轻松读写数据.HCatalog 是 Hive 的一个关键组件,它使用户能够以任何格式和结构存储数据.
默认情况下,HCatalog 支持 RCFile 、 CSV 、 JSON 、序列文件和 ORC 文件格式.
Benefits of HCatalog:
HCatalog 的好处:
Enables notifications of data availability.
With the table abstraction, HCatalog frees the user from overhead of data storage.
Provide visibility for data cleaning and archiving tools.
启用数据可用性通知.
通过表抽象,HCatalog 将用户从数据存储的开销中解放出来.
为数据清理和归档工具提供可见性.
2.8. Avro
Acro is a part of Hadoop ecosystem and is a most popular Data serialization system. Avro is an open source project that provides data serialization and data exchange services for Hadoop. These services can be used together or independently. Big data can exchange programs written in different languages using Avro.
Acro 是 Hadoop 生态系统的一部分,是最受欢迎的数据序列化系统. Avro 是为 Hadoop 提供数据序列化和数据交换服务的开源项目.这些服务可以一起使用,也可以独立使用.大数据可以使用 Avro 交换用不同语言编写的程序.
Using serialization service programs can serialize data into files or messages. It stores data definition and data together in one message or file making it easy for programs to dynamically understand information stored in Avro file or message.
使用序列化服务程序可以将数据序列化为文件或消息.它将数据定义和数据存储在一条消息或文件中,使得程序能够动态理解存储在 Avro 文件或消息中的信息.
Avro schema – It relies on schemas for serialization/deserialization. Avro requires the schema for data writes/read. When Avro data is stored in a file its schema is stored with it, so that files may be processed later by any program.
Avro 模式 它依赖于序列化/反序列化的模式. Avro 需要数据写入/读取的模式. 当 Avro 数据存储在文件中时,它的模式与它一起存储,这样任何程序都可以在以后处理文件.
Dynamic typing – It refers to serialization and deserialization without code generation. It complements the code generation which is available in Avro for statically typed language as an optional optimization.
动态打字- 它指的是没有代码生成的序列化和反序列化.它补充了 Avro 中静态类型语言作为可选优化的代码生成.
Features provided by Avro:
Rich data structures.
Remote procedure call.
Compact, fast, binary data format.
Container file, to store persistent data.
丰富的数据结构.
远程过程调用.
紧凑,快速,二进制数据格式.
存储持久数据的容器文件.
2.9. Thrift
It is a software framework for scalable cross-language services development. Thrift is an interface definition language for RPC(Remote procedure call) communication. Hadoop does a lot of RPC calls so there is a possibility of using Hadoop Ecosystem componet Apache Thrift for performance or other reasons.
它是一个可扩展的跨语言服务开发的软件框架.Thrift 是 RPC (远程过程调用) 通信的接口定义语言.Hadoop 进行了大量 RPC 调用,因此出于性能或其他原因,有可能使用 Hadoop 生态系统组件 Apache Thrift.
Thrift Diagram
2.10. Apache Drill
The main purpose of the Hadoop Ecosystem Component is large-scale data processing including structured and semi-structured data. It is a low latency distributed query engine that is designed to scale to several thousands of nodes and query petabytes of data. The drill is the first distributed SQL query engine that has a schema-free model.
Hadoop 生态系统组件的主要目的是包括结构化和半结构化数据在内的大规模数据处理.它是一个低延迟的分布式查询引擎,旨在扩展到数千个节点,查询数 pb 的数据.演练是第一个具有无模式模型的分布式 SQL 查询引擎.
Application of Apache drill
The drill has become an invaluable tool at cardlytics, a company that provides consumer purchase data for mobile and internet banking. Cardlytics is using a drill to quickly process trillions of record and execute queries.
该演练已成为 cardlytics 的宝贵工具,该公司为移动和互联网银行提供消费者购买数据. Cardlytics 正在使用 drill 快速处理数万亿条记录并执行查询.
Features of Apache Drill:
The drill has specialized memory management system to eliminates garbage collection and optimize memory allocation and usage. Drill plays well with Hive by allowing developers to reuse their existing Hive deployment.
Drill 拥有专门的内存管理系统,可以消除垃圾收集,优化内存分配和使用. Drill 允许开发人员重用他们现有的 Hive 部署,从而很好地发挥了 Hive 的作用.
**Extensibility – **Drill provides an extensible architecture at all layers, including query layer, query optimization, and client API. We can extend any layer for the specific need of an organization.
Flexibility – Drill provides a hierarchical columnar data model that can represent complex, highly dynamic data and allow efficient processing.
Dynamic schema discovery – Apache drill does not require schema or type specification for data in order to start the query execution process. Instead, drill starts processing the data in units called record batches and discover schema on the fly during processing.
**Drill decentralized metadata – **Unlike other SQL Hadoop technologies, the drill does not have centralized metadata requirement. Drill users do not need to create and manage tables in metadata in order to query data.
扩展性- Drill 在所有层都提供了可扩展的体系结构,包括查询层、查询优化和客户端 API.我们可以根据组织的具体需要扩展任何层.
灵活性- Drill 提供了一种能够表示复杂的、高度动态的数据并能够高效处理的分层列式数据模型.
动态模式发现 为了启动查询执行过程,Apache drill 不需要数据的模式或类型规范.相反,drill 开始以称为记录批次的单位处理数据,并在处理过程中动态发现模式.
演练分散元数据- 与其他 SQL Hadoop 技术不同,演练没有集中的元数据要求.为了查询数据,Drill 用户不需要在元数据中创建和管理表.
2.11. Apache Mahout
Mahout is open source framework for creating scalable machine learning algorithm and data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science tools to automatically find meaningful patterns in those big data sets.
Mahout用于创建可扩展的开源框架机器学习 算法和数据挖掘库.一旦数据存储在 Hadoop HDFS 中,mahout 就提供了数据科学工具,可以在这些大数据集中自动找到有意义的模式.
Algorithms of Mahout are:
Clustering – Here it takes the item in particular class and organizes them into naturally occurring groups, such that item belonging to the same group are similar to each other.
Collaborative filtering – It mines user behavior and makes product recommendations (e.g. Amazon recommendations)
Classifications – It learns from existing categorization and then assigns unclassified items to the best category.
Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart or terms in query session) and then identifies which items typically appear together.
聚类- 在这里,它将特定类中的项目组织成自然存在的组,这样属于同一组的项目彼此相似.
协同过滤 它挖掘用户行为,并提出产品建议 (例如亚马逊建议)
分类- 它从现有的分类中学习,然后将未分类的项目分配给最佳类别.
频繁模式挖掘 它分析组中的项目 (例如购物车中的项目或查询会话中的术语),然后识别哪些项目通常会一起出现.
2.12. Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.
Sqoop 将外部来源的数据导入到相关的 Hadoop 生态系统组件中,如 HDFS 、 Hbase 或 Hive.它还将数据从 Hadoop 导出到其他外部源.Sqoop 与 teradata 、 Netezza 、 oracle 、 MySQL 等关系数据库一起工作.
Apache Sqoop Diagram
Features of Apache Sqoop:
Import sequential datasets from mainframe – Sqoop satisfies the growing need to move data from the mainframe to HDFS.
Import direct to ORC files – Improves compression and light weight indexing and improve query performance.
Parallel data transfer – For faster performance and optimal system utilization.
Efficient data analysis – Improve efficiency of data analysis by combining structured data and unstructured data on a schema on reading data lake.
Fast data copies – from an external system into Hadoop.
从大型机导入顺序数据集 Sqoop 满足了将数据从大型机移动到 HDFS 的日益增长的需求.
直接导入 ORC 文件- 改进了压缩和轻量级索引,提高了查询性能.
并行数据传输 实现更快的性能和最佳的系统利用率.
高效的数据分析 通过在读取数据湖上的模式上组合结构化数据和非结构化数据来提高数据分析的效率.
快速的数据拷贝 从外部系统到 Hadoop.
2.13. Apache Flume
Flume efficiently collects, aggregate and moves a large amount of data from its origin and sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop Ecosystem component allows the data flow from the source into Hadoop environment. It uses a simple extensible data model that allows for the online analytic application. Using Flume, we can get the data from multiple servers immediately into hadoop.
Flume 高效地从其来源收集、聚合和移动大量数据,并将其发送回 HDFS.是一种容错可靠的机制. 这个 Hadoop 生态系统组件允许从数据源到 Hadoop 环境的数据流. 它使用一个简单的可扩展数据模型,允许在线分析应用程序. 使用 Flume,我们可以立即将多个服务器的数据获取到 hadoop 中.
Apache Flume
Refer** Flume Comprehensive Guide** for more details
参考** 水槽综合指南**更多详情
2.14. Ambari
Ambari, another Hadop ecosystem component, is a management platform for provisioning, managing, monitoring and securing apache Hadoop cluster. Hadoop management gets simpler as Ambari provide consistent, secure platform for operational control.
另一个 Hadop 生态系统组件 Ambari 是一个管理平台,用于调配、管理、监控和保护 apache Hadoop 集群.随着 Ambari 为操作控制提供一致、安全的平台,Hadoop 管理变得更加简单.
Ambari Diagram
Features of Ambari:
Simplified installation, configuration, and management – Ambari easily and efficiently create and manage clusters at scale.
**Centralized security setup – **Ambari reduce the complexity to administer and configure cluster security across the entire platform.
Highly extensible and customizable – Ambari is highly extensible for bringing custom services under management.
Full visibility into cluster health – Ambari ensures that the cluster is healthy and available with a holistic approach to monitoring.
简化的安装、配置和管理-Ambari 可以轻松高效地大规模创建和管理集群.
集中安全设置-Ambari 降低了在整个平台上管理和配置集群安全的复杂性.
高度可扩展和可定制的Ambari 对于管理定制服务具有高度的可扩展性.
全面了解集群运行状况-Ambari 确保集群是健康的,并以全面的监测方法提供.
2.15. Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper manages and coordinates a large cluster of machines.
是用于维护配置信息、命名、提供分布式同步和提供组服务的集中式服务和 Hadoop 生态系统组件.动物园管理员负责管理和协调一大群的机器.
ZooKeeper Diagram
Features of Zookeeper:
Fast – Zookeeper is fast with workloads where reads to data are more common than writes. The ideal read/write ratio is 10:1.
Ordered – Zookeeper maintains a record of all transactions.
快- 对于读取数据比写入更常见的工作负载,Zookeeper 速度很快.理想的读写比为 10:1.
订购- Zookeeper 保持记录的所有交易.
2.16. Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.
它是一个管理 apache Hadoop 作业的工作流调度系统.Oozie 将多个作业按顺序组合成一个逻辑工作单元.Oozie 框架与 apache Hadoop 堆栈、 YARN 作为架构中心完全集成,并支持 apache MapReduce 、 Pig 、 Hive 和 Sqoop 的 Hadoop 作业.
Oozie Diagram
In Oozie, users can create Directed Acyclic Graph of workflow, which can run in parallel and sequentially in Hadoop. Oozie is scalable and can manage timely execution of thousands of workflow in a Hadoop cluster. Oozie is very much flexible as well. One can easily start, stop, suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it in Oozie.
在 Oozie 中,用户可以创建工作流的有向无环图,该图可以在 Hadoop 中并行、顺序运行.Oozie 是可扩展的,可以在一个Hadoop 集群.Oozie 也非常灵活.可以轻松地启动、停止、暂停和重新运行作业.甚至可以跳过特定的失败节点,或者在 Oozie 中重新运行它.
There are two basic types of Oozie jobs:
Oozie 工作有两种基本类型:
**Oozie workflow – ** It is to store and run workflows composed of Hadoop jobs e.g., MapReduce, pig, Hive.
Oozie Coordinator – It runs workflow jobs based on predefined schedules and availability of data.
Oozie workflow- 它是存储和运行由 Hadoop 作业组成的工作流,例如 MapReduce 、 pig 、 Hive.
Oozie Coordinator 它根据预定义的计划和数据的可用性运行工作流作业.
This was all about Components of Hadoop Ecosystem
这都是关于 Hadoop 生态系统的组件.
3. Conclusion: Components of Hadoop Ecosystem
We have covered all the Hadoop Ecosystem Components in detail. Hence these Hadoop ecosystem components empower Hadoop functionality. As you have learned the components of the Hadoop ecosystem, so refer Hadoop installation guide to use Hadoop functionality. If you like this blog or feel any query so please feel free to share with us.
我们已经详细介绍了 Hadoop 生态系统的所有组件.因此,这些 Hadoop 生态系统组件 Hadoop 功能. 您已经了解了 Hadoop 生态系统的组件,请参考 Hadoop 安装指南 使用 Hadoop 功能.如果你喜欢这个博客,或者有任何疑问,请随时与我们分享.
Reference for Hadoop
Hadoop 的参考资料
https://data-flair.training/blogs/hadoop-ecosystem-components
- Hadoop 是如何工作的?
- Apache Hadoop 的限制.
Https://data-flair.training/blogs/hadoop-ecosystem-components