003 Hadoop 生态系统 - 15 个关键 Hadoop 组件

003 Hadoop Ecosystem – 15 Must Know Hadoop Components

In this tutorial, we will have an overview of various Hadoop Ecosystem Components. These ecosystem components are actually different services deployed by the various enterprise. We can integrate these to work with a variety of data. Each of the Hadoop Ecosystem Components is developed to deliver explicit function. And each has its own developer community and individual release cycle.

在本教程中,我们将对各种 Hadoop 生态系统组件进行概述. 这些生态系统组件实际上是各种企业部署的不同服务. 我们可以将这些数据集成到各种数据中. Hadoop 生态系统的每个组件都是为了提供明确的功能而开发的. 每个社区都有自己的开发者社区和个人发布周期.

So, let’s explore Hadoop Ecosystem Components.

让我们来探索 Hadoop 生态系统组件.

Hadoop Ecosystem - 15 Must Know Hadoop Components

Hadoop Ecosystem – 15 Must Know Hadoop Components

Hadoop Ecosystem Components

Hadoop 生态系统组件

Hadoop Ecosystem is a suite of services that work to solve the Big Data problem. The different components of the Hadoop Ecosystem are as follows:-

Hadoop 生态系统是解决大数据问题的一套服务.Hadoop 生态系统的不同组件如下:-

1. Hadoop Distributed File System

Hadoop Ecosystem Component

HDFS is the foundation of Hadoop and hence is a very important component of the Hadoop ecosystem. It is Java software that provides many features like scalability, high availability, fault tolerance, cost effectiveness etc. It also provides robust distributed data storage for Hadoop. We can deploy many other software frameworks over HDFS.

HDFS 是 Hadoop 的基础因此,Hadoop 是 Hadoop 生态系统的一个非常重要的组成部分.正是 Java 软件提供了可扩展性、高可用性、容错、高性价比等诸多特性,也为 Hadoop 提供了健壮的分布式数据存储.我们可以通过 HDFS 部署许多其他软件框架.

Components of HDFS:-

There are three major components of Hadoop HDFS are as follows:-

Hadoop HDFS 的主要组件有以下三个:-

Hadoop Ecosystem

a. DataNode

These are the nodes which store the actual data. HDFS stores the data in a distributed manner. It divides the input files of varied formats into blocks. The DataNodes stores each of these blocks. Following are the functions of DataNodes:-

这些是存储实际数据的节点.HDFS 以分布式的方式存储数据.它将不同格式的输入文件分成块.DataNodes 存储这些块中的每一个.DataNodes 的功能如下:-

  • On startup, DataNode does handshake with NameNode. It verifies the namespace ID and software version of DataNode.

  • Also, it sends a block report to NameNode and verifies the block replicas.

  • It sends a heartbeat to NameNode every 3 seconds to tell that it is alive.

  • 在启动时,DataNode 不握手、复制指令.验证 DataNode 的命名空间 ID 和软件版本.

  • 此外,它还向 NameNode 发送块报告并验证块副本.

  • 它每 3 秒向 NameNode 发送一次心跳,以告知它是活着的.

b. NameNode

NameNode is nothing but the master node. The NameNode is responsible for managing file system namespace, controlling the client’s access to files. Also, it executes tasks such as opening, closing and naming files and directories. NameNode has two major files – FSImage and Edits log

除了主节点,NameNode 什么都不是.NameNode 负责管理文件系统命名空间,控制客户端对文件的访问.此外,它还执行打开、关闭和命名文件和目录等任务.NameNode 有两个主要文件-FSImage 和编辑日志

FSImage – FSImage is a point-in-time snapshot of HDFS’s metadata. It contains information like file permission, disk quota, modification timestamp, access time etc.

FSImage-FSImage 是 HDFS 元数据的时间点快照.它包含文件权限、磁盘配额、修改时间戳、访问时间等信息.

Edits log – It contains modifications on FSImage. It records incremental changes like renaming the file, appending data to the file etc.

编辑日志-它包含对 FSImage 的修改.它记录诸如重命名文件、将数据附加到文件等增量更改.

Whenever the NameNode starts it applies Edits log to FSImage. And the new FSImage gets loaded on the NameNode.

每当 NameNode 启动时,它都会将编辑日志应用于 FSImage.新的 FSImage 被加载到 NameNode 上.

c. Secondary NameNode

If the NameNode has not restarted for months the size of Edits log increases. This, in turn, increases the downtime of the cluster on the restart of NameNode. In this case, Secondary NameNode comes into the picture. The Secondary NameNode applies edits log on FSImage at regular intervals. And it updates the new FSImage on primary NameNode.

如果 NameNode 在几个月内没有重新启动,编辑日志的大小会增加.这反过来又增加了 NameNode 重新启动时集群的停机时间.在这种情况下,图片中出现了二级 NameNode.二级 NameNode 定期在 FSImage 上应用编辑日志.它更新了主 NameNode 上的新 FSImage.

2. MapReduce

MapReduce is the data processing component of Hadoop. It applies the computation on sets of data in parallel thereby improving the performance. MapReduce works in two phases –

MapReduce是 Hadoop 的数据处理组件.它将并行计算应用于数据集,从而提高了性能.MapReduce 分两个阶段工作

Map Phase – This phase takes input as key-value pairs and produces output as key-value pairs. It can write custom business logic in this phase. Map phase processes the data and gives it to the next phase.

Map Phase – 这个阶段将输入作为键值对,并将输出作为键值对.它可以在这个阶段编写定制的业务逻辑.Map 阶段处理数据并将其提供给下一阶段.

Reduce Phase – The MapReduce framework sorts the key-value pair before giving the data to this phase. This phase applies the summary type of calculations to the key-value pairs.

Reduce Phase – 在将数据提供给这个阶段之前,MapReduce 框架会对键值对进行排序.此阶段将计算的汇总类型应用于键值对.

Hadoop Ecosystem Components

  • Mapper reads the block of data and converts it into key-value pairs.

  • Now, these key-value pairs are input to the reducer.

  • The reducer receives data tuples from multiple mappers.

  • Reducer applies aggregation to these tuples based on the key.

  • The final output from reducer gets written to HDFS.

  • Mapper 读取数据块并将其转换为键值对.

  • 现在,这些键值对输入到减速器中.

  • Reducer 接收来自多个 mappers 的数据元组.

  • Reducer 基于密钥对这些元组应用聚合.

  • 最终的输出从减速机被写到 HDFS.

MapReduce framework takes care of the failure. It recovers data from another node in an event where one node goes down.

MapReduce 框架解决了这个问题.它在一个节点宕机的情况下从另一个节点恢复数据.

3. Yarn

Hadoop Ecosystem

Yarn which is short for Yet Another Resource Manager. It is like the operating system of Hadoop as it monitors and manages the resources. Yarn came into the picture with the launch of Hadoop 2.x in order to allow different workloads. It handles the workloads like stream processing, interactive processing, batch processing over a single platform. Yarn has two main components – Node Manager and Resource Manager.

这是另一个资源管理器的缩写.它就像 Hadoop 的操作系统一样,对资源进行监控和管理.随着 Hadoop 2.X 的推出,Yarn 开始出现,以允许不同的工作负载.它在单个平台上处理流处理、交互处理、批处理等工作负载.Yarn 有节点管理器和资源管理器两个主要组件.

a. Node Manager

It is Yarn’s per-node agent and takes care of the individual compute nodes in a Hadoop cluster. It monitors the resource usage like CPU, memory etc. of the local node and intimates the same to Resource Manager.

它是 Yarn 的每个节点代理,负责 a 中的各个计算节点Hadoop 集群.它监控本地节点的 CPU 、内存等资源使用情况,并与资源管理器关联.

b. Resource Manager

It is responsible for tracking the resources in the cluster and scheduling tasks like map-reduce jobs.

它负责跟踪集群中的资源,并调度 map-reduce 作业等任务.

Also, we have the Application Master and Scheduler in Yarn. Let us take a look at them.

此外,我们在 Yarn 中有应用程序主程序和调度程序.让我们看看他们.

Hadoop Ecosystem

Application Master has two functions and they are:-

应用大师有两个功能,它们是:-

  • Negotiating resources from Resource Manager

  • Working with NodeManager to monitor and execute the sub-task.

  • 从资源管理器协商资源

  • 与 NodeManager 一起监控和执行子任务.

Following are the functions of Resource Scheduler:-

以下是资源调度器的功能:-

  • It allocates resources to various running applications

  • But it does not monitor the status of the application. So in the event of failure of the task, it does not restart the same.

  • 它为各种正在运行的应用程序分配资源

  • 但是它不监控应用程序的状态.因此,在任务失败的情况下,它不会重新启动.

We have another concept called Container. It is nothing but a fraction of NodeManager capacity i.e. CPU, memory, disk, network etc.

我们还有一个叫容器的概念.这只是 NodeManager 容量的一小部分,即 CPU 、内存、磁盘、网络等.

4. Hive

Hadoop Ecosystem Component

Hive is a data warehouse project built on the top of Apache Hadoop which provides data query and analysis. It has got the language of its own call HQL or Hive Query Language. HQL automatically translates the queries into the corresponding map-reduce job.

Hive 是一个数据仓库项目在 Apache Hadoop 的基础上构建,提供数据查询和分析.它有自己的调用 HQL 或语言蜂巢查询语言.HQL 自动将查询转换为相应的 map-reduce 作业.

Main parts of the Hive are –

  • MetaStore – it stores metadata
  • Driver – Manages the lifecycle of HQL statement
  • Query compiler – Compiles HQL into DAG i.e. Directed Acyclic Graph
  • Hive server – Provides interface for JDBC/ODBC server.

Hadoop Ecosystem

Facebook designed Hive for people who are comfortable in SQL. It has two basic components – Hive Command Line and JDBC, ODBC. Hive Command line is an interface for execution of HQL commands. And JDBC, ODBC establishes the connection with data storage. Hive is highly scalable. It can handle both types of workloads i.e. batch processing and interactive processing. It supports native data type of SQL. Hive provides many pre-defined functions for analysis. But you can also define your own custom functions called UDFs or user-defined functions.

Facebook 设计 Hive 对于在 SQL 中感到舒适的人来说.它有两个基本组件: Hive 命令行和 JDBC 、 ODBC.Hive 命令行是执行 HQL 命令的接口.ODBC 与数据存储建立连接.Hive 是高度可扩展的.它可以处理两种类型的工作负载,即批处理和交互式处理.它支持本地SQL 的数据类型.Hive 为分析提供了许多预定义的功能.但是,您也可以定义自己的自定义函数,称为 UDFs 或用户定义函数.

5. Pig

Hadoop Ecosystem Component

Pig is a SQL like language used for querying and analyzing data stored in HDFS. Yahoo was the original creator of the Pig. It uses pig latin language. It loads the data, applies a filter to it and dumps the data in the required format. Pig also consists of JVM called Pig Runtime. Various features of Pig are as follows:-

Pig 是一种类似 SQL 的语言,用于查询和分析存储在 HDFS 中的数据.雅虎是 Pig 的原始创造者 它加载数据,对其应用过滤器,并以所需格式转储数据.

  • Extensibility – For carrying out special purpose processing, users can create their own custom function.
  • Optimization opportunities – Pig automatically optimizes the query allowing users to focus on semantics rather than efficiency.
  • Handles all kinds of data – Pig analyzes both structured as well as unstructured.

Hadoop Ecosystem

a. How does Pig work?

  • First, the load command loads the data.

  • At the backend, the compiler converts pig latin into the sequence of map-reduce jobs.

  • Over this data, we perform various functions like joining, sorting, grouping, filtering etc.

  • Now, you can dump the output on the screen or store it in an HDFS file.

  • 首先,load 命令加载数据.

  • 在后端,编译器将 pig latin 转换为 map-reduce 作业的序列.

  • 在这些数据上,我们执行各种功能,如连接、排序、分组、过滤等.

  • 现在,您可以将输出转储到屏幕上,或者将其存储在 HDFS 文件中.

6. HBase

Hadoop Ecosystem Components

HBase is a NoSQL database built on the top of HDFS. The various **features of **HBase are that it is open-source, non-relational, distributed database. It imitates Google’s Bigtable and written in Java. It provides real-time read/write access to large datasets. Its various components are as follows:-

a. HBase Master

HBase performs the following functions:

HBase 执行以下功能:

  • Maintain and monitor the Hadoop cluster.

  • Performs administration of the database.

  • Controls the failover.

  • HMaster handles DDL operation.

  • 维护和监控Hadoop 集群.

  • 执行数据库的管理.

  • 控制故障切换.

  • HMaster 处理 dll 操作.

b. RegionServer

Region server is a process which handles read, writes, update and delete requests from clients. It runs on every node in a Hadoop cluster that is HDFS DataNode.

Region server 是处理来自客户端的读、写、更新和删除请求的过程.它运行在 Hadoop 集群中的每个节点上,即 HDFS DataNode.

HBase is a column-oriented database management system. It runs on top of HDFS. It suits for sparse data sets which are common in** Big Data use cases**. HBase support writing application in Apache Avro, REST and Thrift. Apache HBase has low latency storage. Enterprises use this for real-time analysis.

HBase 是一个面向列的数据库管理系统.它运行在 HDFS 之上.它适用于常见的稀疏数据集大数据用例.HBase 支持在 Apache Avro 、 REST 和 Thrift 中编写应用程序.Apache HBase 具有低延迟存储.企业将此用于实时分析.

The design of HBase is such that to contain many tables. Each of these tables must have a primary key. Access attempts to HBase tables use this primary key.

HBase 的设计是包含很多表的.每个表都必须有一个主键.访问 HBase 表的尝试使用此主键.

As an example lets us consider HBase table storing diagnostic log from the server. In this case, the typical log row will contain columns such as timestamp when the log gets written. And server from which the log originated.

作为一个例子让我们考虑从服务器存储诊断日志的 HBase 表.在这种情况下,典型的日志行将包含日志写入时的时间戳等列.以及日志来源的服务器.

7. Mahout

Mahout provides a platform for creating machine learning applications which are scalable.

Mahout 为创建可扩展的机器学习应用程序提供了一个平台.

a. What is Machine Learning?

机器学习是什么?

Machine learning algorithms allow us to create self-evolving machines without being explicitly programmed. It makes future decisions based on user behavior, past experiences and data patterns.

机器学习算法允许我们在没有明确编程的情况下创建自我进化的机器.它基于用户行为、过去的经验和数据模式做出未来的决策.

b. What Mahout does?

Mahout 做什么?

It performs collaborative filtering, clustering, and classification.

它执行协同过滤、聚类和分类.

  • Collaborative filtering – Mahout mines user behavior patterns and based on these it makes recommendations to users.

  • Clustering – It groups together a similar type of data like the article, blogs, research paper, news etc.

  • Classification – It means categorizing data into various sub-departments. For example, we can classify article into blogs, essays, research papers and so on.

  • Frequent Itemset missing – It looks for the items generally bought together and based on that it gives a suggestion. For instance, usually, we buy a cell phone and its cover together. So, when you buy a cell phone it will give suggestion to buy cover also.

  • 协同过滤Mahout 挖掘用户行为模式,并在此基础上向用户提出建议.

  • 聚类-它将文章、博客、研究论文、新闻等类似类型的数据分组在一起.

  • 分类-这意味着将数据分类到各个子部门.例如,我们可以将文章分为博客、文章、研究论文等.

  • 缺少频繁项集它寻找通常一起购买的物品,并在此基础上给出建议.例如,我们通常一起买一部手机和它的盖子.所以,当你买手机的时候,它也会给你买封面的建议.

8. Zookeeper

Hadoop Ecosystem Component

Zookeeper coordinates between various services in the Hadoop ecosystem. It saves the time required for synchronization, configuration maintenance, grouping, and naming. Following are the features of Zookeeper:-

ZookeeperHadoop 生态系统中各种服务之间的协调.它节省了同步、配置维护、分组和命名所需的时间.

  • Speed – Zookeeper is fast in workloads where reads to data are more than write. A typical read: write ratio is 10:1.

  • Organized – Zookeeper maintains a record of all transactions.

  • Simple – It maintains a single hierarchical namespace, similar to directories and files.

  • Reliable – We can replicate Zookeeper over a set of hosts and they are aware of each other. There is no single point of failure. As long as major servers are available zookeeper is available.

  • 速度-在读取数据多于写入的工作负载中,Zookeeper 速度很快.典型的读写比为 10:1.

  • 有组织的动物园管理员保持记录的所有交易.

  • 简单-它维护一个类似于目录和文件的单一层次命名空间.

  • 可靠-我们可以在一组主机上复制 Zookeeper,它们彼此都知道.没有单一的失败点.只要有可用的主要服务器,zookeeper 就可用.

Why do we need Zookeeper in Hadoop?

Hadoop faces many problems as it runs a distributed application. One of the problems is deadlock. Deadlock occurs when two or more tasks fight for the same resource. For instance, task T1 has resource R1 and is waiting for resource R2 held by task T2. And this task T2 is waiting for resource R1 held by task T1. In such a scenario deadlock occurs. Both task T1 and T2 would get locked waiting for resources. Zookeeper solves Deadlock condition via synchronization.

Hadoop 在运行分布式应用程序时面临许多问题.其中一个问题是僵局.当两个或多个任务争夺同一资源时,会出现死锁.例如,任务 T1 有资源 R1,正在等待任务 t2 持有的资源 R2.这个任务 T2 正在等待任务 t1 持有的资源 R1.在这种情况下会出现死锁.等待资源时,任务 T1 和 T2 都会被锁定.动物园管理员解决死锁情况通过同步.

Another problem is of race condition. This occurs when the machine tries to perform two or more operations at a time. Zookeeper solves this problem by property of serialization.

另一个问题是比赛条件.当机器试图一次执行两个或多个操作时,会出现这种情况.Zookeeper 通过序列化的属性来解决这个问题.

9. Oozie

Hadoop Ecosystem Components

It is a workflow scheduler systems for managing Hadoop jobs. It supports Hadoop jobs for Map-Reduce, Pig, Hive, and Sqoop. Oozie combines multiple jobs into a single unit of work. It is scalable and can manage thousands of workflow in a Hadoop cluster. Oozie works by creating DAG i.e. Directed Acyclic Graph of the workflow. It is very much flexible as it can start, stop, suspend and rerun failed jobs.

它是一个用于管理 Hadoop 作业的工作流调度系统.它支持Hadoop 作业对于 Map-Reduce 、 Pig 、 Hive 和 Sqoop.Oozie 将多个工作组合成一个工作单元.它是可扩展的,可以在 Hadoop 集群中管理成千上万的工作流.Oozie 通过创建 DAG 来工作,即工作流的有向无环图.它可以启动、停止、暂停和重新运行失败的作业,因此非常灵活.

Oozie is an open-source web-application written in Java. Oozie is scalable and can execute thousands of workflow containing dozens of Hadoop jobs.

Oozie 是一个用 Java 编写的开源 web 应用程序.Oozie 是可扩展的,可以执行包含数十个 Hadoop 作业的数千个工作流.

There are three basic types of Oozie jobs and they are as follows:-

Oozie 工作有三种基本类型,如下所示:-

  • Workflow – It stores and runs a workflow composed of Hadoop jobs. It stores the job as Directed Acyclic Graph to determine the sequence of actions that will get executed.

  • Coordinator – It runs workflow jobs based on predefined schedules and availability of data.

  • Bundle – This is nothing but a package of many coordinators and workflow jobs.

  • Workflow –它存储并运行由 Hadoop 作业组成的工作流.它将作业存储为有向无环图,以确定将执行的操作序列.

  • Coordinator –它根据预定义的计划和数据的可用性运行工作流作业.

  • Bundle –这只不过是许多协调员和工作流作业的一个包.

How does Oozie work?

Oozie 是如何工作的?

Oozie runs a service in the Hadoop cluster. Client submits workflow to run, immediately or later.

Oozie 在 Hadoop 集群中运行服务.客户端提交要立即或稍后运行的工作流.

There are two types of nodes in Oozie. They are action node and control flow node.

Oozie 中有两种类型的节点.它们分别是动作节点和控制流节点.

  • Action Node – It represents the task in the workflow like MapReduce job, shell script, pig or hive jobs etc.

  • Control flow Node – It controls the workflow between actions by employing conditional logic. In this, the previous action decides which branch to follow.

  • Action Node - 它表示工作流中的任务,如 MapReduce 作业、 shell 脚本、 pig 或 hive 作业等.

  • Control flow Node – 它通过使用条件逻辑来控制动作之间的工作流.在这一点上,前面的动作决定了要遵循哪个分支.

Start, End and Error Nodes fall under this category.

开始、结束和错误节点属于此类别.

Hadoop Ecosystem

  • Start Node signals the start of the workflow job.

  • End Node designates the end of job.

  • ErrorNode signals the error and gives an error message.

  • “开始” 节点表示工作流作业的开始.

  • 结束节点指定作业结束.

  • 错误节点发出错误信号并给出错误消息.

10. Sqoop

Hadoop Ecosystem Components

Sqoop imports data from external sources into compatible Hadoop Ecosystem components like HDFS, Hive, HBase etc. It also transfers data from Hadoop to other external sources. It works with RDBMS like TeraData, Oracle, MySQL and so on. The major difference between Sqoop and Flume is that Flume does not work with structured data. But Sqoop can deal with structured as well as unstructured data.

导入数据Compatible 、 Hive 、 HBase 等兼容的 Hadoop 生态系统组件,并将数据从 Hadoop 传输到其他外部源.它与 TeraData 、 Oracle 、 MySQL 等关系数据库一起工作.Sqoop 和 Flume 的主要区别在于 Flume 不处理结构化数据.但是 Sqoop 可以处理结构化和非结构化的数据.

Let us see how Sqoop works

让我们看看 Sqoop 是如何工作的

When we submit Sqoop command, at the back-end, it gets divided into a number of sub-tasks. These sub-tasks are nothing but map-tasks. Each map-task import a part of data to Hadoop. Hence all the map-task taken together imports the whole data.

当我们在后端提交 Sqoop 命令时,它会被分成许多子任务.这些子任务只是地图任务.每个 map-task 都会将一部分数据导入 Hadoop.因此,所有一起完成的地图任务都会导入整个数据.

Sqoop export also works in a similar way. Only thing is instead of importing, the map-task export the part of data from Hadoop to destination database.

Sqoop 导出也以类似的方式工作.唯一的事情是,map-task 将数据的一部分从 Hadoop 导出到目标数据库,而不是导入.

11. Flume

Hadoop Ecosystem Components

It is a service which helps to ingest structured and semi-structured data into HDFS. Flume works on the principle of distributed processing. It aids in collection, aggregation, and movement of a huge amount of data sets. Flume has three components source, sink, and channel.

它是一种帮助将结构化和半结构化数据接收到 HDFS 中的服务.水槽遵循分布式处理的原则.它有助于收集、聚合和移动大量数据集.水槽由源、水槽和通道三部分组成.

Source – It accepts the data from the incoming stream and stores the data in the channel

Source –它接受来自传入流的数据,并将数据存储在通道中

Channel – It is a medium of temporary storage between the source of the data and persistent storage of HDFS.

Channel-它是数据源和 HDFS 持久存储之间的临时存储介质.

**Sink – **This component collects the data from the channel and writes it permanently to the HDFS.

Sink –该组件从通道中收集数据,并将其永久写入 HDFS.

12. Ambari

Hadoop Ecosystem Componnents
Ambari is another Hadoop ecosystem component. It is responsible for provisioning, managing, monitoring and securing Hadoop cluster. Following are the different** features of Ambari**:

  • Simplified cluster configuration, management, and installation

  • Ambari reduces the complexity of configuring and administration of Hadoop cluster security.

  • It ensures that the cluster is healthy and available for monitoring.

  • 简化的集群配置、管理和安装

  • Ambari 降低了 Hadoop 集群安全配置和管理的复杂性.

  • 它确保集群是健康的,并且可用于监控.

Ambari gives:-
Hadoop cluster provisioning

Ambari 了:-
Hadoop 集群资源调配

  • It gives step by step procedure for installing Hadoop services on the Hadoop cluster.

  • It also handles configuration of services across the Hadoop cluster.

  • 它给出了一步一步的程序安装 Hadoop 服务在 Hadoop 集群上.

  • 它还处理整个 Hadoop 集群的服务配置.

Hadoop cluster management

Hadoop 集群管理

  • It provides centralized service for starting, stopping and reconfiguring services on the network of machines.

  • 它为机器网络上的启动、停止和重新配置服务提供集中服务.

Hadoop cluster monitoring

Hadoop 集群监控

  • To monitor health and status Ambari provides us dashboard.

  • Ambari alert framework alerts the user when the node goes down or has low disk space etc.

  • Ambari 为我们提供了监控健康和状态的仪表板.

  • 当节点宕机或磁盘空间不足时,Ambari 警报框架会提醒用户.

13. Apache Drill

Apache Drill is a schema-free SQL query engine. It works on the top of Hadoop, NoSQL and cloud storage. Its main purpose is large scale processing of data with low latency. It is a distributed query processing engine. We can query petabytes of data using Drill. It can scale to several thousands of nodes. It supports NoSQL databases like Azure BLOB storage, Google cloud storage,** Amazon** S3, HBase, MongoDB and so on.

Apache Drill 是一个无模式的 SQL 查询引擎.它在 Hadoop 、 NoSQL 和云存储的基础上工作.它的主要目的是对低延迟的数据进行大规模处理.分布式查询处理引擎.我们可以使用 Drill 查询 pb 级的数据.它可以扩展到几千个节点.它支持 Azure BLOB 存储、 Google 云存储等 NoSQL 数据库,亚马逊S3,HBase,MongoDB等等.

Let us look at some of the features of Drill:-

让我们看看 Drill 的一些特性:-

  • Variety of data sources can be the basis of a single query.

  • Drill follows ANSI SQL.

  • It can support millions of users and serve their queries over large data sets.

  • Drill gives faster insights without ETL overheads like loading, schema creation, maintenance, transformation etc.

  • It can analyze multi-structured and nested data without having to do transformations or filtering.

  • 各种数据源可以作为单个查询的基础.

  • Drill 遵循 ANSI SQL.

  • 它可以支持数以百万计的用户,并在大型数据集上为他们的查询提供服务.

  • Drill 提供了更快的洞察,而无需加载、模式创建、维护、转换等 ETL 开销.

  • 它可以分析多结构化和嵌套的数据,而不必进行转换或过滤.

14. Apache Spark

Hadoop Ecosystem Components

Apache Spark unifies all kinds of Big Data processing under one umbrella. It has** built-in libraries** for streaming, SQL, machine learning and graph processing. Apache Spark is lightening fast. It gives good performance for both batch and stream processing. It does this with the help of DAG scheduler, query optimizer, and physical execution engine.

Apache Spark把各种大数据处理统一在一个伞下.它有内置库用于流、 SQL 、机器学习和图形处理.Apache Spark 正在快速点亮.它为批处理和流处理提供了良好的性能.它在帮助下做到了这一点DAG 调度器查询优化器和物理执行引擎.

Spark offers 80 high-level operators which makes it easy to build parallel applications. Spark has various libraries like MLlib for machine learning, GraphX for graph processing, SQL and Data frames, and Spark Streaming. One can run Spark in standalone cluster mode on Hadoop, Mesos, or on Kubernetes. One can write Spark applications using SQL, R, Python, Scala, and Java. As such Scala in the native language of Spark. It was originally developed at the University of California, Berkley. Spark does in-memory calculations. This makes Spark faster than Hadoop map-reduce.

Spark 提供了 80 个高级操作员,这使得构建并行应用程序变得更加容易.Spark 有各种各样的库机器学习的 MLlib用于图形处理、 SQL 和数据帧以及 Spark 流的 GraphX.可以在 Hadoop 、 Mesos 或 Kubernetes 上以独立集群模式运行 Spark.可以使用 SQL 、 R 、 Python 、 Scala 和 Java 编写 Spark 应用程序.就像 Spark 母语中的 Scala 一样.它最初是在加州大学伯克利分校开发的.Spark 进行内存计算.这使得火花比 Hadoop map-reduce.

15. Solr And Lucene

Hadoop Ecosystem Components

Apache Solr and Apache Lucene are two services which search and indexes the Hadoop ecosystem. Apache Solr is an application built around Apache Lucene. Code of Apache Lucene is in Java. It uses Java libraries for searching and indexing. Apache Solr is an open source, blazing fast search platform.

Apache Solr 和 Apache Lucene 是搜索和索引 Hadoop 生态系统的两个服务.Apache Solr 是一个基于 Apache Lucene 构建的应用程序.Apache Lucene 的代码是用 Java 编写的.它使用Java 库用于搜索和索引.Apache Solr 是一个开源的、快速的搜索平台.

Various features of Solr are as follows –

Solr 的各种特性如下-

  • Solr is highly scalable, reliable and fault tolerant.

  • It provides distributed indexing, automated failover and recovery, load-balanced query, centralized configuration and much more.

  • You can query Solr using HTTP GET and receive the result in JSON, binary, CSV and XML.

  • Solr provides matching capabilities like phrases, wildcards, grouping, joining and much more.

  • It gets shipped with a built-in administrative interface enabling management of solr instances.

  • Solr takes advantage of Lucene’s near real-time indexing. It enables you to see your content when you want to see it.

  • Solr 具有高度可扩展性、可靠性和容错能力.

  • 它提供分布式索引、自动故障切换和恢复、负载平衡查询、集中配置等功能.

  • 您可以使用 HTTP GET 查询 Solr,并以 JSON 、二进制、 CSV 和 XML 接收结果.

  • Solr 提供了短语、通配符、分组、连接等匹配功能.

  • 它附带了一个支持 solr 实例管理的内置管理界面.

  • Solr 利用了 Lucene 近乎实时的索引功能.当你想看的时候,它可以让你看到你的内容.

So, this was all in the Hadoop Ecosystem. Hope you liked this article.

所以,这一切都在 Hadoop 生态系统中.希望你喜欢这篇文章.

Summary

The Hadoop ecosystem elements described above are all open system Apache Hadoop Project. Many commercial applications use these ecosystem elements. Let us summarize Hadoop ecosystem components. At the core, we have HDFS for data storage, map-reduce for data processing and Yarn a resource manager. Then we have HIVE a data analysis tool, Pig – SQL like a scripting language, HBase – NoSQL database, Mahout – machine learning tool, Zookeeper – a synchronization tool, Oozie – workflow scheduler system, Sqoop – structured data importing and exporting utility, Flume – data transfer tool for unstructured and semi-structured data, Ambari – a tool for managing and securing Hadoop clusters, and lastly Avro – RPC, and data serialization framework.

Hadoop上面描述的生态系统元素都是开放系统 Apache Hadoop 项目.许多商业应用使用这些生态系统元素.让我们总结一下 Hadoop 生态系统组件.核心是数据存储的 HDFS,数据处理的 map-reduce,以及资源管理器.然后我们有一个蜂巢数据分析工具像脚本语言一样的 Pig-SQL,HBase-NoSQL 数据库,Mahout-机器学习工具,Zookeeper-同步工具,Oozie-工作流调度系统, sqoop-结构化数据导入和导出实用程序,非结构化和半结构化数据的水槽数据传输工具,Ambari-管理和保护 Hadoop 集群,最后是 Avro-RPC 和数据序列化框架的工具.

Since you are familiar with the Hadoop ecosystem and components, you are ready to learn more in Hadoop. Check out the Hadoop training by DataFlair.

由于您熟悉 Hadoop 生态系统和组件,因此您可以在 Hadoop 中了解更多信息.退房的Hadoop 培训由 DataFlair.

Still, if any doubt regarding Hadoop Ecosystem, ask in the comment section.

尽管如此,如果对 Hadoop 生态系统有任何疑问,请在评论部分提出.

https://data-flair.training/blogs/hadoop-ecosystem

你可能感兴趣的:(003 Hadoop 生态系统 - 15 个关键 Hadoop 组件)