Hadoop Modules(Hadoop 模块)
Amabri: 基于Web的工具,用于配置,管理和监视Apache Hadoop集群,其中包括对Hadoop HDFS,Hadoop MapReduce,Hive,HCatalog,HBase,ZooKeeper,Oozie,Pig和Sqoop的支持。Ambari还提供了一个仪表板,用于查看集群健康状况(例如热图)以及以可视方式查看MapReduce,Pig和Hive应用程序的功能,以及以用户友好的方式诊断其性能特征的功能。
Spark: 一种用于Hadoop数据的快速通用计算引擎。Spark提供了一个简单而富于表现力的编程模型,该模型支持广泛的应用程序,包括ETL,机器学习,流处理和图形计算。
摘自:http://hadoop.apache.org/
优势及特点
1. Hadoop is Open Source(Hadoop是开源的)
Hadoop is an open-source project, which means its source code is available free of cost for inspection, modification, and analyses that allows enterprises to modify the code as per their requirements.
Hadoop是一个开源项目,这意味着它的源代码可免费获得以进行检查,修改和分析,从而使企业可以根据自己的要求对其进行修改。
2. Hadoop cluster is Highly Scalable(Hadoop集群具有高度可扩展性)
Hadoop cluster is scalable means we can add any number of nodes (horizontal scalable) or increase the hardware capacity of nodes (vertical scalable) to achieve high computation power. This provides horizontal as well as vertical scalability to the Hadoop framework.
Hadoop集群是可扩展的,这意味着我们可以添加任意数量的节点(水平可扩展)或增加节点的硬件容量(垂直可扩展)以实现高计算能力。这为Hadoop框架提供了水平和垂直可扩展性。
3. Hadoop provides Fault Tolerance(Hadoop提供容错能力)
Fault tolerance is the most important feature of Hadoop. HDFS in Hadoop 2 uses a replication mechanism to provide fault tolerance.
容错是Hadoop最重要的功能。Hadoop 2中的HDFS使用复制机制来提供容错能力。
It creates a replica of each block on the different machines depending on the replication factor (by default, it is 3). So if any machine in a cluster goes down, data can be accessed from the other machines containing a replica of the same data.
它会根据复制因子在不同机器上创建每个块的副本(默认情况下为3)。因此,如果群集中的任何计算机出现故障,则可以从其他包含相同数据副本的计算机访问数据。
Hadoop 3 has replaced this replication mechanism by erasure coding. Erasure coding provides the same level of fault tolerance with less space. With Erasure coding, the storage overhead is not more than 50%.
Hadoop 3已通过擦除编码替代了此复制机制。擦除编码以较小的空间提供相同级别的容错能力。使用擦除编码时,存储开销不超过50%。
4. Hadoop provides High Availability(Hadoop提供高可用性)
This feature of Hadoop ensures the high availability of the data, even in unfavorable conditions.
Hadoop的此功能即使在不利条件下也可确保数据的高可用性。
Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down, the data is available to the user from different DataNodes containing a copy of the same data.
由于Hadoop的容错功能,如果任何DataNode出现故障,则用户可以从包含相同数据副本的不同DataNode中获得数据。
Also, the high availability Hadoop cluster consists of 2 or more running NameNodes (active and passive) in a hot standby configuration. The active node is the NameNode, which is active. Passive node is the standby node that reads edit logs modification of active NameNode and applies them to its own namespace.
此外,高可用性Hadoop集群由两个或两个以上在热备用配置中运行的NameNode(主动和被动)组成。活动节点是NameNode,该节点处于活动状态。被动节点是备用节点,它读取活动NameNode的编辑日志修改并将其应用于自己的名称空间。
If an active node fails, the passive node takes over the responsibility of the active node. Thus even if the NameNode goes down, files are available and accessible to users.
如果主动节点发生故障,则被动节点将接管主动节点的责任。因此,即使NameNode发生故障,文件仍可被用户访问。
5. Hadoop is very Cost-Effective(Hadoop具有很高的成本效益)
Since the Hadoop cluster consists of nodes of commodity hardware that are inexpensive, thus provides a cost-effective solution for storing and processing big data. Being an open-source product, Hadoop doesn’t need any license.
由于Hadoop集群由廉价的商品硬件节点组成,因此为存储和处理大数据提供了一种经济高效的解决方案。作为开源产品,Hadoop不需要任何许可证。
6. Hadoop is Faster in Data Processing(Hadoop的数据处理速度更快)
Hadoop stores data in a distributed fashion, which allows data to be processed distributedly on a cluster of nodes. Thus it provides lightning-fast processing capability to the Hadoop framework.
Hadoop以分布式方式存储数据,从而允许在节点群集上分布式处理数据。因此,它为Hadoop框架提供了快速的处理能力。
7. Hadoop is based on Data Locality concept(Hadoop基于数据局部性概念)
Hadoop is popularly known for its data locality feature means moving computation logic to the data, rather than moving data to the computation logic. This features of Hadoop reduces the bandwidth utilization in a system.
Hadoop以其数据局部性功能而广为人知,这意味着将计算逻辑移至数据,而不是将数据移至计算逻辑。Hadoop的此功能降低了系统中的带宽利用率。
8. Hadoop provides Feasibility(Hadoop提供可行性)
Unlike the traditional system, Hadoop can process unstructured data. Thus provide feasibility to the users to analyze data of any formats and size.
与传统系统不同,Hadoop可以处理非结构化数据。因此为用户提供了分析任何格式和大小的数据的可行性。
9. Hadoop is Easy to use(Hadoop易于使用)
Hadoop is easy to use as the clients don’t have to worry about distributing computing. The processing is handled by the framework itself.
Hadoop易于使用,因为客户端不必担心分布式计算。处理由框架本身处理。
10. Hadoop ensures Data Reliability(Hadoop确保数据可靠性)
In Hadoop due to the replication of data in the cluster, data is stored reliably on the cluster machines despite machine failures.
The framework itself provides a mechanism to ensure data reliability by Block Scanner, Volume Scanner, Disk Checker, and Directory Scanner. If your machine goes down or data gets corrupted, then also your data is stored reliably in the cluster and is accessible from the other machine containing a copy of data.
在Hadoop中,由于集群中数据的复制,即使机器发生故障,数据仍可靠地存储在集群机器上。该框架本身提供了一种机制,可通过块扫描程序,卷扫描程序,磁盘检查程序和目录扫描程序来确保数据可靠性。如果您的计算机出现故障或数据损坏,那么您的数据也将可靠地存储在群集中,并且可以从包含数据副本的另一台计算机上进行访问。
摘自:https://data-flair.training/blogs/features-of-hadoop-and-design-principles/
Hadoop 应用场景如下:
Simple numerical summaries – average, minimum, sum – were sufficient for the business problems of the 1980s and 1990s. Large amounts of complex data, though, require new techniques. Recognizing customer preferences requires analysis of purchase history, but also a close examination of browsing behavior and products viewed, comments and reviews logged on a web site, and even complaints and issues raised with customer support staff. Predicting behavior demands that customers be grouped by their preferences, so that behavior of one individual in the group can be used to predict the behavior of others. The algorithms involved include natural language processing, pattern recognition, machine learning and more. These techniques run very well on Hadoop.
简单的数字摘要,平均值,最小值,总和 - 只足够处理 20世纪80年代和90年代 的业务问题。今时今日大量复杂的数据需要新的技术 : 从认识到顾客喜好,购买历史记录的分析,仔细检查浏览行为和产品查看,网站上的意见和评论,客户支持人员的投诉和提出的问题,行为的预测,需求分组,客户自己的喜好,一个个体在群体中的行为,预测他人的行为,涉及的算法包括自然语言处理,模式识别,机器学习等。这些技术都是大数据应用。
摘自:http://chi.hadoop.hk/About/Hadoop-Executive-Summary