Hadoop 应用

Hadoop Modules(Hadoop 模块)

  • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
    Hdfs:一种分布式文件系统,提供对应用程序数据的高吞吐量访问的分布式文件系统。
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
    Yarn :提供集群的作业调度和资源管理服务
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
    MapReduce:基于YARN的系统,用于并行处理大数据集。
  • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

Amabri: 基于Web的工具,用于配置,管理和监视Apache Hadoop集群,其中包括对Hadoop HDFS,Hadoop MapReduce,Hive,HCatalog,HBase,ZooKeeper,Oozie,Pig和Sqoop的支持。Ambari还提供了一个仪表板,用于查看集群健康状况(例如热图)以及以可视方式查看MapReduce,Pig和Hive应用程序的功能,以及以用户友好的方式诊断其性能特征的功能。

  • HBase: A scalable, distributed database that supports structured data storage for large tables.
    HBase:可扩展的分布式数据库,支持大型表的结构化数据存储。
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
    Hive:一种数据仓库基础结构,可提供数据汇总和即席查询。
  • Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

Spark: 一种用于Hadoop数据的快速通用计算引擎。Spark提供了一个简单而富于表现力的编程模型,该模型支持广泛的应用程序,包括ETL,机器学习,流处理和图形计算。

  • Tez: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop MapReduce as the underlying execution engine.
    Tez: 基于Hadoop YARN的通用数据流编程框架,它提供了强大而灵活的引擎来执行任意DAG任务,以处理批处理和交互用例的数据。Hadoop生态系统中的Hive™,Pig™和其他框架以及其他商业软件(例如ETL工具)都采用了Tez,以取代Hadoop™MapReduce作为基础执行引擎。
  • ZooKeeper: A high-performance coordination service for distributed applications.
    Zookeeper :
    面向分布式应用程序的高性能协调服务。

 

摘自:http://hadoop.apache.org/

 

 

 

优势及特点

 

1. Hadoop is Open SourceHadoop是开源的)

Hadoop is an open-source project, which means its source code is available free of cost for inspection, modification, and analyses that allows enterprises to modify the code as per their requirements.

Hadoop是一个开源项目,这意味着它的源代码可免费获得以进行检查,修改和分析,从而使企业可以根据自己的要求对其进行修改。

2. Hadoop cluster is Highly ScalableHadoop集群具有高度可扩展性)

Hadoop cluster is scalable means we can add any number of nodes (horizontal scalable) or increase the hardware capacity of nodes (vertical scalable) to achieve high computation power. This provides horizontal as well as vertical scalability to the Hadoop framework.

Hadoop集群是可扩展的,这意味着我们可以添加任意数量的节点(水平可扩展)或增加节点的硬件容量(垂直可扩展)以实现高计算能力。这为Hadoop框架提供了水平和垂直可扩展性。

 

3. Hadoop provides Fault ToleranceHadoop提供容错能力)

Fault tolerance is the most important feature of Hadoop. HDFS in Hadoop 2 uses a replication mechanism to provide fault tolerance.

容错是Hadoop最重要的功能。Hadoop 2中的HDFS使用复制机制来提供容错能力。

It creates a replica of each block on the different machines depending on the replication factor (by default, it is 3). So if any machine in a cluster goes down, data can be accessed from the other machines containing a replica of the same data.

它会根据复制因子在不同机器上创建每个块的副本(默认情况下为3)。因此,如果群集中的任何计算机出现故障,则可以从其他包含相同数据副本的计算机访问数据。

Hadoop 3 has replaced this replication mechanism by erasure coding. Erasure coding provides the same level of fault tolerance with less space. With Erasure coding, the storage overhead is not more than 50%.

Hadoop 3已通过擦除编码替代了此复制机制。擦除编码以较小的空间提供相同级别的容错能力。使用擦除编码时,存储开销不超过50%。

 

4. Hadoop provides High AvailabilityHadoop提供高可用性)

This feature of Hadoop ensures the high availability of the data, even in unfavorable conditions.

Hadoop的此功能即使在不利条件下也可确保数据的高可用性。

Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down, the data is available to the user from different DataNodes containing a copy of the same data.

由于Hadoop的容错功能,如果任何DataNode出现故障,则用户可以从包含相同数据副本的不同DataNode中获得数据。

Also, the high availability Hadoop cluster consists of 2 or more running NameNodes (active and passive) in a hot standby configuration. The active node is the NameNode, which is active. Passive node is the standby node that reads edit logs modification of active NameNode and applies them to its own namespace.

此外,高可用性Hadoop集群由两个或两个以上在热备用配置中运行的NameNode(主动和被动)组成。活动节点是NameNode,该节点处于活动状态。被动节点是备用节点,它读取活动NameNode的编辑日志修改并将其应用于自己的名称空间。

If an active node fails, the passive node takes over the responsibility of the active node. Thus even if the NameNode goes down, files are available and accessible to users.

如果主动节点发生故障,则被动节点将接管主动节点的责任。因此,即使NameNode发生故障,文件仍可被用户访问。

5. Hadoop is very Cost-EffectiveHadoop具有很高的成本效益)

Since the Hadoop cluster consists of nodes of commodity hardware that are inexpensive, thus provides a cost-effective solution for storing and processing big data. Being an open-source product, Hadoop doesn’t need any license.

由于Hadoop集群由廉价的商品硬件节点组成,因此为存储和处理大数据提供了一种经济高效的解决方案。作为开源产品,Hadoop不需要任何许可证。

6. Hadoop is Faster in Data ProcessingHadoop的数据处理速度更快)

Hadoop stores data in a distributed fashion, which allows data to be processed distributedly on a cluster of nodes. Thus it provides lightning-fast processing capability to the Hadoop framework.

Hadoop以分布式方式存储数据,从而允许在节点群集上分布式处理数据。因此,它为Hadoop框架提供了快速的处理能力。

7. Hadoop is based on Data Locality conceptHadoop基于数据局部性概念)

Hadoop is popularly known for its data locality feature means moving computation logic to the data, rather than moving data to the computation logic. This features of Hadoop reduces the bandwidth utilization in a system.

Hadoop以其数据局部性功能而广为人知,这意味着将计算逻辑移至数据,而不是将数据移至计算逻辑。Hadoop的此功能降低了系统中的带宽利用率。

8. Hadoop provides FeasibilityHadoop提供可行性)

Unlike the traditional system, Hadoop can process unstructured data. Thus provide feasibility to the users to analyze data of any formats and size.

与传统系统不同,Hadoop可以处理非结构化数据。因此为用户提供了分析任何格式和大小的数据的可行性。

9. Hadoop is Easy to useHadoop易于使用)

Hadoop is easy to use as the clients don’t have to worry about distributing computing. The processing is handled by the framework itself.

Hadoop易于使用,因为客户端不必担心分布式计算。处理由框架本身处理。

10. Hadoop ensures Data ReliabilityHadoop确保数据可靠性)

In Hadoop due to the replication of data in the cluster, data is stored reliably on the cluster machines despite machine failures.

The framework itself provides a mechanism to ensure data reliability by Block Scanner, Volume Scanner, Disk Checker, and Directory Scanner. If your machine goes down or data gets corrupted, then also your data is stored reliably in the cluster and is accessible from the other machine containing a copy of data.

Hadoop中,由于集群中数据的复制,即使机器发生故障,数据仍可靠地存储在集群机器上。该框架本身提供了一种机制,可通过块扫描程序,卷扫描程序,磁盘检查程序和目录扫描程序来确保数据可靠性。如果您的计算机出现故障或数据损坏,那么您的数据也将可靠地存储在群集中,并且可以从包含数据副本的另一台计算机上进行访问。

 

摘自:https://data-flair.training/blogs/features-of-hadoop-and-design-principles/

 

 

 

 

 

Hadoop 应用场景如下:

Simple numerical summaries – average, minimum, sum – were sufficient for the business problems of the 1980s and 1990s. Large amounts of complex data, though, require new techniques. Recognizing customer preferences requires analysis of purchase history, but also a close examination of browsing behavior and products viewed, comments and reviews logged on a web site, and even complaints and issues raised with customer support staff. Predicting behavior demands that customers be grouped by their preferences, so that behavior of one individual in the group can be used to predict the behavior of others. The algorithms involved include natural language processing, pattern recognition, machine learning and more. These techniques run very well on Hadoop.

简单的数字摘要,平均值,最小值,总和 - 只足够处理 20世纪80年代和90年代 的业务问题。今时今日大量复杂的数据需要新的技术 : 从认识到顾客喜好,购买历史记录的分析,仔细检查浏览行为和产品查看,网站上的意见和评论,客户支持人员的投诉和提出的问题,行为的预测,需求分组,客户自己的喜好,一个个体在群体中的行为,预测他人的行为,涉及的算法包括自然语言处理,模式识别,机器学习等。这些技术都是大数据用。

  • Archive platform - Big Image library, big document library
    大图片库,大文档库
  • Natural Language processing
    自然语言处理
  • Recommendation Engine - How can companies predict customer preferences? Click-stream analysis, log analysis at web scale
    推荐引擎 - 企业如何能预测顾客的喜好呢?
  • Customer Churn Analysis - How to win more customers and avoid really losing customers?  Sophisticated data mining 
    客户流失分析 - 如何赢得更多的客户,并避免真的失去客户?先进的数据挖掘
  • AD Targeting - How can companies increase campaign efficiency? Marketing automation, business intelligence
    广告定位 - 企业如何才能提高作战效能?营销自动化,商业智能
  • Point-of-sales Transaction Analysis - How do retailers target promotions guaranteed to make you buy?
     销售点交易分析
  • Analyzing Network Data to Predict - How can organizations use machine generated data to identify potential trouble?
    网络数据分析预测 -
  • Threat Analysis - How can companies detect threats and fraudulent activity? Crawling, text processing
    威胁分析 - 企业如何才能检测到的威胁和欺诈活动?
  • Trade Surveillance - How can a bank spot the rogue trader?
    贸易管制 - 检测流氓交易?
  • Search Quality - What’s in your search?
    搜索质量 - 你在搜索什么呢?
  • Data Sandbox - What can you do with new data? Big data archiving and sandbox, including of relational/tabular data
    数据沙箱 / 数据测试实验室
  • GIS - 3D maps, spatial applications
     3D 地图,空间应用
  • Real-time Customer Segmentation - Marketing analytics 
    实时客户细分 - 市场分析

摘自:http://chi.hadoop.hk/About/Hadoop-Executive-Summary

你可能感兴趣的:(Hadoop小记,hadoop)