Java Code Geeks联合创始人Byron Kiourtzoglou近日发表文章,从理论到实践剖析了大数据的4个V,并于文章最后分享了Java工程师可能会需要的13个主流开源大数据工具。
What is Big Data? You may ask; and more importantly why it is the latest trend in nearly every business domain? Is it just a hype or its here to stay?
As a matter of fact “Big Data” is a pretty straightforward term – its just what its says – a very large data-set. How large? The exact answer is “as large as you can imagine”!
事实上大数据是个非常简单的术语——就像它所说的一样,是非常大的数据集。究竟有大多?真实的答案就是“如你所想的那么大”!
How can this data-set be so massively big? Because the data may come from everywhere and in enormous rates: RFID sensors that gather traffic data, sensors used to gather weather information, GPRS packets from cell phones, posts to social media sites, digital pictures and videos, online purchase transaction records, you name it! Big Data is an enormous data-set that may contain information from every possible source that produces data that we are interested in.
为什么数据集会变得如此之大?因为当今的数据已经无所不在并且存在着巨大的回报:收集通信数据的RFID传感器,收集天气信息的传感器,移动设备给社交网站发送的GPRS数据包,图片视频,在线购物产生的交易记录,应有尽有!大数据是一个巨大的数据集,包含了任何数据源产生的信息,当然前提是这些信息是我们感兴趣的。
Nevertheless Big Data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make businesses more agile, and to answer questions that were previously considered beyond our reach. That is why Big Data is characterized by four main aspects: Volume, Variety, Velocity, and Veracity(Value) known as “the four Vs of Big Data”. Let’s briefly examine what each one of them stands for and what challenges it presents:
然而大数据的含义绝不只与体积相关,因为大数据还可以用于寻找新的真知、形成新的数据和内容;我们可以使用从大数据中提取的真知、数据和内容去使商业更加灵活,以及回答那些之前被认为远超当前范畴的问题。这也是大数据被从以下4个方面定义的原因:Volume(体积)、Variety(多样)、Velocity(效率)以及Veracity(Value,价值),也就是大数据的4V。下面将简述每个特性以及所面临的挑战:
Volume
Volume references the amount of content a business must be able to capture, store and access. 90% of the world’s data has been generated in the past two years alone. Organizations today are overwhelmed with volumes of data, easily amassing terabytes—even petabytes—of information of all types, some of which needs to be organized, secured and analyzed.
Volume说的是一个业务必须捕获、存储及访问的数据量,仅仅在过去两年内就生产了世界上所有数据的90%。现今的机构已完全被数据的体积所淹没,轻易的就会产生TB甚至是PB级不同类型的数据,并且其中有些数据需要被组织、防护(窃取)以及分析。
Variety
80% of the world’s data is semi – structured. Sensors, smart devices and social media are generating this data through Web pages, weblog files, social-media forums, audio, video, click streams, e-mails, documents, sensor systems and so on. Traditional analytics solutions work very well with structured information, for example data in a relational database with a well formed schema. Variety in data types represents a fundamental shift in the way data is stored and analysis needs to be done to support today’s decision-making and insight process. Thus Variety represents the various types of data that can’t easily be captured and managed in a traditional relational database but can be easily stored and analyzed with Big Data technologies.
世界上产生的数据有80%都是半结构化的,传感器、智能设备和社交媒体都是通过Web页面、网络日志文件、社交媒体论坛、音频、视频、点击流、电子邮件、文档、传感系统等生成这些数据。传统的分析方案往往只适合结构化数据,举个例子:存储在关系型数据库中的数据就有完整的结构模型。数据类型的多样化同样意味着为支持当下的决策制定及真知处理,我们需要在数据储存和分析上面进行根本的改变。Variety代表了在传统关系数据库中无法轻易捕获和管理的数据类型,使用大数据技术却可以轻松的储存和分析。
Velocity
Velocity requires analyzing data in near real time, aka “sometimes 2 minutes is too late!”. Gaining a competitive edge means identifying a trend or opportunity in minutes or even seconds before your competitor does. Another example is time-sensitive processes such as catching fraud where information must be analyzed as it streams into your enterprise in order to maximize its value. Time-sensitive data has a very short shelf-life; compelling organizations to analyze them in near real-time.
Velocity则需要对数据进行近实时的分析,亦称“sometimes 2 minutes is too late!”。获取竞争优势意味着你需要在几分钟,甚至是几秒内识别一个新的趋势或机遇,同样还需要尽可能的快于你竞争对手。另外一个例子是时间敏感性数据的处理,比如说捕捉罪犯,在这里数据必须被收集后就完成被分析,这样才能获取最大价值。对时间敏感的数据保质期往往都很短,这就需求组织或机构使用近实时的方式对其分析。
Veracity (Value)
Acting on data is how we create opportunities and derive value. Data is all about supporting decisions, so when you are looking at decisions that can have a major impact on your business, you are going to want as much information as possible to support your case. Nevertheless the volume of data alone does not provide enough trust for decision makers to act upon information. The truthfulness and quality of data is the most important frontier to fuel new insights and ideas. Thus establishing trust in Big Data solutions probably presents the biggest challenge one should overcome to introduce a solid foundation for successful decision making.
通过分析数据我们得出如何的抓住机遇及收获价值,数据的重要性就在于对决策的支持;当你着眼于一个可能会对你企业产生重要影响的决策,你希望获得尽可能多的信息与用例相关。单单数据的体积并不能决定其是否对决策产生帮助,数据的真实性和质量才是获得真知和思路最重要的因素,因此这才是制定成功决策最坚实的基础。
While the existing installed base of business intelligence and data warehouse solutions weren’t engineered to support the four V’s, big data solutions are being developed to address these challenges.
然而当下现有的商业智能和数据仓库技术并不完全支持4V理论,大数据解决方案的开发正是针对这些挑战。
What follows is a brief presentation of the major open-source Java based tools that are available today and support Big Data :
下面将介绍大数据领域支持Java的主流开源工具:
|
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. HDFS is specifically designed for storing vast amount of data, so it is optimized for storing/accessing a relatively small number of very large files compared to traditional file systems where are optimized to handle large numbers of relatively small files. HDFS是Hadoop应用程序中主要的分布式储存系统, HDFS集群包含了一个NameNode(主节点),这个节点负责管理所有文件系统的元数据及存储了真实数据的DataNode(数据节点,可以有很多)。HDFS针对海量数据所设计,所以相比传统文件系统在大批量小文件上的优化,HDFS优化的则是对小批量大型文件的访问和存储。 |
|
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Hadoop MapReduce是一个软件框架,用以轻松编写处理海量(TB级)数据的并行应用程序,以可靠和容错的方式连接大型集群中上万个节点(商用硬件)。 |
|
Apache HBase is the Hadoop database, a distributed, scalable, big data store. It provides random, realtime read/write access to Big Data and is optimized for hosting very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. In its core Apache HBase is a distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured Databy Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Apache HBase是Hadoop数据库,一个分布式、可扩展的大数据存储。它提供了大数据集上随机和实时的读/写访问,并针对了商用服务器集群上的大型表格做出优化——上百亿行,上千万列。其核心是Google Bigtable论文的开源实现,分布式列式存储。就像Bigtable利用GFS(Google File System)提供的分布式数据存储一样,它是Apache Hadoop在HDFS基础上提供的一个类Bigatable。 |
|
The Apache Cassandra is a performant, linear scalable and high available database that can run on commodity hardware or cloud infrastructure making it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for users and the peace of mind of knowing that you can survive regional outages. Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching. Apache Cassandra是一个高性能、可线性扩展、高有效性数据库,可以运行在商用硬件或云基础设施上打造完美的任务关键性数据平台。在横跨数据中心的复制中,Cassandra同类最佳,为用户提供更低的延时以及更可靠的灾难备份。通过log-structured update、反规范化和物化视图的强支持以及强大的内置缓存,Cassandra的数据模型提供了方便的二级索引(column indexe)。 |
|
Apache Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Apache Hive是Hadoop的一个数据仓库系统,促进了数据的综述(将结构化的数据文件映射为一张数据库表)、即席查询以及存储在Hadoop兼容系统中的大型数据集分析。Hive提供完整的SQL查询功能——HiveQL语言,同时当使用这个语言表达一个逻辑变得低效和繁琐时,HiveQL还允许传统的Map/Reduce程序员使用自己定制的Mapper和Reducer。 |
|
Apache Pig is a platform for analyzing large data sets. It consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. Pig’s language layer currently consists of a textual language called Pig Latin, which is developed with ease of programming, optimization opportunities and extensibility in mind. Apache Pig是一个用于大型数据集分析的平台,它包含了一个用于数据分析应用的高级语言以及评估这些应用的基础设施。Pig应用的闪光特性在于它们的结构经得起大量的并行,也就是说让它们支撑起非常大的数据集。Pig的基础设施层包含了产生Map-Reduce任务的编译器。Pig的语言层当前包含了一个原生语言——Pig Latin,开发的初衷是易于编程和保证可扩展性。 |
|
Apache Chukwa is an open source data collection system for monitoring large distributed systems. It is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data. Apache Chukwa是个开源的数据收集系统,用以监视大型分布系统。建立于HDFS和Map/Reduce框架之上,继承了Hadoop的可扩展性和稳定性。Chukwa同样包含了一个灵活和强大的工具包,用以显示、监视和分析结果,以保证数据的使用达到最佳效果。 |
|
Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Apache Ambari是一个基于web的工具,用于配置、管理和监视Apache Hadoop集群,支持Hadoop HDFS,、Hadoop MapReduce、Hive、HCatalog,、HBase、ZooKeeper、Oozie、Pig和Sqoop。Ambari同样还提供了集群状况仪表盘,比如heatmaps和查看MapReduce、Pig、Hive应用程序的能力,以友好的用户界面对它们的性能特性进行诊断。 |
|
Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. In short Apache ZooKeeper is a high-performance coordination service for distributed applications like those run on a hadoop cluster. Apache ZooKeeper是一个针对大型分布式系统的可靠协调系统,提供的功能包括:配置维护、命名服务、分布式同步、组服务等。ZooKeeper的目标就是封装好复杂易出错的关键服务,将简单易用的接口和性能高效、功能稳定的系统提供给用户。 |
|
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop是一个用来将Hadoop和关系型数据库中的数据相互转移的工具,可以将一个关系型数据库中数据导入Hadoop的HDFS中,也可以将HDFS中数据导入关系型数据库中。 |
|
Apache Oozie is a scalable, reliable and extensible workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Apache Oozie是一个可扩展、可靠及可扩充的工作流调度系统,用以管理Hadoop作业。Oozie Workflow作业是活动的Directed Acyclical Graphs(DAGs)。Oozie Coordinator作业是由周期性的Oozie Workflow作业触发,周期一般决定于时间(频率)和数据可用性。Oozie与余下的Hadoop堆栈结合使用,开箱即用的支持多种类型Hadoop作业(比如:Java map-reduce、Streaming map-reduce、Pig、 Hive、Sqoop和Distcp)以及其它系统作业(比如Java程序和Shell脚本)。 |
|
Apache Mahout is a scalable machine learning and data mining library. Currently Mahout supports mainly four use cases:
- Recommendation mining : Takes users’ behavior and from that tries to find items users might like.
- Clustering : Takes e.g. text documents and groups them into groups of topically related documents.
- Classification : Learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category.
- Frequent itemset mining : Takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
Apache Mahout是个可扩展的机器学习和数据挖掘库,当前Mahout支持主要的4个用例:
- 推荐挖掘 :搜集用户动作并以此给用户推荐可能喜欢的事物。
- 聚集 :收集文件并进行相关文件分组。
- 分类 :从现有的分类文档中学习,寻找文档中的相似特征,并为无标签的文档进行正确的归类。
- 频繁项集挖掘 :将一组项分组,并识别哪些个别项会经常一起出现。
|
|
Apache HCatalog is a table and storage management service for data created using Apache Hadoop. This includes:
- Providing a shared schema and data type mechanism.
- Providing a table abstraction so that users need not be concerned with where or how their data is stored.
- Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.
Apache HCatalog是Hadoop建立数据的映射表和存储管理服务,它包括:
- 提供一个共享模式和数据类型机制。
- 提供一个抽象表,这样用户就不需要关注数据存储的方式和地址。
- 为类似Pig、MapReduce及Hive这些数据处理工具提供互操作性。
|
That’s it; Big Data, a short theoretical introduction and a compact matrix of implementation approaches focused on overcoming the problems of a new era – the era that forces us to ask bigger questions!
Happy Coding
Byron