UHP博客文章地址:http://yuntai.1kapp.com/?p=854
原文链接:
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
· by Marcel Kornacker & Justin Erickson
· October 24, 2012
After a long period ofintense engineering effort and user feedback, we are very pleased, and proud,to announce the Cloudera Impala project. This technology is a revolutionary onefor Hadoop users, and we do not take that claim lightly.
经过长期的设计和用户反馈,我们非常高兴和自豪地发布ClouderaImpala项目。对Hadoop用户来说,这是一项革命性的技术,并且这个结论我们不是随便给出的。
When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by thetechnical vision to bring real-time, ad hoc query capability to Apache Hadoop,complementing traditional MapReduce batch processing. Today, we are announcinga fully functional, open-sourced codebase that delivers on that vision – and,we believe, a bit more – which we call Cloudera Impala. An Impala binary is nowavailable in public beta form, but if you would prefer to test-drive Impala viaa pre-baked VM, we have one of those for you, too. (Links to all downloads anddocumentation are here.) You can also review the source code and testing harness at Github right now.
谷歌在2010发布了关于Dremel的论文,我们和社区的其他人被其所启发,决定开发一个具有实时,特别是热查询功能的ApacheHadoop对应版本,作为传统MapReduce批处理的补充。现在,我们发布了一个具有全部功能(甚至更多)的开源Dremel版本——称之为Cloudera Impala。Impala的二进制码已可用,其以beta版的形式公开。如果你希望以预先安装好的虚拟机来测试Impala,我们也有准备。你也可以在Github查看源代码和测试。
Impala raises the bar forquery performance while retaining a familiar user experience. With Impala, youcan query data, whether stored in HDFS or Apache HBase – including SELECT,JOIN, and aggregate functions – in real time. Furthermore, it uses the samemetadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax)as Apache Hive, providing a familiar and unified platform for batch-oriented orreal-time queries. (For that reason, Hive users can utilize Impala with littlesetup overhead.) The first beta drop includes support for text files andSequenceFiles; SequenceFiles can be compressed as Snappy, GZIP, and BZIP (withSnappy recommended for maximum performance). Support for additional formatsincluding Avro, RCFile, LZO text files, and Doug Cutting’s Trevni columnar format is planned for the production drop.
Impala通过记录类似的用户操作结果来提升查询性能。通过Impala,你可以使用SELECT、JOIN和聚集函数实时地查询储存在HDFS或HBase上的数据。此外,其使用了和Hive一样的元数据、SQL语法、ODBC驱动和用户界面,提供了一个类似而且统一的平台进行批处理或者实时的查询(因此Hive用户只需花费很少的设置开销就能使用Impala)。第一个beta版本包含了对文本文件和顺序文件的支持。顺序文件可以通过Snappy、GZIP或BZIP进行压缩(建议使用Snappy,性能最好)。对于其他格式,如Avro,、RCFile,、LZOtext files、 Doug Cutting’s Trevni columnar format等的支持计划在正式产品版本中提供。
To avoid latency, Impalacircumvents MapReduce to directly access the data through a specializeddistributed query engine that is very similar to those found in commercialparallel RDBMSs. The result is order-of-magnitude faster performance than Hive,depending on the type of query and configuration. (See FAQ below for moredetails.) Note that this performance improvement has been confirmed by severallarge companies that have tested Impala on real-world workloads for severalmonths now.
为了避免延迟,Impala没有使用MapReduce而是通过一个特别的,和商业并行RDBMS很类似的分布式查询引擎直接访问数据。相对Hive,根据查询的类型和配置,性能有数量级级别的提升。几家大型公司在过去的几个月中通过在实际工作中对Impala进行的测试已经确认了这种性能提升。
A high-level architecturalview is below:
高层次的结构图:
There are many advantagesto this approach over alternative approaches for querying Hadoop data,including:
使用这种方式查询Hadoop数据相对其他方式有很多优势:
· Thanks to local processingon data nodes, network bottlenecks are avoided.
通过在数据节点上进行本地化处理,避免了网络瓶颈。
· A single, open, and unifiedmetadata store can be utilized.
可以使用一个单一、开放、统一的元数据储存。
· Costly data formatconversion is unnecessary and thus no overhead is incurred.
不必要进行耗时的数据格式转换,因此不会导致性能开销。
· All data is immediatelyquery-able, with no delays for ETL.
所有数据都是可以及时查询的,对于ETL没有延迟。
· All hardware is utilizedfor Impala queries as well as for MapReduce.
用于MapReduce的所有硬件都可以用于Impala。
· Only a single machine poolis needed to scale.
扩展时只需考虑单一的机器池。
We encourage you to readthe documentation for further technical details.
Finally, we’d like toanswer some questions that we anticipate will be popular:
我们建议那你阅读documentation以获取更多的技术细节。
最后,我们将回答一些我们认为会被经常问道的问题:
IsImpala open source? Impala开源吗?
Yes, Impala is 100% open source (Apache License). You can review the code foryourself at Github today.
Impala百分之百开源(Apache许可)。你可以在Github 查看代码。
Howis Impala different than Dremel? Impala和Dremel区别?
The first and principal difference is that Impala is open source and availablefor everyone to use, whereas Dremel is proprietary to Google.
最重要的区别是Impala是开源的,所有人都可以使用,Dremel则是Google的财产。
Technically, Dremelachieves interactive response times over very large data sets through the useof two techniques:
技术上说,Dremel在超大数据集上实现了可接受的交互相应时间主要是使用了以下两种技术:
· A novel columnar storageformat for nested relational data/data with nested structures
一种新的,针对嵌套关系数据(或者说具有嵌套结构的数据)列储存格式。
· Distributed scalableaggregation algorithms, which allow the results of a query to be computed onthousands of machines in parallel.
分布式可扩展聚集算法,允许查询的结果可以在数千个机器上并行地计算。
The latter is borrowed fromtechniques developed for parallel DBMSs, which also inspired the creation ofImpala. Unlike Dremel as described in the 2010paper, which could only handle single-tablequeries, Impala already supports the full set of join operators that are one ofthe factors that make SQL so popular.
后者借鉴了并行DBMS的技术,这种技术同样对Impala的产生有启发。不同于Dremel在2010paper中描述的只能处理单表查询,Impala已经支持所有join操作集(join操作正是使SQL如此流行的一个因素)。
In order to realize thefull performance benefits demonstrated by Dremel, Hadoop will shortly have anefficient columnar binary storage format called Trevni. But contrary to Dremel, Impala supports a range of popularfile formats. This lets users run Impala on their existing data without havingto “load” or transform it. It also lets users decide if they want to optimizefor flexibility or just pure performance.
为了搞清楚Dremel所有的性能优势,Hadoop很快将拥有一个高效的二进制列储存格式——Trevni。相对Dremel,Impala支持一系列流行的文件格式。这使得用户可以直接在他们已有的数据上运行Impala,而不必对数据进行加载或者转换。同时还可以让用户在灵活性和性之间进行选择。
To sum it up, Impala plusTrevni will achieve the query performance described in the Dremel paper, butsurpass what is described there in SQL functionality.
综上,Impala加上Trevni将实现Dremel论文中描述的查询性能,而且在SQL的功能性上还有所超越。
Howmuch faster are Impala queries than Hive ones, really? 实际中Impala能比Hive快多少?
The precise amount of performance improvement is highly dependent on a numberof factors:
精确的性能提升高度依赖于以下因素:
· Hardware configuration:Impala is generally able to take full advantage of hardware resources andspecifically generates less CPU load than Hive, which often translates intohigher observed aggregate I/O bandwidth than with Hive. Impala of course cannotgo faster than the hardware permits, so any hardware bottlenecks will limit theobserved speedup. For purely I/O bound queries, we typically see performancegains in the range of 3-4x.
硬件配置: Impala通常情况下可以利用硬件资源的所有优势。特别地,相对Hive,一般来说CPU负载更低,但经常导致更高的可观察到的总I/O带宽需求。Impala不可能超过硬件的限制,所以任何硬件的瓶颈都将限制可观察到的性能提升。对于单纯的I/O消耗的查询,典型的性能提升有3-4倍。
· Complexity of the query:Queries that require multiple MapReduce phases in Hive or require reduce-sidejoins will see a higher speedup than, say, simple single-table aggregationqueries. For queries with at least one join, we have seem performance gains of7-45X.
查询的复杂度:需要在Hive中进行多个MapReduce阶段的查询或者需要在reduce时进行join操作的查询相对简单的单表聚集查询将获得更多的性能提升。对于至少包含一次join操作的查询,性能提升在7-45倍。
· Availability of main memoryas a cache for table data: If the data accessed through the query comes out ofthe cache, the speedup will be more dramatic thanks to Impala’s superiorefficiency. In those scenarios, we have seen speedups of 20x-90x over Hive evenon simple aggregation queries.
用于缓存表数据的主内存的有效性:如果查询中访问的数据来自缓存,性能提升将更引人注目,这得益于Impala上层架构提供的效率提升。在这些场景,即便是对于简单的聚集查询来说,相对Hive也有20-90倍的性能提升。
IsImpala a replacement for MapReduce or Hive – or for traditional data warehouseinfrastructure, for that matter? Impala是用来替换MapReduce或者Hive,还是用来替换传统数据仓库的基础设施?
No. There will continue be many viable use cases for MapReduce and Hive (forexample, for long-running data transformation workloads) as well as traditionaldata warehouse frameworks (for example, for complex analytics on limited,structured data sets). Impala is a complement to those approaches, supportinguse cases where users need to interact with very large data sets, across alldata silos, to get focused result sets quickly.
不。MapReduce、Hive和传统的数据仓库框架仍然会有很多的可行用例(如:长时间运行的数据转换工作;对于有限的、结构化的数据集的复杂分析)。Impala对于这些处理是一种补充,用于支持需要对超大数据集进行交互,遍历所有储存的数据,快速地获得关注的结果的用例。
Doesthe Impala Beta Release have any technical limitations? Impala beta版有技术限制吗?
As mentioned previously, supported file formats in the first beta drop includetext files and SequenceFiles, with many other formats to be supported in theupcoming production release. Furthermore, currently all joins are done in amemory space no larger than that of the smallest node in the cluster; inproduction, joins will be done in aggregate memory. Lastly, no UDFs arepossible at this time.
之前提到,第一个beta版支持文本文件和顺序文件,其他的格式将在未来的产品发布版本中得到支持。此外,目前所有的join操作
Whatare the technical requirements for the Impala Beta Release? Impala beta版的环境要求?
You will need to have CDH4.1 installed on RHEL/CentOS 6.2. We highly recommend the use of ClouderaManager(Free or EnterpriseEdition) to deploy and manage Impala because it takes care of distributeddeployment and monitoring details automatically.
需要安装在RHEL/CentOS 6.2上的 CDH4.1。我们强烈建议使用 ClouderaManager(免费版或企业版)来部署和管理Impala,因为其可以自动进行分布式部署并对细节进行监控。
Whatis the support policy for the Impala Beta Release? Impala beta版的支持政策?
If you are an existing Cloudera customer with a bug, you may raise a Customer Support ticket and we will attempt to resolve it on a best-effort basis.If you are not an existing Cloudera customer, you may use our public JIRA instanceor the impala-user mailing list, which will be monitored by Cloudera employees.
如果你已经是Cloudera的顾客并发现了一个bug,你可以通过Customer Support向我们反映,我们将尽力尝试解决。如果你不是Cloudera的顾客,你可以通过public JIRA instance或者impala-user邮件列表向我们反映,Cloudera的员工将检查这些问题。
Whenwill Impala be generally available for production use? Impala何时能够作为产品使用?
A production drop is planned for the first quarter of 2013. Customers mayobtain commercial support in the form of aClouderaEnterprise RTQ subscription at that time.
产品版本计划在2013第一季度发布。到时消费者可以通过ClouderaEnterprise RTQ的形式订购以获得商业支持。
We hope that you take theopportunity to review the Impala source code, explore the beta release,download and install the VM, or any combination of the above. Your feedback inall cases is appreciated; we need your help to make Impala even better.
我们希望你抓住机会查看Impala源码,研究beta版本,下载和安装虚拟机,或者以上途径的任意组合。你的任何反馈都是非常有价值的,我们需要你的帮助来使Impala更好。
We will bring you furtherupdates about Impala as we get closer to production availability.
我们在Impala产品化的过程中将带来进一步的更新。
Impalaresources:
– Impala source code
– Impaladownloads (Beta Release and VM)
– Impaladocumentation
– Public JIRA
– Impala mailing list
- Free Impalatraining (Screencast)
(Added10/30/2012) Third-party articles about Impala:
- GigaOm: Real-time query for Hadoop democratizes accessto big data analytics (Oct. 22, 2012)
- Wired: Man Busts Out of Google, Rebuilds Top-Secret QueryMachine (Oct. 24, 2012)
- InformationWeek: Cloudera Debuts Real-Time Hadoop Query (Oct. 24, 2012)
- GigaOm: Cloudera Makes SQL a First-Class Citizen on Hadoop (Oct. 24, 2012)
- ZDNet: Cloudera’s Impala Brings Hadoop to SQL and BI (Oct. 25, 2012)
- Wired: Marcel Kornacker Profile (Oct. 29, 2012)
- Dr. Dobbs: Cloudera Impala – Processing Petabytes at TheSpeed Of Thought (Oct. 29, 2012)
Marcel Kornacker isthe architect of Impala. Prior to joining Cloudera, he was the lead developerfor the query engine of Google’s F1 project.
Justin Erickson isthe product manager for Impala.