Impala介绍博客相关问答

UHP博客文章地址:http://yuntai.1kapp.com/?p=875

原创文章,转载请注明出处:http://blog.csdn.net/wind5shy/article/details/8433492

原博客文章地址:

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

 

·        SONAL / OCTOBER 25, 2012 / 11:44 AM

Very excited tosee Impala. The Dremel paper outlines efficient columnar storage for nesteddata. How does Impala achieve its speeds if data is not to be loaded in to thesystem?

Thanks
Sonal

Dremel论文描述了使用列储存来有效地储存嵌套数据。如果数据没有被加载到系统中,Impala的实现是如何保证其速度的?

·        MARCEL KORNACKER / OCTOBER 31, 2012 / 9:06 PM

To addressSonal’s question:

The performanceadvantage you will see with Impala will always depend on the storage format ofthe data, among other things. Impala tries hard to be fast on ascii-encodeddata (text files and sequencefile), but of course the parsing overhead will alwaysshow up as a performance penalty compared to something like ColumnIO or Trevni.Impala will also support Trevni in the GA release, as mentioned in the blogpost.

Regarding dataloading: we are working on background conversion into Trevni, in a way thatenables a logical table to be backed by a mix of data formats. New data wouldshow up in, say, sequencefile format and eventually get converted into the moreefficient Trevni columnar format, but all of the data would be queryable at alltimes, regardless of format.

Marcel

Impala的性能优势始终依赖于数据的储存格式。Impala致力于能够对ASCII编码的数据进行快速处理,但是同ColumnIOTrevni相比,解析开销肯定会对性能造成影响。Impala在正式版本中将支持Trevni

考虑数据加载:我们在后台将数据转换到Trevni,这种方式可以允许一张逻辑表以混合格式进行备份。新数据是顺序文件格式,最终被转换为更有效的Trevni列格式,但是所有数据在任何时刻都是可查询的,和格式无关。

·        ALEX B /NOVEMBER 22, 2012 / 8:25 AM

Can you pleasecomment how Impala compares to Hadapt in terms of architecture ? As far as Iunderstand in case of Hadapt ( and I could be wrong of course ) sometransformation of the data to Postgre SQL is needed . That does not seems to bethe case with Impala( at least in the current implementation) ?

Thanks,
Alex

ImpalaHadapt在结构上进行比较?Hadapt中,需要进行某些数据到PG的转换。Impala看起来不需要这样做。

·        MARCEL KORNACKER / DECEMBER 20, 2012 / 6:02 PM

Regarding Alex’squestion:

That’s correct,Impala does read data directly from HDFS and HBase. Impala also relies onApache Hive’s metastore for the mapping of files into tables, which means youcan re-use your schema definitions if you’re already querying Hadoop throughHive.

Hadapt runs aPostgreSql instance on each data node, and appears to require some form of datamovement (and duplication of data storage) between Postgres and HDFS, but forthe specifics of that architecture I would recommend consulting the Hadaptwebsite.

Marcel

Impala直接从HDFSHBase上读取数据,同时Impala依赖Hive的元存储来将文件映射到表,这意味着你如果已经通过HiveHadoop上的数据进行查询,那么你可以重用你的模式定义。

Hadapt在每个数据节点上运行一个PG实例,而且似乎需要在PGHDFS直接进行某些形式的数据移动(和数据复制),但对于相关架构的细节建议到Hadapt网站上进行咨询。

·        KANG XIAO /DECEMBER 03, 2012 / 6:59 AM

Great stuff! Wehave tried it and impala shows about 2x speedup vs. hive on our simple query ontest dataset.

Could Marcelexplain more about the main reasons that make impala faster?
1. about columnar storage: it seems that hive can also benifit from columnarstorage compared with text file.
2. about distributed scalable aggregation algorithms: is there some details andexamples about the algorithms?
3. about join: if dataset can not fit into memory, how impala keep faster ifimpala use disk.
4. about main memory as a cache for table data: is it a cache in impala forrecently accessed data?

Thanks!
Kang

我们已经试用过Impala,在测试的数据集中,使用简单查询,Impala的速度比Hive提升了2倍。

Marcel解释Impala速度快的主要原因:

1.    关于列储存:相对于文本文件,Hive也可以通过使用列储存获益。

2.    关于分布式可扩展聚集算法:有算法的细节和例子吗?

3.    关于join:如果数据集无法全部读入内存,Impala如何在使用磁盘的时候保持速度。

4.    关于用作表数据缓存的主内存:缓存Impala最近访问的数据?

·        MARCEL KORNACKER / DECEMBER 20, 2012 / 6:32 PM

Regarding Kang’squestions:

1. Yes, theTrevni columnar storage format will be an open and general purpose storageformat that will be available for any of the Hadoop processing frameworks,including Hive, MapReduce, and Pig.

However, weexpect to see greater performance gains from Trevni in Impala compared to whatyou’d see in Hive. The reason is that in a disk-based system, Impala is oftenI/O-bound, and a columnar format will reduce the total I/O volume, often by asubstantial amount. Hive is often cpu-bound and will therefore benefit muchless from a reduction in I/O volume.

2. At themoment, Impala does a simple 2-stage aggregation: pre-aggregation is done byall executing backends, followed by a single, central merge aggregation step inthe coordinator. In an upcoming release Impala will also support repartitioningaggregation, where the result of the pre-aggregation step is hash-partitionedacross all executing backends, so that the total merge aggregation work is alsodistributed.

3. Impala currentlyhas the limitation that the right-hand side table of a join needs to fit intothe memory of every executing backend. In the GA release, this will be relaxed,so that the right-hand side table will only have to fit into the *aggregate*memory of all executing backends. Disk-based join algorithms won’t be availableuntil after the GA release.

4. Impala doesnot maintain its own cache; instead, it relies on the OS buffer cache in orderto keep frequently-accessed data in memory.

Marcel

1.      Trevni列储存格式将是一个开放和通用的储存格式,对所有Hadoop处理框架都可用,包括HiveMapReducePig

但是,相对Hive,我们希望通过TrevniImpala上获得更多的性能提升。原因是在一个基于磁盘的系统中,Impala经常受到I/O的限制,而列格式可以减少总I/O量,而且经常可以减少很多。Hive经常受到CPU的限制因此在I/O量减少方面获益较少。

2.    目前,Impala进行一个简单的2阶段聚集算法:预聚集在所有执行后端完成,之后在协调器进行一个单一的、中心合并聚集步骤。在即将发布的版本中,Impala还将支持再分配聚集,预聚集步骤的结果将通过hash分区到所有执行后端,所以合并聚集工作也是分布式的。

3.    Impala目前限制右连接表需要加载到每个执行后端的内存中处理。在正式版本中,限制将放宽,右连接表只需要能加载到所有执行后端的总内存中即可。基于磁盘的join算法在正式版本之前都不可用。

4.    Impala没有维持其自有的缓存,取而代之的是使用OS buffer进行缓存以保证频繁访问的数据保留在内存中。

原创文章,转载请注明出处:http://blog.csdn.net/wind5shy/article/details/8433492


你可能感兴趣的:(hadoop,问题,impala)