impact of total region numbers?

这几天tune了hbase的几个参数,有些有意思的结果。具体看我下面的邮件吧。

 

For example, I have total some data and I can tune hbase.hregion.max.filesize to increase/decrease total region number, rite?
I want to know if the region number has performance impact to random read tests. I observed that in my ycsb test,  with larger hfile size, I got better tput and smaller latency. 
Anybody can give me hints. Thanks.

Tao


  回复
  转发
回复

Tatsuya Kawano

  发送至 user
显示详细信息  1月18日 (2 天前)
Hi Tao,

I think the number of regions won't have much impact to random read throughput and latency. But the number of generations (HFiles) per region will do.

If this is the case, try to run major compaction on the table. This will merge HFile generations so the read throughput and latency will be recovered. You can do this from the hbase shell.

Also, you might want to increase  hbase.region.mstore.flush.size to keep the number of HFile generations smaller.

Thanks,

--
Tatsuya Kawano (Mr.)
Tokyo, Japan
- 显示引用文字 -
  回复
  转发
  邀请 Tatsuya Kawano 聊天
回复

Tao Xie

  发送至 user
显示详细信息  1月18日 (2 天前)
Thanks for response.
I tuned the values of dfs.block.size and h base.hregion.max.filesize for my tests (pure read tests) and had below results:
Test    dfs.block.size         hbase.hregion.max.filesize            requests/sec           latency
   1          32                                1024                                                           ~4000                          24
   2         256                              256                                                             ~4500                          22
   3         1024                            1024                                                          ~5000                          20

My understanding to the results is that,  with less hdfs blocks hfile can speed up the lookup for a random row, avoiding jumping from one block to another (Test 1 vs. Test2);  with less but bigger regions performance will also be better? (Test2 vs. Test3).
Sure, I believe number of HFiles per region will have impact, but I truly all  did major compaction using the command line:
major_compact 'mytable' 
and checked each region has only one storefile.

Is that correct?



2011/1/18 Tatsuya Kawano  <[email protected]>
- 显示引用文字 -

  回复
  转发
回复

Tatsuya Kawano

  发送至 user
显示详细信息  1月18日 (2 天前)

Hi Tao,

Thanks for sharing the test result.

> but I truly
> all did major compaction using the command line:
> major_compact 'mytable'
> and checked each region has only one storefile.

Yes, that's what I mean. So that isn't your case.


> My understanding to the results is that,  with less hdfs blocks hfile can
> speed up the lookup for a random row, avoiding jumping from one block to
> another (Test 1 vs. Test2)

I can't tell if this is correct just becasuse of my limited knowledge on HDFS. But I think less number of HDFS blocks could make the hard drives to seek the data quicker because HDFS tries to save all bytes in a block in the continuous location of a disk. Less blocks (less fragments) on the hard drives will improve the seek latency especially when multiple threads are trying to access to the same drives.

Thanks,

--
Tatsuya Kawano (Mr.)
Tokyo, Japan


- 显示引用文字 -
  回复
  转发
  邀请 Tatsuya Kawano 聊天
回复

Stack

  发送至 user
显示详细信息  1月18日 (2 天前)
Along with Tatsuya, I thank you for sharing this interesting result.

I too wonder why the bigger block makes a difference -- 25%
improvement is a bunch -- since we set up a socket on each random read
and seek the block (we do not currently reuse connection if correct
block is already in the breach)?

Thanks for trying this experiment.
St.Ack
- 显示引用文字 -

 

你可能感兴趣的:(hadoop,hbase,cloud)