接着上一篇,
问题(2):
scan有
scan.setCaching(10000)
scan.setCacheBlocks(true)
等设定.setCaching ,个人感觉不够用.hbase 默认是在内存里面放一块数据用来读取,所以读取效率比较高,可是,其余大部分数据还是在硬盘中,这个内存数据块的设定和意义,待清晰研究.
单节点hbase的写入效率,有人粗估计,在3万-5万,这个效率是远高于传统数据库的.一般读效率比写效率要高,我们之前使用3台机器,查询10000条数据,get方式,用了3秒.虽然中间有处理hdfs上csv文件的过程,这还是让本人感觉还有很大提升的地方.
如果我想要提高hbase的读取效率,我该怎么做?
hbase读数据流程,放一张图片:
hbase读数据,我的理解是:
(1)hbase 有个Blockcache内存块,用来放一小部分硬盘上数据供随时读取,这个块size可以配置;开启方式,比如:
create'MyTable',{NAME =>'myCF',PREFETCH_BLOCKS_ON_OPEN =>'true'}
或者java方式:
HTableDescriptor tableDesc = new HTableDescriptor("myTable");
HColumnDescriptor cfDesc = new HColumnDescriptor("myCF");
cfDesc.setPrefetchBlocksOnOpen(true);
tableDesc.addFamily(cfDesc);
如果block块中没有结果数据,就得去硬盘上再加载一块去block cache;
(2)hbase是用socket套接字连接,使用hbase api 操作,影响效率的因素兼具内存和网络IO等影响.
(3)每个client 请求,都会有多个rpc请求, rpc一批批的把结果数据返回至client端;
通过scan.setCaching(10000) 方式设置每一批数据行数. 本人一般设成10000,应用场景是,如果用户查询频率较多,而结果数据较少,那么不要设太大;如果用户需要查询过去几年的数据做计算统计分析,可以适当设高一点.这个设定,其实本人感觉也是不确定较大.
觉得hbase 并没有做成完全内存式的方式有些遗憾, 而且hbase的多项参数调优范围不确定很高,但是,我是应该对hbase这个组件的某些内容做出自己的质疑呢,还是按部就班的按照hbase 开发者方面提供的调优建议一条一条来呢?
扯远了.
这里分2个方面来说,
(1)代码配置和代码设计优化.
(2)hbase本身的调优项.
先说优化(1):代码配置和代码设计优化.
为此,本人找了一些资料,看上去,有一些对部分参数做了针对性的分析,可是这些调优项协调起来,对hbase整体性提高如何,并不清楚.
然而,我还是找到了一个比较系统的资料,Quora 上有推荐cloudera写的系统化比较强的文档.是英文的,有条件的可以使用谷歌翻译.现摘抄一些如下:
(一).使用字节数组的常量.
When people get started with HBase they have a tendency to write code that looks like this:
Get get = new Get(rowkey);
Result r = table.get(get);
byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
But especially when inside loops (and MapReduce jobs), converting the columnFamily and column-names to byte-arrays repeatedly is surprisingly expensive. It’s better to use constants for the byte-arrays, like this:
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...
Get get = new Get(rowkey);
Result r = table.get(get);
byte[] b = r.getValue(CF, ATTR); // returns current version of value
意思就是用 “cf”.getBytes() 代替Bytes.toBytes(“cf”) ,这样会少转换一次.这种方式本人还没尝试过,听起来可以一试.
(二) 批量加载
98.1. Batch Loading
Use the bulk load tool if you can. See Bulk Loading. Otherwise, pay attention to the below.
98.2. Table Creation: Pre-Creating Regions
Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance.
There are two different approaches to pre-creating splits. The first approach is to rely on the default Admin strategy (which is implemented in Bytes.split)…
byte[] startKey = ...; // your lowest key
byte[] endKey = ...; // your highest key
int numberOfRegions = ...; // # of regions to create
admin.createTable(table, startKey, endKey, numberOfRegions);
And the other approach is to define the splits yourself…
byte[][] splits = ...; // create your own splits
admin.createTable(table, splits);
See Relationship Between RowKeys and Region Splits for issues related to understanding your keyspace and pre-creating regions. See manual region splitting decisions for discussion on manually pre-splitting regions.
一般读写hbase使用Mapreduce 或hbase Api 方式.
批量加载适用那些Mapreduce 来做批量任务. 由于减少了一些client请求环节,它的内存和网络IO消耗要比hbase api小一些.
本人理解的做法是,把Mapreduce 批量任务做出jar包形式,利用shell脚本来执行会多一些.之前使用过的一个hbase 中间件: key-value store indexer ,也有一种Mapreduce批量处理的方式.
(三)scan caching
If HBase is used as an input source for a MapReduce job, for example, make sure that the input Scan instance to the MapReduce job has setCaching set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed. Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn’t always better.
这是必选项,一般通过
scan.setCaching(10000)
来设置.
(四)关闭ResultScanners
This isn’t so much about improving performance but rather avoiding performance problems. If you forget to close ResultScanners you can cause problems on the RegionServers. Always have ResultScanner processing enclosed in try/catch blocks.
Scan scan = new Scan();
// set attrs...
ResultScanner rs = table.getScanner(scan);
try {
for (Result r = rs.next(); r != null; r = rs.next()) {
// process result...
} finally {
rs.close(); // always close the ResultScanner!
}
table.close();
(五)使用Bloom过滤器
Enabling Bloom Filters can save your having to go to disk and can help improve read latencies.
Bloom filters were developed over in HBase-1200 Add bloomfilters. For description of the development process — why static blooms rather than dynamic — and for an overview of the unique properties that pertain to blooms in HBase, as well as possible future directions, see the Development Process section of the document BloomFilters in HBase attached to HBASE-1200. The bloom filters described here are actually version two of blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom option based on work done by the European Commission One-Lab Project 034819. The core of the HBase bloom work was later pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. Version 1 of HBase blooms never worked that well. Version 2 is a rewrite from scratch though again it starts with the one-lab work.
布隆过滤器,这个,本人感觉,是个比较好的建议,值得尝试.我找到一个测试对比表格,可以简单看下它的效果.
使用航空公司的交通 数据 进行布隆过滤器实验。HBase表中加载了大约500万条来自该数据集的记录。以下是结果:
可以通过以下配置设置布隆过滤器:
io.hfile.bloom.enabled global kill switch
io.hfile.bloom.enabled in Configuration serves as the kill switch in case something goes wrong. Default = true.
io.hfile.bloom.error.rate
io.hfile.bloom.error.rate = average false positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1 bit per bloom entry.
io.hfile.bloom.max.fold
io.hfile.bloom.max.fold = guaranteed minimum fold rate. Most people should leave this alone. Default = 7, or can collapse to at least 1/128th of original size. See the Development Process section of the document BloomFilters in HBase for more on what this option means.
就是:
io.hfile.bloom.enabled 默认true
io.hfile.bloom.error.rate 默认1%, 减小到0.5%
io.hfile.bloom.max.fold 默认7,尽量减小该值.
感觉细节是很多的.这些方面,本人也是在学习中.
先写到这里吧.
首先,hbase的使用环境相比mysql等较为复杂,而且一般为分布式的架构,在实际中,如果使用不好,并不会带来多大好处,反而适得其反。现总结一下本人使用hbase的心得。
问题1:基于rowkey的条件检索。
问题2:检索的列不确定,或多或少。
hbase一般检索使用rowkey,类似于唯一id。如果知道rowkey的值,查询的效率最高。
比如,检索某一行:
hbase(main)> get 't1','rowkey001'
一般检索中,极少有直接知道rowkey的值的,因此,这种检索方式不实用。
一般检索多是条件检索,或相关检索。
对于问题1,我们对此做了一些探讨和设计,具体有:
将重要的列拼接成rowkey,如分区号,编号,时间等。
简化查询列的命名,比如编号,统一为6位数格式;性别,统一为1或0;某条件只有3个分支,则统一为1,2,3.
比较高表和宽表的适用性,根据情况,择优使用。
一个表减少列族的数量。
分析每个单元存储什么数据,优化单元的数据结构,使其紧凑。
以上的优化,针对的多是数据的总量很大,而数据的列不是很多的情况。如果数据的列很多,而检索条件又是不确定,可能某次检索需要几列,又可能需要十几列,几十列。随着一次性检索的列数的增加,每次IO的流量相应增加,由于hbase的分布式架构,如果客户端是处于外网的环境,IO效率可能很糟糕,其网络也随之需要优化。另一方面,hbase不适合类SQL查询。
针对问题2,如果一开始完全沿用基于rowkey的条件检索,虽然看起来没有问题,但在实际操作中,却碰到了一些问题,比如检索条件受限,检索效率问题,rpc 网络超时问题。
就这些问题,需要更强大的检索架构。本人觉得可能的思路有2种:
使用搜索引擎,和以elasticsearch为代表的搜索引擎搭配使用。
使用spark on hive。
使用搜索引擎,无疑会很大提高检索效率,也无需在乎每日增加的数据量。但核心缺点在于内存使用量大,有一定门槛。
使用hive,需要很强的设计能力,hadoop on hive,虽能解决写入和检索,可是实际效率相对比较低,存在优化瓶颈;spark on hive,同样存在内存使用量大,spark调优也需要精确控制。但目前spark sql可以支持hiveql语法解析器。这里笔者探讨的是检索,而spark sql或hive,已进入BI分析的范畴,就不费过多笔墨了。
参考资料:
http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.7.0/book.html#perf.general
https://www.ellicium.com/hbase-best-performance/
https://www.cnblogs.com/cxzdy/p/5118545.html
其它.