当我们对HBase表中的数据进行一些简单的行数统计或者聚合计算时,如果使用MapReduce或Native API将数据传到客户端进行计算,就会有较大延迟和大量网络IO开销。如果能把这些计算放在Server端,就可以减少网络IO开销,从而获得很好的性能提升。HBase的协处理器可以很好的实现上述想法。
HBase coprocessor 分为两大类,分别是:
1、Observer:类似于观察者模式,提供了Get、Put、Delete、Scan等一些钩子方法。RegionObserver具体又可以分为:RegionObserver、WALObserver和MasterObserver
2、Endpoint:通过RPC调用实现。
下面介绍使用Endpoint实现行数统计。
开发环境:
Hadoop 2.6.0
HBase 0.98.4
实现代码:
1、定义RPC通讯协议(ExampleProtos.proto)
option java_package = "com.iss.gbg.protobuf.proto.generated";
option java_outer_classname = "ExampleProtos";
option java_generic_services = true;
option java_generate_equals_and_hash = true;
option optimize_for = SPEED;
message CountRequest {
}
message CountResponse {
required int64 count = 1 [default = 0];
}
service RowCountService {
rpc getRowCount(CountRequest)
returns (CountResponse);
}
协议的内容不再解析,想了解各行代理什么意思,请查看Protocol Buffers的相关内容。
使用Protocol Buffers的编译器生成Java类
服务器端代码实现:
public class RowCountEndpoint extends ExampleProtos.RowCountService implements Coprocessor, CoprocessorService { private RegionCoprocessorEnvironment env; public RowCountEndpoint() { } /** * Just returns a reference to this object, which implements the RowCounterService interface. */ @Override public Service getService() { return this; } /** * Returns a count of the rows in the region where this coprocessor is loaded. */ @Override public void getRowCount(RpcController controller, ExampleProtos.CountRequest request, RpcCallback<ExampleProtos.CountResponse> done) { Scan scan = new Scan(); scan.setFilter(new FirstKeyOnlyFilter()); ExampleProtos.CountResponse response = null; InternalScanner scanner = null; try { scanner = env.getRegion().getScanner(scan); List<Cell> results = new ArrayList<Cell>(); boolean hasMore = false; byte[] lastRow = null; long count = 0; do { hasMore = scanner.next(results); for (Cell kv : results) { byte[] currentRow = CellUtil.cloneRow(kv); if (lastRow == null || !Bytes.equals(lastRow, currentRow)) { lastRow = currentRow; count++; } } results.clear(); } while (hasMore); response = ExampleProtos.CountResponse.newBuilder() .setCount(count).build(); } catch (IOException ioe) { ResponseConverter.setControllerException(controller, ioe); } finally { if (scanner != null) { try { scanner.close(); } catch (IOException ignored) {} } } done.run(response); } /** * Stores a reference to the coprocessor environment provided by the * {@link org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost} from the region where this * coprocessor is loaded. Since this is a coprocessor endpoint, it always expects to be loaded * on a table region, so always expects this to be an instance of * {@link RegionCoprocessorEnvironment}. * @param env the environment provided by the coprocessor host * @throws IOException if the provided environment is not an instance of * {@code RegionCoprocessorEnvironment} */ @Override public void start(CoprocessorEnvironment env) throws IOException { if (env instanceof RegionCoprocessorEnvironment) { this.env = (RegionCoprocessorEnvironment)env; } else { throw new CoprocessorException("Must be loaded on a table region!"); } } @Override public void stop(CoprocessorEnvironment env) throws IOException { // nothing to do } }
客户端调用代码:
public class RowCountClient { private static final Logger LOG = LoggerFactory.getLogger(RowCountClient.class); public static void main(String[] args) { HTableInterface htable = null; try { htable = HBaseServer.getTable("test_crawler_data"); LOG.info("htable 获取成功!"); final ExampleProtos.CountRequest request = ExampleProtos.CountRequest.getDefaultInstance(); Map<byte[], Long> results = htable.coprocessorService(ExampleProtos.RowCountService.class, null, null, new Batch.Call<ExampleProtos.RowCountService, Long>() { /* (non-Javadoc) * @see org.apache.hadoop.hbase.client.coprocessor.Batch.Call#call(java.lang.Object) */ @Override public Long call(RowCountService counter) throws IOException { ServerRpcController controller = new ServerRpcController(); BlockingRpcCallback<ExampleProtos.CountResponse> rpcCallback = new BlockingRpcCallback<ExampleProtos.CountResponse>(); counter.getRowCount(controller, request, rpcCallback); ExampleProtos.CountResponse response = rpcCallback.get(); if(controller.failedOnException()) { throw controller.getFailedOn(); } return (null != response && response.hasCount())? response.getCount() : 0 ; } }); if(null != results && !results.isEmpty()) { long sum = 0; for(Entry<byte[], Long> entry : results.entrySet()) { sum += entry.getValue().longValue(); } System.out.println("sum=" + sum); } } catch (IOException e) { e.printStackTrace(); } catch (ServiceException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (Throwable e) { // TODO Auto-generated catch block e.printStackTrace(); } finally { if(null != htable) { try { htable.close(); } catch (IOException e) { e.printStackTrace(); } } } } }
部署:
1、将代码打包发布到HBase的lib上目录,重启HBase即可。
2、给指定表添加endpoint
alter 'test_crawler_data','coprocessor'=>'|com.iss.gbg.protobuf.hbase.RowCountEndpoint|1001|'
查看test_crawler_data的描述如下:
hbase(main):022:0> describe 'test_crawler_data'
DESCRIPTION ENABLED
'test_crawler_data', {TABLE_ATTRIBUTES => {coprocessor$1 => '|org.apache.hadoop.hbase.coprocessor.example.RowCou true
ntEndpoint|1001|'}, {NAME => 'extdata', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =
> '0', COMPRESSION => 'NONE', VERSIONS => '1', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FAL
SE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'srcdata', DATA_BLOCK_ENCODING
=> 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', MIN_VERSIONS
=> '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}
调用方法:
执行客户端程序即可返回指表的行数。
转载请注明出处:http://dujian-gu.iteye.com/blog/2225032