当我们对HBase表中的数据进行一些简单的行数统计或者聚合计算时,如果使用MapReduce或Native API将数据传到客户端进行计算,就会有较大延迟和大量网络IO开销。如果能把这些计算放在Server端,就可以减少网络IO开销,从而获得很好的性能提升。HBase的协处理器可以很好的实现上述想法。
HBase coprocessor 分为两大类,分别是:
1、Observer:类似于观察者模式,提供了Get、Put、Delete、Scan等一些钩子方法。RegionObserver具体又可以分为:RegionObserver、WALObserver和MasterObserver
2、Endpoint:通过RPC调用实现。
下面介绍使用Endpoint实现行数统计。
开发环境:
Hadoop 2.6.0
HBase 0.98.4
实现代码:
1、定义RPC通讯协议(ExampleProtos.proto)
option java_package = "com.iss.gbg.protobuf.proto.generated";
option java_outer_classname = "ExampleProtos";
option java_generic_services = true;
option java_generate_equals_and_hash = true;
option optimize_for = SPEED;
message CountRequest {
}
message CountResponse {
required int64 count = 1 [default = 0];
}
service RowCountService {
rpc getRowCount(CountRequest)
returns (CountResponse);
}
协议的内容不再解析,想了解各行代理什么意思,请查看Protocol Buffers的相关内容。
使用Protocol Buffers的编译器生成Java类
服务器端代码实现:
public class RowCountEndpoint extends ExampleProtos.RowCountService
implements Coprocessor, CoprocessorService {
private RegionCoprocessorEnvironment env;
public RowCountEndpoint() {
}
/**
* Just returns a reference to this object, which implements the RowCounterService interface.
*/
@Override
public Service getService() {
return this;
}
/**
* Returns a count of the rows in the region where this coprocessor is loaded.
*/
@Override
public void getRowCount(RpcController controller, ExampleProtos.CountRequest request,
RpcCallback done) {
Scan scan = new Scan();
scan.setFilter(new FirstKeyOnlyFilter());
ExampleProtos.CountResponse response = null;
InternalScanner scanner = null;
try {
scanner = env.getRegion().getScanner(scan);
List results = new ArrayList();
boolean hasMore = false;
byte[] lastRow = null;
long count = 0;
do {
hasMore = scanner.next(results);
for (Cell kv : results) {
byte[] currentRow = CellUtil.cloneRow(kv);
if (lastRow == null || !Bytes.equals(lastRow, currentRow)) {
lastRow = currentRow;
count++;
}
}
results.clear();
} while (hasMore);
response = ExampleProtos.CountResponse.newBuilder()
.setCount(count).build();
} catch (IOException ioe) {
ResponseConverter.setControllerException(controller, ioe);
} finally {
if (scanner != null) {
try {
scanner.close();
} catch (IOException ignored) {}
}
}
done.run(response);
}
/**
* Stores a reference to the coprocessor environment provided by the
* {@link org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost} from the region where this
* coprocessor is loaded. Since this is a coprocessor endpoint, it always expects to be loaded
* on a table region, so always expects this to be an instance of
* {@link RegionCoprocessorEnvironment}.
* @param env the environment provided by the coprocessor host
* @throws IOException if the provided environment is not an instance of
* {@code RegionCoprocessorEnvironment}
*/
@Override
public void start(CoprocessorEnvironment env) throws IOException {
if (env instanceof RegionCoprocessorEnvironment) {
this.env = (RegionCoprocessorEnvironment)env;
} else {
throw new CoprocessorException("Must be loaded on a table region!");
}
}
@Override
public void stop(CoprocessorEnvironment env) throws IOException {
// nothing to do
}
}
| |
客户端调用代码:
public class RowCountClient {
private static final Logger LOG = LoggerFactory.getLogger(RowCountClient.class);
public static void main(String[] args) {
HTableInterface htable = null;
try {
htable = HBaseServer.getTable("test_crawler_data");
LOG.info("htable 获取成功!");
final ExampleProtos.CountRequest request = ExampleProtos.CountRequest.getDefaultInstance();
Map results = htable.coprocessorService(ExampleProtos.RowCountService.class, null, null, new Batch.Call() {
/* (non-Javadoc)
* @see org.apache.hadoop.hbase.client.coprocessor.Batch.Call#call(java.lang.Object)
*/
@Override
public Long call(RowCountService counter) throws IOException {
ServerRpcController controller = new ServerRpcController();
BlockingRpcCallback rpcCallback = new BlockingRpcCallback();
counter.getRowCount(controller, request, rpcCallback);
ExampleProtos.CountResponse response = rpcCallback.get();
if(controller.failedOnException()) {
throw controller.getFailedOn();
}
return (null != response && response.hasCount())? response.getCount() : 0 ;
}
});
if(null != results && !results.isEmpty()) {
long sum = 0;
for(Entry entry : results.entrySet()) {
sum += entry.getValue().longValue();
}
System.out.println("sum=" + sum);
}
} catch (IOException e) {
e.printStackTrace();
} catch (ServiceException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Throwable e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
if(null != htable) {
try {
htable.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
部署:
1、将代码打包发布到HBase的lib上目录,重启HBase即可。
2、给指定表添加endpoint
alter 'test_crawler_data','coprocessor'=>'|com.iss.gbg.protobuf.hbase.RowCountEndpoint|1001|'
查看test_crawler_data的描述如下:
hbase(main):022:0> describe 'test_crawler_data'
DESCRIPTION ENABLED
'test_crawler_data', {TABLE_ATTRIBUTES => {coprocessor$1 => '|org.apache.hadoop.hbase.coprocessor.example.RowCou true
ntEndpoint|1001|'}, {NAME => 'extdata', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =
> '0', COMPRESSION => 'NONE', VERSIONS => '1', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FAL
SE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'srcdata', DATA_BLOCK_ENCODING
=> 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', MIN_VERSIONS
=> '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}
调用方法:
执行客户端程序即可返回指表的行数。
转载请注明出处:http://dujian-gu.iteye.com/blog/2225032