【原创】HBase 0.98 coprocessor Endpoint实现行数统计

当我们对HBase表中的数据进行一些简单的行数统计或者聚合计算时,如果使用MapReduce或Native API将数据传到客户端进行计算,就会有较大延迟和大量网络IO开销。如果能把这些计算放在Server端,就可以减少网络IO开销,从而获得很好的性能提升。HBase的协处理器可以很好的实现上述想法。

HBase coprocessor 分为两大类,分别是:

1、Observer:类似于观察者模式,提供了Get、Put、Delete、Scan等一些钩子方法。RegionObserver具体又可以分为:RegionObserver、WALObserver和MasterObserver

2、Endpoint:通过RPC调用实现。

下面介绍使用Endpoint实现行数统计。

开发环境:

Hadoop 2.6.0

HBase 0.98.4

实现代码:

1、定义RPC通讯协议(ExampleProtos.proto)

option java_package = "com.iss.gbg.protobuf.proto.generated";
option java_outer_classname = "ExampleProtos";
option java_generic_services = true;
option java_generate_equals_and_hash = true;
option optimize_for = SPEED;

message CountRequest {
}

message CountResponse {
  required int64 count = 1 [default = 0];
}

service RowCountService {
  rpc getRowCount(CountRequest)
    returns (CountResponse);
}

协议的内容不再解析,想了解各行代理什么意思,请查看Protocol Buffers的相关内容。

使用Protocol Buffers的编译器生成Java类

服务器端代码实现:

public class RowCountEndpoint extends ExampleProtos.RowCountService
    implements Coprocessor, CoprocessorService {
  private RegionCoprocessorEnvironment env;

  public RowCountEndpoint() {
  }

  /**
   * Just returns a reference to this object, which implements the RowCounterService interface.
   */
  @Override
  public Service getService() {
    return this;
  }

  /**
   * Returns a count of the rows in the region where this coprocessor is loaded.
   */
  @Override
  public void getRowCount(RpcController controller, ExampleProtos.CountRequest request,
                          RpcCallback done) {
    Scan scan = new Scan();
    scan.setFilter(new FirstKeyOnlyFilter());
    ExampleProtos.CountResponse response = null;
    InternalScanner scanner = null;
    try {
      scanner = env.getRegion().getScanner(scan);
      List results = new ArrayList();
      boolean hasMore = false;
      byte[] lastRow = null;
      long count = 0;
      do {
        hasMore = scanner.next(results);
        for (Cell kv : results) {
          byte[] currentRow = CellUtil.cloneRow(kv);
          if (lastRow == null || !Bytes.equals(lastRow, currentRow)) {
            lastRow = currentRow;
            count++;
          }
        }
        results.clear();
      } while (hasMore);

      response = ExampleProtos.CountResponse.newBuilder()
          .setCount(count).build();
    } catch (IOException ioe) {
      ResponseConverter.setControllerException(controller, ioe);
    } finally {
      if (scanner != null) {
        try {
          scanner.close();
        } catch (IOException ignored) {}
      }
    }
    done.run(response);
  }


  /**
   * Stores a reference to the coprocessor environment provided by the
   * {@link org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost} from the region where this
   * coprocessor is loaded.  Since this is a coprocessor endpoint, it always expects to be loaded
   * on a table region, so always expects this to be an instance of
   * {@link RegionCoprocessorEnvironment}.
   * @param env the environment provided by the coprocessor host
   * @throws IOException if the provided environment is not an instance of
   * {@code RegionCoprocessorEnvironment}
   */
  @Override
  public void start(CoprocessorEnvironment env) throws IOException {
    if (env instanceof RegionCoprocessorEnvironment) {
      this.env = (RegionCoprocessorEnvironment)env;
    } else {
      throw new CoprocessorException("Must be loaded on a table region!");
    }
  }

  @Override
  public void stop(CoprocessorEnvironment env) throws IOException {
    // nothing to do
  }
}

 客户端调用代码:

public class RowCountClient {
	private static final Logger LOG = LoggerFactory.getLogger(RowCountClient.class);
	public static void main(String[] args) {
		HTableInterface htable = null;
		try {
			htable = HBaseServer.getTable("test_crawler_data");
			LOG.info("htable 获取成功!");
			final ExampleProtos.CountRequest request = ExampleProtos.CountRequest.getDefaultInstance();
			
			Map results = htable.coprocessorService(ExampleProtos.RowCountService.class, null, null, new Batch.Call() {
				/* (non-Javadoc)
				 * @see org.apache.hadoop.hbase.client.coprocessor.Batch.Call#call(java.lang.Object)
				 */
				@Override
				public Long call(RowCountService counter) throws IOException {
					ServerRpcController controller = new ServerRpcController();
					BlockingRpcCallback rpcCallback = new BlockingRpcCallback();
					counter.getRowCount(controller, request, rpcCallback);
					ExampleProtos.CountResponse response = rpcCallback.get();
					if(controller.failedOnException()) {
						throw controller.getFailedOn();
					}
					return (null != response && response.hasCount())? response.getCount() : 0 ;
					
				}
			});
			if(null != results && !results.isEmpty()) {
				long sum = 0;
				for(Entry entry : results.entrySet()) {
					sum += entry.getValue().longValue();
				}
				System.out.println("sum=" + sum);
			}
		} catch (IOException e) {
			
			e.printStackTrace();
		} catch (ServiceException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (Throwable e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} finally {
			if(null != htable) {
				try {
					htable.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
	}
}

 部署:

1、将代码打包发布到HBase的lib上目录,重启HBase即可。

2、给指定表添加endpoint

 

alter 'test_crawler_data','coprocessor'=>'|com.iss.gbg.protobuf.hbase.RowCountEndpoint|1001|'

 

查看test_crawler_data的描述如下:

hbase(main):022:0> describe 'test_crawler_data'
DESCRIPTION                                                                                                       ENABLED                                                      
 'test_crawler_data', {TABLE_ATTRIBUTES => {coprocessor$1 => '|org.apache.hadoop.hbase.coprocessor.example.RowCou true                                                         
 ntEndpoint|1001|'}, {NAME => 'extdata', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =                                                              
 > '0', COMPRESSION => 'NONE', VERSIONS => '1', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FAL                                                              
 SE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'srcdata', DATA_BLOCK_ENCODING                                                               
 => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', MIN_VERSIONS                                                               
 => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE                                                               
 => 'true'}                                                                                                     

 

调用方法:

执行客户端程序即可返回指表的行数。

 

转载请注明出处:http://dujian-gu.iteye.com/blog/2225032

 

 

 

你可能感兴趣的:(HBase)