HBase 创建表时的预分区

如果知道hbase数据表的key的分布情况,就可以在建表的时候对hbase进行region的预分区。这样做的好处是防止大数据量插入的热点问题,提高数据插入的效率。

背景:HBase默认建表时有一个region,这个region的rowkey是没有边界的,即没有startkey和endkey,在数据写入时,所有数据都会写入这个默认的region,随着数据量的不断 增加,此region已经不能承受不断增长的数据量,会进行split,分成2个region。在此过程中,会产生两个问题:1.数据往一个region上写,会有写热点问题。2.region split会消耗宝贵的集群I/O资源。基于此我们可以控制在建表的时候,创建多个空region,并确定每个region的起始和终止rowky,这样只要我们的rowkey设计能均匀的命中各个region,就不会存在写热点问题。自然split的几率也会大大降低。当然随着数据量的不断增长,该split的还是要进行split。像这样预先创建hbase表分区的方式,称之为预分区,通常我们有三种方式实现

首先看没有进行预分区的表,startkey和endkey为空。
在这里插入图片描述

1.shell createTable直接创建预分区:

create 'split01','cf1',SPLITS=>['1000000','2000000','3000000']

HBase 创建表时的预分区_第1张图片
从上图中可以看到将创建了4个region 根据raw key 写入到不同的region中

2.通过文件创建

create 'split02','cf1',SPLITS_FILE=>'/data/hbase/split/split.txt'

3.javaAPI createTable并预分区:

在hbase包的Admin类中提供了4个create表的方法(前三个为同步创建,第四个为异步):

- 直接根据描述创建表

这里是直接根据表描述创建表,不指定分区。

 /**
 * Creates a new table. Synchronous operation.
 *  * @param desc table descriptor for table
 * @throws IllegalArgumentException if the table name is reserved
 * @throws MasterNotRunningException if master is not running
 * @throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
 * threads, the table may have been created between test-for-existence and attempt-at-creation).
 * @throws IOException if a remote or network exception occurs
   */
  void createTable(HTableDescriptor desc) throws IOException;

- 根据描述和region个数以及startKey以及endKey自动分配

根据表描述以及指定startKey和endKey和region个数创建表,这里hbase会自动创建region个数,并且会为你的每一个region指定key的范围,但是所有的范围都是连续的且均匀的,如果业务key的某些范围内数据量很多有的很少,这样就会造成数据的数据的倾斜,这样的场景就必须自己指定分区的范围,可以用第三种或者第四种方式预分区。

/**
 * Creates a new table with the specified number of regions.  The start key specified will become
 * the end key of the first region of the table, and the end key specified will become the start
 * key of the last region of the table (the first region has a null start key and the last region
 * has a null end key). BigInteger math will be used to divide the key range specified into enough
 * segments to make the required number of total regions. Synchronous operation.
 *  * @param desc table descriptor for table
 * @param startKey beginning of key range
 * @param endKey end of key range
 * @param numRegions the total number of regions to create
 * @throws IllegalArgumentException if the table name is reserved
 * @throws MasterNotRunningException if master is not running
 * @throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
 * threads, the table may have been created between test-for-existence and attempt-at-creation).
 * @throws IOException
   */
  void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)
      throws IOException;

- 根据表的描述和自定义的分区设置创建表(同步)

根据表的描述和自定义的分区设置创建表,这个就可以自己自定义指定region执行的key的范围,比如:

byte[][] splitKeys = new byte[][] { Bytes.toBytes("10000"),
                Bytes.toBytes("20000"), Bytes.toBytes("30000"),
                Bytes.toBytes("40000") };

调用接口的时候splitKeys传入上面的值,那么他会自动创建5个region并且为之分配key的分区范围。
startKey,最后一个没有endKey:
第一个region:“ to 10000”
第二个region:“10000 to 20000”
第三个region:“20000 to 30000”
第四个region:“30000 to 40000”
第五个region:“40000 to ”

/**
 * Creates a new table with an initial set of empty regions defined by the specified split keys.
 * The total number of regions created will be the number of split keys plus one. Synchronous
 * operation. Note : Avoid passing empty split key.
 *  * @param desc table descriptor for table
 * @param splitKeys array of split keys for the initial regions of the table
 * @throws IllegalArgumentException if the table name is reserved, if the split keys are repeated
 * and if the split key has empty byte array.
 * @throws MasterNotRunningException if master is not running
 * @throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
 * threads, the table may have been created between test-for-existence and attempt-at-creation).
 * @throws IOException
   */
  void createTable(final HTableDescriptor desc, byte[][] splitKeys) throws IOException;

- 根据表的描述和自定义的分区设置创建表(异步)

同上面的三是一样的,不过是异步执行。

/**
   * Creates a new table but does not block and wait for it to come online. Asynchronous operation.
   * To check if the table exists, use {@link #isTableAvailable} -- it is not safe to create an
   * HTable instance to this table before it is available. Note : Avoid passing empty split key.
   *
   * @param desc table descriptor for table
   * @throws IllegalArgumentException Bad table name, if the split keys are repeated and if the
   * split key has empty byte array.
   * @throws MasterNotRunningException if master is not running
   * @throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
   * threads, the table may have been created between test-for-existence and attempt-at-creation).
   * @throws IOException
   */
  void createTableAsync(final HTableDescriptor desc, final byte[][] splitKeys) throws IOException;

欢迎关注,更多福利

你可能感兴趣的:(hbase)