最近这一段时间一直在研究hbase的批量入库,看似简单的问题其实埋着无数的坑......
接下来就把我遇到的一些问题和解决的办法分享给大家,希望能让那些新接触到的人不至于像我一样走这么多弯路。
hbase一般的插入过程都使用HTable对象,将数据封装在Put对象中,Put在new创建的时候需要传入rowkey,并将列族,列名,列值add进去。然后HTable调用put方法,通过rpc请求提交到Regionserver端。
写入的方式可以分为以下几种:
单条put
批量put
使用Mapreduce
bluckload
进行批量入库之前,首先要连接到正确的连接到hbase
static{
conf=HBaseConfiguration.create();
//可以连接hbase
//zookeeper给客户端的端口
conf.set("hbase.zookeeper.property.clientPort", "2181");
conf.set("hbase.zookeeper.quorum", "192.168.137.138,192.168.137.139");
conf.set("hbase.master", "192.168.10.138:60000");
}
然后开始建立我们的表结构:
public static void createTable(String tableName){
try {
ha = new HBaseAdmin(conf);
if(ha.tableExists(tableName)){
ha.disableTable(tableName);
ha.deleteTable(tableName);
}
//建立表结构
HTableDescriptor hd =new HTableDescriptor(tableName);
//添加列族
hd.addFamily(new HColumnDescriptor("family1".getBytes()));
hd.addFamily(new HColumnDescriptor("family2".getBytes()));
ha.createTable(hd);
} catch (Exception e) {
System.out.println(e);
}
}
有了上面的基础后,可以正式开始进行数据的插入
// 插入内容,行键,列族,列名,值,插入的表名
public static void insertData(String rowkey, String cf,
String clomun, String content, String tableName)
throws IOException {
htable = new HTable(conf, tableName);
Put put = new Put(rowkey.getBytes());
put.add(cf.getBytes(), clomun.getBytes(), content.getBytes());
htable.put(put);
}
这种方式是批量插入数据最慢的方式,它更合适的应用场景是一般是线上业务运行时,记录单条插入,如报文记录,处理记录,写入后htable对象即释放。每次提交就是一次rpc请求.
也就是将每一个put对象,放入List集合里面,然后对这个List集合进行入库,相比于单条Put,这种方式在入库效率上明显会有所提升. 应用场景一般在数据量稍多的环境下,通过批量提交减少请求次数
public static void insertData(String rowkey, String cf,
String clomun, String content, String tableName)
throws IOException {
htable = new HTable(conf, tableName);
List list =new ArrayList();
Put put = new Put(rowkey.getBytes());
put.add(cf.getBytes(), clomun.getBytes(), content.getBytes());
list.add(put);
htable.put(list);
}
在主方法里面调用该方法并且输入相关参数就可以实现用put方式对数据的批量插入了
public static void main(String[] args) throws IOException {
createTable("insertTest");
try {
for (int i = 0; i < 10; i++) {
String rowkey = UUID.randomUUID().toString();
// 因为不能动态增加列簇,所以只能动态添加列
for (int j = 0; j <= 10; j++) {
insertData(rowkey, "family1", "column",
new SimpleDateFormat("yyyy-MM-dd hh:mm:ss").
format(new Date()), "insertTest");
insertData(rowkey, "family2", "column",
new SimpleDateFormat("yyyy-MM-dd hh:mm:ss").
format(new Date()), "insertTest");
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
以上的put插入数据,因为不适合处理大批量的数据,所以都是在自己搭建的集群上进行的测试,接下来介绍的两种方式用的是公司的集群.
因为hbase是寄托在hadoop集群上的分布式非关系型数据库,而Hadoop又是处理大规模数据的典范,所以使用MapReduce来实现hbase的批量入库自然是个不错的选择,废话不多说,直接贴代码
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
public class Hbase_MapReduceTest {
static class BatchMapper extends Mapper<LongWritable, Text, Text, Text> {
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyyMMddHHmmssSS");
Text text = new Text();
protected void map(LongWritable key, Text value, Context context) {
try {
final String[] spliteds = value.toString().split("::");
Date date = new Date();
String dateFormat = simpleDateFormat.format(date);
final String rowKey = spliteds[0] + "_" + dateFormat;
text.set(rowKey);
context.write(text, value);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
static class BatchReducer extends TableReducer<Text, Text, NullWritable> {
protected void reduce(Text key, Iterable values, Context context) {
for (Text tx : values) {
try {
final String[] arrays = tx.toString().split("::");
Put put = new Put(key.getBytes());
put.addColumn("info".getBytes(), "name".getBytes(),
arrays[1].getBytes());
context.write(NullWritable.get(), put);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) throws Exception {
final Configuration configuration = new Configuration();
configuration.set("hbase.zookeeper.quorum", "master");
configuration.set("hbase.zookeeper.property.clientPort", "4180");
// 设置hbase表名称
configuration.set(TableOutputFormat.OUTPUT_TABLE, "HBASE_INSERT");
configuration.set("dfs.socket.timeout", "180000");
final Job job = new Job(configuration, "HBaseBatchImport");
// 设置reduce的个数
job.setNumReduceTasks(3);
job.setMapperClass(BatchMapper.class);
job.setReducerClass(BatchReducer.class);
// 设置map的输出,不设置reduce的输出类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
// 不再设置输出路径,而是设置输出格式类型
job.setOutputFormatClass(TableOutputFormat.class);
// 设置数据的输入路径
FileInputFormat.setInputPaths(job, args[0]);
// hdfs://master:9000/input
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
这里和普通的MapReduce最大的区别就在于reduce过程的输出:不是常规的将reduce的结果输出到hdfs上面,而是直接输出到表里面,而且不用设置reduce的输出类型 这里是最关键的两行代码:
//设置输出的hbase的表名
configuration.set(TableOutputFormat.OUTPUT_TABLE, "HBASE_INSERT");
//设置数据的输出格式类型
job.setOutputFormatClass(TableOutputFormat.class);
这种方式最终会调用Tableoutputformat类,核心的原理还是使用htable的put方法,不过由于使用了mapreduce分布式提交到hbase,速度比单线程效率高出许多.
但是这种方式也不是万能的,put提交的速度太快时会给hbase造成比较大的压力,容易发生gc,造成节点挂掉,尤其是初始化表到hbase时,一般都会有很多的历史数据需要入库,容易造成比较大的压力,这种情况下建议使用下面的方式bulkload方式入库,减少给hbase压力。上面这种方式是直接在map中生成put然后交给TableOutputformat去提交的,因为这里几乎不需要逻辑处理,如果需要做逻辑处理,那么一般会在reduce端去生成put对象,在map端做业务逻辑处理,比如数据关联,汇总之类的.
这是应用最广泛的,也是经过官方认证的最快捷使用的hbase 批量入库的方式, hbase官方文档对这一块的介绍如下(经过google翻译之后的文档...):
散货装载 bulkload
1. 概观 HBase包含几种将数据加载到表中的方法。
最直接的方法是使用TableOutputFormatMapReduce作业中的类,
或者使用普通的客户端API; 然而,这些并不总是最有效的方法。
批量加载功能使用MapReduce作业以HBase内部数据格式输出表格数据,
然后直接将生成的StoreFiles加载到正在运行的集群中。
使用批量加载将比使用HBase API使用更少的CPU和网络资源。
2. 大容量装载限制 当批量加载绕过写入路径时,WAL不会被写入作为过程的一部分。
复制通过读取WAL文件来工作,所以它不会看到批量加载的数据 -
对于使用的编辑也是如此Put.setDurability(SKIP_WAL)。
处理该问题的一种方法是将原始文件或HFile发送到其他群集,并在那里进行其他处理。
3. 批量加载架构 HBase批量加载过程包含两个主要步骤。
1. 通过MapReduce作业准备数据
批量加载的第一步是使用MapReduce作业生成HBase数据文件(StoreFiles)HFileOutputFormat2。
这种输出格式将数据写入HBase的内部存储格式,以便以后可以非常高效地将其加载到群集中。
为了高效工作,HFileOutputFormat2必须对每个输出HFile进行配置,使其适合单个区域。
为了做到这一点,输出将被批量加载到HBase中的作业使用Hadoop的
TotalOrderPartitioner类将映射输出分区到键空间的不相交范围,对应于表中区域的键范围。
HFileOutputFormat2包括一个便利功能,configureIncrementalLoad()它'
TotalOrderPartitioner根据当前的表格区域边界自动设置一个。
2. 完成数据加载 在准备好数据导入之后,通过使用importtsv具有“importtsv.bulk.output”
选项的工具或使用其他某个MapReduce作业HFileOutputFormat,
该completebulkload工具可用于将数据导入到正在运行的集群中。
这个命令行工具遍历准备好的数据文件,每个文件确定文件所属的区域。
然后,它会联系采用HFile的相应RegionServer,
将其移动到其存储目录中,并将数据提供给客户端。
如果在批量装载准备过程中,或者在准备和完成步骤之间区域边界发生了变化,
completebulkload公用程序将自动将数据文件分割成对应于新边界的片段。
这个过程并不是最佳的,所以用户应该小心地减少准备批量加载和导入到群集之间的延迟,
尤其是当其他客户端同时通过其他方式加载数据时。
$ hadoop jar hbase-server-VERSION.jar completebulkload
[-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
该-c config-file选项可用于指定包含相应hbase参数的文件(例如,hbase-site.xml)
(如果CLASSPATH中尚未提供此参数)
(另外,如果zookeeper不是,则CLASSPATH必须包含具有zookeeper配置文件的目录由HBase管理)。
如果目标表在HBase中不存在,该工具将自动创建表。
4. 也可以看看 有关引用的实用程序的更多信息,请参阅ImportTsv和 CompleteBulkLoad。
请参阅操作方法:使用HBase批量加载,以及为什么选择最近一次有关批量加载状态的博客。
5. 高级用法
尽管该importtsv工具在很多情况下都很有用,但是高级用户可能希望以编程方式生成数据
,或者从其他格式导入数据。
要开始这样做,挖掘ImportTsv.java并检查JavaDoc for HFileOutputFormat。
批量加载的导入步骤也可以通过编程来完成。
我做测试的时候采用的是将wordcount的结果进行批量入库,测试的数据量是2个G的文本文件,wordcount的代码就不贴出来了,直接贴批量入库的代码
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WorldCount_Hbase {
public static class ConvertWordCountOutToHFileMapper
extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// 上一个WordCount的输出格式是:a 100 b 20这样的形式,
// 按行读取后,行号作为key,每一行的内容作为value
String wordCountStr = value.toString();
String[] wordCountArray = wordCountStr.split("\t");
String word = wordCountArray[0];
int count = Integer.valueOf(wordCountArray[1]);
// 创建HBase中的RowKey
byte[] rowKey = Bytes.toBytes(word);
ImmutableBytesWritable rowKeyWritable = new ImmutableBytesWritable(rowKey);
byte[] family = Bytes.toBytes("cf");
byte[] qualifier = Bytes.toBytes("count");
byte[] hbaseValue = Bytes.toBytes(count);
// Put 用于列簇下的多列提交,若只有一个列,则可以使用 KeyValue 格式
// KeyValue keyValue = new KeyValue(rowKey, family, qualifier,
// hbaseValue);
Put put = new Put(rowKey);
put.add(family, qualifier, hbaseValue);
context.write(rowKeyWritable, put);
}
}
public static void main(String[] args) throws Exception {
Long starttime = System.currentTimeMillis();
Configuration hadoopConfiguration = new Configuration();
hadoopConfiguration.set("fs.defaultFS", "hdfs://192.168.1.31:9000");
// hadoopConfiguration.set("mapreduce.map.memory.mb", "512");
// hadoopConfiguration.set("mapreduce.reduce.memory.mb", "512");
String[] dfsArgs = new GenericOptionsParser(hadoopConfiguration, args).
getRemainingArgs();
FileSystem fs = FileSystem.get(hadoopConfiguration);
Path input = new Path("/tmp/xmr/hbase/test");
Path output = new Path("/tmp/xmr/resultdata/test");
// Path input = new Path(dfsArgs[0]);
// Path output = new Path(dfsArgs[1]);
Job convertWordCountJobOutputToHFileJob = new Job(hadoopConfiguration, "wordCount_bulkload");
convertWordCountJobOutputToHFileJob.setJarByClass(
WorldCount_Hbase.class);
convertWordCountJobOutputToHFileJob.setMapperClass(
ConvertWordCountOutToHFileMapper.class);
convertWordCountJobOutputToHFileJob.setMapOutputKeyClass(
ImmutableBytesWritable.class);
convertWordCountJobOutputToHFileJob.setMapOutputValueClass(Put.class);
if (fs.exists(output)) {
fs.delete(output);
}
FileInputFormat.addInputPath(convertWordCountJobOutputToHFileJob, input);
FileOutputFormat.setOutputPath(convertWordCountJobOutputToHFileJob, output);
// 创建HBase的配置对象
Configuration hbaseConfiguration = HBaseConfiguration.create();
hbaseConfiguration.set("hbase.zookeeper.quorum", "master,node001,node002");
hbaseConfiguration.set("hbase.zookeeper.property.clientPort", "4180");
System.out.println(hbaseConfiguration.toString());
// 创建目标表对象
Admin admin = ConnectionFactory.createConnection(hbaseConfiguration).getAdmin();
if (!admin.isTableAvailable(TableName.valueOf
("wordcount"))) {
HTableDescriptor hbaseTable = new HTableDescriptor(TableName.valueOf("word1count"));
hbaseTable.addFamily(new HColumnDescriptor("cf"));
admin.createTable(hbaseTable);
}
HTable wordCountTable = new HTable(hbaseConfiguration, "word1count");
HFileOutputFormat.configureIncrementalLoad(
convertWordCountJobOutputToHFileJob, wordCountTable);
int convertWordCountJobOutputToHFileJobResult =
convertWordCountJobOutputToHFileJob.waitForCompletion(true) ? 0
: 1;
// 调用BulkLoad方式来将MR结果批量入库
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConfiguration);
// 第一个参数为第二个Job的输出目录即保存HFile的目录,第二个参数为目标表
loader.doBulkLoad(output, wordCountTable);
Long endtime = System.currentTimeMillis();
System.out.println("程序的执行时间为:" + (endtime - starttime));
System.exit(convertWordCountJobOutputToHFileJobResult);
}
}
这里面需要注意的地方:
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConfiguration);
// 第一个参数为第二个Job的输出目录即保存HFile的目录,第二个参数为目标表
loader.doBulkLoad(output, wordCountTable);
以上是对常用的几种批量入库方式进行的基本介绍,对于每一种方法,也用了几个G的数据进行过测试无误..
网上也能搜到一大堆雷同相似的代码,但是Hbase是用来存储海量数据的数据库,到了实际的应用中,面临很大数据量一大,也不会出问题么?
答案当然是NO,Hbase批量入库的坑才刚刚开始!!!我就被这些坑折腾的怀疑人生!!
那些奇奇怪怪的小问题在这里就不多赘述了,把使用BulkLoad进行批量入库遇到的最大的几个坑分享给大家!!!
在实际的应用中你就会发现,对于稍大一点的数据量,map过程的执行效率还是比较让人满意的,但是到了reduce阶段就会出现比较严重的卡顿,我的困惑就是?我的代码里明明没有reduce过程,为什么还会有reduce过程来影响我入库的效率呢?
于是,我尝试着在job里,设置reduce的数量,把它设置为0,可是重新执行的时候就会发现还是会有那个烦人的reduce过程, 既然设置它为0没有效果,那我把它的数量设置的多一点,提高它的并行度总能加快效率了吧 于是我又修改了reduce的数量,执行的时候发现还是只有一个..... 后来我才知道, 在这种情况下,我们不用自己写reduce过程,但是会使用Hbase给我们提供的reduce,也就是说,无论你怎么设置reduce数量,都是无效的. 这样我也就释然了
首先我用100M的数据量做测试,居然需要30s才能入库完毕!用几个G的数据量测试,效率也没有明显的提升! 也就是说平均每秒的插入速度还不到15000条.,这甚至比mysql的入库还要慢很多,这种效率在实际生产中是完全不能接受的 说好的这是入库最快的方式呢?我不仅产生了怀疑.. 说到底,这种问题还是因为reduce数量只有一个这个蛋疼的问题所导致的,也就是说,不管你的集群有多牛,都值相当于单机版,这显然是不合适的...那么该如何解决这个问题呢????
就是在建表的时候进行合理的预分区!!!预分区的数目会决定你的reduce过程的数目!简单来说,在一定的范围内,进行合适预分区的话,reduce的数量增加多少,效率就提高多少倍!!!
有关于hbase的预分区,进行合适的预分区,实际上是一个很复杂的问题,也不是本篇文章讨论的重点. 感兴趣的话可以去看看这位大神写的东西,给了我很大的启发
大神的博客链接
我只简单介绍一下hbase建表时预分区的shell语句和执行的结果:
create 'XUE_BULKLOAD','info',{SPLITS => [ '1','2','3', '4','5','6','7','8','9']}
这样就成功的将表名为 'XUE_BULKLOAD',列簇名为'info'的表在建表时预分了10个分区
预分区结束之后进行测试:发现reduce的数量为预分区的数量+1,而且执行效率大大提高! 插入效率大致在10W/s~20W/s之间,已经勉强能达到实际工作的要求!
经过各种各样的调试,效率已经可以接受! 然后开始调大数据量测试,发现哪怕几十个G的数据量,在执行完MapReduce过程之后都会报错,去表里面查看数据,一条记录都没有!!报错信息如下
Trying to load more than 32 hfiles to one family of one region
18/01/18 23:20:36 ERROR mapreduce.LoadIncrementalHFiles: Trying to load
more than 32 hfiles to family info of region with start key
Exception in thread "main" java.io.IOException: Trying to load more than
32 hfiles to one family of one region
at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad
(LoadIncrementalHFiles.java:377)
at hbase_Insert.Hbase_Insert.main(Hbase_Insert.java:241)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
报错的大致意思就是试图将超过32个Hfile文件导入到hbase里面的一个region导致失败 那这个问题该如何解决呢?实际上就是两个重要的参数限制的一个是:
hbase.hregion.max.filesize
单个ColumnFamily的region大小,若按照ConstantSizeRegionSplitPolicy策略,超过设置的该值则自动split 默认的大小是1G hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily
允许的hfile的最大个数,默认配置是32 也就是说:这两个参数的默认值决定了,每次批量入库的数据量不能超过1*32也就是32个G,超过这个数量就会导致入库失败
可以在代码里,或者在hbase安装路径下conf目录下的hbase-site.xml里面针对这两个参数进行设置 为了一劳永逸,我选择在hbase-site.xml里面进行设置,设置结果如下:
hbase.hregion.max.filesize
10737418240
hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily
3200
这样,每次能够批量入库的数据就达到了32个T,符合公司的数据量需要! 配置完毕后重启集群进行测试,不在报这个错误,执行结果如下:
18/01/20 12:17:31 INFO mapreduce.Job: map 1% reduce 0%
18/01/20 12:17:35 INFO mapreduce.Job: map 2% reduce 0%
18/01/20 12:17:42 INFO mapreduce.Job: map 3% reduce 0%
18/01/20 12:17:45 INFO mapreduce.Job: map 4% reduce 0%
18/01/20 12:17:51 INFO mapreduce.Job: map 5% reduce 0%
18/01/20 12:17:55 INFO mapreduce.Job: map 6% reduce 0%
18/01/20 12:17:59 INFO mapreduce.Job: map 7% reduce 0%
18/01/20 12:18:03 INFO mapreduce.Job: map 8% reduce 0%
18/01/20 12:18:06 INFO mapreduce.Job: map 9% reduce 0%
18/01/20 12:18:11 INFO mapreduce.Job: map 10% reduce 0%
18/01/20 12:18:16 INFO mapreduce.Job: map 11% reduce 0%
18/01/20 12:18:20 INFO mapreduce.Job: map 12% reduce 0%
18/01/20 12:18:27 INFO mapreduce.Job: map 13% reduce 0%
18/01/20 12:18:32 INFO mapreduce.Job: map 14% reduce 0%
18/01/20 12:18:37 INFO mapreduce.Job: map 15% reduce 0%
18/01/20 12:18:42 INFO mapreduce.Job: map 16% reduce 0%
18/01/20 12:18:47 INFO mapreduce.Job: map 17% reduce 0%
18/01/20 12:18:53 INFO mapreduce.Job: map 18% reduce 0%
18/01/20 12:18:58 INFO mapreduce.Job: map 19% reduce 0%
18/01/20 12:19:03 INFO mapreduce.Job: map 20% reduce 0%
18/01/20 12:19:08 INFO mapreduce.Job: map 21% reduce 0%
18/01/20 12:19:14 INFO mapreduce.Job: map 22% reduce 0%
18/01/20 12:19:18 INFO mapreduce.Job: map 23% reduce 0%
18/01/20 12:19:23 INFO mapreduce.Job: map 24% reduce 0%
18/01/20 12:19:29 INFO mapreduce.Job: map 25% reduce 0%
18/01/20 12:19:33 INFO mapreduce.Job: map 26% reduce 0%
18/01/20 12:19:38 INFO mapreduce.Job: map 27% reduce 0%
18/01/20 12:19:43 INFO mapreduce.Job: map 28% reduce 0%
18/01/20 12:19:48 INFO mapreduce.Job: map 29% reduce 0%
18/01/20 12:19:53 INFO mapreduce.Job: map 30% reduce 0%
18/01/20 12:19:58 INFO mapreduce.Job: map 31% reduce 0%
18/01/20 12:20:04 INFO mapreduce.Job: map 32% reduce 0%
18/01/20 12:20:08 INFO mapreduce.Job: map 33% reduce 0%
18/01/20 12:20:13 INFO mapreduce.Job: map 34% reduce 0%
18/01/20 12:20:17 INFO mapreduce.Job: map 35% reduce 0%
18/01/20 12:20:21 INFO mapreduce.Job: map 36% reduce 0%
18/01/20 12:20:25 INFO mapreduce.Job: map 37% reduce 0%
18/01/20 12:20:29 INFO mapreduce.Job: map 38% reduce 0%
18/01/20 12:20:33 INFO mapreduce.Job: map 39% reduce 0%
18/01/20 12:20:37 INFO mapreduce.Job: map 40% reduce 0%
18/01/20 12:20:41 INFO mapreduce.Job: map 41% reduce 0%
18/01/20 12:20:45 INFO mapreduce.Job: map 42% reduce 0%
18/01/20 12:20:50 INFO mapreduce.Job: map 43% reduce 0%
18/01/20 12:20:54 INFO mapreduce.Job: map 44% reduce 0%
18/01/20 12:20:58 INFO mapreduce.Job: map 45% reduce 0%
18/01/20 12:21:02 INFO mapreduce.Job: map 46% reduce 0%
18/01/20 12:21:06 INFO mapreduce.Job: map 47% reduce 0%
18/01/20 12:21:10 INFO mapreduce.Job: map 48% reduce 0%
18/01/20 12:21:14 INFO mapreduce.Job: map 49% reduce 0%
18/01/20 12:21:18 INFO mapreduce.Job: map 50% reduce 0%
18/01/20 12:21:22 INFO mapreduce.Job: map 51% reduce 0%
........
18/01/20 12:29:12 INFO mapreduce.Job: map 100% reduce 81%
18/01/20 12:29:24 INFO mapreduce.Job: map 100% reduce 82%
18/01/20 12:29:36 INFO mapreduce.Job: map 100% reduce 83%
18/01/20 12:29:48 INFO mapreduce.Job: map 100% reduce 84%
18/01/20 12:30:00 INFO mapreduce.Job: map 100% reduce 85%
18/01/20 12:30:12 INFO mapreduce.Job: map 100% reduce 86%
18/01/20 12:30:23 INFO mapreduce.Job: map 100% reduce 87%
18/01/20 12:30:33 INFO mapreduce.Job: map 100% reduce 88%
18/01/20 12:30:45 INFO mapreduce.Job: map 100% reduce 89%
18/01/20 12:30:59 INFO mapreduce.Job: map 100% reduce 90%
18/01/20 12:31:11 INFO mapreduce.Job: map 100% reduce 91%
18/01/20 12:31:21 INFO mapreduce.Job: map 100% reduce 92%
18/01/20 12:31:33 INFO mapreduce.Job: map 100% reduce 93%
18/01/20 12:31:45 INFO mapreduce.Job: map 100% reduce 94%
18/01/20 12:31:57 INFO mapreduce.Job: map 100% reduce 95%
18/01/20 12:32:10 INFO mapreduce.Job: map 100% reduce 96%
18/01/20 12:32:28 INFO mapreduce.Job: map 100% reduce 97%
18/01/20 12:32:57 INFO mapreduce.Job: map 100% reduce 98%
18/01/20 12:33:28 INFO mapreduce.Job: map 100% reduce 99%
18/01/20 12:34:43 INFO mapreduce.Job: map 100% reduce 100%
18/01/20 12:38:02 INFO mapreduce.Job: Job job_1516347580021_0001 completed successfully
18/01/20 12:38:02 INFO mapreduce.Job: Counters: 52
File System Counters
FILE: Number of bytes read=87576726096
FILE: Number of bytes written=142193600747
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=83582905128
HDFS: Number of bytes written=166475667426
HDFS: Number of read operations=5468
HDFS: Number of large read operations=0
HDFS: Number of write operations=39
Job Counters
Failed map tasks=6
Launched map tasks=1086
Launched reduce tasks=10
Other local map tasks=6
Data-local map tasks=465
Rack-local map tasks=615
Total time spent by all maps in occupied slots (ms)=82454392
Total time spent by all reduces in occupied slots (ms)=47463944
Total time spent by all map tasks (ms)=10306799
Total time spent by all reduce tasks (ms)=5932993
Total vcore-seconds taken by all map tasks=10306799
Total vcore-seconds taken by all reduce tasks=5932993
Total megabyte-seconds taken by all map tasks=84433297408
Total megabyte-seconds taken by all reduce tasks=48603078656
Map-Reduce Framework
Map input records=568152966
Map output records=568152966
Map output bytes=228099087448
Map output materialized bytes=54476960272
Input split bytes=186120
Combine input records=0
Combine output records=0
Reduce input groups=292435364
Reduce shuffle bytes=54476960272
Reduce input records=568152966
Reduce output records=2339482912
Spilled Records=1513624168
Shuffled Maps =10800
Failed Shuffles=0
Merged Map outputs=10800
GC time elapsed (ms)=794607
CPU time spent (ms)=21363440
Physical memory (bytes) snapshot=3038556569600
Virtual memory (bytes) snapshot=9401710268416
Total committed heap usage (bytes)=3512994889728
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=83582349648
File Output Format Counters
Bytes Written=166475667426
18/01/20 12:38:02 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x71f30c76 connecting to ZooKeeper ensemble=node003:4180,node002:4180,node001:4180,master:4180,node009:4180,node008:4180,node007:4180,node010:4180,node006:4180,node005:4180,node004:4180
18/01/20 12:38:02 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=node003:4180,node002:4180,node001:4180,master:4180,node009:4180,node008:4180,node007:4180,node010:4180,node006:4180,node005:4180,node004:4180 sessionTimeout=90000 watcher=hconnection-0x71f30c760x0, quorum=node003:4180,node002:4180,node001:4180,master:4180,node009:4180,node008:4180,node007:4180,node010:4180,node006:4180,node005:4180,node004:4180, baseZNode=/hbase
18/01/20 12:38:02 INFO zookeeper.ClientCnxn: Opening socket connection to server node004/192.168.1.38:4180. Will not attempt to authenticate using SASL (unknown error)
18/01/20 12:38:02 INFO zookeeper.ClientCnxn: Socket connection established to node004/192.168.1.38:4180, initiating session
18/01/20 12:38:02 INFO zookeeper.ClientCnxn: Session establishment complete on server node004/192.168.1.38:4180, sessionid = 0x26001af8d8190002, negotiated timeout = 40000
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://192.168.1.31:9000/test_demo/result/test/_SUCCESS
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/28e47c53edaf4616a3dfc349d0f0e02a with size: 10931823633 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/5e6501ccb7554b82a2e93024d61dbe0e with size: 10931820982 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/6245730468534f85a428ef7fb7acd499 with size: 10931829083 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/a3b7225320b24e838559d5a5772bdd87 with size: 10931823391 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/a8306bc4ef3941f5bd131d47f0b1c2c3 with size: 10931822321 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/c163c568b1c24d88ac0ed7599b81ecba with size: 10931824861 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/d069683ce064411793640f2a0ec6ca98 with size: 10931822990 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/e5df664e18c54da7b84370b72506923b with size: 10931821709 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 WARN mapreduce.LoadIncrementalHFiles: Trying to bulk load hfile hdfs://192.168.1.31:9000/test_demo/result/test/info/fba38b4d0bd34f6782b844b288780e7b with size: 10931826385 bytes can be problematic as it may lead to oversplitting.
18/01/20 12:38:02 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:02 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:02 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:02 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:02 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:02 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:02 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:02 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/11e458c2c1f0465
4ae1783ec4e6576e8 first=459096918168596876155 last=4999999888024945828
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load h
file=hdfs://192.168.1.31:9000/test_demo/result/test/info/78067c90799149db
b4a423ef556a4272 first=559078464243536377945 last=5999999888024945828
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/0049f16fd57b482
aa2e68ebe21a0cb72 first=15907887724999982915 last=19999999217611496331
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/28e47c53edaf461
6a3dfc349d0f0e02a first=80100000359202982424 last=859088818898462383266
hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/6245730468534f8
5a428ef7fb7acd499 first=401000000531957283573 last=459096917941294955954
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/18068da4a3f5469a804eee9f6921617a first=959083192452571451003 last=99999998239977206078
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/5e6501ccb7554b8
2a2e93024d61dbe0e first=30100000359202982424 last=359081166786305137185
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load
hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/0cff66c092004d4
88db32c3bf549a1d1 first=0100000359202982424 last=0999998239977206078
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/c163c568b1c24d8
8ac0ed7599b81ecba first=10100000359202982424 last=15907887393454423668
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/e5df664e18c54da7b84370b72506923b first=501000000531957283573 last=559078458337340744586
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/88df957d66e84b758583c47c9e6aec95 first=25908421410455709356 last=29999998239977206078
18/01/20 12:38:03 INFO hfile.CacheConfig: CacheConfig:disabled
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/a8306bc4ef3941f5bd131d47f0b1c2c3 first=60100000359202982424 last=659079145929173333600
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/d28c5e918b784127a7faa8afee8b364d first=359081168652388606128 last=39999999217611496331
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/a3b7225320b24e838559d5a5772bdd87 first=701000000531957283573 last=759089489615157841144
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/d069683ce064411793640f2a0ec6ca98 first=20100000359202982424 last=25908421377193754247
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/7b148f47400d49d8aefd92b06a530dc5 first=659079146670017258500 last=69999999217611496331
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/e93f91b50972491d8c600a6bd115bab3 first=859088819882023983305 last=89999998239977206078
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/f73df3c94e9f4c659965dc11f66ddd7b first=759089490241357286269 last=7999999888024945828
18/01/20 12:38:03 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://192.168.1.31:9000/test_demo/result/test/info/fba38b4d0bd34f6782b844b288780e7b first=901000000531957283573 last=959083189020333412493
程序的执行时间为:1256053
18/01/20 12:38:03 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x38001af93e730001
18/01/20 12:38:03 INFO zookeeper.ZooKeeper: Session: 0x38001af93e730001 closed
18/01/20 12:38:03 INFO zookeeper.ClientCnxn: EventThread shut down
去hbase数据库里面查询数据:
Current count: 100983000, row: 405415804076494331733
Current count: 100984000, row: 405418669611868961647
Current count: 100985000, row: 40542152486397650152
Current count: 100986000, row: 405424331577144238851
Current count: 100987000, row: 405427092734454272384
Current count: 100988000, row: 405429767351893163972
Current count: 100989000, row: 40543249871570790691
Current count: 100990000, row: 405435366049740236059
可以看到,不完全统计就已经有1亿条以上的数据.... 这样hbase的批量入库,无论从效率上,还是数据量上都能够慢足要求!
还有一些能够使效率增加的hbase的集群参数,以及更加合理的预分区和行键设置都要在实践中不断的调试..
总之,实践才是检验真理的唯一标准!!