通常 MapReduce
在写 HBase
时使用的是 HTableOutputFormat
方式,在 reduce 中直接生成 put 对象写入 HBase
,该方式在大数据量写入时效率低下(HBase 会 block 写入,频繁进行 flush、split、compact 等大量 IO 操作),并对 HBase
节点的稳定性造成一定的影响(GC 时间过长,响应变慢,导致节点超时退出,并引起一系列连锁反应),而 HBase
支持 bulk load
的入库方式,它是利用 hbase
的数据信息按照特定格式存储在 hdfs
内这一原理,直接在 HDFS
中生成持久化的 HFile
数据格式文件,然后上传至合适位置,即完成巨量数据快速入库的办法。配合 Mapreduce
完成,高效便捷,而且不占用 region 资源,增添负载,在大数据量写入时能极大的提高写入效率,并降低对 HBase
节点的写入压力。
通过使用先生成 HFile
,然后再 BulkLoad
到 Hbase
的方式来替代之前直接调用 HTableOutputFormat
的方法有如下的好处:
(1)消除了对 HBase
集群的插入压力
(2)提高了 Job 的运行速度,降低了 Job 的执行时间
bulkload
方式需要两个 Job 配合完成:
(1)第一个 Job 还是运行原来业务处理逻辑,处理的结果不直接调用 HTableOutputFormat
写入到 HBase
,而是先写入到 HDFS
上的一个中间目录下(如 middata)
(2)第二个 Job 以第一个 Job 的输出(middata)做为输入,然后将其格式化 HBase
的底层存储文件 HFile
(3)调用 BulkLoad
将第二个 Job 生成的 HFile
导入到对应的 HBase
表中
下面给出相应的范例代码:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class GeneratePutHFileAndBulkLoadToHBase {
public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
private Text wordText=new Text();
private IntWritable one=new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String line=value.toString();
String[] wordArray=line.split(" ");
for(String word:wordArray)
{
wordText.set(word);
context.write(wordText, one);
}
}
}
public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result=new IntWritable();
protected void reduce(Text key, Iterable<IntWritable> valueList,
Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
int sum=0;
for(IntWritable value:valueList)
{
sum+=value.get();
}
result.set(sum);
context.write(key, result);
}
}
public static class ConvertWordCountOutToHFileMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>
{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String wordCountStr=value.toString();
String[] wordCountArray=wordCountStr.split("\t");
String word=wordCountArray[0];
int count=Integer.valueOf(wordCountArray[1]);
//创建HBase中的RowKey
byte[] rowKey=Bytes.toBytes(word);
ImmutableBytesWritable rowKeyWritable=new ImmutableBytesWritable(rowKey);
byte[] family=Bytes.toBytes("cf");
byte[] qualifier=Bytes.toBytes("count");
byte[] hbaseValue=Bytes.toBytes(count);
// Put 用于列簇下的多列提交,若只有一个列,则可以使用 KeyValue 格式
// KeyValue keyValue = new KeyValue(rowKey, family, qualifier, hbaseValue);
Put put=new Put(rowKey);
put.add(family, qualifier, hbaseValue);
context.write(rowKeyWritable, put);
}
}
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
Configuration hadoopConfiguration=new Configuration();
String[] dfsArgs = new GenericOptionsParser(hadoopConfiguration, args).getRemainingArgs();
//第一个Job就是普通MR,输出到指定的目录
Job job=new Job(hadoopConfiguration, "wordCountJob");
job.setJarByClass(GeneratePutHFileAndBulkLoadToHBase.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(dfsArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(dfsArgs[1]));
//提交第一个Job
int wordCountJobResult=job.waitForCompletion(true)?0:1;
//第二个Job以第一个Job的输出做为输入,只需要编写Mapper类,在Mapper类中对一个job的输出进行分析,并转换为HBase需要的KeyValue的方式。
Job convertWordCountJobOutputToHFileJob=new Job(hadoopConfiguration, "wordCount_bulkload");
convertWordCountJobOutputToHFileJob.setJarByClass(GeneratePutHFileAndBulkLoadToHBase.class);
convertWordCountJobOutputToHFileJob.setMapperClass(ConvertWordCountOutToHFileMapper.class);
//ReducerClass 无需指定,框架会自行根据 MapOutputValueClass 来决定是使用 KeyValueSortReducer 还是 PutSortReducer
//convertWordCountJobOutputToHFileJob.setReducerClass(KeyValueSortReducer.class);
convertWordCountJobOutputToHFileJob.setMapOutputKeyClass(ImmutableBytesWritable.class);
convertWordCountJobOutputToHFileJob.setMapOutputValueClass(Put.class);
//以第一个Job的输出做为第二个Job的输入
FileInputFormat.addInputPath(convertWordCountJobOutputToHFileJob, new Path(dfsArgs[1]));
FileOutputFormat.setOutputPath(convertWordCountJobOutputToHFileJob, new Path(dfsArgs[2]));
//创建HBase的配置对象
Configuration hbaseConfiguration=HBaseConfiguration.create();
//创建目标表对象
HTable wordCountTable =new HTable(hbaseConfiguration, "word_count");
HFileOutputFormat.configureIncrementalLoad(convertWordCountJobOutputToHFileJob,wordCountTable);
//提交第二个job
int convertWordCountJobOutputToHFileJobResult=convertWordCountJobOutputToHFileJob.waitForCompletion(true)?0:1;
//当第二个job结束之后,调用BulkLoad方式来将MR结果批量入库
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConfiguration);
//第一个参数为第二个Job的输出目录即保存HFile的目录,第二个参数为目标表
loader.doBulkLoad(new Path(dfsArgs[2]), wordCountTable);
//最后调用System.exit进行退出
System.exit(convertWordCountJobOutputToHFileJobResult);
}
}
比如原始的输入数据的目录为:/rawdata/test/wordcount/20131212
中间结果数据保存的目录为:/middata/test/wordcount/20131212
最终生成的 HFile 保存的目录为:/resultdata/test/wordcount/20131212
运行上面的 Job 的方式如下:hadoop jar test.jar /rawdata/test/wordcount/20131212 /middata/test/wordcount/20131212 /resultdata/test/wordcount/20131212
(1)HFile
方式在所有的加载方案里面是最快的,不过有个前提 —— 数据是第一次导入,表是空的。如果表中已经有了数据。HFile
再导入到 Hbase
的表中会触发 split
操作。
(2)最终输出结果,无论是 map 还是 reduce,输出部分 key 和 value 的类型必须是:
或者
。否则报这样的错误:
java.lang.IllegalArgumentException: Can't read partitions file
...
Caused by: java.io.IOException: wrong key class: org.apache.hadoop.io.*** is not class org.apache.hadoop.hbase.io.ImmutableBytesWritable
(3)最终输出部分,Value 类型是 KeyValue 或 Put,对应的 Sorter 分别是 KeyValueSortReducer
或 PutSortReducer
,这个 SorterReducer
可以不指定,因为源码中已经做了判断。
if (KeyValue.class.equals(job.getMapOutputValueClass())) {
job.setReducerClass(KeyValueSortReducer.class);
} else if (Put.class.equals(job.getMapOutputValueClass())) {
job.setReducerClass(PutSortReducer.class);
} else {
LOG.warn("Unknown map output value type:" + job.getMapOutputValueClass());
}
(4) MR 例子中 job.setOutputFormatClass(HFileOutputFormat.class);
HFileOutputFormat 只适合一次对单列族组织成 HFile
文件,多列簇需要起多个 job,不过新版本的 Hbase
(这句话是作者在四年前说的,现在最新的版本到达了什么程度我还没有去细究)已经解决了这个限制。
(5) MR 例子中最后生成 HFile
存储在 HDFS
上,输出路径下的子目录是各个列族。如果对 HFile
进行入库 HBase
,相当于移动 HFile
到 HBase
的 Region
中,HFile
子目录的列族内容没有了。
(6)最后一个 Reduce
没有 setNumReduceTasks
是因为,该设置由框架根据 region
个数自动配置的。
(7)下边配置部分,注释掉的其实写不写都无所谓,因为看源码就知道 configureIncrementalLoad
方法已经把固定的配置全配置完了,不固定的部分才需要手动配置。
public class HFileOutput {
//job 配置
public static Job configureJob(Configuration conf) throws IOException {
Job job = new Job(configuration, "countUnite1");
job.setJarByClass(HFileOutput.class);
//job.setNumReduceTasks(2);
//job.setOutputKeyClass(ImmutableBytesWritable.class);
//job.setOutputValueClass(KeyValue.class);
//job.setOutputFormatClass(HFileOutputFormat.class);
Scan scan = new Scan();
scan.setCaching(10);
scan.addFamily(INPUT_FAMILY);
TableMapReduceUtil.initTableMapperJob(inputTable, scan,
HFileOutputMapper.class, ImmutableBytesWritable.class, LongWritable.class, job);
//这里如果不定义reducer部分,会自动识别定义成KeyValueSortReducer.class 和PutSortReducer.class
job.setReducerClass(HFileOutputRedcuer.class);
//job.setOutputFormatClass(HFileOutputFormat.class);
HFileOutputFormat.configureIncrementalLoad(job, new HTable(
configuration, outputTable));
HFileOutputFormat.setOutputPath(job, new Path());
//FileOutputFormat.setOutputPath(job, new Path()); //等同上句
return job;
}
public static class HFileOutputMapper extends
TableMapper<ImmutableBytesWritable, LongWritable> {
public void map(ImmutableBytesWritable key, Result values,
Context context) throws IOException, InterruptedException {
//mapper逻辑部分
context.write(new ImmutableBytesWritable(Bytes()), LongWritable());
}
}
public static class HFileOutputRedcuer extends
Reducer<ImmutableBytesWritable, LongWritable, ImmutableBytesWritable, KeyValue> {
public void reduce(ImmutableBytesWritable key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
//reducer逻辑部分
KeyValue kv = new KeyValue(row, OUTPUT_FAMILY, tmp[1].getBytes(),
Bytes.toBytes(count));
context.write(key, kv);
}
}
}
上述内容来自:HBase 写优化之 BulkLoad 实现数据快速入库
[hadoop@h71 ~]$ vi he.txt
hello world
hello hadoop
hello hive
[hadoop@h71 ~]$ hadoop fs -mkdir /rawdata
[hadoop@h71 ~]$ hadoop fs -put he.txt /rawdata
[hadoop@h71 hui]$ /usr/jdk1.7.0_25/bin/javac GeneratePutHFileAndBulkLoadToHBase.java
[hadoop@h71 hui]$ /usr/jdk1.7.0_25/bin/jar cvf xx.jar GeneratePutHFileAndBulkLoadToHBase*class
[hadoop@h71 hui]$ hadoop jar xx.jar GeneratePutHFileAndBulkLoadToHBase /rawdata /middata /resultdata
会报错:
Exception in thread "main" java.lang.IllegalArgumentException: No regions passed
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.writePartitions(HFileOutputFormat2.java:315)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configurePartitioner(HFileOutputFormat2.java:573)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configureIncrementalLoad(HFileOutputFormat2.java:421)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2.configureIncrementalLoad(HFileOutputFormat2.java:386)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat.configureIncrementalLoad(HFileOutputFormat.java:90)
at TestHFileToHBase.main(TestHFileToHBase.java:57)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
原来需要先建表然后再执行之前的命令:
hbase(main):031:0> create 'word_count','cf'
[hadoop@h71 ~]$ hadoop fs -lsr /middata
-rw-r--r-- 2 hadoop supergroup 0 2017-03-20 10:36 /middata/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 32 2017-03-20 10:36 /middata/part-r-00000
[hadoop@h71 ~]$ hadoop fs -cat /middata/part-r-00000
hadoop 1
hello 3
hive 1
world 1
[hadoop@h71 ~]$ hadoop fs -lsr /resultdata
-rw-r--r-- 2 hadoop supergroup 0 2017-03-20 10:36 /resultdata/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2017-03-20 10:36 /resultdata/cf
# 这里的 cf 是空目录,是因为 bulkload 会将指定目录下的 Hfile 格式的文件移动到 hbase 中,所以会是空目录,当用 mr 生成 HFile 文件是 cf 目录下会有 Hfile 格式的文件存在,并且无法用 hadoop fs -cat 查看,如果非要用的话会是乱码
hbase(main):012:0> scan 'word_count'
ROW COLUMN+CELL
hadoop column=cf:count, timestamp=1489973703632, value=\x00\x00\x00\x01
hello column=cf:count, timestamp=1489973703632, value=\x00\x00\x00\x03
hive column=cf:count, timestamp=1489973703632, value=\x00\x00\x00\x01
world column=cf:count, timestamp=1489973703632, value=\x00\x00\x00\x01
# 发现插入的数据是字节类型的,后将代码中的 put.add(family, qualifier, hbaseValue); 改为 put.add(family, qualifier, Bytes.toBytes("5"));)
# 再执行上述指令得到:
hbase(main):032:0> scan 'word_count'
ROW COLUMN+CELL
hadoop column=cf:count, timestamp=1489977438537, value=5
hello column=cf:count, timestamp=1489977438537, value=5
hive column=cf:count, timestamp=1489977438537, value=5
world column=cf:count, timestamp=1489977438537, value=5
# 后来又将int count=Integer.valueOf(wordCountArray[1]);修改为String count=wordCountArray[1];
# 再执行上述指令得到:
hbase(main):007:0> scan 'word_count'
ROW COLUMN+CELL
hadoop column=cf:count, timestamp=1489748145527, value=1
hello column=cf:count, timestamp=1489748145527, value=3
hive column=cf:count, timestamp=1489748145527, value=1
world column=cf:count, timestamp=1489748145527, value=1
注:我不明白原作者为什么非要整成 int 类型,这样导入到 hbase 中就成 \x00\x00\x00\x01
了啊,后来搜索到一篇文章,可以看一下:【hbase】——bulk load导入数据时value=\x00\x00\x00\x01问题解析
最终输出部分,Value 类型是 KeyValue
或 Put
,对应的 Sorter
分别是 KeyValueSortReducer
或 PutSortReducer
,这个 SorterReducer
可以不指定,因为源码中已经做了判断:
于是我想将 Put
改为 KeyValue
输出为 HFile
:于是修改 ConvertWordCountOutToHFileMapper
类的代码为:
public static class ConvertWordCountOutToHFileMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue>
{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String wordCountStr=value.toString();
String[] wordCountArray=wordCountStr.split("\t");
String word=wordCountArray[0];
String count=wordCountArray[1];
//创建HBase中的RowKey
byte[] rowKey=Bytes.toBytes(word);
ImmutableBytesWritable rowKeyWritable=new ImmutableBytesWritable(rowKey);
byte[] family=Bytes.toBytes("cf");
byte[] qualifier=Bytes.toBytes("count");
byte[] hbaseValue=Bytes.toBytes(count);
// Put 用于列簇下的多列提交,若只有一个列,则可以使用 KeyValue 格式
KeyValue keyValue = new KeyValue(rowKey, family, qualifier, hbaseValue);
// Put put=new Put(rowKey);
// put.add(family, qualifier, hbaseValue);
context.write(rowKeyWritable, keyValue);
}
}
执行上面命令后会报这个错:
Error: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.hbase.client.Put, received org.apache.hadoop.hbase.KeyValue
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1078)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:715)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at GeneratePutHFileAndBulkLoadToHBase$ConvertWordCountOutToHFileMapper.map(GeneratePutHFileAndBulkLoadToHBase.java:89)
at GeneratePutHFileAndBulkLoadToHBase$ConvertWordCountOutToHFileMapper.map(GeneratePutHFileAndBulkLoadToHBase.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
然后我将主方法中的 convertWordCountJobOutputToHFileJob.setMapOutputValueClass(Put.class);
改为 convertWordCountJobOutputToHFileJob.setMapOutputValueClass(KeyValue.class);
再执行这才好使了。原来这些后面跟的这些 class 都不是瞎写的啊,一开始我还以为是随便写的呐。。。
job.setJarByClass(GeneratePutHFileAndBulkLoadToHBase.class); //代码中的主类
job.setMapperClass(WordCountMapper.class); //第一个mr中的map类名
job.setReducerClass(WordCountReducer.class); //第一个mr中的reduce类名
job.setOutputKeyClass(Text.class); //我感觉这个是源码中的类名,并不是辖写的啊
job.setOutputValueClass(IntWritable.class); //同上
convertWordCountJobOutputToHFileJob.setJarByClass(GeneratePutHFileAndBulkLoadToHBase.class); //代码中的主类
convertWordCountJobOutputToHFileJob.setMapperClass(ConvertWordCountOutToHFileMapper.class); //第二个mr中的map类名
convertWordCountJobOutputToHFileJob.setMapOutputKeyClass(ImmutableBytesWritable.class); //源码中的类名
对于 Hbase的ImmutableBytesWritable
类型,如果直接 Sysout
输出的是一个类似于16进制的 byte[];
假设我们获得了 ImmutableBytesWritable aa;
我们一般先将 aa 通过 byte[] bb = aa.get()
得到 byte[]
类型;然后通过 String cc = Bytes.toString(bb)
将其解析为 String;
[hadoop@h71 hui]$ vi TestHFileToHBase.java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class TestHFileToHBase {
public static class TestHFileToHBaseMapper extends Mapper {
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] values = value.toString().split(" ", 2);
byte[] row = Bytes.toBytes(values[0]);
ImmutableBytesWritable k = new ImmutableBytesWritable(row);
KeyValue kvProtocol = new KeyValue(row, "PROTOCOLID".getBytes(), "PROTOCOLID".getBytes(), values[1]
.getBytes());
context.write(k, kvProtocol);
}
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "TestHFileToHBase");
job.setJarByClass(TestHFileToHBase.class);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(KeyValue.class);
job.setMapperClass(TestHFileToHBaseMapper.class);
job.setReducerClass(KeyValueSortReducer.class);
// job.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.HFileOutputFormat.class);
job.setOutputFormatClass(HFileOutputFormat.class);
// job.setNumReduceTasks(4);
// job.setPartitionerClass(org.apache.hadoop.hbase.mapreduce.SimpleTotalOrderPartitioner.class);
// HBaseAdmin admin = new HBaseAdmin(conf);
HTable table = new HTable(conf, "hua");
HFileOutputFormat.configureIncrementalLoad(job, table);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
[hadoop@h71 ~]$ vi he.txt
hello world
hello hadoop
hello hive
[hadoop@h71 ~]$ hadoop fs -mkdir /rawdata
[hadoop@h71 ~]$ hadoop fs -put he.txt /rawdata
hbase(main):020:0> create 'hua','PROTOCOLID'
[hadoop@h71 hui]$ /usr/jdk1.7.0_25/bin/javac TestHFileToHBase.java
[hadoop@h71 hui]$ /usr/jdk1.7.0_25/bin/jar cvf xx.jar TestHFileToHBase*class
[hadoop@h71 hui]$ hadoop jar xx.jar TestHFileToHBase /rawdata /middata
报错:
Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.hbase.io.ImmutableBytesWritable, received org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:715)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
解决:
# 于是我把
public static class TestHFileToHBaseMapper extends Mapper {
# 修改为
public static class TestHFileToHBaseMapper extends Mapper, Text, ImmutableBytesWritable, KeyValue>{
# 就好使了。。。
[hadoop@h71 ~]$ hadoop fs -lsr /middata
drwxr-xr-x - hadoop supergroup 0 2017-03-17 20:50 /middata/PROTOCOLID
-rw-r--r-- 2 hadoop supergroup 1142 2017-03-17 20:50 /middata/PROTOCOLID/65493afaefac43528c554d0b8056f1e3
-rw-r--r-- 2 hadoop supergroup 0 2017-03-17 20:50 /middata/_SUCCESS
(/middata/PROTOCOLID/65493afaefac43528c554d0b8056f1e3是个Hfile格式的文件,无法用hadoop fs -cat查看,否则会出现乱码)
原文代码有很多问题,修改后为:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.util.GenericOptionsParser;
public class TestLoadIncrementalHFileToHBase {
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String[] dfsArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
HTable table = new HTable(conf,"hua");
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(dfsArgs[0]), table);
}
}
[hadoop@h71 hui]$ /usr/jdk1.7.0_25/bin/javac TestLoadIncrementalHFileToHBase.java
[hadoop@h71 hui]$ /usr/jdk1.7.0_25/bin/java TestLoadIncrementalHFileToHBase /middata/PROTOCOLID
执行后在hbase shell端查看表hua无数据。。因执行:
[hadoop@h71 hui]$ /usr/jdk1.7.0_25/bin/java TestLoadIncrementalHFileToHBase /middata
hbase(main):073:0> scan 'hua'
ROW COLUMN+CELL
hello column=PROTOCOLID:PROTOCOLID, timestamp=1489758507378, value=hive
(查看hua表h只有一条数据,一开始还很困惑,我的he.txt中有三条数据啊,为何只导入了一条数据啊,后来突然明白了hbase将he.txt中三行数据的hello作为rowkey,则三行数据的rowkey都一样了啊)
上述内容来自:生成HFile以及入库到HBase
为避免数据都写入一个 region,造成 ·Hbase· 的数据倾斜问题。在当前 HMaster 活跃的节点上,创建预分区表:
create ‘userprofile_labels', { NAME => "f", BLOCKCACHE => "true", BLOOMFILTER => "ROWCOL", COMPRESSION => 'snappy', IN_MEMORY => 'true' }, { NUMREGIONS => 10, SPLITALGO => 'HexStringSplit' }
将待同步的数据写入 HFile,HFile 中的数据以 key-value 键值对方式存储,然后将 HFile 数据使用 BulkLoad 批量写入 HBase 集群中。 Scala 脚本执行如下:
import org.apache.hadoop.fs.{FileSystem, Path}
import
上述内容来自书籍:《用户画像方法论与工程化解决方案》