1、传统的主要使用Hbase的shell进行手动的输入,都需要经过Hbase的接口,过程
2、使用MapReduce进行批量的导入,但是还是会经过Hbase的HMaster,HregionerServer一些列的过程,增加系统的资源的消耗。例如:
import java.text.SimpleDateFormat;
public class BatchImport {
//数据的形式类似于 0 20101223122329大叫好
static class BatchImportMapper extends Mapper
SimpleDateFormat dateformat1=new SimpleDateFormat("yyyyMMddHHmmss");
Text v2 = new Text();
protected void map(LongWritable key, Text value, Context context) throws java.io.IOException ,InterruptedException {
final String[] splited = value.toString().split("\t");
try {
final Date date = new Date(Long.parseLong(splited[0].trim()));
final String dateFormat = dateformat1.format(date);
v2.set(splited[1]+":"+dateFormat+"\t"+value.toString());
context.write(key, v2);
} catch (NumberFormatException e) {
final Counter counter = context.getCounter("BatchImport", "ErrorFormat");
counter.increment(1L);
System.out.println("出错了"+splited[0]+" "+e.getMessage());
}
};
}
static class BatchImportReducer extends TableReducer
protected void reduce(LongWritable key, java.lang.Iterable
for (Text text : values) {
final String[] splited = text.toString().split("\t");
final Put put = new Put(Bytes.toBytes(splited[0]));
put.add(Bytes.toBytes("cf"), Bytes.toBytes("date"), Bytes.toBytes(splited[1]));
//省略其他字段,调用put.add(....)即可
context.write(NullWritable.get(), put);
}
};
}
public static void main(String[] args) throws Exception {
final Configuration configuration = new Configuration();
//设置zookeeper
configuration.set("hbase.zookeeper.quorum", "master");
//设置hbase表名称
configuration.set(TableOutputFormat.OUTPUT_TABLE, "wlan_log");
//将该值改大,防止hbase超时退出
configuration.set("dfs.socket.timeout", "180000");
final Job job = new Job(configuration, "HBaseBatchImport");
job.setMapperClass(BatchImportMapper.class);
job.setReducerClass(BatchImportReducer.class);
//设置map的输出,不设置reduce的输出类型
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
//不再设置输出路径,而是设置输出格式类型
job.setOutputFormatClass(TableOutputFormat.class);
FileInputFormat.setInputPaths(job, "hdfs://222.27.174.66:9000/input2");
job.waitForCompletion(true);
}
}
3、不经过Hbase的过程,直接在HDFS中生成HFile,在将HFile更新到相应的HReginServer中
可以使用命令的方式,将hdfs文件转化为hfile
首先创建一个表 create 'datatsv' ,'d'
创建一个文件inputfile
row1 c1 c2
row2 c1 c2
row3 c1 c2
row4 c1 c2
row5 c1 c2
row6 c1 c2
row7 c1 c2
row8 c1 c2
row9 c1 c2
hadoop fs -put inputfile /user/input/inputfile //上传到hdfs的目录下
hadoop jar hbase-0.94.7.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=/user/output/outputfile datatsv /user/input/inputfile
在/user/output/outputfile目录下生成HFile文件
将HFile导入到HBAse中
bin/hbaseorg.apache.hadoop.hbase.mapreduce.LoadIncreamentalHFiles /user/output/outputfile datatsv
1. 如果我们一次性入库hbase巨量数据,处理速度慢不说,还特别占用Region资源, 一个比较高效便捷的方法就是使用 “Bulk Loading”方法,即HBase提供的HFileOutputFormat类。
2. 它是利用hbase的数据信息按照特定格式存储在hdfs内这一原理,直接生成这种hdfs内存储的数据格式文件,然后上传至合适位置,即完成巨量数据快速入库的办法。配合mapreduce完成,高效便捷,而且不占用region资源,增添负载。
1. 仅适合初次数据导入,即表内数据为空,或者每次入库表内都无数据的情况。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
|
package
zl.hbase.mr;
import
java.io.IOException;
import
org.apache.hadoop.conf.Configuration;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.hbase.KeyValue;
import
org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import
org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer;
import
org.apache.hadoop.hbase.mapreduce.SimpleTotalOrderPartitioner;
import
org.apache.hadoop.hbase.util.Bytes;
import
org.apache.hadoop.io.LongWritable;
import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import
org.apache.hadoop.util.GenericOptionsParser;
import
zl.hbase.util.ConnectionUtil;
public
class
HFileGenerator {
public
static
class
HFileMapper
extends
Mapper
@Override
protected
void
map(LongWritable key, Text value, Context context)
throws
IOException, InterruptedException {
String line = value.toString();
String[] items = line.split(
","
, -
1
);
ImmutableBytesWritable rowkey =
new
ImmutableBytesWritable(
items[
0
].getBytes());
KeyValue kv =
new
KeyValue(Bytes.toBytes(items[
0
]),
Bytes.toBytes(items[
1
]), Bytes.toBytes(items[
2
]),
System.currentTimeMillis(), Bytes.toBytes(items[
3
]));
if
(
null
!= kv) {
context.write(rowkey, kv);
}
}
}
public
static
void
main(String[] args)
throws
IOException,
InterruptedException, ClassNotFoundException {
Configuration conf =
new
Configuration();
String[] dfsArgs =
new
GenericOptionsParser(conf, args)
.getRemainingArgs();
Job job =
new
Job(conf,
"HFile bulk load test"
);
job.setJarByClass(HFileGenerator.
class
);
job.setMapperClass(HFileMapper.
class
);
job.setReducerClass(KeyValueSortReducer.
class
);
job.setMapOutputKeyClass(ImmutableBytesWritable.
class
);
job.setMapOutputValueClass(Text.
class
);
job.setPartitionerClass(SimpleTotalOrderPartitioner.
class
);
FileInputFormat.addInputPath(job,
new
Path(dfsArgs[
0
]));
FileOutputFormat.setOutputPath(job,
new
Path(dfsArgs[
1
]));
HFileOutputFormat.configureIncrementalLoad(job,
ConnectionUtil.getTable());
System.exit(job.waitForCompletion(
true
) ?
0
:
1
);
}
}
|
生成HFile程序说明:
①. 最终输出结果,无论是map还是reduce,输出部分key和value的类型必须是: < ImmutableBytesWritable, KeyValue>或者< ImmutableBytesWritable, Put>。
②. 最终输出部分,Value类型是KeyValue 或Put,对应的Sorter分别是KeyValueSortReducer或PutSortReducer。
③. MR例子中job.setOutputFormatClass(HFileOutputFormat.class); HFileOutputFormat只适合一次对单列族组织成HFile文件。
④. MR例子中HFileOutputFormat.configureIncrementalLoad(job, table);自动对job进行配置。SimpleTotalOrderPartitioner是需要先对key进行整体排序,然后划分到每个reduce中,保证每一个reducer中的的key最小最大值区间范围,是不会有交集的。因为入库到HBase的时候,作为一个整体的Region,key是绝对有序的。
⑤. MR例子中最后生成HFile存储在HDFS上,输出路径下的子目录是各个列族。如果对HFile进行入库HBase,相当于move HFile到HBase的Region中,HFile子目录的列族内容没有了。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
package
zl.hbase.bulkload;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import
org.apache.hadoop.util.GenericOptionsParser;
import
zl.hbase.util.ConnectionUtil;
public
class
HFileLoader {
public
static
void
main(String[] args)
throws
Exception {
String[] dfsArgs =
new
GenericOptionsParser(
ConnectionUtil.getConfiguration(), args).getRemainingArgs();
LoadIncrementalHFiles loader =
new
LoadIncrementalHFiles(
ConnectionUtil.getConfiguration());
loader.doBulkLoad(
new
Path(dfsArgs[
0
]), ConnectionUtil.getTable());
}
}
|
通过HBase中 LoadIncrementalHFiles的doBulkLoad方法,对生成的HFile文件入库
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
public class ConnectionUtil {
private static final String TABLE_NAME="yy";
public static HTable getTable() {
HTable hTable=null;
try {
if(hTable==null){
hTable= new HTable(getConfiguration(), TABLE_NAME);
}
} catch (IOException e1) {
e1.printStackTrace();
}
return hTable;
}
public static Configuration getConfiguration() {
Configuration conf=HBaseConfiguration.create();
return conf;
}
}