本blog介绍如何读取Hbase中的数据并写入到HDFS分布式文件系统中。读取数据比较简单,我们借用上一篇【HBase基础教程】6、HBase之读取MapReduce数据写入HBase的hbase数据输出wordcount表作为本篇数据源的输入,编写Mapper函数,读取wordcount表中的数据填充到< key,value>,通过Reduce函数直接输出得到的结果即可。
硬件环境:Centos 6.5 服务器4台(一台为Master节点,三台为Slave节点)
软件环境:Java 1.7.0_45、Eclipse Juno Service Release 2、hadoop-1.2.1、hbase-0.94.20。
1)输入数据源:
上一篇【HBase基础教程】6、HBase之读取MapReduce数据写入HBase实现了读取MapReduce数据写入到Hbase表wordcount中,在本篇blog中,我们将wordcount表作为输入数据源。
2)输出目标:
HDFS分布式文件系统中的文件。
WordCountHbaseReaderMapper类继承了TableMapper< Text,Text>抽象类,TableMapper类专门用于完成MapReduce中Map过程与Hbase表之间的操作。此时的map(ImmutableBytesWritable key,Result value,Context context)方法,第一个参数key为Hbase表的rowkey主键,第二个参数value为key主键对应的记录集合,此处的map核心实现是遍历key主键对应的记录集合value,将其组合成一条记录通过contentx.write(key,value)填充到< key,value>键值对中。
详细源码请参考:WordCountHbaseReader\src\com\zonesion\hbase\WordCountHbaseReader.java
public static class WordCountHbaseReaderMapper extends
TableMapper{
@Override
protected void map(ImmutableBytesWritable key,Result value,Context context)
throws IOException, InterruptedException {
StringBuffer sb = new StringBuffer("");
for(Entry entry:value.getFamilyMap("content".getBytes()).entrySet()){
String str = new String(entry.getValue());
//将字节数组转换为String类型
if(str != null){
sb.append(new String(entry.getKey()));
sb.append(":");
sb.append(str);
}
context.write(new Text(key.get()), new Text(new String(sb)));
}
}
}
此处的WordCountHbaseReaderReduce实现了直接输出Map输出的< key,value>键值对,没有对其做任何处理。详细源码请参考:WordCountHbaseReader\src\com\zonesion\hbase\WordCountHbaseReader.java
public static class WordCountHbaseReaderReduce extends Reducer{
private Text result = new Text();
@Override
protected void reduce(Text key, Iterable values,Context context)
throws IOException, InterruptedException {
for(Text val:values){
result.set(val);
context.write(key, result);
}
}
}
与WordCount的驱动类不同,在Job配置的时候没有配置job.setMapperClass(),而是用以下方法执行Mapper类: TableMapReduceUtil.initTableMapperJob(tablename,scan,WordCountHbaseReaderMapper.class, Text.class, Text.class, job);
该方法指明了在执行job的Map过程时,数据输入源是hbase的tablename表,通过扫描读入对象scan对表进行全表扫描,为Map过程提供数据源输入,通过WordCountHbaseReaderMapper.class执行Map过程,Map过程的输出key/value类型是 Text.class与Text.class,最后一个参数是作业对象。特别注意:这里声明的是一个最简单的扫描读入对象scan,进行表扫描读取数据,其中scan可以配置参数,这里为了例子简单不再详述,用户可自行尝试。
详细源码请参考:WordCountHbaseReader\src\com\zonesion\hbase\WordCountHbaseReader.java
public static void main(String[] args) throws Exception {
String tablename = "wordcount";
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "Master");
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 1) {
System.err.println("Usage: WordCountHbaseReader ");
System.exit(2);
}
Job job = new Job(conf, "WordCountHbaseReader");
job.setJarByClass(WordCountHbaseReader.class);
//设置任务数据的输出路径;
FileOutputFormat.setOutputPath(job, new Path(otherArgs[0]));
job.setReducerClass(WordCountHbaseReaderReduce.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob(tablename,scan,WordCountHbaseReaderMapper.class, Text.class, Text.class, job);
//调用job.waitForCompletion(true) 执行任务,执行成功后退出;
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
[hadoop@K-Master ~]$ start-dfs.sh #启动hadoop HDFS文件管理系统
[hadoop@K-Master ~]$ start-mapred.sh #启动hadoop MapReduce分布式计算服务
[hadoop@K-Master ~]$ start-hbase.sh #启动Hbase
[hadoop@K-Master ~]$ jps #查看进程
22003 HMaster
10611 SecondaryNameNode
22226 Jps
21938 HQuorumPeer
10709 JobTracker
22154 HRegionServer
20277 Main
10432 NameNode
#设置工作环境
[hadoop@K-Master ~]$ mkdir -p /usr/hadoop/workspace/Hbase
#部署源码
将WordCountHbaseReader文件夹拷贝到/usr/hadoop/workspace/Hbase/ 路径下;
… 你可以直接 下载 WordCountHbaseReader
a)查看hbase核心配置文件hbase-site.xml的hbase.zookeeper.quorum属性
参考“【HBase基础教程】5、HBase API访问 3、部署运行 3)修改配置文件”查看hbase核心配置文件hbase-site.xml的hbase.zookeeper.quorum属性;
b)修改项目WordCountHbaseWriter/src/config.properties属性文件
将项目WordCountHbaseWriter/src/config.properties属性文件的hbase.zookeeper.quorum属性值修改为上一步查询到的属性值,保持config.properties文件的hbase.zookeeper.quorum属性值与hbase-site.xml文件的hbase.zookeeper.quorum属性值一致;
#切换工作目录
[hadoop@K-Master ~]$ cd /usr/hadoop/workspace/Hbase/ WordCountHbaseReader
#修改属性值
[hadoop@K-Master WordCountHbaseReader]$ vim src/config.properties
hbase.zookeeper.quorum=K-Master
#拷贝src/config.properties文件到bin/文件夹
[hadoop@K-Master WordCountHbaseReader]$ cp src/config.properties bin/
#切换工作目录
[hadoop@K-Master ~]$ cd /usr/hadoop/workspace/Hbase/WordCountHbaseReader
#执行编译
[hadoop@K-Master WordCountHbaseReader]$ javac -classpath /usr/hadoop/hadoop-core-1.2.1.jar:/usr/hadoop/lib/commons-cli-1.2.jar:lib/zookeeper-3.4.5.jar:lib/hbase-0.94.20.jar -d bin/ src/com/zonesion/hbase/WordCountHbaseReader.java
#查看编译文件
[hadoop@K-Master WordCountHbaseReader]$ ls bin/com/zonesion/hbase/ -la
total 20
drwxrwxr-x 2 hadoop hadoop 4096 Dec 29 10:36 .
drwxrwxr-x 3 hadoop hadoop 4096 Dec 29 10:36 ..
-rw-rw-r-- 1 hadoop hadoop 2166 Dec 29 14:31 WordCountHbaseReader.class
-rw-rw-r-- 1 hadoop hadoop 2460 Dec 29 14:31 WordCountHbaseReader$WordCountHbaseReaderMapper.class
-rw-rw-r-- 1 hadoop hadoop 1738 Dec 29 14:31 WordCountHbaseReader$WordCountHbaseReaderReduce.class
#拷贝lib文件夹到bin文件夹
[hadoop@K-Master WordCountHbaseReader]$ cp -r lib/ bin/
#打包Jar文件
[hadoop@K-Master WordCountHbaseReader]$ jar -cvf WordCountHbaseReader.jar -C bin/ .
added manifest
adding: lib/(in = 0) (out= 0)(stored 0%)
adding: lib/zookeeper-3.4.5.jar(in = 779974) (out= 721150)(deflated 7%)
adding: lib/guava-11.0.2.jar(in = 1648200) (out= 1465342)(deflated 11%)
adding: lib/protobuf-java-2.4.0a.jar(in = 449818) (out= 420864)(deflated 6%)
adding: lib/hbase-0.94.20.jar(in = 5475284) (out= 5038635)(deflated 7%)
adding: com/(in = 0) (out= 0)(stored 0%)
adding: com/zonesion/(in = 0) (out= 0)(stored 0%)
adding: com/zonesion/hbase/(in = 0) (out= 0)(stored 0%)
adding: com/zonesion/hbase/PropertiesHelper.class(in = 4480) (out= 1926)(deflated 57%)
adding: com/zonesion/hbase/WordCountHbaseReader.class(in = 2702) (out= 1226)(deflated 54%)
adding: com/zonesion/hbase/WordCountHbaseReader$WordCountHbaseReaderMapper.class(in = 3250) (out= 1275)(deflated 60%)
adding: com/zonesion/hbase/WordCountHbaseReader$WordCountHbaseReaderReduce.class(in = 2308) (out= 872)(deflated 62%)
adding: config.properties(in = 32) (out= 34)(deflated -6%)
[hadoop@K-Master WordCountHbase]$ hadoop jar WordCountHbaseReader.jar WordCountHbaseReader /user/hadoop/WordCountHbaseReader/output/
...................省略.............
14/12/30 17:51:58 INFO mapred.JobClient: Running job: job_201412161748_0035
14/12/30 17:51:59 INFO mapred.JobClient: map 0% reduce 0%
14/12/30 17:52:13 INFO mapred.JobClient: map 100% reduce 0%
14/12/30 17:52:26 INFO mapred.JobClient: map 100% reduce 100%
14/12/30 17:52:27 INFO mapred.JobClient: Job complete: job_201412161748_0035
14/12/30 17:52:27 INFO mapred.JobClient: Counters: 39
14/12/30 17:52:27 INFO mapred.JobClient: Job Counters
14/12/30 17:52:27 INFO mapred.JobClient: Launched reduce tasks=1
14/12/30 17:52:27 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4913
14/12/30 17:52:27 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/12/30 17:52:27 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/12/30 17:52:27 INFO mapred.JobClient: Rack-local map tasks=1
14/12/30 17:52:27 INFO mapred.JobClient: Launched map tasks=1
14/12/30 17:52:27 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13035
14/12/30 17:52:27 INFO mapred.JobClient: HBase Counters
14/12/30 17:52:27 INFO mapred.JobClient: REMOTE_RPC_CALLS=8
14/12/30 17:52:27 INFO mapred.JobClient: RPC_CALLS=8
14/12/30 17:52:27 INFO mapred.JobClient: RPC_RETRIES=0
14/12/30 17:52:27 INFO mapred.JobClient: NOT_SERVING_REGION_EXCEPTION=0
14/12/30 17:52:27 INFO mapred.JobClient: NUM_SCANNER_RESTARTS=0
14/12/30 17:52:27 INFO mapred.JobClient: MILLIS_BETWEEN_NEXTS=9
14/12/30 17:52:27 INFO mapred.JobClient: BYTES_IN_RESULTS=216
14/12/30 17:52:27 INFO mapred.JobClient: BYTES_IN_REMOTE_RESULTS=216
14/12/30 17:52:27 INFO mapred.JobClient: REGIONS_SCANNED=1
14/12/30 17:52:27 INFO mapred.JobClient: REMOTE_RPC_RETRIES=0
14/12/30 17:52:27 INFO mapred.JobClient: File Output Format Counters
14/12/30 17:52:27 INFO mapred.JobClient: Bytes Written=76
14/12/30 17:52:27 INFO mapred.JobClient: FileSystemCounters
14/12/30 17:52:27 INFO mapred.JobClient: FILE_BYTES_READ=92
14/12/30 17:52:27 INFO mapred.JobClient: HDFS_BYTES_READ=68
14/12/30 17:52:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=159978
14/12/30 17:52:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=76
14/12/30 17:52:27 INFO mapred.JobClient: File Input Format Counters
14/12/30 17:52:27 INFO mapred.JobClient: Bytes Read=0
14/12/30 17:52:27 INFO mapred.JobClient: Map-Reduce Framework
14/12/30 17:52:27 INFO mapred.JobClient: Map output materialized bytes=92
14/12/30 17:52:27 INFO mapred.JobClient: Map input records=5
14/12/30 17:52:27 INFO mapred.JobClient: Reduce shuffle bytes=92
14/12/30 17:52:27 INFO mapred.JobClient: Spilled Records=10
14/12/30 17:52:27 INFO mapred.JobClient: Map output bytes=76
14/12/30 17:52:27 INFO mapred.JobClient: Total committed heap usage (bytes)=211025920
14/12/30 17:52:27 INFO mapred.JobClient: CPU time spent (ms)=2160
14/12/30 17:52:27 INFO mapred.JobClient: Combine input records=0
14/12/30 17:52:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=68
14/12/30 17:52:27 INFO mapred.JobClient: Reduce input records=5
14/12/30 17:52:27 INFO mapred.JobClient: Reduce input groups=5
14/12/30 17:52:27 INFO mapred.JobClient: Combine output records=0
14/12/30 17:52:27 INFO mapred.JobClient: Physical memory (bytes) snapshot=263798784
14/12/30 17:52:27 INFO mapred.JobClient: Reduce output records=5
14/12/30 17:52:27 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1491795968
14/12/30 17:52:27 INFO mapred.JobClient: Map output records=5
[hadoop@K-Master WordCountHbaseReader]$ hadoop fs -ls /user/hadoop/WordCountHbaseReader/output/
Found 3 items
-rw-r--r-- 1 hadoop supergroup 0 2014-07-28 18:04 /user/hadoop/WordCountHbaseReader/output/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2014-07-28 18:04 /user/hadoop/WordCountHbaseReader/output/_logs
-rw-r--r-- 1 hadoop supergroup 76 2014-07-28 18:04 /user/hadoop/WordCountHbaseReader/output/part-r-00000
[hadoop@K-Master WordCountHbaseReader]$ hadoop fs -cat /user/hadoop/WordCountHbaseReader/output/part-r-00000
Bye count:1
Goodbye count:1
Hadoope count:2
Hellope count:2
Worldpe count:2
【HBase基础教程】1、HBase之单机模式与伪分布式模式安装
【HBase基础教程】2、HBase之完全分布式模式安装
【HBase基础教程】3、HBase Shell DDL操作
【HBase基础教程】4、HBase Shell DML操作
【HBase基础教程】5、HBase API访问
【HBase基础教程】6、HBase之读取MapReduce数据写入HBase
【HBase基础教程】7、HBase之读取HBase数据写入HDFS