Writing a MapReduce application
HBase is running on top of Hadoop, specifically the HDFS. Data in HBase is partitioned
and replicated like any other data in the HDFS. That means running a MapReduce
program over data stored in HBase has all the same advantages as a regular
MapReduce program. This is why your MapReduce calculation can execute the same
HBase scan as the multithreaded example and attain far greater throughput. In the
MapReduce application, the scan is executing simultaneously on multiple nodes. This
removes the bottleneck of all data moving through a single machine. If you’re running
MapReduce on the same cluster that’s running HBase, it’s also taking advantage
of any collocation that might be available. Putting it all together, the Shakespearean
counting example looks like the following listing.
package HBaseIA.TwitBase.mapreduce; //... public class CountShakespeare { public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "TwitBase Shakespeare counter"); job.setJarByClass(CountShakespeare.class); Scan scan = new Scan(); scan.addColumn(TwitsDAO.TWITS_FAM, TwitsDAO.TWIT_COL); TableMapReduceUtil.initTableMapperJob(Bytes.toString( TwitsDAO.TABLE_NAME), scan, Map.class, ImmutableBytesWritable.class, Result.class, job); job.setOutputFormatClass(NullOutputFormat.class); job.setNumReduceTasks(0); System.exit(job.waitForCompletion(true) ? 0 : 1); } public static class Map extends TableMapper<Text, LongWritable> { private boolean containsShakespeare(String msg) { //... } @Override protected void map(ImmutableBytesWritable rowkey, Result result, Context context) { byte[] b = result.getColumnLatest(TwitsDAO.TWITS_FAM, TwitsDAO.TWIT_COL).getValue(); String msg = Bytes.toString(b); if ((msg != null) && !msg.isEmpty()) { context.getCounter(Counters.ROWS).increment(1); } if (containsShakespeare(msg)) { context.getCounter(Counters.SHAKESPEAREAN).increment(1); } } public static enum Counters {ROWS, SHAKESPEAREAN; } } }
CountShakespeare is pretty simple; it packages a Mapper implementation and a main
method. It also takes advantage of the HBase-specific MapReduce helper class
TableMapper and the TableMapReduceUtil utility class that we talked about earlier in
the chapter. Also notice the lack of a reducer. This example doesn’t need to perform
additional computation in the reduce phase. Instead, map output is collected via job
counters.
Counters are fun and all, but what about writing back to HBase? We’ve developed a
similar algorithm specifically for detecting references to Hamlet. The mapper is similar
to the Shakespearean example, except that its [k2,v2] output types are [Immutable-
BytesWritable,Put]—basically, HBase rowkey and an instance of the Put command
you learned in the previous chapter. Here’s the reducer code:
public static class Reduce extends TableReducer<ImmutableBytesWritable, Put, ImmutableBytesWritable> { @Override protected void reduce(ImmutableBytesWritable rowkey, Iterable<Put> values, Context context) { Iterator<Put> i = values.iterator(); if (i.hasNext()) { context.write(rowkey, i.next()); } } }
There’s not much to it. The reducer implementation accepts [k2,{v2}], the rowkey
and a list of Puts as input. In this case, each Put is setting the info:hamlet_tag column
to true. A Put need only be executed once for each user, so only the first is emitted
to the output context object. [k3,v3] tuples produced are also of type
[ImmutableBytesWritable,Put]. You let the Hadoop machinery handle execution of
the Puts to keep the reduce implementation idempotent.