hbase学习之使用并发的mapper

   首先hadoop是支持并发的Mapper的,所以hbase没有道理不实现并发的Mapper,这个类是org.apache.hadoop.hbase.mapreduce.MultithreadedTableMapper.

   该类简单理解就是重写了Mapper的run方法
    
/**
   * Run the application's maps using a thread pool.
   */
  @Override
  public void run(Context context) throws IOException, InterruptedException {
    outer = context;
    int numberOfThreads = getNumberOfThreads(context);
    mapClass = getMapperClass(context);
    if (LOG.isDebugEnabled()) {
      LOG.debug("Configuring multithread runner to use " + numberOfThreads +
          " threads");
    }
    executor = Executors.newFixedThreadPool(numberOfThreads);
    for(int i=0; i < numberOfThreads; ++i) {
      MapRunner thread = new MapRunner(context);
      executor.execute(thread);
    }
    executor.shutdown();
    while (!executor.isTerminated()) {
      // wait till all the threads are done
      Thread.sleep(1000);
    }
  }

   以上是源代码,引自hbase-0.94.1

   同时,该类内部还实现了一个private的class MapRunner,该MapRunner持有一个mapper变量,而这个mapper就是我们要执行的mapper,而这个mapper是怎么设置进去的呢?

/**
   * Set the application's mapper class.
   * @param <K2> the map output key type
   * @param <V2> the map output value type
   * @param job the job to modify
   * @param cls the class to use as the mapper
   */
  public static <K2,V2>
  void setMapperClass(Job job,
      Class<? extends Mapper<ImmutableBytesWritable, Result,K2,V2>> cls) {
    if (MultithreadedTableMapper.class.isAssignableFrom(cls)) {
      throw new IllegalArgumentException("Can't have recursive " +
          "MultithreadedTableMapper instances.");
    }
    job.getConfiguration().setClass(MAPPER_CLASS,
        cls, Mapper.class);
  }

   以上是源代码,引自hbase-0.94.1
   可以看出,我们要实现并发的Mapper类一定不能是MultithreadedTableMapper 的子类(本人在试验的时候就因为继承了MultithreadedTableMapper 而抛出异常),通过在提交任务之前调用此静态方法,就可以设定我们真实的Mapper类。

   同时
/**
   * Set the number of threads in the pool for running maps.
   * @param job the job to modify
   * @param threads the new number of threads
   */
  public static void setNumberOfThreads(Job job, int threads) {
    job.getConfiguration().setInt(NUMBER_OF_THREADS,
        threads);
  }

   我们还可以调用该方法来设置并发线程的数目,默认的并发数目是10。

   此外还要注意,我们使用TableMapReduceUtil来initTableMapperJob中的Mapper class必须是MultithreadedTableMapper。

   最后,该类其实还实现了一些其它的内部类和方法来辅助数据的一致性,有兴趣的朋友可以自己看源代码,我这里只抛一个砖。

你可能感兴趣的:(hbase,并发Mapper)