看nutch学习hadoop的编程

刚下了最新版本的nutch1.0.*,发现nutch的搜索已经转由solr来实现了。nutch上有很多hadoop的应用,可以作为案例学习,看人家如何使用hadoop来实现,这对于刚接触hadoop编程的人来说,这是一个比较好的选择,怎么说nutch也算是hadoop的起源地。。。新版本的nutch使用的hadoop也是比较新的版本。。


看一下nutch的index模块,使用的hadoop的mapreduce方式来实现建索引:

map: 从序列化的文件解析出key与value值。

  public void map(Textkey,Writable value,

      OutputCollector<Text,NutchWritable>output, Reporter reporter) throws IOException {

    output.collect(key,new NutchWritable(value));

  }


reduce:key为文档主键,value为文档+对应操作

  public void reduce(Textkey,Iterator<NutchWritable>values,

                     OutputCollector<Text,NutchIndexAction>output, Reporter reporter)

    throws IOException {

.......

 reporter.incrCounter("IndexerStatus","Documents added",1);


    NutchIndexAction action = new NutchIndexAction(doc,NutchIndexAction.ADD);

    output.collect(key,action);

.......

}


map只是简单的将输入文件简单的收集起来,而reduce也只是封装了文档数据,这样大量的输入源可以并发处理封装数据源。真正提交索引的是在FileOutputFormat上操作,reduce中的key值是文档主键,value封装了是文档与之对应的操作


public classIndexerOutputFormatextendsFileOutputFormat<Text,NutchIndexAction> {


  @Override

  public RecordWriter<Text,NutchIndexAction> getRecordWriter(FileSystemignored,

      JobConf job, String name, Progressable progress) throws IOException {

    

    // populate JobConf with field indexing options

    IndexingFilters filters =new IndexingFilters(job);

    

    finalNutchIndexWriter[]writers =

      NutchIndexWriterFactory.getNutchIndexWriters(job);

//初始化每个witer

    for (finalNutchIndexWriterwriter :writers) {

      writer.open(job,name);

    }

    return new RecordWriter<Text,NutchIndexAction>() {


      public void close(Reporterreporter)throws IOException {

        for (final NutchIndexWriter writer : writers) {

          writer.close();

        }

      }

   //处理每个记录

      public void write(Textkey,NutchIndexAction indexAction) throws IOException {

        for (final NutchIndexWriter writer : writers) {

          if (indexAction.action ==NutchIndexAction.ADD) {

            writer.write(indexAction.doc);

          }

          if (indexAction.action ==NutchIndexAction.DELETE) {

            writer.delete(key.toString());

          }

        }

      }

    };

  }

}


处理每一个reduce后的记录:RecordWriter,调用  public void write(Text keyNutchIndexAction indexAction) 。nutch只是简单地使用solr的提交索引方式 ,实现的SolrWriter


首先初始化:

 public void open(JobConfjob,String name)throws IOException {

    SolrServerserver =SolrUtils.getCommonsHttpSolrServer(job);

    init(server, job);

  }


处理每条记录,首先是删除操作:

 public void delete(Stringkey)throws IOException {

    if (delete) {

      try {

        solr.deleteById(key);

        numDeletes++;

      } catch (finalSolrServerExceptione) {

        throwmakeIOException(e);

      }

    }

  }


增加操作:将增加的文档放到缓冲区上,达到一定数量时才进行提交 或者结束前调用

 inputDocs.add(inputDoc);

    if (inputDocs.size() +numDeletes >=commitSize) {

      try {

        LOG.info("Indexing " +Integer.toString(inputDocs.size()) +" documents");

        LOG.info("Deleting " +Integer.toString(numDeletes) +" documents");

        numDeletes = 0;

        UpdateRequest req =new UpdateRequest();

        req.add(inputDocs);

        req.setParams(params);

        req.process(solr);

      } catch (finalSolrServerExceptione) {

        throwmakeIOException(e);

      }

      inputDocs.clear();

    }








你可能感兴趣的:(看nutch学习hadoop的编程)