Nutch 1.3 学习笔记 9 SolrIndexer

Nutch 1.3 学习笔记 9 SolrIndexer
----------------------------------
新的Nutch使用了Solr来做了后台的索引服务,nutch正在努力与Solr进行更方便的整合,它很好的与Solr处理了耦合关系,把Solr当成一个服务,Nutch只要调用其客户端就可以与其进行通讯。


1. bin/nutch solrindex

   这个命令是用来对抓取下来的内容建立索引,帮助如下:
   
Usage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>)

   这里我们可以看到第一个参数为<solr url>,这是solr服务的一个地址,第二个参数为抓取的url数据库名,第三个参数为反向链接数据库,第四个参数就segment目录名


  使用这个命令的前提是你要有一个相应的Solr服务才行。


2. 看一下SolrIndexer这个类做了些什么

bin/nutch solrindex这个命令最终是调用SolrIndexer的main方法,其中一个最主要是方法是indexSolr方法,
下面来看一下这个方法做了些什么
	 final JobConf job = new NutchJob(getConf());
    job.setJobName("index-solr " + solrUrl);
	// 这里会初始化Job任务,设置其Map与Reduce方法
    IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);


    job.set(SolrConstants.SERVER_URL, solrUrl);


	// 这里配置OutputFormat的类
    NutchIndexWriterFactory.addClassToConf(job, SolrWriter.class);


    job.setReduceSpeculativeExecution(false);


    final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-" +
                         new Random().nextInt());
	// 配置输出路径
    FileOutputFormat.setOutputPath(job, tmp);
    try {
		// 提交任务
      JobClient.runJob(job);
      // do the commits once and for all the reducers in one go
      SolrServer solr =  new CommonsHttpSolrServer(solrUrl);
      solr.commit();
      long end = System.currentTimeMillis();
      LOG.info("SolrIndexer: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));
    }
    catch (Exception e){
      LOG.error(e);
    } finally {
      FileSystem.get(job).delete(tmp, true);
    }


下面来看一下IndexMapReduce.initMRJob这个方法做了些什么
  public static void initMRJob(Path crawlDb, Path linkDb,
                           Collection<Path> segments,
                           JobConf job) {


    LOG.info("IndexerMapReduce: crawldb: " + crawlDb);
    LOG.info("IndexerMapReduce: linkdb: " + linkDb);


	// 加入segment中要建立索引的目录
    for (final Path segment : segments) {
      LOG.info("IndexerMapReduces: adding segment: " + segment);
      FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME));  // crawl_fetch
      FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME));  // fetch_parse
      FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));         // parse_data
      FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));         // parse_text
    }


    FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));         // crawldb/current
    FileInputFormat.addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME));           // linkdb/current
    job.setInputFormat(SequenceFileInputFormat.class);                    // 设置输入的文件格式, 这里所有目录中的文件格式都是SequenceFileInputFormat,


	// 设置Map与Reduce的类型
    job.setMapperClass(IndexerMapReduce.class);
    job.setReducerClass(IndexerMapReduce.class);


	// 设置输出类型
    job.setOutputFormat(IndexerOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setMapOutputValueClass(NutchWritable.class);   // 这里设置了Map输出的Value的类型,key类型还是上面的Text
    job.setOutputValueClass(NutchWritable.class);
  }


  	IndexerMapRducer中的Map只是读入<key,value>对,把value做NutchWritable进行了封装再输出,下面来看一下IndexerMapReduce中的Reduce方法做了些什么
	 public void reduce(Text key, Iterator<NutchWritable> values,
                     OutputCollector<Text, NutchDocument> output, Reporter reporter)
    throws IOException {
    Inlinks inlinks = null;
    CrawlDatum dbDatum = null;
    CrawlDatum fetchDatum = null;
    ParseData parseData = null;
    ParseText parseText = null;
	// 这一块代码是判断相同key的value的类型,根据其类型来对
	// inlinks,dbDatum,fetchDatum,parseData,praseText对象进行赋值
    while (values.hasNext()) {
      final Writable value = values.next().get(); // unwrap
      if (value instanceof Inlinks) {
        inlinks = (Inlinks)value;
      } else if (value instanceof CrawlDatum) {
        final CrawlDatum datum = (CrawlDatum)value;
        if (CrawlDatum.hasDbStatus(datum))
          dbDatum = datum;
        else if (CrawlDatum.hasFetchStatus(datum)) {
          // don't index unmodified (empty) pages
          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
            fetchDatum = datum;
        } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                   CrawlDatum.STATUS_SIGNATURE == datum.getStatus() ||
                   CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {
          continue;
        } else {
          throw new RuntimeException("Unexpected status: "+datum.getStatus());
        }
      } else if (value instanceof ParseData) {
        parseData = (ParseData)value;
      } else if (value instanceof ParseText) {
        parseText = (ParseText)value;
      } else if (LOG.isWarnEnabled()) {
        LOG.warn("Unrecognized type: "+value.getClass());
      }
    }


    if (fetchDatum == null || dbDatum == null
        || parseText == null || parseData == null) {
      return;                                     // only have inlinks
    }


    if (!parseData.getStatus().isSuccess() ||
        fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
      return;
    }


	// 生成一个可以索引的文档对象,在Lucene中,Docuemnt就是一个抽象的文档对象,其有Fields组成,而Field又由Terms组成
    NutchDocument doc = new NutchDocument();
    final Metadata metadata = parseData.getContentMeta();


    // add segment, used to map from merged index back to segment files
    doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));


    // add digest, used by dedup
    doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));


    final Parse parse = new ParseImpl(parseText, parseData);
    try {
      // extract information from dbDatum and pass it to
      // fetchDatum so that indexing filters can use it
      final Text url = (Text) dbDatum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY);
      if (url != null) {
        fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);
      }
      // run indexing filters
      doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
    } catch (final IndexingException e) {
      if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
      return;
    }


    // skip documents discarded by indexing filters
    if (doc == null) return;


    float boost = 1.0f;
    // run scoring filters
    try {
      boost = this.scfilters.indexerScore(key, doc, dbDatum,
              fetchDatum, parse, inlinks, boost);
    } catch (final ScoringFilterException e) {
      if (LOG.isWarnEnabled()) {
        LOG.warn("Error calculating score " + key + ": " + e);
      }
      return;
    }
    // apply boost to all indexed fields.
    doc.setWeight(boost);
    // store boost for use by explain and dedup
    doc.add("boost", Float.toString(boost));


	// 收集输出结果,用下面的IndexerOutputFormat写到Solr中去
    output.collect(key, doc);
  }


  下面来看一下IndexerOutputFormat中的getRecordWriter是如何实现的
   @Override
  public RecordWriter<Text, NutchDocument> getRecordWriter(FileSystem ignored,
      JobConf job, String name, Progressable progress) throws IOException {
    
    // populate JobConf with field indexing options
    IndexingFilters filters = new IndexingFilters(job);
    
	// 这里可以写到多个输出源中
    final NutchIndexWriter[] writers =
      NutchIndexWriterFactory.getNutchIndexWriters(job);


    for (final NutchIndexWriter writer : writers) {
      writer.open(job, name);
    }
	// 这里使用了一个inner class来返回相应的RecordWriter,用于输出Reduce收集的<key,value>对
    return new RecordWriter<Text, NutchDocument>() {


      public void close(Reporter reporter) throws IOException {
        for (final NutchIndexWriter writer : writers) {
          writer.close();
        }
      }


      public void write(Text key, NutchDocument doc) throws IOException {
        for (final NutchIndexWriter writer : writers) {
          writer.write(doc);
        }
      }
    };
  }
}

这里有多个NutchIndexWriter,目前只有一个子类,就是SolrWriter,下面分析一下其write方法做了些什么
public void write(NutchDocument doc) throws IOException {
    final SolrInputDocument inputDoc = new SolrInputDocument();
	// 生成Solr的InputDocuement对象
    for(final Entry<String, NutchField> e : doc) {
      for (final Object val : e.getValue().getValues()) {
        // normalise the string representation for a Date
        Object val2 = val;
        if (val instanceof Date){
          val2 = DateUtil.getThreadLocalDateFormat().format(val);
        }
        inputDoc.addField(solrMapping.mapKey(e.getKey()), val2, e.getValue().getWeight());
        String sCopy = solrMapping.mapCopyKey(e.getKey());
        if (sCopy != e.getKey()) {
        	inputDoc.addField(sCopy, val);	
        }
      }
    }
    inputDoc.setDocumentBoost(doc.getWeight());
    inputDocs.add(inputDoc);   // 加入缓冲
    if (inputDocs.size() >= commitSize) {   // 缓冲到达commitSize后,调用solr客户端的add方法写出到Solr服务端
      try {
        solr.add(inputDocs);
      } catch (final SolrServerException e) {
        throw makeIOException(e);
      }
      inputDocs.clear();
    }
  }


3. 总结

这里大概介绍了一下Nutch对于抓取内容的索引建立过程,也使用了一个MP任务,在Reduce端主要是把要索引的字段生成了一个NutchDocument对象,再通过SolrWriter写出到Solr的服务端,这里SolrWriter封装了Solr的客户端对象,在这里要把Nutch中的Document转换成Solr中的Document,因为这边的NutchDocument是一个可Writable的类型,它一定要是可序列化的,而SorlInputDocument是SolrInputFormat是不可以被序列化的。

你可能感兴趣的:(String,null,Solr,Lucene,url,Path)