笔记笔记

inner class fetcher:
323: metadata.set(Nutch.SEGMENT_NAME_KEY, segmentName);

/** Return the set of anchor texts.  Only a single anchor with a given text
   * is permitted from a given domain. */


IndexerMapReduce.reduce:

else if (CrawlDatum.hasFetchStatus(datum)) {
          // don't index unmodified (empty) pages
          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
            fetchDatum = datum;


basicfilter////and

IndexerOutputFormat

createLuceneDoc

now p is in title

hadoop 0.19真是爽阿

将额外的需求加载在 html parser里面

你可能感兴趣的:(html,hadoop)