Nutch学习笔记4-Nutch 1.7 的 索引篇 ElasticSearch

上一篇讲解了爬取和分析的流程,很重要的收获就是:

解析过程中,会根据页面的ContentType获得一系列的注册解析器,

依次调用每个解析器,当其中一个解析成功后就返回,否则继续执行下一个解析器。

当然,返回之前还要经过注册过的所有HtmlParseFilter的过滤,至少对HtmlParser是这样的。

 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~·下面来看看索引过程。

当我们敲入

./bin/nutch index

发生了什么?

查看1.7版本的nutch脚本

elif [ "$COMMAND" = "index" ] ; then
  CLASS=org.apache.nutch.indexer.IndexingJob

那么我们就来看看IndexingJob这个类。

首先看函数

public int run(String[] args) throws Exception {

这个函数先解析参数。

然后调用

try {
            index(crawlDb, linkDb, segments, noCommit, deleteGone, params,
                    filter, normalize);
            return 0;
        } catch (final Exception e) {
            LOG.error("Indexer: " + StringUtils.stringifyException(e));
            return -1;
        }

那么我们继续看

public void index(Path crawlDb, Path linkDb, List<Path> segments,
            boolean noCommit, boolean deleteGone, String params,
            boolean filter, boolean normalize) throws IOException {

看代码的第100行

IndexWriters writers = new IndexWriters(getConf());
        LOG.info(writers.describe());

如果你查看IndexWriters的构造函数,会发现这仍然是通过插件机制获得所有可用的IndexWriter的。

PS:Nutch只是规定了一系列流程,至于每个流程,可以通过Plugin来介入。

为我们后续注入自己的东西比如定制化需求提供了很大的方便。

~~~~~~~~~~~~~

 如果你此时运行

./bin/nutch index ./data/crawldb/ -linkdb ./data/linkdb/  -dir ./data/segments/ -deleteGone -filter -normalize

系统会报错:

Indexer: starting at 2014-06-26 14:49:35
Indexer: deleting gone documents: true
Indexer: URL filtering: true
Indexer: URL normalizing: true
Indexer: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url
SOLRIndexWriter
	solr.server.url : URL of the SOLR instance (mandatory)
	solr.commit.size : buffer size when sending to SOLR (default 1000)
	solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
	solr.auth : use authentication (default false)
	solr.auth.username : use authentication (default false)
	solr.auth : username for authentication
	solr.auth.password : password for authentication

	at org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
	at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160)
	at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
	at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:100)
	at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)

这是为啥,如果你了解了Nutch, 其实自己都能找到原因。

学开源就是这点好处,你看了代码你就知道问题所在。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

之所以这里定位到了Solr,是因为我们在配置文件里启用了index-solr插件。

 修改前的nutch-default.xml关于插件是

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

看到了吧,这里就有indexer-solr的配置。

复制到nutch-site.xml,然后去掉indexer-solr,替换成indexer-elastic.

重新执行下面的命令

./bin/nutch index ./data/crawldb/ -linkdb ./data/linkdb/  -dir ./data/segments/ -deleteGone -filter -normalize

系统会报错,表示缺少配置项,自己解决吧。

~~~~~~~~~~~~~~~~~~~~~~~~~~

关于IndexWriter和IndexingFilter的官方解释:

IndexWriter -- Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
IndexingFilter -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).

 ~~~~~~~~~~~~~~~~~~~~~~~~~~

关于INdexWriter和IndexingFilter的连接关系

IndexerMapReduce.java的272行

// run indexing filters
      doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);

其实也就是依次调用每个IndexingFilter,如果结果不是null,则继续调用下一个。

然后过滤后的最终结果通过

public void open(JobConf job, String name) throws IOException {
		for (int i = 0; i < this.indexWriters.length; i++) {
			try {
				this.indexWriters[i].open(job, name);
			} catch (IOException ioe) {
				throw ioe;
			}
		}
	}

	public void write(NutchDocument doc) throws IOException {
		for (int i = 0; i < this.indexWriters.length; i++) {
			try {
				this.indexWriters[i].write(doc);
			} catch (IOException ioe) {
				throw ioe;
			}
		}
	}

也是调用每一个IndexWriter写入到对应的比如说Solr,ElasticSearch等索引器中。

 

至此,我们已经把解析和索引的整个过程及原理都分析完毕。

 

 

 

 

你可能感兴趣的:(elasticsearch,Nutch)