上一篇讲解了爬取和分析的流程,很重要的收获就是:
解析过程中,会根据页面的ContentType获得一系列的注册解析器,
依次调用每个解析器,当其中一个解析成功后就返回,否则继续执行下一个解析器。
当然,返回之前还要经过注册过的所有HtmlParseFilter的过滤,至少对HtmlParser是这样的。
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~·下面来看看索引过程。
当我们敲入
./bin/nutch index
发生了什么?
查看1.7版本的nutch脚本
elif [ "$COMMAND" = "index" ] ; then CLASS=org.apache.nutch.indexer.IndexingJob
那么我们就来看看IndexingJob这个类。
首先看函数
public int run(String[] args) throws Exception {
这个函数先解析参数。
然后调用
try { index(crawlDb, linkDb, segments, noCommit, deleteGone, params, filter, normalize); return 0; } catch (final Exception e) { LOG.error("Indexer: " + StringUtils.stringifyException(e)); return -1; }
那么我们继续看
public void index(Path crawlDb, Path linkDb, List<Path> segments, boolean noCommit, boolean deleteGone, String params, boolean filter, boolean normalize) throws IOException {
看代码的第100行
IndexWriters writers = new IndexWriters(getConf()); LOG.info(writers.describe());
如果你查看IndexWriters的构造函数,会发现这仍然是通过插件机制获得所有可用的IndexWriter的。
PS:Nutch只是规定了一系列流程,至于每个流程,可以通过Plugin来介入。
为我们后续注入自己的东西比如定制化需求提供了很大的方便。
~~~~~~~~~~~~~
如果你此时运行
./bin/nutch index ./data/crawldb/ -linkdb ./data/linkdb/ -dir ./data/segments/ -deleteGone -filter -normalize
系统会报错:
Indexer: starting at 2014-06-26 14:49:35 Indexer: deleting gone documents: true Indexer: URL filtering: true Indexer: URL normalizing: true Indexer: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication at org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160) at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:100) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
这是为啥,如果你了解了Nutch, 其实自己都能找到原因。
学开源就是这点好处,你看了代码你就知道问题所在。
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
之所以这里定位到了Solr,是因为我们在配置文件里启用了index-solr插件。
修改前的nutch-default.xml关于插件是
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>
看到了吧,这里就有indexer-solr的配置。
复制到nutch-site.xml,然后去掉indexer-solr,替换成indexer-elastic.
重新执行下面的命令
./bin/nutch index ./data/crawldb/ -linkdb ./data/linkdb/ -dir ./data/segments/ -deleteGone -filter -normalize
系统会报错,表示缺少配置项,自己解决吧。
~~~~~~~~~~~~~~~~~~~~~~~~~~
关于IndexWriter和IndexingFilter的官方解释:
IndexWriter -- Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.). IndexingFilter -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
~~~~~~~~~~~~~~~~~~~~~~~~~~
关于INdexWriter和IndexingFilter的连接关系
IndexerMapReduce.java的272行
// run indexing filters doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
其实也就是依次调用每个IndexingFilter,如果结果不是null,则继续调用下一个。
然后过滤后的最终结果通过
public void open(JobConf job, String name) throws IOException { for (int i = 0; i < this.indexWriters.length; i++) { try { this.indexWriters[i].open(job, name); } catch (IOException ioe) { throw ioe; } } } public void write(NutchDocument doc) throws IOException { for (int i = 0; i < this.indexWriters.length; i++) { try { this.indexWriters[i].write(doc); } catch (IOException ioe) { throw ioe; } } }
也是调用每一个IndexWriter写入到对应的比如说Solr,ElasticSearch等索引器中。
至此,我们已经把解析和索引的整个过程及原理都分析完毕。