Nutch 1.3 学习笔记 外传 扩展Nutch插件实现自定义索引字段

扩展Nutch插件实现自定义索引字段

1.Nutch与Solr的使用介绍

  1.1 一些基本的配置

  • 在conf/nutch-site.xml加入http.agent.name的属性
  •  生成一个种子文件夹,mkdir -p urls,在其中生成一个种子文件,在这个文件中写入一个url,如http://nutch.apache.org/
  •  编辑conf/regex-urlfilter.txt文件,配置url过滤器,一般用默认的好了,也可以加入如下配置,只抓取nutch.apache.org这个网址 +^http://([a-z0-9]*\.)*nutch.apache.org/
  •  使用如下命令来抓取网页
 bin/nutch crawl urls -dir crawl -depth 3 -topN 5 说明: -dir 抓取结果目录名 -depth 抓取的深度 -topN 最一层的最大抓取个数 一般抓取完成后会看到如下的目录 crawl/crawldb crawl/linkdb crawl/segments
  • 使用如下来建立索引
 bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/* 使用这个命令的前提是你已经开启了默认的solr服务 开启默认solr服务的命令如下 cd ${APACHE_SOLR_HOME}/example java -jar start.jar 这个时候服务就开启了 你可以在浏览器中输入如下地址进行测试 http://localhost:8983/solr/admin/ http://localhost:8983/solr/admin/stats.jsp
  但是要结合Nutch来使用solr,还要在solr中加一个相应的策略配置,在nutch的conf目录中有一个默认的配置,把它复制到solr的相应目录中就可以使用了 cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/ 这个时候要重新启动一下solr
 索引建立完成以后就你就可以用关键词进行查询,solr默认返回的是一个xml文件


2.Nutch的索引过滤插件介绍

 除了一些元数据,如segment,boost,digest,nutch的其它索引字段都是通过一个叫做索引过滤的插件来完成的,如index-basic,index-more,index-anchor,这些都是通过nutch的插件机制来完成索引文件的字段生成,如果你要自定义相应的索引字段,你就要实现IndexingFilter这个接口,其定义如下:

/** Extension point for indexing.  Permits one to add metadata to the indexed
 * fields.  All plugins found which implement this extension point are run
 * sequentially on the parse.
 */
public interface IndexingFilter extends Pluggable, Configurable {
  /** The name of the extension point. */
  final static String X_POINT_ID = IndexingFilter.class.getName();




  /**
   * Adds fields or otherwise modifies the document that will be indexed for a
   * parse. Unwanted documents can be removed from indexing by returning a null value.
   * 
   * @param doc document instance for collecting fields
   * @param parse parse data instance
   * @param url page url
   * @param datum crawl datum for the page
   * @param inlinks page inlinks
   * @return modified (or a new) document instance, or null (meaning the document
   * should be discarded)
   * @throws IndexingException
   */
  NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
    throws IndexingException;
}


主要是实现filter这个抽象接口.
对于NutchDocument的生成都在IndexerMapReduce.java中的reduce方法中,其中调用了indexing filters插件来进行索引字段的设置。


3.写一个自己的索引过滤插件


现在如果我们有这样的需求,要自定义索引文件的字段值,如要再生成一个metadata与fetchTime字段,并且在Solr中能查询显示出来,下面是相应插件的代码与一些说明
这个时候要首先要写一个索引过滤器,代码如下:


package org.apache.nutch.indexer.metadata;

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseData;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;

import java.util.Date;
import org.apache.hadoop.conf.Configuration;

/**
 * Add (or reset) a few metaData properties as respective fields (if they are
 * available), so that they can be displayed by more.jsp (called by search.jsp).
 * 
 * @author Lemo lu
 */


public class MetaDataIndexingFilter implements IndexingFilter {
 public static final Logger LOG = LoggerFactory
 .getLogger(MetaDataIndexingFilter.class);
 private Configuration conf;

 public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
 CrawlDatum datum, Inlinks inlinks) throws IndexingException {

 // add metadata field
 addMetaData(doc, parse.getData(), datum);
 // add fetch time field
 addFetchTime(doc, parse.getData(),datum);

 return doc;
 }

 private NutchDocument addFetchTime(NutchDocument doc, ParseData data,CrawlDatum datum)
 {
 long fetchTime = datum.getFetchTime();
 doc.add("fetchTime",new Date(fetchTime));
 return doc;
 }
 private NutchDocument addMetaData(NutchDocument doc, ParseData data,
                             CrawlDatum datum) {
     String metadata = data.getParseMeta().toString();
     doc.add("metadata", metadata);
     return doc;
   }
 public void setConf(Configuration conf) {
 this.conf = conf;
 }

 public Configuration getConf() {
 return this.conf;
 }
}


这个时候插件就写好了,打成jar包放到plugins目录下的index-metadata目录就行,这个index-metadata要自己建立的。然后再写一个相应的plugin.xml配置文件,这样可以让nutch的插件正确的动态加载这些模块,plugin.xml如下:


<?xml version="1.0" encoding="UTF-8"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<plugin
   id="index-metadata"
   name="Metadata Indexing Filter"
   version="1.0.0"
   provider-name="nutch.org">

   <runtime>
      <library name="index-metadata.jar">
         <export name="*"/>
      </library>
   </runtime>
   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.indexer.more"
              name="Nutch MetaData Indexing Filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="MetaDataIndexingFilter"
                      class="org.apache.nutch.indexer.metadata.MetaDataIndexingFilter"/>
   </extension>
</plugin>


这个时候自己定义的索引过滤器就完成了,
下面要再配置两个文件,一个是schema.xml,在其中的fields标签下加入如下代码
 <!-- metadata fields -->
 <field name="fetchTime" type="date" stored="true" indexed="true"/>
 <field name="metadata" type="string" stored="true" indexed="true"/>


说明:
其中的stored表示这个字段的值要存储在lucene的索引中
其中的indexed表示这个字段的值是不是要进行分词查询

还有一个是solrindex-mapping.xml文件,这个文件的作用是把索引过滤器中生成的字段名与schema.xml中的做一个对应关系,要在其fields标签中加入如下代码:
 <field dest="fetchTime" source="fetchTime"/>
 <field dest="metadata" source="metadata"/>

这样自定义索引过滤插件就算完成了,记得这里的schema.xml文件是在solr/conf目录下的,修改以后要重启一下,不知道solr支不支持修改了配置文件后不重启就可以生效。


4.建立索引


再使用上面的命令重建立一下索引

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/*

这时候可以看到索引已经建立完成,对了,solr的索引文件在solr/data/index中,你可以用luke这个工具加开其索引文件,看一下其中的一些元信息,这个时候你就应该可以看到fetchTime与metadata这两个字段了

如图:



5.查询


这个时候你可以打开浏览器,在其中输入http://localhost:8983/solr/admin/,再输入一些查询条件,你就可以看到结果了,大概结果内容如下:


This XML file does not appear to have any style information associated with it. The document tree is shown below.
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">a</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<float name="boost">1.1090536</float>
<str name="digest">da3aefc69d8a5a7c1ea5447f9680d66d</str>
<date name="fetchTime">2012-04-11T03:19:33.088Z</date>
<str name="id">http://nutch.apache.org/</str>
<str name="metadata">
CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
</str>
<str name="segment">20120410231836</str>
<str name="title">Welcome to Apache Nutch®</str>
<date name="tstamp">2012-04-11T03:19:33.088Z</date>
<str name="url">http://nutch.apache.org/</str>
</doc>
</result>
</response>

有没有看到我们定义的两个字段的值


6.参考

http://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search

你可能感兴趣的:(apache,filter,Solr,扩展,extension,permissions)