nutch 写一个indexingfilter插件

参考源:http://blog.csdn.net/amuseme_lu/article/details/6780244


1 生成一个与urlfilter-regex类似的包结构

代码路径的生成:http://www.cnblogs.com/i80386/archive/2012/09/04/2670670.html


2

public class MyIndexingFilter  implements IndexingFilter {



    public static final Log LOG = LogFactory.getLog(MyIndexingFilter.class);

    private Configuration conf;

    public void addIndexBackendOptions(Configuration conf) {

        LuceneWriter.addFieldOptions("mt", LuceneWriter.STORE.YES,

                LuceneWriter.INDEX.TOKENIZED, conf);

    }

    private NutchDocument addMyField(NutchDocument doc)  

     {  

        System.out.println("银河系");

        String value="银河系";

        doc.add("mt",value);  //这里我设置了一个固定字段,实际应该从html抽取目标字段

        return doc;  

     }  

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,

            CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        addMyField(doc);

        return doc;

    }

    public Configuration getConf() {

        return this.conf;

    }

    public void setConf(Configuration arg0) {

        this.conf = arg0;

    }

}

3 生成jar包       build fat jar

4 生成plugin.xml

<plugin

   id="index-myfield"

   name="my Indexing Filter"

   version="1.0.0"

   provider-name="nutch.org">





   <runtime>

      <library name="myfield.jar">

         <export name="*"/>

      </library>

   </runtime>



   <requires>

      <import plugin="nutch-extensionpoints"/>

   </requires>



   <extension id="org.apache.nutch.indexer.myfield"

              name="Nutch My Indexing Filter"

              point="org.apache.nutch.indexer.IndexingFilter">

      <implementation id="MyIndexingFilter"

                      class="org.apache.nutch.indexer.myfield.MyIndexingFilter"/>

   </extension>



</plugin>

5 最后把打好的jar包与plugin.xml放到E:\nutch\src\plugin\index-myfield 文件夹中

6 修改conf\nutch-site.xml

<configuration>

<property>

        <name>searcher.dir</name>

        <value>E:/crawl_2</value>

</property>

    <property>  

      <name>plugin.includes</name>  

      <value>protocol-http|urlfilter-(regex|prefix|my)|parse-(html|tika)|index-(basic|anchor|myfield)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>  

      <description>Regular expression naming plugin directory names to  

      include.  Any plugin not matching this expression is excluded.  

      In any case you need at least include the nutch-extensionpoints plugin. By  

      default Nutch includes crawling just HTML and plain text via HTTP,  

      and basic indexing and search plugins. In order to use HTTPS please enable   

      protocol-httpclient, but be aware of possible intermittent problems with the   

      underlying commons-httpclient library.  

      </description>  

    </property>  

</configuration>

7 启动nutch

8 在solr中检索

9 可以检索到我们需要的字段


注:如果我不是手动打jar放到 index-myfield文件夹中 ,而是直接修改nutch-site.xml 添加了 index-(basic|anchor|myfield)

 

 

你可能感兴趣的:(filter)