Nutch1.0开源搜索引擎与Paoding在eclipse中用plugin方式集成(终极篇)

     本文主要描述的是如何将paoding分词用plugin方式集成到 nutch1.0中去,在集成之前首先要在eclipse中把nutch1.0编译通过。然后,写一个中文分词程序,配置好插件配置文件,重新打包编译。如果有linux环境,就可以直接进行编译,如果没有linux环境,还需要下载并配置cygwin等模拟linux环境。

 

     一.环境说明

         工具:myeclipse6.5 ,jdk1.6.0_14,tomcat-6.0.20

         软件:nutch1.0

         相关软件请自行google,下载安装

 

     二.配置eclipse

        新建nutch工程后,配译会报错

        1)下载缺失的包
    从http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib /,http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib /下载MP3跟rtf的jar文件,分别拷贝到src/plugin/parse-mp3/lib 和 src/plugin/parse-rtf/lib/文件夹下

 

         2)修改了@override错误
    org.apache.nutch.indexer.solr.SolrDeleteDuplicates;
    org.apache.nutch.util.domain.DomainStatistics;
        //@override错误   将override注释掉

 

        3)licensing issues修复
    到这一步,一般的工程都会有两个错误,nutch的official 1.0 release版本中,这两个问题因为licensing issues没有修复。接下来的就是最关键的部分了。
    修改src\plugin\parse-rtf\src\java\org\apache\nutch\parse\rtf下RTFParseFactory.java
    添加import org.apache.nutch.parse.ParseResult;
    将public Parse getParse(Content content) {
    改为public ParseResult getParse(Content content) {
    将return new ParseStatus(ParseStatus.FAILED,
                       ParseStatus.FAILED_EXCEPTION,
                       e.toString()).getEmptyParse(conf);
    改为return new ParseStatus(ParseStatus.FAILED,
            ParseStatus.FAILED_EXCEPTION,
              e.toString()).getEmptyParseResult(content.getUrl(), getConf());
    将return new ParseImpl(text,
                 new ParseData(ParseStatus.STATUS_SUCCESS,
                           title,
                           OutlinkExtractor.getOutlinks(text, this.conf),
                           content.getMetadata(),
                           metadata));
    改为return ParseResult.createParseResult(content.getUrl(),
                     new ParseImpl(text,
                         new ParseData(ParseStatus.STATUS_SUCCESS,
                             title,
                             OutlinkExtractor.getOutlinks(text, this.conf),
                             content.getMetadata(),
                             metadata)));

    修改src\plugin\parse-rtf\src\test\org\apache\nutch\parse\rtf下的TestRTFParser.java
    将parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);
    改为parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);
    到这一步,eclipse上面的工程就会没有错误了

 

 

         三.配置paoding插件

         1)写中文分词程序,继承NutchAnalyzer

/**
* Paoding chinese analyzer
*/
package org.apache.nutch.analysis.zh;
// JDK imports
import java.io.Reader;
// Lucene imports
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
// Nutch imports
import org.apache.nutch.analysis.NutchAnalyzer;
/**
* A simple Chinese Analyzer that wraps the Lucene one.
* @author kevin tu
*/
public class ChineseAnalyzer extends NutchAnalyzer {   
    private final static Analyzer ANALYZER =
            new net.paoding.analysis.analyzer.PaodingAnalyzer();   
    /** Creates a new instance of ChineseAnalyzer */
    public ChineseAnalyzer() { }
    public TokenStream tokenStream(String fieldName, Reader reader) {
        return ANALYZER.tokenStream(fieldName, reader);
    }

}

 

      2)配置插件目录在src/plugin下面,analysis-zh,lib-paoding-analyzers

      把上面写好的ChineseAnalyzer放到analysis-zh/src下面,

      修改plugin.xml文件

<plugin
   id="analysis-zh"
   name="Chinese Analysis Plug-in"
   version="1.0.0"
   provider-name="net.paoding.analysis">
   <runtime>
      <library name="analysis-zh.jar">
         <export name="*"/>
      </library>
   </runtime>
   <requires>
      <import plugin="nutch-extensionpoints"/>
      <import plugin="lib-paoding-analyzers"/>
   </requires>
   <extension id="org.apache.nutch.analysis.zh"
              name="Chinese Analyzer"
              point="org.apache.nutch.analysis.NutchAnalyzer">
      <implementation id="ChineseAnalyzer"
                      class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
        <parameter name="lang" value="zh"/>
      </implementation>
   </extension>
</plugin>

 

         修改build.xml

         <project name="analysis-zh" default="jar-core">
  <import file="../build-plugin.xml"/>
  <!-- Build compilation dependencies -->
  <target name="deps-jar">
    <ant target="jar" inheritall="false" dir="../lib-paoding-analyzers"/>
  </target>
  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-paoding-analyzers/*.jar" />
    </fileset>
  </path>
</project>

       lib-paoding-analyzers的配置同上,不再赘述。

 

 

      3)配置src\plugin的build.xml
   <target name="deploy">
<ant dir="analysis-zh" target="deploy"/>
<ant dir="lib-paoding-analyzers" target="deploy"/>
...
   </target>

   <target name="clean">
<ant dir="analysis-zh" target="clean"/>
<ant dir="lib-paoding-analyzers" target="clean"/>
...
   </target>

4)修改nutch-default.xml,加入|analysis-(zh)|  加载paoding的jar包,和自己写的analysis-(zh) jar包
<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>
  </description>
</property>

 

 

          5)修改nutch工程的build.xml,targe war
      <lib dir="${build.dir}/analysis-zh">
              <include name="analysis-zh.jar"/>
      </lib>  
      <lib dir="${build.dir}/lib-paoding-analyzers">
              <include name="paoding-analysis.jar"/>
      </lib>

 

      四.重新编译

       ant package

     注意:nutch1.0 需要ant1.7.1才行,主要是touch任务需要ant 1.7.1支持

 

      五.配置tomcat,修改webapps/cse/WEB-INF/classes/nutch-site.xml
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>local</value>
  </property>

  <property><!--指定本地的index目录-->
    <name>searcher.dir</name>
    <value>/nutch/local/crawled</value>
  </property>
<property> 
</property>
</configuration>

 

     六.配置运行环境

     export PAODING_DIC_HOME=/nutch/dic

 

    七.运行测试

     http://localhost:8080/

 

2009-09-14 10:26:49,312 INFO  PluginRepository - Registered Plugins:
2009-09-14 10:26:49,312 INFO  PluginRepository -        the nutch core extension points (nutch-extensionpoints)
2009-09-14 10:26:49,312 INFO  PluginRepository -        Basic Query Filter (query-basic)
2009-09-14 10:26:49,312 INFO  PluginRepository -        Basic URL Normalizer (urlnormalizer-basic)
2009-09-14 10:26:49,312 INFO  PluginRepository -        Paoding Analysers (lib-paoding-analyzers)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Html Parse Plug-in (parse-html)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Basic Indexing Filter (index-basic)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Basic Summarizer Plug-in (summary-basic)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Site Query Filter (query-site)
2009-09-14 10:26:49,328 INFO  PluginRepository -        HTTP Framework (lib-http)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Text Parse Plug-in (parse-text)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Pass-through URL Normalizer (urlnormalizer-pass)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Regex URL Filter (urlfilter-regex)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Http Protocol Plug-in (protocol-http)
2009-09-14 10:26:49,328 INFO  PluginRepository -        XML Response Writer Plug-in (response-xml)
2009-09-14 10:26:49,328 INFO  PluginRepository -        Regex URL Normalizer (urlnormalizer-regex)
2009-09-14 10:26:49,328 INFO  PluginRepository -        OPIC Scoring Plug-in (scoring-opic)
2009-09-14 10:26:49,343 INFO  PluginRepository -        CyberNeko HTML Parser (lib-nekohtml)
2009-09-14 10:26:49,343 INFO  PluginRepository -        Anchor Indexing Filter (index-anchor)
2009-09-14 10:26:49,343 INFO  PluginRepository -        JavaScript Parser (parse-js)
2009-09-14 10:26:49,343 INFO  PluginRepository -        URL Query Filter (query-url)
2009-09-14 10:26:49,343 INFO  PluginRepository -        Chinese Analysis Plug-in (analysis-zh)
2009-09-14 10:26:49,343 INFO  PluginRepository -        Regex URL Filter Framework (lib-regex-filter)
2009-09-14 10:26:49,343 INFO  PluginRepository -        JSON Response Writer Plug-in (response-json)
2009-09-14 10:26:49,343 INFO  PluginRepository - Registered Extension-Points:
2009-09-14 10:26:49,359 INFO  PluginRepository -        Nutch Summarizer (org.apache.nutch.searcher.Summarizer)

 

其中hinese Analysis Plug-in (analysis-zh)就是配置好的中文分词插件啦。

 

好,大功告成,用paoding爽一爽吧,分词效果“刚刚的”。

 

 

 

 

你可能感兴趣的:(apache,eclipse,ant,搜索引擎,Lucene)