本文主要描述的是如何将paoding分词用plugin方式集成到 nutch1.0中去,在集成之前首先要在eclipse中把nutch1.0编译通过。然后,写一个中文分词程序,配置好插件配置文件,重新打包编译。如果有linux环境,就可以直接进行编译,如果没有linux环境,还需要下载并配置cygwin等模拟linux环境。
一.环境说明
工具:myeclipse6.5 ,jdk1.6.0_14,tomcat-6.0.20
软件:nutch1.0
相关软件请自行google,下载安装
二.配置eclipse
新建nutch工程后,配译会报错
1)下载缺失的包
从http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib /,http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib /下载MP3跟rtf的jar文件,分别拷贝到src/plugin/parse-mp3/lib 和 src/plugin/parse-rtf/lib/文件夹下
2)修改了@override错误
org.apache.nutch.indexer.solr.SolrDeleteDuplicates;
org.apache.nutch.util.domain.DomainStatistics;
//@override错误 将override注释掉
3)licensing issues修复
到这一步,一般的工程都会有两个错误,nutch的official 1.0 release版本中,这两个问题因为licensing issues没有修复。接下来的就是最关键的部分了。
修改src\plugin\parse-rtf\src\java\org\apache\nutch\parse\rtf下RTFParseFactory.java
添加import org.apache.nutch.parse.ParseResult;
将public Parse getParse(Content content) {
改为public ParseResult getParse(Content content) {
将return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParse(conf);
改为return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParseResult(content.getUrl(), getConf());
将return new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata));
改为return ParseResult.createParseResult(content.getUrl(),
new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata)));
修改src\plugin\parse-rtf\src\test\org\apache\nutch\parse\rtf下的TestRTFParser.java
将parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);
改为parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);
到这一步,eclipse上面的工程就会没有错误了
三.配置paoding插件
1)写中文分词程序,继承NutchAnalyzer
/**
* Paoding chinese analyzer
*/
package org.apache.nutch.analysis.zh;
// JDK imports
import java.io.Reader;
// Lucene imports
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
// Nutch imports
import org.apache.nutch.analysis.NutchAnalyzer;
/**
* A simple Chinese Analyzer that wraps the Lucene one.
* @author kevin tu
*/
public class ChineseAnalyzer extends NutchAnalyzer {
private final static Analyzer ANALYZER =
new net.paoding.analysis.analyzer.PaodingAnalyzer();
/** Creates a new instance of ChineseAnalyzer */
public ChineseAnalyzer() { }
public TokenStream tokenStream(String fieldName, Reader reader) {
return ANALYZER.tokenStream(fieldName, reader);
}
}
2)配置插件目录在src/plugin下面,analysis-zh,lib-paoding-analyzers
把上面写好的ChineseAnalyzer放到analysis-zh/src下面,
修改plugin.xml文件
<plugin
id="analysis-zh"
name="Chinese Analysis Plug-in"
version="1.0.0"
provider-name="net.paoding.analysis">
<runtime>
<library name="analysis-zh.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
<import plugin="lib-paoding-analyzers"/>
</requires>
<extension id="org.apache.nutch.analysis.zh"
name="Chinese Analyzer"
point="org.apache.nutch.analysis.NutchAnalyzer">
<implementation id="ChineseAnalyzer"
class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
<parameter name="lang" value="zh"/>
</implementation>
</extension>
</plugin>
修改build.xml
<project name="analysis-zh" default="jar-core">
<import file="../build-plugin.xml"/>
<!-- Build compilation dependencies -->
<target name="deps-jar">
<ant target="jar" inheritall="false" dir="../lib-paoding-analyzers"/>
</target>
<!-- Add compilation dependencies to classpath -->
<path id="plugin.deps">
<fileset dir="${nutch.root}/build">
<include name="**/lib-paoding-analyzers/*.jar" />
</fileset>
</path>
</project>
lib-paoding-analyzers的配置同上,不再赘述。
3)配置src\plugin的build.xml
<target name="deploy">
<ant dir="analysis-zh" target="deploy"/>
<ant dir="lib-paoding-analyzers" target="deploy"/>
...
</target>
<target name="clean">
<ant dir="analysis-zh" target="clean"/>
<ant dir="lib-paoding-analyzers" target="clean"/>
...
</target>
4)修改nutch-default.xml,加入|analysis-(zh)| 加载paoding的jar包,和自己写的analysis-(zh) jar包
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
5)修改nutch工程的build.xml,targe war
<lib dir="${build.dir}/analysis-zh">
<include name="analysis-zh.jar"/>
</lib>
<lib dir="${build.dir}/lib-paoding-analyzers">
<include name="paoding-analysis.jar"/>
</lib>
四.重新编译
ant package
注意:nutch1.0 需要ant1.7.1才行,主要是touch任务需要ant 1.7.1支持
五.配置tomcat,修改webapps/cse/WEB-INF/classes/nutch-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>local</value>
</property>
<property><!--指定本地的index目录-->
<name>searcher.dir</name>
<value>/nutch/local/crawled</value>
</property>
<property>
</property>
</configuration>
六.配置运行环境
export PAODING_DIC_HOME=/nutch/dic
七.运行测试
http://localhost:8080/
2009-09-14 10:26:49,312 INFO PluginRepository - Registered Plugins:
2009-09-14 10:26:49,312 INFO PluginRepository - the nutch core extension points (nutch-extensionpoints)
2009-09-14 10:26:49,312 INFO PluginRepository - Basic Query Filter (query-basic)
2009-09-14 10:26:49,312 INFO PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2009-09-14 10:26:49,312 INFO PluginRepository - Paoding Analysers (lib-paoding-analyzers)
2009-09-14 10:26:49,328 INFO PluginRepository - Html Parse Plug-in (parse-html)
2009-09-14 10:26:49,328 INFO PluginRepository - Basic Indexing Filter (index-basic)
2009-09-14 10:26:49,328 INFO PluginRepository - Basic Summarizer Plug-in (summary-basic)
2009-09-14 10:26:49,328 INFO PluginRepository - Site Query Filter (query-site)
2009-09-14 10:26:49,328 INFO PluginRepository - HTTP Framework (lib-http)
2009-09-14 10:26:49,328 INFO PluginRepository - Text Parse Plug-in (parse-text)
2009-09-14 10:26:49,328 INFO PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2009-09-14 10:26:49,328 INFO PluginRepository - Regex URL Filter (urlfilter-regex)
2009-09-14 10:26:49,328 INFO PluginRepository - Http Protocol Plug-in (protocol-http)
2009-09-14 10:26:49,328 INFO PluginRepository - XML Response Writer Plug-in (response-xml)
2009-09-14 10:26:49,328 INFO PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2009-09-14 10:26:49,328 INFO PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2009-09-14 10:26:49,343 INFO PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2009-09-14 10:26:49,343 INFO PluginRepository - Anchor Indexing Filter (index-anchor)
2009-09-14 10:26:49,343 INFO PluginRepository - JavaScript Parser (parse-js)
2009-09-14 10:26:49,343 INFO PluginRepository - URL Query Filter (query-url)
2009-09-14 10:26:49,343 INFO PluginRepository - Chinese Analysis Plug-in (analysis-zh)
2009-09-14 10:26:49,343 INFO PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2009-09-14 10:26:49,343 INFO PluginRepository - JSON Response Writer Plug-in (response-json)
2009-09-14 10:26:49,343 INFO PluginRepository - Registered Extension-Points:
2009-09-14 10:26:49,359 INFO PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
其中hinese Analysis Plug-in (analysis-zh)就是配置好的中文分词插件啦。
好,大功告成,用paoding爽一爽吧,分词效果“刚刚的”。