Lucene入门级笔记五 -- 分词器，使用中文分词器，扩展词库，停用词

1 . 常见的中文分词器有：极易分词的(MMAnalyzer) 、 " 庖丁分词 " 分词器(PaodingAnalzyer)、IKAnalyzer 等等。其中 MMAnalyzer 和 PaodingAnalzyer 不支持 lucene3.0及以后版本。

使用方式都类似，在构建分词器时

Analyzer analyzer = new [My]Analyzer();

2 . 这里只示例 IKAnalyzer，目前只有它支持Lucene3. 0 以后的版本。

首先需要导入 IKAnalyzer3. 2 .0Stable.jar 包

3 . 示例代码

view plaincopy to clipboardprint ?

public class AnalyzerTest

{

@Test

public void test() throws Exception

{

String text = "An IndexWriter creates and maintains an index.";

/**//* 标准分词器：单子分词 */

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

testAnalyzer(analyzer, text);

String text2 = "测试中文环境下的信息检索";

testAnalyzer(new IKAnalyzer(), text2); // 使用IKAnalyzer，词库分词

}

/** *//**

* 使用指定的分词器对指定的文本进行分词，并打印结果

* @param analyzer

* @param text

* @throws Exception

private void testAnalyzer(Analyzer analyzer, String text) throws Exception

{

System.out.println("当前使用的分词器：" + analyzer.getClass());

TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));

tokenStream.addAttribute(TermAttribute.class);

while (tokenStream.incrementToken())

{

TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

System.out.println(termAttribute.term());

}

public class AnalyzerTest

{

@Test

public void test() throws Exception

{

String text = "An IndexWriter creates and maintains an index.";

/**//* 标准分词器：单子分词 */

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

testAnalyzer(analyzer, text);

String text2 = "测试中文环境下的信息检索";

testAnalyzer(new IKAnalyzer(), text2); // 使用IKAnalyzer，词库分词

}

/** *//**

* 使用指定的分词器对指定的文本进行分词，并打印结果

* @param analyzer

* @param text

* @throws Exception

private void testAnalyzer(Analyzer analyzer, String text) throws Exception

{

System.out.println("当前使用的分词器：" + analyzer.getClass());

TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));

tokenStream.addAttribute(TermAttribute.class);

while (tokenStream.incrementToken())

{

TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

System.out.println(termAttribute.term());

}

3 . 如何扩展词库：很多情况下，我们可能需要定制自己的词库，例如 XXX 公司，我们希望这能被分词器识别，并拆分成一个词。

IKAnalyzer 可以很方便的实现我们的这种需求。

新建 IKAnalyzer.cfg.xml

view plaincopy to clipboardprint ?

<? xml version = " 1.0 " encoding = " UTF-8 " ?>

<! DOCTYPE properties SYSTEM " http://java.sun.com/dtd/properties.dtd " >

< properties >

< entry key = " ext_dict " >/ mydict.dic </ entry >

</ properties >

<? xml version = " 1.0 " encoding = " UTF-8 " ?>

<! DOCTYPE properties SYSTEM " http://java.sun.com/dtd/properties.dtd " >

< properties >

< entry key = " ext_dict " >/ mydict.dic </ entry >

</ properties >

解析：

< entry key = " ext_dict " >/ mydict.dic </ entry > 扩展了一个自己的词典，名字叫 mydict.dic

因此我们要建一个文本文件，名为：mydict.dic （此处使用的 .dic 并非必须）

在这个文本文件里写入：

北京XXXX科技有限公司

这样就添加了一个词汇。

如果要添加多个，则新起一行：

词汇一

词汇二

词汇三

需要注意的是，这个文件一定要使用 UTF - 8编码

4 . 停用词：

有些词在文本中出现的频率非常高，但是对文本所携带的信息基本不产生影响，例如英文的 " a、an、the、of " ，或中文的 " 的、了、着 " ，以及各种标点符号等，这样的词称为停用词（stop word）。

文本经过分词之后，停用词通常被过滤掉，不会被进行索引。在检索的时候，用户的查询中如果含有停用词，检索系统也会将其过滤掉（因为用户输入的查询字符串也要进行分词处理）。

排除停用词可以加快建立索引的速度，减小索引库文件的大小。

IKAnalyzer 中自定义停用词也非常方便，和配置 " 扩展词库 " 操作类型，只需要在 IKAnalyzer.cfg.xml 加入如下配置：

< entry key = " ext_stopwords " >/ ext_stopword.dic </ entry >

同样这个配置也指向了一个文本文件 / ext_stopword.dic （后缀名任意），格式如下：

也

了

仍

从

本文来自CSDN博客，转载请标明出处：http: // blog.csdn.net/wenlin56/archive/2010/12/13/6074124.aspx

Lucene入门级笔记五 -- 分词器，使用中文分词器，扩展词库，停用词

你可能感兴趣的:(Lucene入门级笔记五 -- 分词器，使用中文分词器，扩展词库，停用词)