问题:
我试验了一下文章中提到的 stemming 和 lemmatization
- 将单词缩减为词根形式,如“cars”到“car”等。这种操作称为:stemming。
- 将单词转变为词根形式,如“drove”到“drive”等。这种操作称为:lemmatization。
试验没有成功
代码如下:
public class TestNorms { public void createIndex() throws IOException { Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms")); IndexWriter writer = new IndexWriter(d, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED); Document doc = new Document(); field.setValue("Hello students was drive"); doc.add(field); writer.addDocument(doc); writer.optimize(); writer.close(); } public void search() throws IOException { Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms")); IndexReader reader = IndexReader.open(d); IndexSearcher searcher = new IndexSearcher(reader); TopDocs docs = searcher.search(new TermQuery(new Term("desc","drove")), 10); System.out.println(docs.totalHits); } public static void main(String[] args) throws IOException { TestNorms test= new TestNorms(); test.createIndex(); test.search(); } } |
不管是单复数,还是单词的变化,都是没有体现的
不知道是不是分词器的原因?
回答:
的确是分词器的问题,StandardAnalyzer并不能进行stemming和lemmatization,因而不能够区分单复数和词型。
文章中讲述的是全文检索的基本原理,理解了他,有利于更好的理解Lucene,但不代表Lucene是完全按照此基本流程进行的。
(1) 有关stemming
作为stemming,一个著名的算法是The Porter Stemming Algorithm,其主页为http://tartarus.org/~martin/PorterStemmer/,也可查看其论文http://tartarus.org/~martin/PorterStemmer/def.txt。
通过以下网页可以进行简单的测试:Porter's Stemming Algorithm Online[http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]
cars –> car
driving –> drive
tokenization –> token
然而
drove –> drove
可见stemming是通过规则缩减为词根的,而不能识别词型的变化。
在最新的Lucene 3.0中,已经有了PorterStemFilter这个类来实现上述算法,只可惜没有Analyzer向匹配,不过不要紧,我们可以简单实现:
public class PorterStemAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter(new LowerCaseTokenizer(reader)); } } |
把此分词器用在你的程序中,就能够识别单复数和规则的词型变化了。
public void createIndex() throws IOException { Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms")); IndexWriter writer = new IndexWriter(d, new PorterStemAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED); Document doc = new Document(); field.setValue("Hello students was driving cars professionally"); doc.add(field); writer.addDocument(doc); writer.optimize(); writer.close(); } public void search() throws IOException { Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms")); IndexReader reader = IndexReader.open(d); IndexSearcher searcher = new IndexSearcher(reader); TopDocs docs = searcher.search(new TermQuery(new Term("desc", "car")), 10); System.out.println(docs.totalHits); docs = searcher.search(new TermQuery(new Term("desc", "drive")), 10); System.out.println(docs.totalHits); docs = searcher.search(new TermQuery(new Term("desc", "profession")), 10); System.out.println(docs.totalHits); } |
(2) 有关lemmatization
至于lemmatization,一般是有字典的,方能够由"drove"对应到"drive".
在网上搜了一下,找到European languages lemmatizer[http://lemmatizer.org/],只不过是在linux下面C++开发的,有兴趣可以试验一下。
首先按照网站的说明下载,编译,安装:
libMAFSA is the core of the lemmatizer. All other libraries depend on it. Download the last version from the following page, unpack it and compile: # tar xzf libMAFSA-0.2.tar.gz
# cd libMAFSA-0.2/
# cmake .
# make
# sudo make install After this you should install libturglem. You can download it at the same place. # tar xzf libturglem-0.2.tar.gz
# cd libturglem-0.2
# cmake .
# make
# sudo make install Next you should install english dictionaries with some additional features to work with. # tar xzf turglem-english-0.2.tar.gz
# cd turglem-english-0.2
# cmake .
# make
# sudo make install |
安装完毕后:
- /usr/local/include/turglem是头文件,用于编译自己编写的代码
- /usr/local/share/turglem/english是字典文件,其中lemmas.xml中我们可以看到"drove"和"drive"的对应,"was"和"be"的对应。
- /usr/local/lib中的libMAFSA.a libturglem.a libturglem-english.a libtxml.a是用于生成应用程序的静态库
<l id="DRIVE" p="6" /> <l id="DROVE" p="6" /> <l id="DRIVING" p="6" /> |
在turglem-english-0.2目录下有例子测试程序test_utf8.cpp
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <turglem/lemmatizer.h> #include <turglem/lemmatizer.hpp> #include <turglem/english/charset_adapters.hpp> int main(int argc, char **argv) { char in_s_buf[1024]; char *nl_ptr; tl::lemmatizer lem; if(argc != 4) { printf("Usage: %s words.dic predict.dic flexias.bin\n", argv[0]); return -1; } lem.load_lemmatizer(argv[1], argv[3], argv[2]); while (!feof(stdin)) { fgets(in_s_buf, 1024, stdin); nl_ptr = strchr(in_s_buf, '\n'); if (nl_ptr) *nl_ptr = 0; nl_ptr = strchr(in_s_buf, '\r'); if (nl_ptr) *nl_ptr = 0; if (in_s_buf[0]) { printf("processing %s\n", in_s_buf); tl::lem_result pars; size_t pcnt = lem.lemmatize<english_utf8_adapter>(in_s_buf, pars); printf("%d\n", pcnt); for (size_t i = 0; i < pcnt; i++) { std::string s; u_int32_t src_form = lem.get_src_form(pars, i); s = lem.get_text<english_utf8_adapter>(pars, i, 0); printf("PARADIGM %d: normal form '%s'\n", (unsigned int)i, s.c_str()); printf("\tpart of speech:%d\n", lem.get_part_of_speech(pars, (unsigned int)i, src_form)); } } } return 0; } |
编译此文件,并且链接静态库:注意链接顺序,否则可能出错。
g++ -g -o output test_utf8.cpp -L/usr/local/lib/ -lturglem-english -lturglem -lMAFSA –ltxml |
运行编译好的程序:
./output /usr/local/share/turglem/english/dict_english.auto /usr/local/share/turglem/english/prediction_english.auto /usr/local/share/turglem/english/paradigms_english.bin |
做测试,虽然对其机制尚不甚了解,但是可以看到lemmatization的作用:
drove processing drove 3 PARADIGM 0: normal form 'DROVE' part of speech:0 PARADIGM 1: normal form 'DROVE' part of speech:2 PARADIGM 2: normal form 'DRIVE' part of speech:2 was processing was 3 PARADIGM 0: normal form 'BE' part of speech:3 PARADIGM 1: normal form 'BE' part of speech:3 PARADIGM 2: normal form 'BE' part of speech:3 |