持续更新 |
1 Document 和 Field |
2 IndexWriter |
3 IndexReader |
4 Lucene中的倒排实现 |
5 IndexSearcher |
6 Analyzer |
7 Directory |
8 Query、Sort和Filter |
9 Lucene中的Ranking算法以及改进 |
1. Document 和 Field
Document和Field在索引创建的过程中必不可少。而Document和Field可以理解成传统的关系型数据库中的记录和字段的关系,而字段可以有很多个,那么Document中可以添加很多个Field,方便满足各种不同的查询。如Field可以是文件内容、文件名称、创建时间或者是修改时间等等。而Field中的属性有:是否存储(this.isStored = store.isStored()) 是否索引( this.isIndexed = index.isIndexed())、是否分词(this.isTokenized = index.isAnalyzed()),根据不同的需要来进行选择。如文档内容不需要存储,但需要被索引。根据底层的源代码知道有一些限制的,比如不能有这样一个个Field,既不index也不store。
Document中的主要方法就是对Field的增删查操作,3.0.2中的
主要API如下:
void add(Fieldable field)
Adds a field to a document.
String get(String name)
Returns the string value of the field with the given name if any exist in this document, or null.
Field getField(String name)
Returns a field with the given name if any exist in this document, or null.
List<Fieldable> getFields()
Returns a List of all the fields in a document.
Field[] getFields(String name)
Returns an array of Fields with the given name.
void removeField(String name)
Removes field with the specified name from the document.
void removeFields(String name)
Removes all fields with the given name from the document.
String toString()
Prints the fields of a document for human consumption.
...
在Field中,主要的两个构造函数如下,帮助理解Field属性(可以自行查看源文件进行阅读)
/**
* Create a field by specifying its name, value and how it will
* be saved in the index.
*
* @param name The name of the field
* @param internName Whether to .intern() name or not
* @param value The string to process
* @param store Whether <code>value</code> should be stored in the index
* @param index Whether the field should be indexed, and if so, if it should
* be tokenized before indexing
* @param termVector Whether term vector should be stored
* @throws NullPointerException if name or value is <code>null</code>
* @throws IllegalArgumentException in any of the following situations:
* <ul>
* <li>the field is neither stored nor indexed</li>
* <li>the field is not indexed but termVector is <code>TermVector.YES</code></li>
* </ul>
*/
public Field(String name, boolean internName, String value, Store store, Index index, TermVector termVector) {
if (name == null)
throw new NullPointerException("name cannot be null");
if (value == null)
throw new NullPointerException("value cannot be null");
if (name.length() == 0 && value.length() == 0)
throw new IllegalArgumentException("name and value cannot both be empty");
if (index == Index.NO && store == Store.NO)
throw new IllegalArgumentException("it doesn't make sense to have a field that "
+ "is neither indexed nor stored");
if (index == Index.NO && termVector != TermVector.NO)
throw new IllegalArgumentException("cannot store term vector information "
+ "for a field that is not indexed");
if (internName) // field names are optionally interned
name = StringHelper.intern(name);
this.name = name;
this.fieldsData = value;
this.isStored = store.isStored();
this.isIndexed = index.isIndexed();
this.isTokenized = index.isAnalyzed();
this.omitNorms = index.omitNorms();
if (index == Index.NO) {
this.omitTermFreqAndPositions = false;
}
this.isBinary = false;
setStoreTermVector(termVector);
}
/**
* Create a tokenized and indexed field that is not stored, optionally with
* storing term vectors. The Reader is read only when the Document is added to the index,
* i.e. you may not close the Reader until {@link IndexWriter#addDocument(Document)}
* has been called.
*
* @param name The name of the field
* @param reader The reader with the content
* @param termVector Whether term vector should be stored
* @throws NullPointerException if name or reader is <code>null</code>
*/
public Field(String name, Reader reader, TermVector termVector) {
if (name == null)
throw new NullPointerException("name cannot be null");
if (reader == null)
throw new NullPointerException("reader cannot be null");
this.name = StringHelper.intern(name); // field names are interned
this.fieldsData = reader;
this.isStored = false;
this.isIndexed = true;
this.isTokenized = true;
this.isBinary = false;
setStoreTermVector(termVector);
}
而其他的构造函数也只是调用这两个个主要的构造函数。如几个比较常用的构造函数;
public Field(String name, String value, Store store, Index index) {
this(name, value, store, index, TermVector.NO);
}
public Field(String name, Reader reader) {
this(name, reader, TermVector.NO);
}
不过读读源代码中Field中的三个静态枚举变量Store、Index和TermVector的话,可以更清楚的理解Field中各个属性值是如何设置的(而以前的版本是三个静态常量内部类)。
2. IndexWriter
可以参考我之前的一个博客:
http://hanyuanbo.iteye.com/blog/812135
下面这段摘自JavaDoc中IndexWriter的前三段:
引用
An IndexWriter creates and maintains an index.
The create argument to the constructor determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with create=true even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also constructors with no create argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index.
In either case, documents are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updateDocument (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, close should be called.
(其中有一点说明了如果没有指明是否是创建还是追加index的时候,采取不存在则创建,存在则打开已经存在的index策略)
引用
Expert: IndexWriter allows an optional IndexDeletionPolicy implementation to be specified.
Expert: IndexWriter allows you to separately change the MergePolicy and the MergeScheduler.
之下的五个构造函数中Expert有三个,正常用另外两个就够了。
IndexWriter(Directory d, Analyzer a, boolean create, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl) |
Expert: constructs an IndexWriter with a custom IndexDeletionPolicy, for the index in d. |
IndexWriter(Directory d, Analyzer a, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl) |
Expert: constructs an IndexWriter with a custom IndexDeletionPolicy, for the index in d, first creating it if it does not already exist. |
IndexWriter(Directory d, Analyzer a, IndexDeletionPolicy deletionPolicy, IndexWriter.MaxFieldLength mfl, IndexCommit commit) |
Expert: constructs an IndexWriter on specific commit point, with a custom IndexDeletionPolicy, for the index in d. |
IndexWriter(Directory d, Analyzer a, IndexWriter.MaxFieldLength mfl) |
Constructs an IndexWriter for the index in d, first creating it if it does not already exist. |
IndexWriter(Directory d, Analyzer a, boolean create, IndexWriter.MaxFieldLength mfl) |
Constructs an IndexWriter for the index in d. |
而实际上在源代码中,都调用了一个私有的init的方法。
private void init(Directory d, Analyzer a, final boolean create,
IndexDeletionPolicy deletionPolicy, int maxFieldLength,
IndexingChain indexingChain, IndexCommit commit)
throws CorruptIndexException, LockObtainFailedException, IOException {
...//在以前的版本中,是调用了一个私有的构造函数。
}
在IndexWriter中,用来创建index的方法
void addDocument(Document doc) |
Adds a document to this index. |
void addDocument(Document doc, Analyzer analyzer) |
Adds a document to this index, using the provided analyzer instead of the value of getAnalyzer(). |
3. IndexReader
帮助来重新处理索引文件。包括更新、删除等操作。构造函数有如下:
static IndexReader open(Directory directory) |
Returns a IndexReader reading the index in the given Directory, with readOnly=true. |
static IndexReader open(Directory directory, boolean readOnly) |
Returns an IndexReader reading the index in the given Directory. |
static IndexReader open(Directory directory, IndexDeletionPolicy deletionPolicy, boolean readOnly) |
Expert: returns an IndexReader reading the index in the given Directory, with a custom IndexDeletionPolicy. |
static IndexReader open(Directory directory, IndexDeletionPolicy deletionPolicy, boolean readOnly, int termInfosIndexDivisor) |
Expert: returns an IndexReader reading the index in the given Directory, with a custom IndexDeletionPolicy. |
static IndexReader open(IndexCommit commit, boolean readOnly) |
Expert: returns an IndexReader reading the index in the given IndexCommit. |
static IndexReader open(IndexCommit commit, IndexDeletionPolicy deletionPolicy, boolean readOnly) |
Expert: returns an IndexReader reading the index in the given Directory, using a specific commit and with a custom IndexDeletionPolicy. |
static IndexReader open(IndexCommit commit, IndexDeletionPolicy deletionPolicy, boolean readOnly, int termInfosIndexDivisor) |
Expert: returns an IndexReader reading the index in the given Directory, using a specific commit and with a custom IndexDeletionPolicy. |
里面会涉及到Term这个类,Term类的构造函数很简单,如下:
Term(String fld) |
Constructs a Term with the given field and empty text. |
Term(String fld, String txt) |
Constructs a Term with the given field and text. |
在IndexReader中常用到的,而且好理解的方法如下:
Document document(int n) |
Returns the stored fields of the nth Document in this index. |
abstract int numDocs() |
Returns the number of documents in this index. |
abstract TermDocs termDocs() |
Returns an unpositioned TermDocs enumerator. |
TermDocs termDocs(Term term) |
Returns an enumeration of all the documents which contain term. |
abstract TermPositions termPositions() |
Returns an unpositioned TermPositions enumerator. |
TermPositions termPositions(Term term) |
Returns an enumeration of all the documents which contain term. |
abstract TermEnum terms() |
Returns an enumeration of all the terms in the index. |
abstract TermEnum terms(Term t) |
Returns an enumeration of all terms starting at a given term. |
void deleteDocument(int docNum) |
Deletes the document numbered docNum. |
int deleteDocuments(Term term) |
Deletes all documents that have a given term indexed. |
如下代码帮助理解如何操作IndexReader对其中的Term进行访问,并进行删除操作(但进行删除的时候,切记要记得将reader关掉)
package com.eric.lucene;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;
public class IndexReaderTest {
private File path ;
public IndexReaderTest(String path) {
this.path = new File(path);
}
public void createIndex(){
try {
IndexWriter writer = new IndexWriter(FSDirectory.open(this.path),new StandardAnalyzer(
Version.LUCENE_30), IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Document doc2 = new Document();
Document doc3 = new Document();
doc1.add(new Field("bookname", "thinking in java -- java 4", Field.Store.YES, Field.Index.ANALYZED));
doc2.add(new Field("bookname", "java core 2", Field.Store.YES, Field.Index.ANALYZED));
doc3.add(new Field("bookname", "thinking in c++", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc1);
writer.addDocument(doc2);
writer.addDocument(doc3);
writer.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
public void test1(){
try {
IndexReader reader = IndexReader.open(FSDirectory.open(this.path));
System.out.println("version:\t" + reader.getVersion());
int num = reader.numDocs();
for(int i=0;i<num;i++){
Document doc = reader.document(i);
System.out.println(doc);
}
Term term = new Term("bookname","java");
TermDocs docs = reader.termDocs(term);
while(docs.next()){
System.out.print("doc num:\t" + docs.doc() + "\t\t");
System.out.println("frequency:\t" + docs.freq());
}
reader.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
// version: 1289906350314
// Document<stored,indexed,tokenized<bookname:thinking in java -- java 4>>
// Document<stored,indexed,tokenized<bookname:java core 2>>
// Document<stored,indexed,tokenized<bookname:thinking in c++>>
// doc num: 0 frequency: 2
// doc num: 1 frequency: 1
public void test2(){
try {
IndexReader reader = IndexReader.open(FSDirectory.open(this.path));
System.out.println("version:\t" + reader.getVersion());
Term term = new Term("bookname","java");
TermPositions pos = reader.termPositions(term);
while(pos.next()){
System.out.print("frequency: " + pos.freq() + "\t");
for(int i=0;i<pos.freq();i++){
System.out.print("pos: " + pos.nextPosition() + "\t");
}
System.out.println();
}
reader.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
// version: 1289906350314
// frequency: 2 pos: 2 pos: 3
// frequency: 1 pos: 0
// 第二次的时候没有调用createIndex() 所以版本号还是相同的
public void delete1(){
try {
IndexReader reader = IndexReader.open(FSDirectory.open(this.path), false);//必须指定readonly 为 false
System.out.println("version:\t" + reader.getVersion());
System.out.println("num:\t" + reader.numDocs());
reader.deleteDocument(2);//删除c++的那个Document
reader.close();
reader = IndexReader.open(FSDirectory.open(this.path), false);
System.out.println("version:\t" + reader.getVersion());
System.out.println("num:\t" + reader.numDocs());
reader.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
// version: 1289906350314
// num: 3
// version: 1289906350315
// num: 2
public void delete2(){
try {
IndexReader reader = IndexReader.open(FSDirectory.open(this.path), false);//必须指定readonly 为 false
System.out.println("version:\t" + reader.getVersion());
System.out.println("num:\t" + reader.numDocs());
Term term = new Term("bookname","java");
reader.deleteDocuments(term);//删除java的Document
reader.close();
reader = IndexReader.open(FSDirectory.open(this.path), false);
System.out.println("version:\t" + reader.getVersion());
System.out.println("num:\t" + reader.numDocs());
reader.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
// version: 1289906350315
// num: 2
// version: 1289906350316
// num: 0
public static void main(String[] args) {
String path = "E:\\indexReaderTest";
IndexReaderTest test = new IndexReaderTest(path);
// test.createIndex();
// test.test1();
// test.test2();
// test.delete1();
test.delete2();
}
}
注释:
先调用
String path = "E:\\indexReaderTest";
IndexReaderTest test = new IndexReaderTest(path);
test.createIndex();
test.test1();
然后再调用:
String path = "E:\\indexReaderTest";
IndexReaderTest test = new IndexReaderTest(path);
test.test2();
然后再调用:
String path = "E:\\indexReaderTest";
IndexReaderTest test = new IndexReaderTest(path);
test.delete1();
然后再调用:
String path = "E:\\indexReaderTest";
IndexReaderTest test = new IndexReaderTest(path);
test.delete2();
4. Lucene中的倒排实现
以下的这个博客,简单的说明了倒排索引的原理。
http://jackyrong.iteye.com/blog/238940
附件中的《Lucene 3.0 原理与代码分析完整版.pdf》的前面有介绍信息检索的基本原理,大概也就几页,很容易理解,Lucene只是对这个原理进行了自己的实现,对于理解Lucene的倒排索引的建立有很大帮助。
通过阅读源代码可以找到在IndexWriter中有个静态的常量static final IndexingChain DefaultIndexingChain,如下:
static final IndexingChain DefaultIndexingChain = new IndexingChain() {
@Override
DocConsumer getChain(DocumentsWriter documentsWriter) {
/*
This is the current indexing chain:
DocConsumer / DocConsumerPerThread
--> code: DocFieldProcessor / DocFieldProcessorPerThread
--> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField
--> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField
--> code: DocInverter / DocInverterPerThread / DocInverterPerField
--> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
--> code: TermsHash / TermsHashPerThread / TermsHashPerField
--> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField
--> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField
--> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField
--> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
--> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField
--> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField
*/
// Build up indexing chain:
final TermsHashConsumer termVectorsWriter = new TermVectorsTermsWriter(documentsWriter);
final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();
final InvertedDocConsumer termsHash = new TermsHash(documentsWriter, true, freqProxWriter,
new TermsHash(documentsWriter, false, termVectorsWriter, null));
final NormsWriter normsWriter = new NormsWriter();
final DocInverter docInverter = new DocInverter(termsHash, normsWriter);
return new DocFieldProcessor(documentsWriter, docInverter);
}
};
这里的注释清晰的给出了整个处理的链是怎样进行的。在Doc文档中是没有这些invertXXX类的说明,必须到源文件中进行阅读。
5. IndexSearcher
Searcher中的接口实现与类继承关系如下(摘自API文档。简单的使用方法参见我之前的一个博客
http://hanyuanbo.iteye.com/blog/812135)
引用
org.apache.lucene.search
Class Searcher
java.lang.Object
org.apache.lucene.search.Searcher
All Implemented Interfaces:
Closeable, Searchable
Direct Known Subclasses:
IndexSearcher, MultiSearcher
其中用到的search函数有很多重载版本,以下摘自API文档。
void search(Query query, Collector results) |
Lower-level search API. |
void search(Query query, Filter filter, Collector results) |
Lower-level search API. |
TopDocs search(Query query, Filter filter, int n) |
Finds the top n hits for query, applying filter if non-null. |
TopFieldDocs search(Query query, Filter filter, int n, Sort sort) |
Search implementation with arbitrary sorting. |
TopDocs search(Query query, int n) |
Finds the top n hits for query. |
abstract void search(Weight weight, Filter filter, Collector results) |
Lower-level search API. |
abstract TopDocs search(Weight weight, Filter filter, int n) |
Expert: Low-level search implementation. |
abstract TopFieldDocs search(Weight weight, Filter filter, int n, Sort sort) |
Expert: Low-level search implementation with arbitrary sorting. |
还有两个非常有用的函数(在Searcher中为抽象方法,具体实现在子类中)
abstract Document doc(int i) |
Returns the stored fields of document i. |
Explanation explain(Weight weight, int doc) |
Expert: low-level implementation method Returns an Explanation that describes how doc scored against weight. |
在源代码中的Searcher抽象类中的search函数的重载版本如下:
/** Search implementation with arbitrary sorting. Finds
* the top <code>n</code> hits for <code>query</code>, applying
* <code>filter</code> if non-null, and sorting the hits by the criteria in
* <code>sort</code>.
*
* <p>NOTE: this does not compute scores by default; use
* {@link IndexSearcher#setDefaultFieldSortScoring} to
* enable scoring.
*
* @throws BooleanQuery.TooManyClauses
*/
public TopFieldDocs search(Query query, Filter filter, int n,
Sort sort) throws IOException {
return search(createWeight(query), filter, n, sort);
}
/** Lower-level search API.
*
* <p>{@link Collector#collect(int)} is called for every matching document.
*
* <p>Applications should only use this if they need <i>all</i> of the
* matching documents. The high-level search API ({@link
* Searcher#search(Query, int)}) is usually more efficient, as it skips
* non-high-scoring hits.
* <p>Note: The <code>score</code> passed to this method is a raw score.
* In other words, the score will not necessarily be a float whose value is
* between 0 and 1.
* @throws BooleanQuery.TooManyClauses
*/
public void search(Query query, Collector results)
throws IOException {
search(createWeight(query), null, results);
}
/** Lower-level search API.
*
* <p>{@link Collector#collect(int)} is called for every matching
* document.
* <br>Collector-based access to remote indexes is discouraged.
*
* <p>Applications should only use this if they need <i>all</i> of the
* matching documents. The high-level search API ({@link
* Searcher#search(Query, Filter, int)}) is usually more efficient, as it skips
* non-high-scoring hits.
*
* @param query to match documents
* @param filter if non-null, used to permit documents to be collected.
* @param results to receive hits
* @throws BooleanQuery.TooManyClauses
*/
public void search(Query query, Filter filter, Collector results)
throws IOException {
search(createWeight(query), filter, results);
}
/** Finds the top <code>n</code>
* hits for <code>query</code>, applying <code>filter</code> if non-null.
*
* @throws BooleanQuery.TooManyClauses
*/
public TopDocs search(Query query, Filter filter, int n)
throws IOException {
return search(createWeight(query), filter, n);
}
/** Finds the top <code>n</code>
* hits for <code>query</code>.
*
* @throws BooleanQuery.TooManyClauses
*/
public TopDocs search(Query query, int n)
throws IOException {
return search(query, null, n);
}
...
abstract public void search(Weight weight, Filter filter, Collector results) throws IOException;
实际上的search函数在Searcher类中并没有实现,留在了子类中来实现,而且最终使用的函数都是
search(Weight weight, Filter filter, Collector results)
版本的。其他传入的query参数的搜索函数,都隐含的调用了createWeight(query)方法。
至于到IndexSearcher类中,搜索函数主要有两个(其他的重载版本,都调用了两个中的一个)
@Override
public void search(Weight weight, Filter filter, Collector collector)
throws IOException {
if (filter == null) {
for (int i = 0; i < subReaders.length; i++) { // search each subreader
collector.setNextReader(subReaders[i], docStarts[i]);
Scorer scorer = weight.scorer(subReaders[i], !collector.acceptsDocsOutOfOrder(), true);
if (scorer != null) {
scorer.score(collector);
}
}
} else {
for (int i = 0; i < subReaders.length; i++) { // search each subreader
collector.setNextReader(subReaders[i], docStarts[i]);
searchWithFilter(subReaders[i], weight, filter, collector);
}
}
}
...
private void searchWithFilter(IndexReader reader, Weight weight,
final Filter filter, final Collector collector) throws IOException {
...
}
可以看到,在其中最主要的区别是是否使用了Filter来进行搜索。而对于有返回类型的search函数,也是调用了上面所说的两个中的一个,只是在结尾返回了
return (TopFieldDocs) collector.topDocs();
而对于简单的使用,调用前面Searcher抽象类(父类)中申明的函数即可。
而在其中还使用到了其他的类来进行辅助搜索,有:
QueryParser |
Query |
TopScoreDocCollector |
TopDocs |
ScoreDoc |
Document |
需要注意的是其中的那个TopScoreDocCollector类,用来存储搜索的结果。这个类的继承关系如下(摘自API文档):
引用
org.apache.lucene.search
Class TopScoreDocCollector
java.lang.Object
org.apache.lucene.search.Collector
org.apache.lucene.search.TopDocsCollector<ScoreDoc>
org.apache.lucene.search.TopScoreDocCollector
其中比较常用的函数包括(摘自API文档):
int getTotalHits() |
The total number of documents that matched this query. |
TopDocs topDocs() |
Returns the top docs that were collected by this collector. |
TopDocs topDocs(int start) |
Returns the documents in the rage [start .. |
TopDocs topDocs(int start, int howMany) |
Returns the documents in the rage [start .. |
而其中的topDocs()的返回类型TopDocs类中,有如下两个属性
ScoreDoc[] scoreDocs |
The top hits for the query. |
int totalHits |
The total number of hits for the query. |
而其中的ScoreDoc类中有两个属性,如下:
int doc |
Expert: A hit document's number. |
float score |
Expert: The score of this document for the query. |
这样便可以得到doc(文档号)和score(得分)
6. Analyzer
在Lucene 3.0.2中的Analyzer实现中,集成结构如下(摘自API文档):
org.apache.lucene.analysis
Class Analyzer
java.lang.Object org.apache.lucene.analysis.Analyzer
All Implemented Interfaces: Closeable
Direct Known Subclasses:ArabicAnalyzer, BrazilianAnalyzer, ChineseAnalyzer, CJKAnalyzer, CollationKeyAnalyzer, CzechAnalyzer, DutchAnalyzer, FrenchAnalyzer, GermanAnalyzer, GreekAnalyzer, ICUCollationKeyAnalyzer, KeywordAnalyzer, PatternAnalyzer, PerFieldAnalyzerWrapper, PersianAnalyzer, QueryAutoStopWordAnalyzer, RussianAnalyzer, ShingleAnalyzerWrapper, SimpleAnalyzer, SmartChineseAnalyzer, SnowballAnalyzer,
StandardAnalyzer, StopAnalyzer, ThaiAnalyzer, WhitespaceAnalyzer
其中出现了很多对于特定语言的实现,如ChineseAnalyzer,RussianAnalyzer等等。目前我只用到了StandardAnalyzer和IKAnalyzer(是IKAnalyzer自己的实现,不是Lucene的一部分)。其他的可以进行下尝试,如果做某种特定语言的分析的话。
在StandardAnalyzer中主要有两个属性:
private Set<?> stopSet;
private final Version matchVersion;
其中还有三个属性,分别是:
/**
* Specifies whether deprecated acronyms should be replaced with HOST type.
* See {@linkplain https://issues.apache.org/jira/browse/LUCENE-1068}
*/
private final boolean replaceInvalidAcronym,enableStopPositionIncrements;
/** An unmodifiable set containing some common English words that are usually not
useful for searching. */
public static final Set<?> STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
我在设置属性的时候,基本上使用了Version.LUCENE_30,所以这两个属性不考虑,也没做进一步研究。其中的STOP_WORDS_SET是为了来虑词操作的,而StopAnalyzer.ENGLISH_STOP_WORDS_SET; 的内容如下:
//StopAnalyzer.java
public static final Set<?> ENGLISH_STOP_WORDS_SET;
static {
final List<String> stopWords = Arrays.asList(
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
);
final CharArraySet stopSet = new CharArraySet(stopWords.size(), false);
stopSet.addAll(stopWords);
ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet);
}
在StandardAnalyzer中构造函数有四个版本,如下:
StandardAnalyzer(Version matchVersion) |
Builds an analyzer with the default stop words (STOP_WORDS_SET). |
StandardAnalyzer(Version matchVersion, File stopwords) |
Builds an analyzer with the stop words from the given file. |
StandardAnalyzer(Version matchVersion, Reader stopwords) |
Builds an analyzer with the stop words from the given reader. |
StandardAnalyzer(Version matchVersion, Set<?> stopWords) |
Builds an analyzer with the given stop words. |
主要的构造函数如下(其他版本的构造函数都是重载这个构造函数)
/** Builds an analyzer with the given stop words.
* @param matchVersion Lucene version to match See {@link
* <a href="#version">above</a>}
* @param stopWords stop words */
public StandardAnalyzer(Version matchVersion, Set<?> stopWords) {
stopSet = stopWords;
setOverridesTokenStreamMethod(StandardAnalyzer.class);
enableStopPositionIncrements = StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion);
replaceInvalidAcronym = matchVersion.onOrAfter(Version.LUCENE_24);
this.matchVersion = matchVersion;
}
replaceInvalidAcronym和enableStopPositionIncrements不考虑在内(我尽量使用Version.LUCENE_30版本,),其他它只做了种boolean值的判断(这在Version的enum中进行的一个compareTo比较),在之后做进一步的考虑。排除掉这两个变量,其中做的工作就是将版本号和stopWords赋值。版本号由Version枚举来选择,而stopWords有默认的集合,不过这个Lucene做了很好的扩展,可以指定自己的stopWords集合或者文件。将里面的关于扩展这个stopWords的代码拿出来。将集合赋值到stopWords中在上面的构造函数中已经说明,指定到指定的文件中的内容如下:
/** Builds an analyzer with the stop words from the given file.
* @see WordlistLoader#getWordSet(File)
* @param matchVersion Lucene version to match See {@link
* <a href="#version">above</a>}
* @param stopwords File to read stop words from */
public StandardAnalyzer(Version matchVersion, File stopwords) throws IOException {
this(matchVersion, WordlistLoader.getWordSet(stopwords));
}
在WordlistLoader类中,方法定义如下:
/**
* Loads a text file and adds every line as an entry to a HashSet (omitting
* leading and trailing whitespace). Every line of the file should contain only
* one word. The words need to be in lowercase if you make use of an
* Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
*
* @param wordfile File containing the wordlist
* @return A HashSet with the file's words
*/
public static HashSet<String> getWordSet(File wordfile) throws IOException {
HashSet<String> result = new HashSet<String>();
FileReader reader = null;
try {
reader = new FileReader(wordfile);
result = getWordSet(reader);
}
finally {
if (reader != null)
reader.close();
}
return result;
}
/**
* Reads lines from a Reader and adds every line as an entry to a HashSet (omitting
* leading and trailing whitespace). Every line of the Reader should contain only
* one word. The words need to be in lowercase if you make use of an
* Analyzer which uses LowerCaseFilter (like StandardAnalyzer).
*
* @param reader Reader containing the wordlist
* @return A HashSet with the reader's words
*/
public static HashSet<String> getWordSet(Reader reader) throws IOException {
HashSet<String> result = new HashSet<String>();
BufferedReader br = null;
try {
if (reader instanceof BufferedReader) {
br = (BufferedReader) reader;
} else {
br = new BufferedReader(reader);
}
String word = null;
while ((word = br.readLine()) != null) {
result.add(word.trim());
}
}
finally {
if (br != null)
br.close();
}
return result;
}
可以看出,只要指定到一个文件上,且文件中的stopWords是每个词占用一行,这样StandardAnalyzer便可以加载这个文件中的stopWords。
7. Directory
Directory的继承关系如下(摘自API文档):
org.apache.lucene.store Class Directory
java.lang.Object org.apache.lucene.store.Directory
All Implemented Interfaces: Closeable
Direct Known Subclasses:DbDirectory, FileSwitchDirectory,
FSDirectory, JEDirectory,
RAMDirectory
其中用到的比较多的是RAMDirectory和FSDirectory。RAMDirectory是将索引存储在内存中(如果数据量很大,用RAMDirectory将是可怕的,会有OutOfMemoryErr: Heap space error),FSDirectory是将索引文件存储到本地硬盘中。大致意思是这样,具体的实现起来的时候,一定要注意IndexWriter和IndexReader操作时,所指向的是同一个Directory,否则将会出现error(这个是RAMDirectory的不指向同一个Directory的错误):no segments* file found in org.apache.lucene.store.RAMDirectory@765291: files: []
Lucene中的Directory中的方法,大多是对文件进行的操作,这也就是对java的io进行了进一步的封装而已,也比较容易理解。而在索引创建的过程中,需要用到Directory的实例,其中将FSDirectory和RAMDirectory中比较常用的方法列举如下:
下面是FSDirectory中创建的方法(摘自API文档),因为构造函数是protected类型,不能直接实例化,调用静态方法来得到具体类的reference。
static FSDirectory open(File path) |
Creates an FSDirectory instance, trying to pick the best implementation given the current environment. |
static FSDirectory open(File path, LockFactory lockFactory) |
Just like open(File), but allows you to also specify a custom LockFactory. |
具体的实现如下:
/** Creates an FSDirectory instance, trying to pick the
* best implementation given the current environment.
* The directory returned uses the {@link NativeFSLockFactory}.
*
* <p>Currently this returns {@link NIOFSDirectory}
* on non-Windows JREs and {@link SimpleFSDirectory}
* on Windows.
*
* <p><b>NOTE</b>: this method may suddenly change which
* implementation is returned from release to release, in
* the event that higher performance defaults become
* possible; if the precise implementation is important to
* your application, please instantiate it directly,
* instead. On 64 bit systems, it may also good to
* return {@link MMapDirectory}, but this is disabled
* because of officially missing unmap support in Java.
* For optimal performance you should consider using
* this implementation on 64 bit JVMs.
*
* <p>See <a href="#subclasses">above</a> */
public static FSDirectory open(File path) throws IOException {
return open(path, null);
}
/** Just like {@link #open(File)}, but allows you to
* also specify a custom {@link LockFactory}. */
public static FSDirectory open(File path, LockFactory lockFactory) throws IOException {
/* For testing:
MMapDirectory dir=new MMapDirectory(path, lockFactory);
dir.setUseUnmap(true);
return dir;
*/
if (Constants.WINDOWS) {
return new SimpleFSDirectory(path, lockFactory);
} else {
return new NIOFSDirectory(path, lockFactory);
}
}
可以看出,如果不指明锁的话,会将锁设置为null,如果显示设置了锁,会根据操作系统的不同,而分别返回不同的LockFactory的实现。目前还没有对其进行使用,以后如果遇到,会进一步追加对LockFactory的理解。
8. Query、Sort 和 Filter
在搜索中,Query是必须要用到的类。而且是需要深入理解下的东西。Query的继承关系如下(摘自API文档)(而Query中常用到Term类,在上面有提到):
引用
org.apache.lucene.search
Class Query
java.lang.Object
org.apache.lucene.search.Query
All Implemented Interfaces:
Serializable, Cloneable
Direct Known Subclasses:
BooleanQuery, BoostingQuery, ConstantScoreQuery, CustomScoreQuery, DisjunctionMaxQuery, FilteredQuery, FuzzyLikeThisQuery, MatchAllDocsQuery, MoreLikeThisQuery, MultiPhraseQuery, MultiTermQuery, PhraseQuery, SpanQuery, TermQuery, ValueSourceQuery
下面一一介绍。
TermQuery
构造函数只有一个。使用比较简单
TermQuery(Term t) |
Constructs a query for the term t. |
TermQuery query = new TermQuery(new Term("bookname","java"));
BooleanQuery
(摘自API文档)
引用
A Query that matches documents matching boolean combinations of other queries, e.g. TermQuerys, PhraseQuerys or other BooleanQuerys.
在BooleanQuery中,有两个构造函数,如下(摘自API文档):
BooleanQuery() |
Constructs an empty boolean query. |
BooleanQuery(boolean disableCoord) |
Constructs an empty boolean query. |
其中需要注意的属性包括(我可能没有用到那么多,将知道的列举如下)
private static int maxClauseCount = 1024;//最大数量限制。默认是1024
this.disableCoord = disableCoord;//第二个构造函数中。是用来在search中的Similarity类中使用的
protected int minNrShouldMatch = 0;//在setMinimumNumberShouldMatch(int)函数中
private ArrayList<BooleanClause> clauses = new ArrayList<BooleanClause>();//用来存放BooleanClause的容器
常用到的函数包括:
void add(BooleanClause clause) |
Adds a clause to a boolean query. |
void add(Query query, BooleanClause.Occur occur) |
Adds a clause to a boolean query. |
而BooleanClause类简单但很有用(对于BooleanQuery来说)。代码没多少,重要的只是其中的那个静态枚举变量和两个属性。
public static enum Occur {
/** Use this operator for clauses that <i>must</i> appear in the matching documents. */
MUST { @Override public String toString() { return "+"; } },
/** Use this operator for clauses that <i>should</i> appear in the
* matching documents. For a BooleanQuery with no <code>MUST</code>
* clauses one or more <code>SHOULD</code> clauses must match a document
* for the BooleanQuery to match.
* @see BooleanQuery#setMinimumNumberShouldMatch
*/
SHOULD { @Override public String toString() { return ""; } },
/** Use this operator for clauses that <i>must not</i> appear in the matching documents.
* Note that it is not possible to search for queries that only consist
* of a <code>MUST_NOT</code> clause. */
MUST_NOT { @Override public String toString() { return "-"; } };
}
/** The query whose matching documents are combined by the boolean query.
*/
private Query query;
private Occur occur;
/** Constructs a BooleanClause.
*/
public BooleanClause(Query query, Occur occur) {
this.query = query;
this.occur = occur;
}
PhraseQuery
在PhraseQuery中,构造函数只有一个,如下:
PhraseQuery() |
Constructs an empty phrase query. |
这主要是用到了其中的属性,所以构造了一个空的PhraseQuery对象。其中的属性包括:
private String field;//field在这个PhraseQuery中必须是相同的
private ArrayList<Term> terms = new ArrayList<Term>(4);//来存储Term的集合
private ArrayList<Integer> positions = new ArrayList<Integer>(4);//来存储位置的集合
private int maxPosition = 0;//maxPosition
private int slop = 0;//用来说明Term之间距离的变量。如果为0,则表示是一个phrase
其中用到的主要函数有:
public void setSlop(int s) { slop = s; }
/**
* Adds a term to the end of the query phrase.
* The relative position of the term is the one immediately after the last term added.
*/
public void add(Term term) {
int position = 0;
if(positions.size() > 0)
position = positions.get(positions.size()-1).intValue() + 1;
add(term, position);
}
/**
* Adds a term to the end of the query phrase.
* The relative position of the term within the phrase is specified explicitly.
* This allows e.g. phrases with more than one term at the same position
* or phrases with gaps (e.g. in connection with stopwords).
*
* @param term
* @param position
*/
public void add(Term term, int position) {
if (terms.size() == 0)
field = term.field();
else if (term.field() != field)
throw new IllegalArgumentException("All phrase terms must be in the same field: " + term);//field必须相同
terms.add(term);
positions.add(Integer.valueOf(position));
if (position > maxPosition) maxPosition = position;
}
WildcardQuery(继承自MultiTermQuery)
WindcardQuery的使用非常简单,只有一个构造函数(引自API文档):
构造函数如下:
/** Implements the wildcard search query. Supported wildcards are <code>*</code>, which
* matches any character sequence (including the empty one), and <code>?</code>,
* which matches any single character. Note this query can be slow, as it
* needs to iterate over many terms. In order to prevent extremely slow WildcardQueries,
* a Wildcard term should not start with one of the wildcards <code>*</code> or
* <code>?</code>.
*
* <p>This query uses the {@link
* MultiTermQuery#CONSTANT_SCORE_AUTO_REWRITE_DEFAULT}
* rewrite method.
*
* @see WildcardTermEnum */
public class WildcardQuery extends MultiTermQuery {
private boolean termContainsWildcard;//如果含有*或者?,则为true
private boolean termIsPrefix;//如果只含有*且*在最后。为了来处理仅仅含有*且在最后的这种情况,来提高检索速度。因为使用WildcardQuery,速度有慢很多
protected Term term;
public WildcardQuery(Term term) {
this.term = term;
String text = term.text();
this.termContainsWildcard = (text.indexOf('*') != -1)
|| (text.indexOf('?') != -1);
this.termIsPrefix = termContainsWildcard
&& (text.indexOf('?') == -1)
&& (text.indexOf('*') == text.length() - 1);
}
...
}
可以看出,WildcardQuery只支持*和?两种,
PrefixQuery(继承自MultiTermQuery)
构造函数同样只有一个,如下(摘自API文档):
PrefixQuery(Term prefix) |
Constructs a query for terms starting with prefix. |
FuzzyQuery(继承自MultiTermQuery)
用来实现相思查询。构造函数如下(摘自API文档):
FuzzyQuery(Term term) |
Calls FuzzyQuery(term, 0.5f, 0). |
FuzzyQuery(Term term, float minimumSimilarity) |
Calls FuzzyQuery(term, minimumSimilarity, 0). |
FuzzyQuery(Term term, float minimumSimilarity, int prefixLength) |
Create a new FuzzyQuery that will match terms with a similarity of at least minimumSimilarity to term. |
实现如下:
public final static float defaultMinSimilarity = 0.5f;
public final static int defaultPrefixLength = 0;
private float minimumSimilarity;
private int prefixLength;
private boolean termLongEnough = false;
protected Term term;
/**
* Create a new FuzzyQuery that will match terms with a similarity
* of at least <code>minimumSimilarity</code> to <code>term</code>.
* If a <code>prefixLength</code> > 0 is specified, a common prefix
* of that length is also required.
*
* @param term the term to search for
* @param minimumSimilarity a value between 0 and 1 to set the required similarity
* between the query term and the matching terms. For example, for a
* <code>minimumSimilarity</code> of <code>0.5</code> a term of the same length
* as the query term is considered similar to the query term if the edit distance
* between both terms is less than <code>length(term)*0.5</code>
* @param prefixLength length of common (non-fuzzy) prefix
* @throws IllegalArgumentException if minimumSimilarity is >= 1 or < 0
* or if prefixLength < 0
*/
public FuzzyQuery(Term term, float minimumSimilarity, int prefixLength) throws IllegalArgumentException {
this.term = term;
if (minimumSimilarity >= 1.0f)
throw new IllegalArgumentException("minimumSimilarity >= 1");
else if (minimumSimilarity < 0.0f)
throw new IllegalArgumentException("minimumSimilarity < 0");
if (prefixLength < 0)
throw new IllegalArgumentException("prefixLength < 0");
if (term.text().length() > 1.0f / (1.0f - minimumSimilarity)) {
this.termLongEnough = true;
}
this.minimumSimilarity = minimumSimilarity;
this.prefixLength = prefixLength;
rewriteMethod = SCORING_BOOLEAN_QUERY_REWRITE;
}
/**
* Calls {@link #FuzzyQuery(Term, float) FuzzyQuery(term, minimumSimilarity, 0)}.
*/
public FuzzyQuery(Term term, float minimumSimilarity) throws IllegalArgumentException {
this(term, minimumSimilarity, defaultPrefixLength);
}
/**
* Calls {@link #FuzzyQuery(Term, float) FuzzyQuery(term, 0.5f, 0)}.
*/
public FuzzyQuery(Term term) {
this(term, defaultMinSimilarity, defaultPrefixLength);
}
...
}
可以看出,minimumSimilarity在0到1之间,prefixLength>=0。其中Similarity用到了
levenshtein算法。此返回两个字符串之间的 Levenshtein 距离。Levenshtein 距离,又称编辑距离,指的是两个字符串之间,由一个转换成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。例如把 kitten 转换为 sitting:
sitten (k→s) sittin (e→i) sitting (→g)
levenshtein() 函数给每个操作(替换、插入和删除)相同的权重。不过,您可以通过设置可选的 insert、replace、delete 参数,来定义每个操作的代价。
TermRangeQuery(继承自MultiTermQuery)
TermRangeQuery的构造函数有两个(引自API文档),默认的是String的比较,不过可以添加自己的比较器(自己实现一个comparator类)。(可能有它的原因,这个field传入的不是由Term来提供的,二是直接由自己的String字符串来提供)
TermRangeQuery(String field, String lowerTerm, String upperTerm, boolean includeLower, boolean includeUpper) |
Constructs a query selecting all terms greater/equal than lowerTerm but less/equal than upperTerm. |
TermRangeQuery(String field, String lowerTerm, String upperTerm, boolean includeLower, boolean includeUpper, Collator collator) |
Constructs a query selecting all terms greater/equal than lowerTerm but less/equal than upperTerm. |
NumericRangeQuery(继承自MultiTermQuery)
(摘自API文档)(我没有来改变precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4).没有进行深入研究)
引用
A Query that matches numeric values within a specified range. To use this, you must first index the numeric values using NumericField (expert: NumericTokenStream). If your terms are instead textual, you should use TermRangeQuery. NumericRangeFilter is the filter equivalent of this query.
所以在你想使用NumericRangeQuery的时候,需要用NumericField来创建索引。在API文档中有说明,这个NumericRangeQuery类和NumericField在以后可能会不兼容现在的版本,如下:
引用
NOTE: This API is experimental and might change in incompatible ways in the next release.
这个Query可以用一下的8个静态成员函数来创建:
static NumericRangeQuery<Double> newDoubleRange(String field, Double min, Double max, boolean minInclusive, boolean maxInclusive) |
Factory that creates a NumericRangeQuery, that queries a double range using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4). |
static NumericRangeQuery<Double> newDoubleRange(String field, int precisionStep, Double min, Double max, boolean minInclusive, boolean maxInclusive) |
Factory that creates a NumericRangeQuery, that queries a double range using the given precisionStep. |
static NumericRangeQuery<Float> newFloatRange(String field, Float min, Float max, boolean minInclusive, boolean maxInclusive) |
Factory that creates a NumericRangeQuery, that queries a float range using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4). |
static NumericRangeQuery<Float> newFloatRange(String field, int precisionStep, Float min, Float max, boolean minInclusive, boolean maxInclusive) |
Factory that creates a NumericRangeQuery, that queries a float range using the given precisionStep. |
static NumericRangeQuery<Integer> newIntRange(String field, Integer min, Integer max, boolean minInclusive, boolean maxInclusive) |
Factory that creates a NumericRangeQuery, that queries a int range using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4). |
static NumericRangeQuery<Integer> newIntRange(String field, int precisionStep, Integer min, Integer max, boolean minInclusive, boolean maxInclusive) |
Factory that creates a NumericRangeQuery, that queries a int range using the given precisionStep. |
static NumericRangeQuery<Long> newLongRange(String field, int precisionStep, Long min, Long max, boolean minInclusive, boolean maxInclusive) |
Factory that creates a NumericRangeQuery, that queries a long range using the given precisionStep. |
static NumericRangeQuery<Long> newLongRange(String field, Long min, Long max, boolean minInclusive, boolean maxInclusive) |
Factory that creates a NumericRangeQuery, that queries a long range using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4). |
而NumericFiled中的构造函数有四个,如下(摘自API文档):
NumericField(String name) |
Creates a field for numeric values using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4). |
NumericField(String name, Field.Store store, boolean index) |
Creates a field for numeric values using the default precisionStep NumericUtils.PRECISION_STEP_DEFAULT (4). |
NumericField(String name, int precisionStep) |
Creates a field for numeric values with the specified precisionStep. |
NumericField(String name, int precisionStep, Field.Store store, boolean index) |
Creates a field for numeric values with the specified precisionStep. |
里面可以用到如下函数:
NumericField setDoubleValue(double value) |
Initializes the field with the supplied double value. |
NumericField setFloatValue(float value) |
Initializes the field with the supplied float value. |
NumericField setIntValue(int value) |
Initializes the field with the supplied int value. |
NumericField setLongValue(long value) |
Initializes the field with the supplied long value. |
这使得可以来对包括基本数值类型的变量在内的其他可以转变为这些数值类型的数据类型的数值进行索引并进行搜索。如Date/Calendar等等。
RegexQuery(继承自MultiTermQuery)
API文档中有叙述,但是在Lucene 3.0.2中没有这个类。不知道为什么。可能是实现出来的性能不够满意,所以没有随着3.0.2一起发布吧,不太清楚。
上面所说的几个Query可能会帮助理解关于Query的概念,下面是一些代码,帮助理解这几个Query。(注释是运行结果)
package com.eric.lucene;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TermRangeQuery;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
/**
* 注释是运行结果
* @author Yuanbo Han
*
*/
public class QueryTest {
public static Query getTermQuery(){
TermQuery query = new TermQuery(new Term("bookname","java"));
return query;
// thinking in java 0.625
// thinking in java IV(Java Classic) 0.61871845
}
public static Query getBooleanQuery(){
TermQuery termQuery2 = new TermQuery(new Term("bookname", "thinking"));
TermQuery termQuery1 = new TermQuery(new Term("bookname", "java"));
BooleanQuery query = new BooleanQuery();
query.add(termQuery1, BooleanClause.Occur.SHOULD);
query.add(termQuery2, BooleanClause.Occur.SHOULD);
return query;
// thinking in java 0.76735055
// thinking in java IV(Java Classic) 0.68474615
// thinking in c++ 0.12914689
}
public static Query getPhraseQuery(){
PhraseQuery query = new PhraseQuery();
//query.setSlop(1);
//thinking in java 0.75674474
//thinking in java IV(Java Classic) 0.5297213
query.setSlop(0);// no result. 说明没有thinking java存在
query.add(new Term("bookname", "thinking"));
query.add(new Term("bookname", "java"));
return query;
}
public static Query getWildcardQuery(){
//WildcardQuery query = new WildcardQuery(new Term("bookname","think*"));
//thinking in java 1.0
//thinking in java IV(Java Classic) 1.0
//thinking in c++ 1.0
//WildcardQuery query = new WildcardQuery(new Term("bookname","ja?a"));
//thinking in java 1.0
//thinking in java IV(Java Classic) 1.0
WildcardQuery query = new WildcardQuery(new Term("bookname","ja?a*"));
//thinking in java 1.0
//thinking in java IV(Java Classic) 1.0
return query;
}
public static Query getPrefixQuery(){
PrefixQuery query = new PrefixQuery(new Term("bookname","java"));//以java为前缀的匹配
//thinking in java 1.0
//thinking in java IV(Java Classic) 1.0
return query;
}
public static Query getFuzzyQuery(){
//FuzzyQuery query = new FuzzyQuery(new Term("bookname","jama"));// default: similarity = 0.5, prefixLength = 0.
/*具体的edit distance 不知道怎么计算的,但是觉得源代码的注意有些问题。解释如下:相似度越高,说明需要做的修改的操作也少,但是它注释中如是说:“For example, for a minimumSimilarity of 0.5, a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term)*0.5”但是这说明Similarity越高的话,可以做的操作可以越多,代码中也试过了,如果将similarity设置为0.9的话,是没有结果的。*/
//thinking in java 0.625
//thinking in java IV(Java Classic) 0.61871845
// FuzzyQuery query = new FuzzyQuery(new Term("bookname","jama"),0.9f);//no result
// FuzzyQuery query = new FuzzyQuery(new Term("bookname","jama"),0.5f,3);//no result
FuzzyQuery query = new FuzzyQuery(new Term("bookname","jama"),0.5f,2);
//thinking in java 0.625
//thinking in java IV(Java Classic) 0.61871845
return query;
}
public static Query getTermRangeQuery(){
// TermRangeQuery query = new TermRangeQuery("bookname", "jama", "jaza", true, true);
//thinking in java 1.0
//thinking in java IV(Java Classic) 1.0
TermRangeQuery query = new TermRangeQuery("bookname", "jama", "jana", true, true);// no result
return query;
}
public static Query getNumericRangeQuery(){
// Query query = NumericRangeQuery.newFloatRange("bookname", 0.3f, 0.10f, true, true);// no result
/* if let the document add the fields below,(if you want to use NumericRangeQuery, you should create the index using the NumericField)
doc1.add(new NumericField("value", Field.Store.YES, true).setFloatValue(0.1f));
doc2.add(new NumericField("value", Field.Store.YES, true).setFloatValue(0.5f));
doc3.add(new NumericField("value", Field.Store.YES, true).setFloatValue(0.1f));
将结果输出中的那句改成System.out.print(doc.get("value") + "\t\t");
结果:
0.1 1.0
0.5 1.0
0.1 1.0
*/
Query query = NumericRangeQuery.newFloatRange("value", null, null, true, true);// no result
return query;
}
/**
* maybe some reasons.
* the api contains the RegexQuery, and other interfaces relevant to the class.
* but in Lucene 3.0.2, the class has not been contained.
* maybe its performance is not satisfying.
* @return
*/
public static Query getRegexQuery(){
return null;
}
public static void main(String[] args) throws Exception {
Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(
dir, new StandardAnalyzer(Version.LUCENE_30), true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Document doc2 = new Document();
Document doc3 = new Document();
doc1.add(new Field("bookname","thinking in java", Field.Store.YES, Field.Index.ANALYZED));
doc2.add(new Field("bookname","thinking in java IV(Java Classic)", Field.Store.YES, Field.Index.ANALYZED));
doc3.add(new Field("bookname","thinking in c++", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc1);
writer.addDocument(doc2);
writer.addDocument(doc3);
writer.optimize();
writer.close();
IndexSearcher searcher = new IndexSearcher(dir);
Query query = QueryTest.getNumericRangeQuery();
TopScoreDocCollector collector = TopScoreDocCollector.create(100, false);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for(int i=0; i<hits.length;i++){
Document doc = searcher.doc(hits[i].doc);
System.out.print(doc.get("bookname") + "\t\t");
System.out.println(hits[i].score);
}
}
}
9. Lucene中的Ranking算法以及改进