有关 Lucene 的 Payload 的相关内容,可以参考如下链接,介绍的非常详细,值得参考:
http://www.ibm.com/developerworks/cn/opensource/os-cn-lucene-pl/
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
例如,有这样的一个需求:
现在有两篇文档内容非常相似,如下所示:
-
文档 1 : egg tomato potato bread
-
文档 2 : egg book potato bread
现在我想要查询食物( foods ),而且是查询关键词是 egg ,如何能够区别出上面两个文档哪一个更是我想要的?
可以看到上面两篇文档,文档 1 中描述的各项都是食物,而文档 2 中的 book 不是食物,基于上述需求,应该是文档 1 比文档 2 更相关,在查询结果中,文档 1 排名应该更靠前。通过上面
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ 中给出的方法,可以在文档中,对给定词出现在文档的出现的权重信息( egg 在文档 1 与文档中,以 foods 来衡量,文档 1 更相关),可以在索引之前处理一下,为 egg 增加 payload 信息,例如:
-
文档 1 : egg|0.984 tomato potato bread
-
文档 2 : egg|0.356 book potato bread
然后再进行索引,通过 Lucene 提供的 PayloadTermQuery 就能够分辨出上述 egg 这个 Term 的不同。在 Lucene 中,实际上是将我们存储的 Payload 数据,如上述 "|" 分隔后面的数字,乘到了 tf 上,然后在进行权重的计算。
下面,我们再看一下,增加一个 Field 来存储 Payload 数据,而源文档不需要进行修改,或者,我们可以在索引之前对文档进行一个处理,例如分类,通过分类可以给不同的文档所属类别的不同程度,计算一个 Payload 数值。
为了能够使用存储的 Payload 数据信息,结合上面提出的实例,我们需要按照如下步骤去做:
第一,待索引数据处理
例如,增加 category 这个 Field 存储类别信息, content 这个 Field 存储上面的内容:
-
文档 1 :
-
new Field("category", "foods|0.984 shopping|0.503", Field.Store.YES, Field.Index.ANALYZED)
-
new Field("content", "egg tomato potato bread", Field.Store.YES, Field.Index.ANALYZED)
-
文档 2 :
-
new Field("category", "foods|0.356 shopping|0.791", Field.Store.YES, Field.Index.ANALYZED)
-
new Field("content", "egg book potato bread", Field.Store.YES, Field.Index.ANALYZED)
第二,实现解析 Payload 数据的 Analyzer
由于 Payload 信息存储在 category 这个 Field 中,多个类别之间使用空格分隔,每个类别内容是以 "|" 分隔的,所以我们的 Analyzer 就要能够解析它。 Lucene 提供了 DelimitedPayloadTokenFilter ,能够处理具有分隔符的情况。我们的实现如下所示:
-
package org.shirdrn.lucene.query.payloadquery;
-
-
import java.io.Reader;
-
-
import org.apache.lucene.analysis.Analyzer;
-
import org.apache.lucene.analysis.TokenStream;
-
import org.apache.lucene.analysis.WhitespaceTokenizer;
-
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
-
import org.apache.lucene.analysis.payloads.PayloadEncoder;
-
-
public class PayloadAnalyzer extends Analyzer {
-
private PayloadEncoder encoder;
-
-
PayloadAnalyzer(PayloadEncoder encoder) {
-
this .encoder = encoder;
-
}
-
-
@SuppressWarnings("deprecation")
-
public TokenStream tokenStream(String fieldName, Reader reader) {
-
TokenStream result = new WhitespaceTokenizer(reader); // 用来解析空格分隔的各个类别
-
result = new DelimitedPayloadTokenFilter(result, '|', encoder); // 在上面分词的基础上,在进行 Payload 数据解析
-
return result;
-
}
-
}
第三, 实现 Similarity 计算得分
Lucene 中 Similarity 类中提供了 scorePayload 方法,用于计算 Payload 值来对文档贡献得分,我们重写了该方法,实现如下所示:
-
package org.shirdrn.lucene.query.payloadquery;
-
-
import org.apache.lucene.analysis.payloads.PayloadHelper;
-
import org.apache.lucene.search.DefaultSimilarity;
-
-
-
public class PayloadSimilarity extends DefaultSimilarity {
-
-
private static final long serialVersionUID = 1L;
-
-
@Override
-
public float scorePayload(int docId, String fieldName, int start, int end,
-
byte [] payload, int offset, int length) {
-
return PayloadHelper.decodeFloat(payload, offset);
-
}
-
-
}
通过使用 PayloadHelper 这个工具类可以获取到 Payload 值,然后在计算文档得分的时候起到作用。
第四,创建索引
在创建索引的时候,需要使用到我们上面实现的 Analyzer 和 Similarity ,代码如下所示:
-
package org.shirdrn.lucene.query.payloadquery;
-
-
import java.io.File;
-
import java.io.IOException;
-
-
import org.apache.lucene.analysis.Analyzer;
-
import org.apache.lucene.analysis.payloads.FloatEncoder;
-
import org.apache.lucene.document.Document;
-
import org.apache.lucene.document.Field;
-
import org.apache.lucene.index.CorruptIndexException;
-
import org.apache.lucene.index.IndexWriter;
-
import org.apache.lucene.index.IndexWriterConfig;
-
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
-
import org.apache.lucene.search.Similarity;
-
import org.apache.lucene.store.FSDirectory;
-
import org.apache.lucene.store.LockObtainFailedException;
-
import org.apache.lucene.util.Version;
-
-
public class PayloadIndexing {
-
-
private IndexWriter indexWriter = null ;
-
private final Analyzer analyzer = new PayloadAnalyzer(new FloatEncoder()); // 使用 PayloadAnalyzer ,并指定 Encoder
-
private final Similarity similarity = new PayloadSimilarity(); // 实例化一个 PayloadSimilarity
-
private IndexWriterConfig config = null ;
-
-
public PayloadIndexing(String indexPath) throws CorruptIndexException, LockObtainFailedException, IOException {
-
File indexFile = new File(indexPath);
-
config = new IndexWriterConfig(Version.LUCENE_31, analyzer);
-
config.setOpenMode(OpenMode.CREATE).setSimilarity(similarity); // 设置计算得分的 Similarity
-
indexWriter = new IndexWriter(FSDirectory.open(indexFile), config);
-
}
-
-
public void index() throws CorruptIndexException, IOException {
-
Document doc1 = new Document();
-
doc1.add(new Field("category", "foods|0.984 shopping|0.503", Field.Store.YES, Field.Index.ANALYZED));
-
doc1.add(new Field("content", "egg tomato potato bread", Field.Store.YES, Field.Index.ANALYZED));
-
indexWriter.addDocument(doc1);
-
-
Document doc2 = new Document();
-
doc2.add(new Field("category", "foods|0.356 shopping|0.791", Field.Store.YES, Field.Index.ANALYZED));
-
doc2.add(new Field("content", "egg book potato bread", Field.Store.YES, Field.Index.ANALYZED));
-
indexWriter.addDocument(doc2);
-
-
indexWriter.close();
-
}
-
-
public static void main(String[] args) throws CorruptIndexException, IOException {
-
new PayloadIndexing("E:\\index").index();
-
}
-
}
第五,查询
查询的时候,我们可以构造 PayloadTermQuery 来进行查询。代码如下所示:
-
package org.shirdrn.lucene.query.payloadquery;
-
-
import java.io.File;
-
import java.io.IOException;
-
-
import org.apache.lucene.document.Document;
-
import org.apache.lucene.index.CorruptIndexException;
-
import org.apache.lucene.index.IndexReader;
-
import org.apache.lucene.index.Term;
-
import org.apache.lucene.queryParser.ParseException;
-
import org.apache.lucene.search.BooleanQuery;
-
import org.apache.lucene.search.Explanation;
-
import org.apache.lucene.search.IndexSearcher;
-
import org.apache.lucene.search.ScoreDoc;
-
import org.apache.lucene.search.TopScoreDocCollector;
-
import org.apache.lucene.search.BooleanClause.Occur;
-
import org.apache.lucene.search.payloads.AveragePayloadFunction;
-
import org.apache.lucene.search.payloads.PayloadTermQuery;
-
import org.apache.lucene.store.NIOFSDirectory;
-
-
public class PayloadSearching {
-
-
private IndexReader indexReader;
-
private IndexSearcher searcher;
-
-
public PayloadSearching(String indexPath) throws CorruptIndexException, IOException {
-
indexReader = IndexReader.open(NIOFSDirectory.open(new File(indexPath)), true );
-
searcher = new IndexSearcher(indexReader);
-
searcher.setSimilarity(new PayloadSimilarity()); // 设置自定义的 PayloadSimilarity
-
}
-
-
public ScoreDoc[] search(String qsr) throws ParseException, IOException {
-
int hitsPerPage = 10;
-
BooleanQuery bq = new BooleanQuery();
-
for (String q : qsr.split(" ")) {
-
bq.add(createPayloadTermQuery(q), Occur.MUST);
-
}
-
TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, true );
-
searcher.search(bq, collector);
-
ScoreDoc[] hits = collector.topDocs().scoreDocs;
-
for (int i = 0; i < hits.length; i++) {
-
int docId = hits[i].doc; // 文档编号
-
Explanation explanation = searcher.explain(bq, docId);
-
System.out.println(explanation.toString());
-
}
-
return hits;
-
}
-
-
public void display(ScoreDoc[] hits, int start, int end) throws CorruptIndexException, IOException {
-
end = Math.min(hits.length, end);
-
for (int i = start; i < end; i++) {
-
Document doc = searcher.doc(hits[i].doc);
-
int docId = hits[i].doc; // 文档编号
-
float score = hits[i].score; // 文档得分
-
System.out.println(docId + "\t" + score + "\t" + doc + "\t");
-
}
-
}
-
-
public void close() throws IOException {
-
searcher.close();
-
indexReader.close();
-
}
-
-
private PayloadTermQuery createPayloadTermQuery(String item) {
-
PayloadTermQuery ptq = null ;
-
if (item.indexOf("^")!=-1) {
-
String[] a = item.split("\\^");
-
String field = a[0].split(":")[0];
-
String token = a[0].split(":")[1];
-
ptq = new PayloadTermQuery(new Term(field, token), new AveragePayloadFunction());
-
ptq.setBoost(Float.parseFloat(a[1].trim()));
-
} else {
-
String field = item.split(":")[0];
-
String token = item.split(":")[1];
-
ptq = new PayloadTermQuery(new Term(field, token), new AveragePayloadFunction());
-
}
-
return ptq;
-
}
-
-
public static void main(String[] args) throws ParseException, IOException {
-
int start = 0, end = 10;
-
// String queries = "category:foods^123.0 content:bread^987.0";
-
String queries = "category:foods content:egg";
-
PayloadSearching payloadSearcher = new PayloadSearching("E:\\index");
-
payloadSearcher.display(payloadSearcher.search(queries), start, end);
-
payloadSearcher.close();
-
}
-
-
}
我们可以看到查询结果,两个文档的相关度排序:
-
0 0.3314532 Document<stored ,indexed,tokenized<category:foods |0.984 shopping|0.503> stored,indexed,tokenized<content:egg tomato potato bread>>
-
1 0.21477573 Document<stored ,indexed,tokenized<category:foods |0.356 shopping|0.791> stored,indexed,tokenized<content:egg book potato bread>>
通过输出计算得分的解释信息,如下所示:
-
0.3314532 = (MATCH) sum of:
-
0.18281947 = (MATCH) weight(category:foods in 0), product of:
-
0.70710677 = queryWeight(category:foods), product of:
-
0.5945349 = idf(category: foods=2)
-
1.1893445 = queryNorm
-
0.2585458 = (MATCH) fieldWeight(category:foods in 0), product of:
-
0.6957931 = (MATCH) btq, product of:
-
0.70710677 = tf(phraseFreq=0.5)
-
0.984 = scorePayload(...)
-
0.5945349 = idf(category: foods=2)
-
0.625 = fieldNorm(field=category, doc=0)
-
0.14863372 = (MATCH) weight(content:egg in 0), product of:
-
0.70710677 = queryWeight(content:egg), product of:
-
0.5945349 = idf(content: egg=2)
-
1.1893445 = queryNorm
-
0.21019982 = (MATCH) fieldWeight(content:egg in 0), product of:
-
0.70710677 = (MATCH) btq, product of:
-
0.70710677 = tf(phraseFreq=0.5)
-
1.0 = scorePayload(...)
-
0.5945349 = idf(content: egg=2)
-
0.5 = fieldNorm(field=content, doc=0)
-
-
0.21477571 = (MATCH) sum of:
-
0.066142 = (MATCH) weight(category:foods in 1), product of:
-
0.70710677 = queryWeight(category:foods), product of:
-
0.5945349 = idf(category: foods=2)
-
1.1893445 = queryNorm
-
0.09353892 = (MATCH) fieldWeight(category:foods in 1), product of:
-
0.25173002 = (MATCH) btq, product of:
-
0.70710677 = tf(phraseFreq=0.5)
-
0.356 = scorePayload(...)
-
0.5945349 = idf(category: foods=2)
-
0.625 = fieldNorm(field=category, doc=1)
-
0.14863372 = (MATCH) weight(content:egg in 1), product of:
-
0.70710677 = queryWeight(content:egg), product of:
-
0.5945349 = idf(content: egg=2)
-
1.1893445 = queryNorm
-
0.21019982 = (MATCH) fieldWeight(content:egg in 1), product of:
-
0.70710677 = (MATCH) btq, product of:
-
0.70710677 = tf(phraseFreq=0.5)
-
1.0 = scorePayload(...)
-
0.5945349 = idf(content: egg=2)
-
0.5 = fieldNorm(field=content, doc=1)
我们可以看到,除了在 tf 上乘了一个 Payload 值以外,其他的都相同,也就是说,我们预期使用的 Payload 为文档( ID=0 )贡献了得分,排名靠前了。否则,如果不使用 Payload 的话,查询结果中两个文档的得分是相同的(可以模拟设置他们的 Payload 值相同,测试一下看看)