IndexAnalysis是ansj分词工具针对搜索引擎提供的一种分词方式,会进行最细粒度的分词,例如下面这句话:
看热闹:2014年度足坛主教练收入榜公布,温格是真·阿森纳代言人啊~
这句话会被拆分成:[看热闹/v, :/w, 2014/m, 年度/n, 足坛/n, 主教练/n, 收入/n, 榜/n, 公布/v, ,/w, 温格/nr, 是/v, 真/d, ·/w, 阿森纳/nr, 代言人/n, 啊/y, ~, , 热闹, 主教, 教练]
也就是“看热闹”和“主教练”这两个词会进一步细分出三个词:热闹, 主教, 教练
这样分其实并没有问题,问题就出在这三个词放置的位置,细分出来的词被放置到了末尾!源码中是这样写的:
/**
* 检索的分词
*
* @return
*/
private List result() {
String temp = null;
List result = new LinkedList();
int length = graph.terms.length - 1;
for (int i = 0; i < length; i++) {
if (graph.terms[i] != null) {
result.add(graph.terms[i]);
}
}
LinkedList last = new LinkedList() ;
for (Term term : result) {
if (term.getName().length() >= 3) {
GetWordsImpl gwi = new GetWordsImpl(term.getName());
while ((temp = gwi.allWords()) != null) {
if (temp.length() < term.getName().length() && temp.length()>1) {
last.add(new Term(temp, gwi.offe + term.getOffe(), TermNatures.NULL));
}
}
}
}
result.addAll(last) ;
setRealName(graph, result);
return result;
}
先遍历所有的term,添加到result列表中去,然后再对result中的term看能否进一步分词,能的话,再细分,最终将细分后的结果添加到result的末尾。
当我们搜索“阿森纳教练”并且需要进行fast vector highlight时,就会报错,我们写个demo来测试一下:
package org.ansj.ansj_lucene4_plug;
import org.ansj.lucene4.AnsjAnalysis;
import org.ansj.lucene4.AnsjIndexAnalysis;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.elasticsearch.search.highlight.vectorhighlight.SourceSimpleFragmentsBuilder;
import java.util.HashSet;
/**
* Created by xiaojun on 2016/1/11.
*/
public class Test3 {
public static void main(String[] args) throws Exception {
HashSet hs = new HashSet();
hs.add("的");
Analyzer analyzer = new AnsjIndexAnalysis(hs, false);
Directory directory = null;
IndexWriter iwriter = null;
// String text = "我发现帖子系统有个BUG啊,怎么解决?怎么破";
String text = "看热闹:2014年度足坛主教练收入榜公布,温格是真·阿森纳代言人啊~";
// String text = "我去过白杨树林!";
// UserDefineLibrary.insertWord("阿森纳", "n", 1000);
// UserDefineLibrary.insertWord("系统", "n", 1000);
IndexWriterConfig ic = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
directory = new RAMDirectory();
iwriter = new IndexWriter(directory, ic);
Document document = new Document();
document.add(new TextField("_id", "1", Field.Store.YES));
document.add(new Field("content", text, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
iwriter.addDocument(document);
// iwriter.commit();
// iwriter.close();
IndexReader reader = DirectoryReader.open(iwriter, true);
IndexSearcher searcher = new IndexSearcher(reader);
FastVectorHighlighter highlighter = new FastVectorHighlighter();
TopDocs topDocs = searcher.search(new TermQuery(new Term("_id", "1")), 1);
//
//// assertThat(topDocs.totalHits, equalTo(1));
//
//
// String fragment = highlighter.getBestFragment(highlighter.getFieldQuery(new TermQuery(new Term("content", "阿森纳"))),
// reader, topDocs.scoreDocs[0].doc, "content", 30);
//// assertThat(fragment, notNullValue());
//// assertThat(fragment, equalTo("the big bad dog"));
// System.out.println(fragment);
Analyzer queryAnalyzer = new AnsjAnalysis(hs, false);
QueryParser tq = new QueryParser("content", queryAnalyzer);
String queryStr = "阿森纳教练";
Query query = tq.createBooleanQuery("content", queryStr);
System.out.println("query:" + query);
TopDocs hits = searcher.search(query, 5);
System.out.println(queryStr + ":共找到" + hits.totalHits + "条记录!");
String fragment2 = highlighter.getBestFragment(highlighter.getFieldQuery(query),
reader, topDocs.scoreDocs[0].doc, "content", 30);
// assertThat(fragment, notNullValue());
// assertThat(fragment, equalTo("the big bad dog"));
System.out.println(fragment2);
}
}
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -16
at java.lang.String.substring(String.java:1911)
at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.makeFragment(BaseFragmentsBuilder.java:178)
at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragments(BaseFragmentsBuilder.java:144)
at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragment(BaseFragmentsBuilder.java:111)
at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragment(BaseFragmentsBuilder.java:95)
at org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFragment(FastVectorHighlighter.java:116)
at org.ansj.ansj_lucene4_plug.Test3.main(Test3.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
protected String makeFragment( StringBuilder buffer, int[] index, Field[] values, WeightedFragInfo fragInfo,
String[] preTags, String[] postTags, Encoder encoder ){
StringBuilder fragment = new StringBuilder();
final int s = fragInfo.getStartOffset();
int[] modifiedStartOffset = { s };
String src = getFragmentSourceMSO( buffer, index, values, s, fragInfo.getEndOffset(), modifiedStartOffset );
int srcIndex = 0;
for( SubInfo subInfo : fragInfo.getSubInfos() ){
for( Toffs to : subInfo.getTermsOffsets() ){
fragment
.append( encoder.encodeText( src.substring( srcIndex, to.getStartOffset() - modifiedStartOffset[0] ) ) )
.append( getPreTag( preTags, subInfo.getSeqnum() ) )
.append( encoder.encodeText( src.substring( to.getStartOffset() - modifiedStartOffset[0], to.getEndOffset() - modifiedStartOffset[0] ) ) )
.append( getPostTag( postTags, subInfo.getSeqnum() ) );
srcIndex = to.getEndOffset() - modifiedStartOffset[0];
}
}
fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );
return fragment.toString();
}
标高亮的逻辑是这样的:
刚开始执行的时候srcIndex=0,首先找到第一个命中的词:“阿森纳”,这里解释一下为什么"阿森纳"是第一个词而“教练”不是第一个词,明明在原句中教练这两个字出现在阿森纳之前?上面我们队IndexAnalysis的源码已经分析过了,”教练“作为”主教练“的细分词,被放置在了词向量的末尾。所以阿森纳的position会比教练的position要靠前。
从然后第一个高亮词前面的片段就是从0到这个词的起始偏移位置,阿森纳的起止位置为:[26,29],也就是0~26,然后下一次循环的时候srcIndex会变为29,即从29开始再找高亮片段,然后第二个词"教练"的起止偏移为:[13,16],src.substring(29,13)显然会导致异常,substring方法参数指定的是截取字符串的起止位置,显然第二个参数不能小于第一个参数,不然就会报索引越界的异常。
解释了这么多,其实只要保证循环的时候先找”教练“这个词,再找”阿森纳这个词“那就没有问题了,也就是跟fragInfo.getSubInfos()这个list中词的顺序有关,而这个list中词的顺序最终与Term的position有关,具体可以看到FieldTermStack这个类的构造方法:
public FieldTermStack( IndexReader reader, int docId, String fieldName, final FieldQuery fieldQuery ) throws IOException {
this.fieldName = fieldName;
Set termSet = fieldQuery.getTermSet( fieldName );
// just return to make null snippet if un-matched fieldName specified when fieldMatch == true
if( termSet == null ) return;
final Fields vectors = reader.getTermVectors(docId);
if (vectors == null) {
// null snippet
return;
}
final Terms vector = vectors.terms(fieldName);
if (vector == null) {
// null snippet
return;
}
final CharsRefBuilder spare = new CharsRefBuilder();
final TermsEnum termsEnum = vector.iterator(null);
DocsAndPositionsEnum dpEnum = null;
BytesRef text;
int numDocs = reader.maxDoc();
while ((text = termsEnum.next()) != null) {
spare.copyUTF8Bytes(text);
final String term = spare.toString();
if (!termSet.contains(term)) {
continue;
}
dpEnum = termsEnum.docsAndPositions(null, dpEnum);
if (dpEnum == null) {
// null snippet
return;
}
dpEnum.nextDoc();
// For weight look here: http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/DefaultSimilarity.html
final float weight = ( float ) ( Math.log( numDocs / ( double ) ( reader.docFreq( new Term(fieldName, text) ) + 1 ) ) + 1.0 );
final int freq = dpEnum.freq();
for(int i = 0;i < freq;i++) {
int pos = dpEnum.nextPosition();
if (dpEnum.startOffset() < 0) {
return; // no offsets, null snippet
}
termList.add( new TermInfo( term, dpEnum.startOffset(), dpEnum.endOffset(), pos, weight ) );
}
}
// sort by position
Collections.sort(termList);
// now look for dups at the same position, linking them together
int currentPos = -1;
TermInfo previous = null;
TermInfo first = null;
Iterator iterator = termList.iterator();
while (iterator.hasNext()) {
TermInfo current = iterator.next();
if (current.position == currentPos) {
assert previous != null;
previous.setNext(current);
previous = current;
iterator.remove();
} else {
if (previous != null) {
previous.setNext(first);
}
previous = first = current;
currentPos = current.position;
}
}
if (previous != null) {
previous.setNext(first);
}
}
// sort by position
Collections.sort(termList);
也就是termList会按position从小到大排序。
说到这里问题已经阐述的很清楚了,我们最终的目的就是要改变“教练”这个词出现的位置,也就是让细分词直接紧跟着出现在原始词的后面,问题就能解决了。
于是可以对IndexAnalysis的result()方法进行如下修改:
/**
* 检索的分词
*
* @return
*/
private List result() {
String temp = null;
List result = new LinkedList();
int length = graph.terms.length - 1;
for (int i = 0; i < length; i++) {
if (graph.terms[i] != null) {
Term term = graph.terms[i];
result.add(term);
if (term.getName().length() >= 3 && !Arrays.asList(new String[]{"nr","nt","nrf","nnt","nsf","adv","nz"}).contains(term.getNatureStr())) {
GetWordsImpl gwi = new GetWordsImpl(term.getName());
while ((temp = gwi.allWords()) != null) {
if (temp.length() < term.getName().length() && temp.length()>1) {
result.add(new Term(temp, gwi.offe + term.getOffe(), TermNatures.NULL));
}
}
}
}
}
setRealName(graph, result);
return result;
}
即在遍历每个词的同时紧接着在判断该词是否需要细分,这里对一些词性做了一些限制,在源码里紧紧是判断了term数量是否大于3,我觉得一些专有名词其实是不需要再细分了,专有名词细分了反而搜索质量变差。
进行如上修改之后,再从新执行以下分词,结果变成如下:
[看热闹/v, 热闹, :/w, 2014/m, 年度/n, 足坛/n, 主教练/n, 主教, 教练, 收入/n, 榜/n, 公布/v, ,/w, 温格/nrf, 是/v, 真/d, ·/w, 阿森纳/nz, 代言人/n, 啊/y, ~, ]
可以看到“主教”和“教练”这两个词紧跟着“主教练”这个词后面出现了。
我们再执行上面的高亮demo,已经可以得到如下正确的结果了:
看热闹:2014年度足坛主教练收入榜公布,温格是真·阿森纳代言人啊~