有关ansj的IndexAnalysis的分词对elasticsearch的fast vector highlight高亮会产生BUG的问题分析

IndexAnalysis是ansj分词工具针对搜索引擎提供的一种分词方式,会进行最细粒度的分词,例如下面这句话:

看热闹:2014年度足坛主教练收入榜公布,温格是真·阿森纳代言人啊~


这句话会被拆分成:[看热闹/v, :/w, 2014/m, 年度/n, 足坛/n, 主教练/n, 收入/n, 榜/n, 公布/v, ,/w, 温格/nr, 是/v, 真/d, ·/w, 阿森纳/nr, 代言人/n, 啊/y, ~, , 热闹, 主教, 教练]

也就是“看热闹”和“主教练”这两个词会进一步细分出三个词:热闹, 主教, 教练

这样分其实并没有问题,问题就出在这三个词放置的位置,细分出来的词被放置到了末尾!源码中是这样写的:

/**
			 * 检索的分词
			 * 
			 * @return
			 */
			private List result() {


				String temp = null;

				List result = new LinkedList();
				int length = graph.terms.length - 1;
				for (int i = 0; i < length; i++) {
					if (graph.terms[i] != null) {
						result.add(graph.terms[i]);
					}
				}

				LinkedList last = new LinkedList() ;
				for (Term term : result) {
					if (term.getName().length() >= 3) {
						GetWordsImpl gwi = new GetWordsImpl(term.getName());
						while ((temp = gwi.allWords()) != null) {
							if (temp.length() < term.getName().length() && temp.length()>1) {
								last.add(new Term(temp, gwi.offe + term.getOffe(), TermNatures.NULL));
							}
						}
					}
				}

				result.addAll(last) ;
				
				setRealName(graph, result);
				return result;
			}
先遍历所有的term,添加到result列表中去,然后再对result中的term看能否进一步分词,能的话,再细分,最终将细分后的结果添加到result的末尾。


当我们搜索“阿森纳教练”并且需要进行fast vector highlight时,就会报错,我们写个demo来测试一下:

package org.ansj.ansj_lucene4_plug;

import org.ansj.lucene4.AnsjAnalysis;
import org.ansj.lucene4.AnsjIndexAnalysis;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.elasticsearch.search.highlight.vectorhighlight.SourceSimpleFragmentsBuilder;

import java.util.HashSet;

/**
 * Created by xiaojun on 2016/1/11.
 */
public class Test3 {
    public static void main(String[] args) throws Exception {
        HashSet hs = new HashSet();
        hs.add("的");
        Analyzer analyzer = new AnsjIndexAnalysis(hs, false);
        Directory directory = null;
        IndexWriter iwriter = null;
//        String text = "我发现帖子系统有个BUG啊,怎么解决?怎么破";
        String text = "看热闹:2014年度足坛主教练收入榜公布,温格是真·阿森纳代言人啊~";
//        String text = "我去过白杨树林!";

//        UserDefineLibrary.insertWord("阿森纳", "n", 1000);
//        UserDefineLibrary.insertWord("系统", "n", 1000);

        IndexWriterConfig ic = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);


        directory = new RAMDirectory();
        iwriter = new IndexWriter(directory, ic);

        Document document = new Document();
        document.add(new TextField("_id", "1", Field.Store.YES));
        document.add(new Field("content", text, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
        iwriter.addDocument(document);
//        iwriter.commit();
//        iwriter.close();
        IndexReader reader = DirectoryReader.open(iwriter, true);
        IndexSearcher searcher = new IndexSearcher(reader);
        FastVectorHighlighter highlighter = new FastVectorHighlighter();
        TopDocs topDocs = searcher.search(new TermQuery(new Term("_id", "1")), 1);
//
////        assertThat(topDocs.totalHits, equalTo(1));
//
//
//        String fragment = highlighter.getBestFragment(highlighter.getFieldQuery(new TermQuery(new Term("content", "阿森纳"))),
//                reader, topDocs.scoreDocs[0].doc, "content", 30);
////        assertThat(fragment, notNullValue());
////        assertThat(fragment, equalTo("the big bad dog"));
//        System.out.println(fragment);


        Analyzer queryAnalyzer = new AnsjAnalysis(hs, false);

        QueryParser tq = new QueryParser("content", queryAnalyzer);
        String queryStr = "阿森纳教练";

        Query query = tq.createBooleanQuery("content", queryStr);
        System.out.println("query:" + query);
        TopDocs hits = searcher.search(query, 5);
        System.out.println(queryStr + ":共找到" + hits.totalHits + "条记录!");

        String fragment2 = highlighter.getBestFragment(highlighter.getFieldQuery(query),
                reader, topDocs.scoreDocs[0].doc, "content", 30);
//        assertThat(fragment, notNullValue());
//        assertThat(fragment, equalTo("the big bad dog"));
        System.out.println(fragment2);
    }
}

运行就会报以下错误:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -16
	at java.lang.String.substring(String.java:1911)
	at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.makeFragment(BaseFragmentsBuilder.java:178)
	at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragments(BaseFragmentsBuilder.java:144)
	at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragment(BaseFragmentsBuilder.java:111)
	at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragment(BaseFragmentsBuilder.java:95)
	at org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFragment(FastVectorHighlighter.java:116)
	at org.ansj.ansj_lucene4_plug.Test3.main(Test3.java:78)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

找到报错的方法:

  protected String makeFragment( StringBuilder buffer, int[] index, Field[] values, WeightedFragInfo fragInfo,
      String[] preTags, String[] postTags, Encoder encoder ){
    StringBuilder fragment = new StringBuilder();
    final int s = fragInfo.getStartOffset();
    int[] modifiedStartOffset = { s };
    String src = getFragmentSourceMSO( buffer, index, values, s, fragInfo.getEndOffset(), modifiedStartOffset );
    int srcIndex = 0;
    for( SubInfo subInfo : fragInfo.getSubInfos() ){
      for( Toffs to : subInfo.getTermsOffsets() ){
        fragment
          .append( encoder.encodeText( src.substring( srcIndex, to.getStartOffset() - modifiedStartOffset[0] ) ) )
          .append( getPreTag( preTags, subInfo.getSeqnum() ) )
          .append( encoder.encodeText( src.substring( to.getStartOffset() - modifiedStartOffset[0], to.getEndOffset() - modifiedStartOffset[0] ) ) )
          .append( getPostTag( postTags, subInfo.getSeqnum() ) );
        srcIndex = to.getEndOffset() - modifiedStartOffset[0];
      }
    }
    fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );
    return fragment.toString();
  }
标高亮的逻辑是这样的:

刚开始执行的时候srcIndex=0,首先找到第一个命中的词:“阿森纳”,这里解释一下为什么"阿森纳"是第一个词而“教练”不是第一个词,明明在原句中教练这两个字出现在阿森纳之前?上面我们队IndexAnalysis的源码已经分析过了,”教练“作为”主教练“的细分词,被放置在了词向量的末尾。所以阿森纳的position会比教练的position要靠前。

从然后第一个高亮词前面的片段就是从0到这个词的起始偏移位置,阿森纳的起止位置为:[26,29],也就是0~26,然后下一次循环的时候srcIndex会变为29,即从29开始再找高亮片段,然后第二个词"教练"的起止偏移为:[13,16],src.substring(29,13)显然会导致异常,substring方法参数指定的是截取字符串的起止位置,显然第二个参数不能小于第一个参数,不然就会报索引越界的异常。


解释了这么多,其实只要保证循环的时候先找”教练“这个词,再找”阿森纳这个词“那就没有问题了,也就是跟fragInfo.getSubInfos()这个list中词的顺序有关,而这个list中词的顺序最终与Term的position有关,具体可以看到FieldTermStack这个类的构造方法:

public FieldTermStack( IndexReader reader, int docId, String fieldName, final FieldQuery fieldQuery ) throws IOException {
    this.fieldName = fieldName;
    
    Set termSet = fieldQuery.getTermSet( fieldName );
    // just return to make null snippet if un-matched fieldName specified when fieldMatch == true
    if( termSet == null ) return;

    final Fields vectors = reader.getTermVectors(docId);
    if (vectors == null) {
      // null snippet
      return;
    }

    final Terms vector = vectors.terms(fieldName);
    if (vector == null) {
      // null snippet
      return;
    }

    final CharsRefBuilder spare = new CharsRefBuilder();
    final TermsEnum termsEnum = vector.iterator(null);
    DocsAndPositionsEnum dpEnum = null;
    BytesRef text;
    
    int numDocs = reader.maxDoc();
    
    while ((text = termsEnum.next()) != null) {
      spare.copyUTF8Bytes(text);
      final String term = spare.toString();
      if (!termSet.contains(term)) {
        continue;
      }
      dpEnum = termsEnum.docsAndPositions(null, dpEnum);
      if (dpEnum == null) {
        // null snippet
        return;
      }

      dpEnum.nextDoc();
      
      // For weight look here: http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/DefaultSimilarity.html
      final float weight = ( float ) ( Math.log( numDocs / ( double ) ( reader.docFreq( new Term(fieldName, text) ) + 1 ) ) + 1.0 );

      final int freq = dpEnum.freq();
      
      for(int i = 0;i < freq;i++) {
        int pos = dpEnum.nextPosition();
        if (dpEnum.startOffset() < 0) {
          return; // no offsets, null snippet
        }
        termList.add( new TermInfo( term, dpEnum.startOffset(), dpEnum.endOffset(), pos, weight ) );
      }
    }
    
    // sort by position
    Collections.sort(termList);
    
    // now look for dups at the same position, linking them together
    int currentPos = -1;
    TermInfo previous = null;
    TermInfo first = null;
    Iterator iterator = termList.iterator();
    while (iterator.hasNext()) {
      TermInfo current = iterator.next();
      if (current.position == currentPos) {
        assert previous != null;
        previous.setNext(current);
        previous = current;
        iterator.remove();
      } else {
        if (previous != null) {
          previous.setNext(first);
        }
        previous = first = current;
        currentPos = current.position;
      }
    }
    if (previous != null) {
      previous.setNext(first);
    }
  }

其中有一句比较关键的代码:

 // sort by position
    Collections.sort(termList);
也就是termList会按position从小到大排序。

说到这里问题已经阐述的很清楚了,我们最终的目的就是要改变“教练”这个词出现的位置,也就是让细分词直接紧跟着出现在原始词的后面,问题就能解决了。

于是可以对IndexAnalysis的result()方法进行如下修改:

			/**
			 * 检索的分词
			 * 
			 * @return
			 */
			private List result() {


				String temp = null;
				List result = new LinkedList();
				int length = graph.terms.length - 1;
				for (int i = 0; i < length; i++) {
					if (graph.terms[i] != null) {
						Term term = graph.terms[i];
						result.add(term);
						if (term.getName().length() >= 3 && !Arrays.asList(new String[]{"nr","nt","nrf","nnt","nsf","adv","nz"}).contains(term.getNatureStr())) {
							GetWordsImpl gwi = new GetWordsImpl(term.getName());
							while ((temp = gwi.allWords()) != null) {
								if (temp.length() < term.getName().length() && temp.length()>1) {
									result.add(new Term(temp, gwi.offe + term.getOffe(), TermNatures.NULL));
								}
							}
						}
					}
				}
				
				setRealName(graph, result);
				return result;
			}
即在遍历每个词的同时紧接着在判断该词是否需要细分,这里对一些词性做了一些限制,在源码里紧紧是判断了term数量是否大于3,我觉得一些专有名词其实是不需要再细分了,专有名词细分了反而搜索质量变差。

进行如上修改之后,再从新执行以下分词,结果变成如下:

[看热闹/v, 热闹, :/w, 2014/m, 年度/n, 足坛/n, 主教练/n, 主教, 教练, 收入/n, 榜/n, 公布/v, ,/w, 温格/nrf, 是/v, 真/d, ·/w, 阿森纳/nz, 代言人/n, 啊/y, ~, ]


可以看到“主教”和“教练”这两个词紧跟着“主教练”这个词后面出现了。

我们再执行上面的高亮demo,已经可以得到如下正确的结果了:

看热闹:2014年度足坛主教练收入榜公布,温格是真·阿森纳代言人啊~

你可能感兴趣的:(搜索引擎,nlp)