ElasticSearch之ANSJ分词器搭建搭建,解决ansj停止词bug (qq交流群 189040279)

         近期因公司特定的业务需要,公司的搜索引擎由Solr跟换为ElasticSearch,团队内之前负责搜索这块的同事采用的ElasticSearch版本为2.2.1,我没有使用同事在用的版本,对这个项目的改造准备准备采用2.x最新的版本( ElasticSearch-2.4.5 ),没有任何原因,只是个人觉得有新版本升级,应该也是解决了一些问题,用新版说不定就可以避免旧版中出现的很多问题(虽然这些问题现在我没遇到),但实际使用过程中还是遇到了很多问题,于是将这些问题都记录下来分享出去。在使用搜索引擎的时候,为了搜索效果好一些,我们采用了ElasticSearch+Ansj分词器来搭建搜索引擎集群,同时为了改善ansj的分词效果,我们对几百万特有数据进行关键词识别,收集了几百万的专有名词添加到词库中,在使用ElasticSearch之前,也翻阅了基本相关的数据,粗略过了一两边ElasticSearch的使用配置。
         经过几个小时的整理,Elastic需要建索引的数据模型编写完成了,将模型放入建索引程序中并配置好了数据库连接,字典连接等,在ElasticSearch方面,将准备下载好的Ansj插件以及Ansj分词器配置好了就开始刷数据了,整个刷数据过程很顺利,没几个小时就将几千万的数据从MySQL刷到ElasticSearch中.。
测试ElasticSearch的搜索效果过程中,我们发现有以下问题:
(1)关键词中带有标点符号搜索不出来,例如搜索 万科,王石
(2)关键词中带有空格搜索不出来 例如搜索 万科 王石
(3)关键词加引号搜索不出数据 例如搜索 万科王石
(4)高亮显示时候搜索引擎报错,异常如下:

RemoteTransportException[[13.35][127.0.0.1:9300][indices:data/read/search[phase/fetch/id]]]; nested: FetchPhaseExecutionException[Fetch Failed [Failed to highlight field [contents]]]; nested: StringIndexOutOfBoundsException[String index out of range: -5];
Caused by: FetchPhaseExecutionException[Fetch Failed [Failed to highlight field [contents]]]; nested: StringIndexOutOfBoundsException[String index out of range: -5];
    at org.elasticsearch.search.highlight.FastVectorHighlighter.highlight(FastVectorHighlighter.java:169)
    at org.elasticsearch.search.highlight.HighlightPhase.hitExecute(HighlightPhase.java:140)
    at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:188)
    at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:605)
    at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:408)
    at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:405)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:378)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -5
    at java.lang.String.substring(String.java:1967)
    at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.makeFragment(BaseFragmentsBuilder.java:179)
    at org.elasticsearch.search.highlight.vectorhighlight.SimpleFragmentsBuilder.makeFragment(SimpleFragmentsBuilder.java:43)
    at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragments(BaseFragmentsBuilder.java:144)
    at org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFragments(FastVectorHighlighter.java:186)
    at org.elasticsearch.search.highlight.FastVectorHighlighter.highlight(FastVectorHighlighter.java:146)
    ... 12 more
我们对这几个问题进行了分析,1,2,3问题都是来自同一个原因,停止词没有生效。查看了ansj分词器源码以及ansj-plugin插件代码,发现两个组件中都存在bug,ansj中的bug为无法使用空格作为停止词,ansj-plugin插件中的bug为未使用停止词。


停止词bug问题的 ansj-plugin 以及 ansj 代码修改具体如下

ElasticSearch之ANSJ分词器搭建搭建,解决ansj停止词bug (qq交流群 189040279)_第1张图片

ElasticSearch之ANSJ分词器搭建搭建,解决ansj停止词bug (qq交流群 189040279)_第2张图片

ElasticSearch之ANSJ分词器搭建搭建,解决ansj停止词bug (qq交流群 189040279)_第3张图片


修改完成后,停止词就可以正常使用,搜索 万科 王石 或者是 万科,王石 均可正常搜索。

搜索长词加引号 ("") 搜索不到数据的bug 问题
给关键词加上引号搜索,lucene内部代码实现的时候是先对引号内的数据进行分词,分词完成后进行查找,当数据满足所有词时候,再检查数据中命中目标的词向量距离是否为1,满足的即为结果候选文档。故出现该问题时候第一想到了是否是lucene版本升级导致该功能该出问题了,于是采用了不同版本的lucene进行测试以及同一个版本采用不同的分词器进行测试,测试结果发现问题出自ansj分词器。再词阅读了ansj代码,在该版本代码增加了对index类型分词方式的排序,排序是按照词的起始偏移量进行排的。而lucene在建索引的时候,会按照迭代出词的顺序来获取每个词的位置增量(ansj设置的增量为1)。
例如对句子分词: 对交易者而言波段为王 将会分词如下
ansj原生分词:[对/p, 交易者/nz, 而言/u, 波段/n, 为王/n, 交/v, 交易/vn, 易/ad, 易者/n, 者/nr, 而/cc, 言/vg, 波/n, 段/q, 为/p, 王/nr]
ansj分词排序后:[对/p, 交易者/nz, 交易/vn, 交/v, 易者/n, 易/ad, 者/nr, 而言/u, 而/cc, 言/vg, 波段/n, 波/n, 段/q, 为王/n, 为/p, 王/nr]
故lucene以此获取到的位置信息分别为:
ansj原生分词:[对 1, 交易者 2, 而言 3, 波段 4, 为王 5, 交 6, 交易 7, 易 8, 易者 9, 者 10, 而 11, 言 12, 波 13, 段 14, 为 15, 王 16]
ansj分词排序后:[对 1, 交易者 2, 交易 3, 交4, 易者5, 易 6, 者 7, 而言 8, 而 9, 言 10, 波段 11, 波12, 段13, 为王 14, 为 15, 王 16]
可以看到,排序后,搜索引擎若搜索 “对交易者” 则可以搜索到,单搜索 “波段为王” 的时候无法搜索到,因为 词汇 为王 与 词汇波段 的距离为3(采用query方式),故搜索长词加引号就容易出现搜索不到数据。
修改办法比较简单,注释掉index分词方式的排序即可,具体如下:
ElasticSearch之ANSJ分词器搭建搭建,解决ansj停止词bug (qq交流群 189040279)_第4张图片

ES 搜索高亮报错问题 StringIndexOutOfBoundsException[String index out of range: -5]
ElasticSearch在搜索的时候,高亮方式默认是根据配置mapping时候的存储格而定的,基于向量的高亮方式(fvl FastVectorHighlighter)有个要求,即查询时候使用的分词方式也使用向量(按照index方式会导致高亮重叠问题以及会出现异常)。使用过程中发现当字段值为 多值 时候特别容易出现高亮数组越界以及高亮标记错误。阅读了lucene高亮模块的源码,FastVectorHighlighter以及lucene-core的源码,lucene使用高亮时候有以下几个步骤。
1.解析输入高亮语句提取的目标关键字
2.读取词目标docid的向量即可,提取目标关键字的向量
3.计算匹配词向量命中的组合,提取topN组(默认为5组)
4.提取文档中的字段内容(多值数据提取后以空格链接成一个string)
5.按照提取的高亮组遍历,根据该组内词的位置偏移量截取段落 (前后段落为找超过一定长度或发现英文的标点符号)
由此可见,数据进行高亮时候是依赖建索引时候词向量的位置,而在lucene的源码带中,lucene记录的词向量偏移量如下:
词向量偏移量=上段数据值偏移量之和+该词在当前数据值中偏移量
上段数据值偏移量之合=上上段数据值偏移量之合+上段数据值最后一个词的尾偏移量


   /** Inverts one field for one document; first is true
     *  if this is the first time we are seeing this field
     *  name in this document. */
    public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
      if (first) {
        // First time we're seeing this field (indexed) in
        // this document:
        invertState.reset();
      }

      IndexableFieldType fieldType = field.fieldType();

      IndexOptions indexOptions = fieldType.indexOptions();
      fieldInfo.setIndexOptions(indexOptions);

      if (fieldType.omitNorms()) {
        fieldInfo.setOmitsNorms();
      }

      final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
        
      // only bother checking offsets if something will consume them.
      // TODO: after we fix analyzers, also check if termVectorOffsets will be indexed.
      final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;

      /*
       * To assist people in tracking down problems in analysis components, we wish to write the field name to the infostream
       * when we fail. We expect some caller to eventually deal with the real exception, so we don't want any 'catch' clauses,
       * but rather a finally that takes note of the problem.
       */
      boolean succeededInProcessingField = false;
     
      try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
        // reset the TokenStream to the first token
        stream.reset();
        invertState.setAttributeSource(stream);
        termsHashPerField.start(field, first);

        while (stream.incrementToken()) {
        
          int posIncr = invertState.posIncrAttribute.getPositionIncrement();
          invertState.position += posIncr;
          if (invertState.position < invertState.lastPosition) {
            if (posIncr == 0) {
              throw new IllegalArgumentException("first position increment must be > 0 (got 0) for field '" + field.name() + "'");
            } else {
              throw new IllegalArgumentException("position increments (and gaps) must be >= 0 (got " + posIncr + ") for field '" + field.name() + "'");
            }
          } else if (invertState.position > IndexWriter.MAX_POSITION) {
            throw new IllegalArgumentException("position " + invertState.position + " is too large for field '" + field.name() + "': max allowed position is " + IndexWriter.MAX_POSITION);
          }
          invertState.lastPosition = invertState.position;
          if (posIncr == 0) {
            invertState.numOverlap++;
          }
              
          if (checkOffsets) {
            int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
            int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
            if (startOffset < invertState.lastStartOffset || endOffset < startOffset) {
              throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
                                                 + "startOffset=" + startOffset + ",endOffset=" + endOffset + ",lastStartOffset=" + invertState.lastStartOffset + " for field '" + field.name() + "'");
            }
            invertState.lastStartOffset = startOffset;
          }

          invertState.length++;
          if (invertState.length < 0) {
            throw new IllegalArgumentException("too many tokens in field '" + field.name() + "'");
          }
          //System.out.println("  term=" + invertState.termAttribute);

          // If we hit an exception in here, we abort
          // all buffered documents since the last
          // flush, on the likelihood that the
          // internal state of the terms hash is now
          // corrupt and should not be flushed to a
          // new segment:
          try {
            termsHashPerField.add();
          } catch (MaxBytesLengthExceededException e) {
            byte[] prefix = new byte[30];
            BytesRef bigTerm = invertState.termAttribute.getBytesRef();
            System.arraycopy(bigTerm.bytes, bigTerm.offset, prefix, 0, 30);
            String msg = "Document contains at least one immense term in field=\"" + fieldInfo.name + "\" (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '" + Arrays.toString(prefix) + "...', original message: " + e.getMessage();
            if (docState.infoStream.isEnabled("IW")) {
              docState.infoStream.message("IW", "ERROR: " + msg);
            }
            // Document will be deleted above:
            throw new IllegalArgumentException(msg, e);
          } catch (Throwable th) {
            throw AbortingException.wrap(th);
          }
        }

        stream.end();
        // TODO: maybe add some safety? then again, it's already checked
        // when we come back around to the field...
        //add by jkuang
        String value = field.stringValue();
        invertState.offset += value==null?0:value.length();
        invertState.position += invertState.posIncrAttribute.getPositionIncrement();
        /* if there is an exception coming through, we won't set this to true here:*/
        succeededInProcessingField = true;
      } finally {
        if (!succeededInProcessingField && docState.infoStream.isEnabled("DW")) {
          docState.infoStream.message("DW", "An exception was thrown while processing field " + fieldInfo.name);
        }
      }

      if (analyzed) {
        invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
        invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
      }

      invertState.boost *= field.boost();
    }
  }

但按照上上诉方方式,因先执行了 stream.end(),故invertState.offsetAttribute.endOffset() 值会被清零,另外,按照词方式,若多字段中,前n-1段只要有一段的末尾字符为停止符,则偏移量则会发生错误。
ElasticSearch之ANSJ分词器搭建搭建,解决ansj停止词bug (qq交流群 189040279)_第5张图片

修改完成后运行ElasticSearch进行高亮搜索, 发现高亮依 然还会报StringIndexOutOfBoundsException[String index out of range: -5] 这个是由于上面讲到的 " 计算匹配词向量命中的组合,提取topN组(默认为5组)" 的时候, 每个组内的词没有进行排序过,不同的组也没有进行排序,而在高亮的时候,lucene高丽的方式是每次高亮都在上次高亮的位置之后进行偏移。例如高亮 万科,王石 ,高亮万科时候,记录相对偏移地址为 10,高亮王石的时候只能从10开始,但若王石的起始地址小于10 ,比如是 5,那就会出现

StringIndexOutOfBoundsException[String index out of range: -7]

可修改lucene高亮模块,对分组进行排序,具体为修改 org.apache.lucene.search.vectorhighlight.WeightedFragInfo 以及 SubInfo 构造函数。

 public WeightedFragInfo( int startOffset, int endOffset, List subInfos, float totalBoost ){
      this.startOffset = startOffset;
      this.endOffset = endOffset;
      this.totalBoost = totalBoost;
      this.subInfos = subInfos;
      Collections.sort(this.subInfos, new Comparator() {
        @Override
        public int compare(SubInfo o1, SubInfo o2) {
            if(o1.getTermsOffsets().size()==0){
                return -1;
            }else if(o2.getTermsOffsets().size()==0){
                return 1;
            }else{
                return o1.getTermsOffsets().get(0).getStartOffset()-o2.getTermsOffsets().get(0).getStartOffset();
            }
        }
    });
    for (SubInfo info:subInfos) {
        for (Toffs ot:info.getTermsOffsets()) {
            if(this.startOffset>ot.getStartOffset()){
                this.startOffset = ot.getStartOffset();
            }
            if(this.endOffset                 this.endOffset = ot.getEndOffset();
            }
        }
    }

    }
    
    public List getSubInfos(){
      return subInfos;
    }
    
    public float getTotalBoost(){
      return totalBoost;
    }
    
    public int getStartOffset(){
      return startOffset;
    }
    
    public int getEndOffset(){
      return endOffset;
    }
    
    @Override
    public String toString(){
      StringBuilder sb = new StringBuilder();
      sb.append( "subInfos=(" );
      for( SubInfo si : subInfos )
        sb.append( si.toString() );
      sb.append( ")/" ).append( totalBoost ).append( '(' ).append( startOffset ).append( ',' ).append( endOffset ).append( ')' );
      return sb.toString();
    }
    
    /**
     * Represents the list of term offsets for some text
     */
    public static class SubInfo {
      private final String text;  // unnecessary member, just exists for debugging purpose
      private final List termsOffsets;   // usually termsOffsets.size() == 1,
                              // but if position-gap > 1 and slop > 0 then size() could be greater than 1
      private final int seqnum;
      private final float boost; // used for scoring split WeightedPhraseInfos.

      public SubInfo( String text, List termsOffsets, int seqnum, float boost ){
        this.text = text;
        this.termsOffsets = termsOffsets;
        this.seqnum = seqnum;
        this.boost = boost;
        Collections.sort(  this.termsOffsets,new Comparator() {

            @Override
            public int compare(Toffs o1, Toffs o2) {
                // TODO Auto-generated method stub
                return o1.getStartOffset()-o2.getStartOffset();
            }
        });

      }
      
      public List getTermsOffsets(){
        return termsOffsets;
      }
      
      public int getSeqnum(){
        return seqnum;
      }

      public String getText(){
        return text;
      }

      public float getBoost(){
        return boost;
      }

      @Override
      public String toString(){
        StringBuilder sb = new StringBuilder();
        sb.append( text ).append( '(' );
        for( Toffs to : termsOffsets )
          sb.append( to.toString() );
        sb.append( ')' );
        return sb.toString();
      }
    }
  }
}

同时在生成不同高亮组的代码上(BaseFragmentsBuilder)增加排序功能。

  protected List discreteMultiValueHighlighting(List fragInfos, Field[] fields) {
    Map> fieldNameToFragInfos = new HashMap<>();
    for (Field field : fields) {
      fieldNameToFragInfos.put(field.name(), new ArrayList());
    }

    fragInfos: for (WeightedFragInfo fragInfo : fragInfos) {
      int fieldStart;
      int fieldEnd = 0;
      for (Field field : fields) {
        if (field.stringValue().isEmpty()) {
          fieldEnd++;
          continue;
        }
        fieldStart = fieldEnd;
        fieldEnd += field.stringValue().length() + 1; // + 1 for going to next field with same name.

        if (fragInfo.getStartOffset() >= fieldStart && fragInfo.getEndOffset() >= fieldStart &&
            fragInfo.getStartOffset() <= fieldEnd && fragInfo.getEndOffset() <= fieldEnd) {
          fieldNameToFragInfos.get(field.name()).add(fragInfo);
          continue fragInfos;
        }

        if (fragInfo.getSubInfos().isEmpty()) {
          continue fragInfos;
        }

        Toffs firstToffs = fragInfo.getSubInfos().get(0).getTermsOffsets().get(0);
        if (fragInfo.getStartOffset() >= fieldEnd || firstToffs.getStartOffset() >= fieldEnd) {
          continue;
        }

        int fragStart = fieldStart;
        if (fragInfo.getStartOffset() > fieldStart && fragInfo.getStartOffset() < fieldEnd) {
          fragStart = fragInfo.getStartOffset();
        }

        int fragEnd = fieldEnd;
        if (fragInfo.getEndOffset() > fieldStart && fragInfo.getEndOffset() < fieldEnd) {
          fragEnd = fragInfo.getEndOffset();
        }


        List subInfos = new ArrayList<>();
        Iterator subInfoIterator = fragInfo.getSubInfos().iterator();
        float boost = 0.0f;  //  The boost of the new info will be the sum of the boosts of its SubInfos
        while (subInfoIterator.hasNext()) {
          SubInfo subInfo = subInfoIterator.next();
          List toffsList = new ArrayList<>();
          Iterator toffsIterator = subInfo.getTermsOffsets().iterator();
          while (toffsIterator.hasNext()) {
            Toffs toffs = toffsIterator.next();
            if (toffs.getStartOffset() >= fieldEnd) {
              // We've gone past this value so its not worth iterating any more.
              break;
            }
            boolean startsAfterField = toffs.getStartOffset() >= fieldStart;
            boolean endsBeforeField = toffs.getEndOffset() < fieldEnd;
            if (startsAfterField && endsBeforeField) {
              // The Toff is entirely within this value.
              toffsList.add(toffs);
              toffsIterator.remove();
            } else if (startsAfterField) {
              /*
               * The Toffs starts within this value but ends after this value
               * so we clamp the returned Toffs to this value and leave the
               * Toffs in the iterator for the next value of this field.
               */
              toffsList.add(new Toffs(toffs.getStartOffset(), fieldEnd - 1));
            } else if (endsBeforeField) {
              /*
               * The Toffs starts before this value but ends in this value
               * which means we're really continuing from where we left off
               * above. Since we use the remainder of the offset we can remove
               * it from the iterator.
               */
              toffsList.add(new Toffs(fieldStart, toffs.getEndOffset()));
              toffsIterator.remove();
            } else {
              /*
               * The Toffs spans the whole value so we clamp on both sides.
               * This is basically a combination of both arms of the loop
               * above.
               */
              toffsList.add(new Toffs(fieldStart, fieldEnd - 1));
            }
          }
          if (!toffsList.isEmpty()) {
            subInfos.add(new SubInfo(subInfo.getText(), toffsList, subInfo.getSeqnum(), subInfo.getBoost()));
            boost += subInfo.getBoost();
          }

          if (subInfo.getTermsOffsets().isEmpty()) {
            subInfoIterator.remove();
          }
        }
        WeightedFragInfo weightedFragInfo = new WeightedFragInfo(fragStart, fragEnd, subInfos, boost);
        fieldNameToFragInfos.get(field.name()).add(weightedFragInfo);
      }
    }

    List result = new ArrayList<>();
    for (List weightedFragInfos : fieldNameToFragInfos.values()) {
      result.addAll(weightedFragInfos);
    }
    Collections.sort(result, new Comparator() {

      @Override
      public int compare(FieldFragList.WeightedFragInfo info1, FieldFragList.WeightedFragInfo info2) {
        return info1.getStartOffset() - info2.getStartOffset();
      }

    });


    return result;
  }
到此,Ansj+ElasticSearch+fvl高亮的bug均已被修复。

你可能感兴趣的:(ElasticSearch)