Ansj分词器集成完成后,即可使用Ansj进行搜索和分词了, 经过几个小时的整理,ElasticSearch内的模型、数据等等哦准备好了。
中EstaticSearch 2.4.5 版本以及对应的ansj分词器中,搜索会存在以下几个问题:
(1)关键词中带有标点符号搜索不出来,例如搜索 万科,王石
(2)关键词中带有空格搜索不出来 例如搜索 万科 王石
(3)关键词加引号搜索不出数据 例如搜索 "万科王石"
查看了Ansj源码以及Ansj插件源码并对错误现象进行了分析,发现ansj分词器虽然加载了停止之,但是在实际使用的过程中,ansj并未使用任何停止词。
停止词bug问题的 ansj-plugin 以及 ansj 代码修改具体如下
:
修改完成后,停止词就可以正常使用,搜索 万科 王石 或者是 万科,王石 均可正常搜索。
搜索长词加引号 ("") 搜索不到数据的bug
问题
给关键词加上引号搜索,lucene内部代码实现的时候是先对引号内的数据进行分词,分词完成后进行查找,当数据满足所有词时候,再检查数据中命中目标的词向量距离是否为1,满足的即为结果候选文档。故出现该问题时候第一想到了是否是lucene版本升级导致该功能该出问题了,于是采用了不同版本的lucene进行测试以及同一个版本采用不同的分词器进行测试,测试结果发现问题出自ansj分词器。再词阅读了ansj代码,在该版本代码增加了对index类型分词方式的排序,排序是按照词的起始偏移量进行排的。而lucene在建索引的时候,会按照迭代出词的顺序来获取每个词的位置增量(ansj设置的增量为1)。
例如对句子分词: 对交易者而言波段为王 将会分词如下
ansj原生分词:[对/p, 交易者/nz, 而言/u, 波段/n, 为王/n, 交/v, 交易/vn, 易/ad, 易者/n, 者/nr, 而/cc, 言/vg, 波/n, 段/q, 为/p, 王/nr]
ansj分词排序后:[对/p, 交易者/nz, 交易/vn, 交/v, 易者/n, 易/ad, 者/nr, 而言/u, 而/cc, 言/vg, 波段/n, 波/n, 段/q, 为王/n, 为/p, 王/nr]
ansj原生分词:[对 1, 交易者 2, 而言 3, 波段 4, 为王 5, 交 6, 交易 7, 易 8, 易者 9, 者 10, 而 11, 言 12, 波 13, 段 14, 为 15, 王 16]
ansj分词排序后:[对 1, 交易者 2, 交易 3, 交4, 易者5, 易 6, 者 7, 而言 8, 而 9, 言 10, 波段 11, 波12, 段13, 为王 14, 为 15, 王 16]
可以看到,排序后,搜索引擎若搜索 “对交易者” 则可以搜索到,单搜索 “波段为王” 的时候无法搜索到,因为 词汇 为王 与 词汇波段 的距离为3(采用query方式),故搜索长词加引号就容易出现搜索不到数据。
修改办法比较简单,注释掉index分词方式的排序即可,具体如下:
ES 搜索高亮报错问题 StringIndexOutOfBoundsException[String index out of range: -5]
ElasticSearch在搜索的时候,高亮方式默认是根据配置mapping时候的存储格而定的,基于向量的高亮方式(fvl FastVectorHighlighter)有个要求,即查询时候使用的分词方式也使用向量(按照index方式会导致高亮重叠问题以及会出现异常)。使用过程中发现当字段值为
多值
时候特别容易出现高亮数组越界以及高亮标记错误。阅读了lucene高亮模块的源码,FastVectorHighlighter以及lucene-core的源码,lucene使用高亮时候有以下几个步骤。
2.读取词目标docid的向量即可,提取目标关键字的向量
3.计算匹配词向量命中的组合,提取topN组(默认为5组)
4.提取文档中的字段内容(多值数据提取后以空格链接成一个string)
5.按照提取的高亮组遍历,根据该组内词的位置偏移量截取段落 (前后段落为找超过一定长度或发现英文的标点符号)
由此可见,数据进行高亮时候是依赖建索引时候词向量的位置,而在lucene的源码带中,lucene记录的词向量偏移量如下:
词向量偏移量=上段数据值偏移量之和+该词在当前数据值中偏移量
上段数据值偏移量之合=上上段数据值偏移量之合+上段数据值最后一个词的尾偏移量
/** Inverts one field for one document; first is true
* if this is the first time we are seeing this field
* name in this document. */
public void invert(IndexableField field, boolean first) throws IOException, AbortingException {
if (first) {
// First time we're seeing this field (indexed) in
// this document:
invertState.reset();
}
IndexableFieldType fieldType = field.fieldType();
IndexOptions indexOptions = fieldType.indexOptions();
fieldInfo.setIndexOptions(indexOptions);
if (fieldType.omitNorms()) {
fieldInfo.setOmitsNorms();
}
final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
// only bother checking offsets if something will consume them.
// TODO: after we fix analyzers, also check if termVectorOffsets will be indexed.
final boolean checkOffsets = indexOptions == IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
/*
* To assist people in tracking down problems in analysis components, we wish to write the field name to the infostream
* when we fail. We expect some caller to eventually deal with the real exception, so we don't want any 'catch' clauses,
* but rather a finally that takes note of the problem.
*/
boolean succeededInProcessingField = false;
try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
// reset the TokenStream to the first token
stream.reset();
invertState.setAttributeSource(stream);
termsHashPerField.start(field, first);
while (stream.incrementToken()) {
int posIncr = invertState.posIncrAttribute.getPositionIncrement();
invertState.position += posIncr;
if (invertState.position < invertState.lastPosition) {
if (posIncr == 0) {
throw new IllegalArgumentException("first position increment must be > 0 (got 0) for field '" + field.name() + "'");
} else {
throw new IllegalArgumentException("position increments (and gaps) must be >= 0 (got " + posIncr + ") for field '" + field.name() + "'");
}
} else if (invertState.position > IndexWriter.MAX_POSITION) {
throw new IllegalArgumentException("position " + invertState.position + " is too large for field '" + field.name() + "': max allowed position is " + IndexWriter.MAX_POSITION);
}
invertState.lastPosition = invertState.position;
if (posIncr == 0) {
invertState.numOverlap++;
}
if (checkOffsets) {
int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
if (startOffset < invertState.lastStartOffset || endOffset < startOffset) {
throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
+ "startOffset=" + startOffset + ",endOffset=" + endOffset + ",lastStartOffset=" + invertState.lastStartOffset + " for field '" + field.name() + "'");
}
invertState.lastStartOffset = startOffset;
}
invertState.length++;
if (invertState.length < 0) {
throw new IllegalArgumentException("too many tokens in field '" + field.name() + "'");
}
//System.out.println(" term=" + invertState.termAttribute);
// If we hit an exception in here, we abort
// all buffered documents since the last
// flush, on the likelihood that the
// internal state of the terms hash is now
// corrupt and should not be flushed to a
// new segment:
try {
termsHashPerField.add();
} catch (MaxBytesLengthExceededException e) {
byte[] prefix = new byte[30];
BytesRef bigTerm = invertState.termAttribute.getBytesRef();
System.arraycopy(bigTerm.bytes, bigTerm.offset, prefix, 0, 30);
String msg = "Document contains at least one immense term in field=\"" + fieldInfo.name + "\" (whose UTF8 encoding is longer than the max length " + DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8 + "), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '" + Arrays.toString(prefix) + "...', original message: " + e.getMessage();
if (docState.infoStream.isEnabled("IW")) {
docState.infoStream.message("IW", "ERROR: " + msg);
}
// Document will be deleted above:
throw new IllegalArgumentException(msg, e);
} catch (Throwable th) {
throw AbortingException.wrap(th);
}
}
stream.end();
// TODO: maybe add some safety? then again, it's already checked
// when we come back around to the field...
//add by jkuang
String value = field.stringValue();
invertState.offset += value==null?0:value.length();
invertState.position += invertState.posIncrAttribute.getPositionIncrement();
/* if there is an exception coming through, we won't set this to true here:*/
succeededInProcessingField = true;
} finally {
if (!succeededInProcessingField && docState.infoStream.isEnabled("DW")) {
docState.infoStream.message("DW", "An exception was thrown while processing field " + fieldInfo.name);
}
}
if (analyzed) {
invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
}
invertState.boost *= field.boost();
}
}
但按照上上诉方方式,因先执行了 stream.end(),故invertState.offsetAttribute.endOffset() 值会被清零,另外,按照词方式,若多字段中,前n-1段只要有一段的末尾字符为停止符,则偏移量则会发生错误。
修改完成后运行ElasticSearch进行高亮搜索,
发现高亮依
然还会报StringIndexOutOfBoundsException[String index out of range: -5] 这个是由于上面讲到的 "
计算匹配词向量命中的组合,提取topN组(默认为5组)"
的时候,
每个组内的词没有进行排序过,不同的组也没有进行排序,而在高亮的时候,lucene高丽的方式是每次高亮都在上次高亮的位置之后进行偏移。例如高亮 万科,王石 ,高亮万科时候,记录相对偏移地址为 10,高亮王石的时候只能从10开始,但若王石的起始地址小于10 ,比如是 5,那就会出现
StringIndexOutOfBoundsException[String index out of range: -7]
可修改lucene高亮模块,对分组进行排序,具体为修改 org.apache.lucene.search.vectorhighlight.WeightedFragInfo 以及 SubInfo 构造函数。
public WeightedFragInfo( int startOffset, int endOffset, List subInfos, float totalBoost ){
this.startOffset = startOffset;
this.endOffset = endOffset;
this.totalBoost = totalBoost;
this.subInfos = subInfos;
Collections.sort(this.subInfos, new Comparator() {
@Override
public int compare(SubInfo o1, SubInfo o2) {
if(o1.getTermsOffsets().size()==0){
return -1;
}else if(o2.getTermsOffsets().size()==0){
return 1;
}else{
return o1.getTermsOffsets().get(0).getStartOffset()-o2.getTermsOffsets().get(0).getStartOffset();
}
}
});
for (SubInfo info:subInfos) {
for (Toffs ot:info.getTermsOffsets()) {
if(this.startOffset>ot.getStartOffset()){
this.startOffset = ot.getStartOffset();
}
if(this.endOffset this.endOffset = ot.getEndOffset();
}
}
}
}
public List getSubInfos(){
return subInfos;
}
public float getTotalBoost(){
return totalBoost;
}
public int getStartOffset(){
return startOffset;
}
public int getEndOffset(){
return endOffset;
}
@Override
public String toString(){
StringBuilder sb = new StringBuilder();
sb.append( "subInfos=(" );
for( SubInfo si : subInfos )
sb.append( si.toString() );
sb.append( ")/" ).append( totalBoost ).append( '(' ).append( startOffset ).append( ',' ).append( endOffset ).append( ')' );
return sb.toString();
}
/**
* Represents the list of term offsets for some text
*/
public static class SubInfo {
private final String text; // unnecessary member, just exists for debugging purpose
private final List termsOffsets; // usually termsOffsets.size() == 1,
// but if position-gap > 1 and slop > 0 then size() could be greater than 1
private final int seqnum;
private final float boost; // used for scoring split WeightedPhraseInfos.
public SubInfo( String text, List termsOffsets, int seqnum, float boost ){
this.text = text;
this.termsOffsets = termsOffsets;
this.seqnum = seqnum;
this.boost = boost;
Collections.sort( this.termsOffsets,new Comparator() {
@Override
public int compare(Toffs o1, Toffs o2) {
// TODO Auto-generated method stub
return o1.getStartOffset()-o2.getStartOffset();
}
});
}
public List getTermsOffsets(){
return termsOffsets;
}
public int getSeqnum(){
return seqnum;
}
public String getText(){
return text;
}
public float getBoost(){
return boost;
}
@Override
public String toString(){
StringBuilder sb = new StringBuilder();
sb.append( text ).append( '(' );
for( Toffs to : termsOffsets )
sb.append( to.toString() );
sb.append( ')' );
return sb.toString();
}
}
}
}
同时在生成不同高亮组的代码上(BaseFragmentsBuilder)增加排序功能。
protected List discreteMultiValueHighlighting(List fragInfos, Field[] fields) {
Map> fieldNameToFragInfos = new HashMap<>();
for (Field field : fields) {
fieldNameToFragInfos.put(field.name(), new ArrayList());
}
fragInfos: for (WeightedFragInfo fragInfo : fragInfos) {
int fieldStart;
int fieldEnd = 0;
for (Field field : fields) {
if (field.stringValue().isEmpty()) {
fieldEnd++;
continue;
}
fieldStart = fieldEnd;
fieldEnd += field.stringValue().length() + 1; // + 1 for going to next field with same name.
if (fragInfo.getStartOffset() >= fieldStart && fragInfo.getEndOffset() >= fieldStart &&
fragInfo.getStartOffset() <= fieldEnd && fragInfo.getEndOffset() <= fieldEnd) {
fieldNameToFragInfos.get(field.name()).add(fragInfo);
continue fragInfos;
}
if (fragInfo.getSubInfos().isEmpty()) {
continue fragInfos;
}
Toffs firstToffs = fragInfo.getSubInfos().get(0).getTermsOffsets().get(0);
if (fragInfo.getStartOffset() >= fieldEnd || firstToffs.getStartOffset() >= fieldEnd) {
continue;
}
int fragStart = fieldStart;
if (fragInfo.getStartOffset() > fieldStart && fragInfo.getStartOffset() < fieldEnd) {
fragStart = fragInfo.getStartOffset();
}
int fragEnd = fieldEnd;
if (fragInfo.getEndOffset() > fieldStart && fragInfo.getEndOffset() < fieldEnd) {
fragEnd = fragInfo.getEndOffset();
}
List subInfos = new ArrayList<>();
Iterator subInfoIterator = fragInfo.getSubInfos().iterator();
float boost = 0.0f; // The boost of the new info will be the sum of the boosts of its SubInfos
while (subInfoIterator.hasNext()) {
SubInfo subInfo = subInfoIterator.next();
List toffsList = new ArrayList<>();
Iterator toffsIterator = subInfo.getTermsOffsets().iterator();
while (toffsIterator.hasNext()) {
Toffs toffs = toffsIterator.next();
if (toffs.getStartOffset() >= fieldEnd) {
// We've gone past this value so its not worth iterating any more.
break;
}
boolean startsAfterField = toffs.getStartOffset() >= fieldStart;
boolean endsBeforeField = toffs.getEndOffset() < fieldEnd;
if (startsAfterField && endsBeforeField) {
// The Toff is entirely within this value.
toffsList.add(toffs);
toffsIterator.remove();
} else if (startsAfterField) {
/*
* The Toffs starts within this value but ends after this value
* so we clamp the returned Toffs to this value and leave the
* Toffs in the iterator for the next value of this field.
*/
toffsList.add(new Toffs(toffs.getStartOffset(), fieldEnd - 1));
} else if (endsBeforeField) {
/*
* The Toffs starts before this value but ends in this value
* which means we're really continuing from where we left off
* above. Since we use the remainder of the offset we can remove
* it from the iterator.
*/
toffsList.add(new Toffs(fieldStart, toffs.getEndOffset()));
toffsIterator.remove();
} else {
/*
* The Toffs spans the whole value so we clamp on both sides.
* This is basically a combination of both arms of the loop
* above.
*/
toffsList.add(new Toffs(fieldStart, fieldEnd - 1));
}
}
if (!toffsList.isEmpty()) {
subInfos.add(new SubInfo(subInfo.getText(), toffsList, subInfo.getSeqnum(), subInfo.getBoost()));
boost += subInfo.getBoost();
}
if (subInfo.getTermsOffsets().isEmpty()) {
subInfoIterator.remove();
}
}
WeightedFragInfo weightedFragInfo = new WeightedFragInfo(fragStart, fragEnd, subInfos, boost);
fieldNameToFragInfos.get(field.name()).add(weightedFragInfo);
}
}
List result = new ArrayList<>();
for (List weightedFragInfos : fieldNameToFragInfos.values()) {
result.addAll(weightedFragInfos);
}
Collections.sort(result, new Comparator() {
@Override
public int compare(FieldFragList.WeightedFragInfo info1, FieldFragList.WeightedFragInfo info2) {
return info1.getStartOffset() - info2.getStartOffset();
}
});
return result;
}