本文针对在工作中遇到的需求:通过es来实现模糊查询来进行总结;模糊查询的具体需求是:查询基金/A股/港股等金融数据,要求可以根据字段,拼音首字母,部分拼音全称进行联想查询;需要注意的是,金融数据名称中可能不止包含汉字,还有英文,数字,特殊字符等。
常用的es模糊查询出于性能问题,官方建议是慎重使用的,但一般针对于与其他es查询相比,如果和其他搜索工具相比,es的模糊查询性能还是不错的;常见的模糊查询相关函数,例如wildcard,fuzzy,query_string等均不完全适配现有的业务需求,因此从另一个角度思考问题,拟采用更加灵活的分词器来解决多条件模糊查询问题。
ngram
分词器与传统的 standard
分词器或者是 ik
分词器相比,他的优点是可以分词出特殊字符,因此,在对字段查询时,可以采用 ngram
分词器;而对拼音全称以及首字母查询时,可以使用 keyword
与 拼音
结合的自定义分词。
根据上述的方案设计,我们可以在es中定义这样一个索引:
settings:
{
"analysis":{
"analyzer":{
"my_ngram_analyzer":{
"tokenizer":"my_ngram_tokenizer"
},
"my_pinyin_analyzer":{
"tokenizer":"keyword",
"filter":"py"
}
},
"tokenizer":{
"my_ngram_tokenizer":{
"type":"ngram",
"min_ngram":1,
"max_ngram":1
}
},
"filter":{
"py":{
"type":"pinyin",
"first_letter":"prefix",
"keep_separate_first_letter":true,
"keep_full_pinyin":true,
"keep_joined_full_pinyin":true,
"keep_original":true,
"limit_first_letter_length":16,
"lowercase":true,
"remove_duplicated_term":true
}
}
}
}
mapping:
{
"properties":{
"name":{
"type":"text",
"analyzer":"my_ngram_analyzer"
},
"fields":{
"PY":{
"type":"text",
"analyzer":"my_pinyin_analyzer",
"term_vector":"with_positions_offsets",
"boost":10.0
}
}
}
}
以text = "恒生电子"为例,它的自定义拼音分词器 my_pinyin_analyzer
效果如下:
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "heng",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "恒生电子",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "hengshengdianzi",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "hsdz",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "s",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "sheng",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "d",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "dian",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "z",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "zi",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 3
}
]
}
而在对应的代码层面,出于对输入词的关联精确性与词语顺序的考虑,从match , match phrase 以及 match phrase prefix中选择match phrase来进行查询:
// 直接的字段匹配优先级大于拼音匹配
BoolQueryBuilder boolQueryBuilderKeyWord = QueryBuilders.boolQuery().matchPhraseQuery("name", imageStr).boost(2.0f))
.should(QueryBuilders.matchPhraseQuery("name.PY", imageStr).boost(1.0f));
应用上述的代码于项目中,经过测试会发现一个问题:输入汉字会查询得出与汉字不相关但缩写一致的数据,例如:关键字录入"恒生电子",接口返回结果如下:
{
"error_code": "0",
"error_info": "success",
"data": [
{
"en_prod_code": "600570.SH",
"secu_code": "600570",
"secu_abbr": "恒生电子",
"type": "A_stock",
"modification_time": "2022-04-22T19:47:37.000+00:00",
"en_abbr": "HSDZ"
},
{
"en_prod_code": "007685.OF",
"secu_code": "007685.OF",
"secu_abbr": "华商电子",
"type": "fund",
"modification_time": "2022-04-22T19:41:38.000+00:00",
"en_abbr": "HSDZ"
}
]
}
通过检查发现,是代码中设置的查询语句有问题,将字段查询与拼音首字母查询隔离即可,即通过中文查询则只查询name字段,通过非中文查询则只查询name.PY,Java代码修改如下:
if (!imageStr.matches("(.*)[\u4e00-\u9fa5](.*)")) {
BoolQueryBuilder boolQueryBuilderKeyWord = QueryBuilders.boolQuery().matchPhraseQuery("name.PY", imageStr);
} else {
BoolQueryBuilder boolQueryBuilderKeyWord = QueryBuilders.boolQuery().matchPhraseQuery("name", imageStr);
}
最后,再次查询关键词“恒生电子",接口返回结果为:
{
"error_code": "0",
"error_info": "success",
"data": [
{
"en_prod_code": "600570.SH",
"secu_code": "600570",
"secu_abbr": "恒生电子",
"type": "A_stock",
"modification_time": "2022-04-22T19:47:37.000+00:00",
"en_abbr": "HSDZ"
}
]
}