前一章讲了搜索中的拼写纠错功能,里面一个很重要的概念就是莱文斯坦距离。这章会讲解搜索中提升用户体验的另一项功能-自动补全。本章直接介绍ES中的实现方式以及真正的搜索引擎对自动补全功能的优化。
大家对上面的这个应该都不陌生,搜索引擎会根据你输入的关键字进行一些提示,这样用户只需要输入部分内容就可以进行选择了。尤其在移动端会比较方便。淘宝、京东的搜索也有类似的功能,只不过行业不同,提示出来的内容也不同罢了。
InputIteratior中几个方法的作用:
weight():此方法设置某个term的权重,设置的越高suggest的优先级越高;
payload():每个suggestion对应的元数据的二进制表示,我们在传输对象的时候需要转换对象或对象的某个属性为BytesRef类型,相应的suggester调用lookup的时候会返回payloads信息;
hasPayload():判断iterator是否有payloads;
contexts():获取某个term的contexts,用来过滤suggest的内容,如果suggest的列表为空,返回null
hasContexts():获取iterator是否有contexts;
Lucene 使用AnalyzingInfixSuggester类中的lookup方法去联想数据来源进行查询,其实就是一个普通的search,所以我们的关键是要维护好这个联想数据来源,各行各业都应该有自己单独的语料库。
应该从这几个方面入手:怎么优化Suggest词库、提升Suggest词准确率、怎么提高响应速度
当使用completion suggester的时候, 不是用于完成 类似于 "关键词"这样的模糊匹配场景,而是用于完成关键词前缀匹配的。 对于汉字的处理,无需使用ik/ HanLP一类的分词器,直接使用keyword analyzer,配合去除一些不需要的stop word即可。
代码参考:
LinkedHashSet returnSet = new LinkedHashSet<>();
Client client = elasticsearchTemplate.getClient();
SuggestRequestBuilder suggestRequestBuilder = client.prepareSuggest(elasticsearchTemplate.getPersistentEntityFor(SuggestEntity.class).getIndexName());
//全拼前缀匹配
CompletionSuggestionBuilder fullPinyinSuggest = new CompletionSuggestionBuilder("full_pinyin_suggest")
.field("full_pinyin").text(input).size(10);
//汉字前缀匹配
CompletionSuggestionBuilder suggestText = new CompletionSuggestionBuilder("suggestText")
.field("suggestText").text(input).size(size);
//拼音搜字母前缀匹配
CompletionSuggestionBuilder prefixPinyinSuggest = new CompletionSuggestionBuilder("prefix_pinyin_text")
.field("prefix_pinyin").text(input).size(size);
suggestRequestBuilder = suggestRequestBuilder.addSuggestion(fullPinyinSuggest).addSuggestion(suggestText).addSuggestion(prefixPinyinSuggest);
SuggestResponse suggestResponse = suggestRequestBuilder.execute().actionGet();
Suggest.Suggestion prefixPinyinSuggestion = suggestResponse.getSuggest().getSuggestion("prefix_pinyin_text");
Suggest.Suggestion fullPinyinSuggestion = suggestResponse.getSuggest().getSuggestion("full_pinyin_suggest");
Suggest.Suggestion suggestTextsuggestion = suggestResponse.getSuggest().getSuggestion("suggestText");
List entries = suggestTextsuggestion.getEntries();
//汉字前缀匹配
for (Suggest.Suggestion.Entry entry : entries) {
List options = entry.getOptions();
for (Suggest.Suggestion.Entry.Option option : options) {
returnSet.add(option.getText().toString());
}
}
//全拼suggest补充
if (returnSet.size() < 10) {
List fullPinyinEntries = fullPinyinSuggestion.getEntries();
for (Suggest.Suggestion.Entry entry : fullPinyinEntries) {
List options = entry.getOptions();
for (Suggest.Suggestion.Entry.Option option : options) {
if (returnSet.size() < 10) {
returnSet.add(option.getText().toString());
}
}
}
}
//首字母拼音suggest补充
if (returnSet.size() == 0) {
List prefixPinyinEntries = prefixPinyinSuggestion.getEntries();
for (Suggest.Suggestion.Entry entry : prefixPinyinEntries) {
List options = entry.getOptions();
for (Suggest.Suggestion.Entry.Option option : options) {
returnSet.add(option.getText().toString());
}
}
}
return new ArrayList<>(returnSet);
搜索引擎的优化,需要更智能,每个人输入相同的关键字,提示出来的内容可能是完全不相同的,这就是所谓的“千人千面”。这就用到了数据分析的知识,可以根据用户一段时间内的搜索历史,分析用户的搜索习惯,结合语料库实现对用户的精准提示。跟输入法的提升功能类似,会根据你过往的输入文本进行自动提示。所以,你付出了隐私,得到的是更大的便捷。这也是没有办法的事情。
如你所见,各大搜索引擎都提供了智能提示的API供广大用户调用,如果你司没有自研的能力,可以直接js中跨域调用,先把系统跑起来再说,给大家提供主流搜索引擎的调用地址,包含电商的哦。
提示:URL中的 #content# 为搜索的 关键字
谷歌(Google)
http://suggestqueries.google.com/complete/search?client=youtube&q=#content#&jsonp=window.google.ac.h
callback:window.google.ac.h
window.google.ac.h(["关键字",[["关键字",0],["关键字 歌词",0],["关键字参数",0],["关键字 lyrics",0],["关键字过滤",0],["关键字排名",0],["关键字查询",0],["关键字提取算法",0],["关键字规划师可通过以下哪种方式帮助您制作新的搜索网络广告系列",0],["关键字优化",0]],{"k":1,"q":"uhaB8ZMjzJay-BACee_C0eVdUCA"}])
必应(Bing)
http://api.bing.com/qsonhs.aspx?type=cb&q=#content#&cb=window.bing.sug
callback:window.bing.sug
if(typeof window.bing.sug == 'function') window.bing.sug({"AS":{"Query":"关键字","FullResults":0}} /* pageview_candidate */);
百度(Baidu)
http://suggestion.baidu.com/su?wd=#content#&cb=window.baidu.sug
callback:window.baidu.sug
window.baidu.sug({q:"关键字",p:false,s:["关键字搜索排名","关键字怎么优化","关键字查询工具","关键字推广","关键词优化","关键词排名","关键字 英文","关键词挖掘","关键词查询","关键词搜索"]});
好搜(So)
https://sug.so.360.cn/suggest?encodein=utf-8&encodeout=utf-8&format=json&word=#content#&callback=window.so.sug
callback:window.so.sug
window.so.sug({"query":"关键字","result":[{"word":"关键字查询"},{"word":"关键字工具"},{"word":"关键字查询工具"},{"word":"关键字挖掘"},{"word":"关键字搜索"},{"word":"关键字英文"},{"word":"关键字是什么"},{"word":"关键字广告"},{"word":"关键字分析"},{"word":"关键字规划师"}],"version":"b","rec":""});
搜狗(Sogou)
https://www.sogou.com/suggnew/ajajjson?type=web&key=#content#
callback:window.sogou.sug
window.sogou.sug(["关键字",["关键字查询","关键字搜索","关键字优化","关键字规划师","关键字查询lol","关键字是什么意思","关键字搜索工具","关键字广告图片","关键字排名查询","关键字生成器"],["0;0;0;0","1;0;0;0","2;0;0;0","3;0;0;0","4;0;0;0","5;0;0;0","6;0;0;0","7;0;0;0","8;0;0;0","9;0;0;0"],["","","","","","","","","",""],["0"],"","suglabId_1"],-1);
淘宝(Taobao)
https://suggest.taobao.com/sug?code=utf-8&q=#content#&callback=window.taobao.sug
callback:window.taobao.sug
window.taobao.sug({"result":[["关键字推广","204"],["关键字seo","198"],["关键字 网站","182"],["关键字搜索","119"],["关键字软件","44"],["关键字首页","50"],["关键字收录","35"],["关键字采集","16"],["关键字采集器","10"],["网站关键字","180"]]})
以百度为例,API返回的是JSONP数据,JSONP是跨域访问的一种方式。由于服务器返回的JavaScript代码可以直接引用,通过回调函数的方式就可以间接的获取服务器的数据。
下面是一个回调搜索建议的例子,window.baidu.sug 返回的是一个json对象:
控制台打印的结果:如果要将结果保存在一个字符串数组中,只需要 var arr = json.s 即可。
我的微信公众号:架构真经(id:gentoo666),分享Java干货,高并发编程,热门技术教程,微服务及分布式技术,架构设计,区块链技术,人工智能,大数据,Java面试题,以及前沿热门资讯等。每日更新哦!
参考文章