网络词语日新月异,如何让新出的网络热词(或特定的词语)实时的更新到我们的搜索当中呢
先用 ik 测试一下 :
curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '
成龙原名陈港生
'
#返回
{
"tokens" : [ {
"token" : "成龙",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "原名",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "陈",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 2
}, {
"token" : "港",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "生",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 4
} ]
}
ik 的主词典中没有”陈港生” 这个词,所以被拆分了。
修改如下:
IK Analyzer 扩展配置
custom/mydict.dic;custom/single_word_low_freq.dic
custom/ext_stopword.dic
http://192.168.1.136/hotWords.php
这里我是用的是远程扩展字典,因为可以使用其他程序调用更新,且不用重启 ES,很方便;
使用本地的文件进行词库扩展,需要重启ES
。当然使用自定义的 mydict.dic 字典也是很方便的,一行一个词,自己加就可以了
hotWords.php 的内容
$s = <<<'EOF'
陈港生
元楼
蓝瘦
EOF;
header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200);
header('ETag: "5816f349-19"');
echo $s;
ik 接收两个返回的头部属性 Last-Modified 和 ETag,只要其中一个有变化,就会触发更新,ik 会每分钟获取一次
重启 Elasticsearch ,查看启动记录,看到了三个词已被加载进来
[2016-10-31 15:08:57,749][INFO ][ik-analyzer ] 陈港生
[2016-10-31 15:08:57,749][INFO ][ik-analyzer ] 元楼
[2016-10-31 15:08:57,749][INFO ][ik-analyzer ] 蓝瘦
现在我们来测试一下,再次执行上面的请求,返回
...
}, {
"token" : "陈港生",
"start_offset" : 5,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 2
}, {
...
可以看到 ik 分词器已经匹配到了 “陈港生” 这个词。
Java服务器端实现:实现加载扩展词、添加扩展词、扩展词刷新接口
http://ip:port/es/dic/loadExtDict
@RestController
@RequestMapping("/es/dic")
public class DicController {
private static final Logger logger = LoggerFactory.getLogger(DicController.class);
@Autowired
private DictRedis dictRedis;
private static final String EXT_DICT_PATH = "E:\\ext_dict.txt";
/**
* Description:加载扩展词
* @param response
*/
@RequestMapping(value = "/loadExtDict")
public void loadExtDict(HttpServletResponse response) {
logger.error("extDict get start");
long count = dictRedis.incr(RedisKeyConstants.ES_EXT_DICT_FLUSH);
//要保证每个节点都能获取到扩展词
if(count > getEsClusterNodesNum()) {
return;
}
String result = FileUtil.read(EXT_DICT_PATH);
if(StringUtils.isEmpty(result)) {
return;
}
// String result = "黄焖鸡米饭\n腾冲大救驾\n陈港生\n大西瓜\n大南瓜";
try {
response.setHeader("Last-Modified", TimeUtil.currentTimeHllDT().toString());
response.setHeader("ETag",TimeUtil.currentTimeHllDT().toString());
response.setContentType("text/plain; charset=UTF-8");
PrintWriter out = response.getWriter();
out.write(result);
out.flush();
} catch (IOException e) {
logger.error("DicController loadExtDict exception" , e);
}
logger.error("extDict get end,result:{}", result);
}
/**
* Description:扩展词刷新
* @param response
* @return
*/
@RequestMapping(value = "/extDictFlush")
public String extDictFlush() {
String result = "ok";
try {
dictRedis.del(RedisKeyConstants.ES_EXT_DICT_FLUSH);
} catch (Exception e) {
result = e.getMessage();
}
return result;
}
/**
* Description:添加扩展词典,多个词以逗号隔开“,”
* @param dict
* @return
*/
@RequestMapping(value = "/addExtDict")
public String addExtDict(String dict) {
String result = "ok";
if(StringUtils.isEmpty(dict)) {
return "添加词不能为空";
}
StringBuilder sb = new StringBuilder();
String[] dicts = dict.split(",");
for (String str : dicts) {
sb.append("\n").append(str);
}
boolean flag = FileUtil.write(EXT_DICT_PATH, sb.toString());
if(flag) {
extDictFlush();
} else {
result = "fail";
}
return result;
}
/**
* Description:获取集群节点个数,若未获取到,默认10个
* @return
*/
private int getEsClusterNodesNum() {
int num = 10;
String esAddress = PropertyConfigurer.getString("es.address","http://172.16.32.69:9300,http://172.16.32.48:9300");
List clusterNodes = Arrays.asList(esAddress.split(","));
if(clusterNodes != null && clusterNodes.size() != 0) {
num = clusterNodes.size();
}
return num;
}
}
文件读写工具类:
public class FileUtil {
private static final Logger logger = LoggerFactory.getLogger(FileUtil.class);
/**
* Description:文件读取
*
* @param path
* @return
* @throws Exception
*/
public static String read(String path) {
StringBuilder sb = new StringBuilder();
BufferedReader reader = null;
try {
BufferedInputStream fis = new BufferedInputStream(new FileInputStream(new File(path)));
reader = new BufferedReader(new InputStreamReader(fis, "utf-8"), 512);// 用512的缓冲读取文本文件
String line = "";
while ((line = reader.readLine()) != null) {
sb.append(line).append("\n");
}
} catch (Exception e) {
logger.error("FileUtil read exception", e);
} finally {
if(reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return sb.toString();
}
/**
* Description:追加写入文件
*
*/
public static boolean write(String path, String content) {
boolean flag = true;
BufferedWriter out = null;
try {
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(path), true))); // 追加的方法
out.write(content);
} catch (IOException e) {
flag = false;
logger.error("FileUtil write exception", e);
} finally {
try {
if(out != null) {
out.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
return flag;
}
}