analysis-ik远程自定义词典热更新需要满足两个条件:
1.GET请求返回词典列表。
2.HEAD请求响应头返回Last-Modified和(或)ETag。
根据这两个条件,可以分为直接访问资源文件和访问接口两种方式。
以文件方式更新词典,将词放在一个utf8编码的文件里,将文件放在nginx或其他server下,当文件修改时http server会在客户端请求文件时自动返回响应的Last-Modified和ETag。在analysis-ik检测到词典更新时会自动更新。
以nginx为例,假设词典文件名为mydic.dic。前提条件如nginx配置、IKAnalyzer.cfg.xml配置、重启等步骤此处省略,当词典为空文件时。
分词器测试接口
http:///_analyze?analyzer=ik_smart&text=远程自定义词典热更新
分词结果为:
{
"tokens": [
{
"token": "远程",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "自定义",
"start_offset": 2,
"end_offset": 5,
"type": "CN_WORD",
"position": 1
},
{
"token": "词典",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 2
},
{
"token": "热",
"start_offset": 7,
"end_offset": 8,
"type": "CN_CHAR",
"position": 3
},
{
"token": "更新",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 4
}
]
}
向词典mydic.dic中添加一个新词:热更新。等待一分钟,词典更新。
重新加载后,测试结果为:
{
"tokens": [
{
"token": "远程",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "自定义",
"start_offset": 2,
"end_offset": 5,
"type": "CN_WORD",
"position": 1
},
{
"token": "词典",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 2
},
{
"token": "热更新",
"start_offset": 7,
"end_offset": 10,
"type": "CN_WORD",
"position": 3
}
]
}
词典更新成功,“热更新”被识别为一个词。
以springboot为例,实现dict的GET和HEAD请求,其中只用到了Last-Modified。
import java.io.IOException;
import java.io.OutputStream;
import java.util.List;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.util.StringUtils;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;
@RestController
public class DictionaryController {
private static Logger LOGGER = LoggerFactory.getLogger(DictionaryController.class);
// 最新更新间隔5分钟
private static final long MIN_UPDATE_INTERVAL = 300;
private static final String CHARSET = "UTF-8";
// 请求头
private static final String REQUEST_MODIFIED_KEY = "If-Modified-Since";
private static final String RESPONSE_MODIFIED_KEY = "Last-Modified";
// 响应头
private static final String REQUEST_ETAG_KEY = "If-None-Match";
private static final String RESPONSE_ETAG_KEY = "ETag";
@Autowired
DictionaryService dictionaryService;
@RequestMapping(value = "/dict", method = RequestMethod.HEAD)
public void needUpdate(HttpServletRequest request, HttpServletResponse response) throws IOException {
long current = System.currentTimeMillis() / 1000;
String lastModifiedStr = request.getHeader(REQUEST_MODIFIED_KEY);
String eTag = request.getHeader(REQUEST_ETAG_KEY);
if (StringUtils.isEmpty(lastModifiedStr)) {
// 首次加载
response.setStatus(HttpStatus.OK.value());
response.setHeader(RESPONSE_ETAG_KEY, eTag);
response.setHeader(RESPONSE_MODIFIED_KEY, String.valueOf(current));
return;
}
long lastModified;
try {
lastModified = Long.parseLong(lastModifiedStr);
} catch (NumberFormatException e) {
LOGGER.error("invalid header info {}", lastModifiedStr, e);
response.sendError(HttpStatus.BAD_REQUEST.value(), "invalid header info");
return;
}
// 上次更新时间不会大于当前时间
if (lastModified >= current) {
LOGGER.error("illegal header info {}", lastModifiedStr);
response.sendError(HttpStatus.BAD_REQUEST.value(), "illegal header info");
return;
}
// 防止频繁更新
if (current <= lastModified + MIN_UPDATE_INTERVAL) {
response.setStatus(HttpStatus.NOT_MODIFIED.value());
return;
}
long lastDictionaryUpdateTime = dictionaryService.getLastUpdateTime();
// 上次更新后如果数据库没有更新则不进行同步
if (lastModified >= lastDictionaryUpdateTime) {
response.setStatus(HttpStatus.NOT_MODIFIED.value());
} else {
response.setStatus(HttpStatus.OK.value());
response.setHeader(RESPONSE_ETAG_KEY, eTag);
response.setHeader(RESPONSE_MODIFIED_KEY, String.valueOf(current));
}
}
@RequestMapping(value = "/dict", method = RequestMethod.GET)
public void sendDict(HttpServletRequest request, HttpServletResponse response) {
response.setStatus(HttpStatus.OK.value());
response.setContentType("text/plain; charset=" + CHARSET);
List words = dictionaryService.getWords();
if (words != null) {
try (OutputStream out = response.getOutputStream()) {
for (String word : words) {
out.write((word + "\n").getBytes(CHARSET));
}
} catch (IOException e) {
LOGGER.error("dict update faild!", e);
}
}
}
}
由于代码中设置了更新间隔需要大于5分钟,所以需要等待5分钟。
更新前后分词测试结果与用文件更新词典结果一致。
使用资源文件,优点是简单,只需要维护词典文件即可。缺点是不够灵活,如果词典更新频繁的话操作比较麻烦。并且词典文件编码必须是utf8,更新频率不能控制,只要文件有修改就会更新,默认1分钟的更新间隔不适合比较大的词典。
使用接口,优点是灵活,能够控制更新频率,可以设置编码格式,可以在不重启elasticsearch的情况下,通过service的不同实现可以实现不同词典源的切换。缺点是需要web服务的支持,比较适合词典比较大或更新频繁的情况。