RedisJson-中文全文检索
RedisJson
- 最近网上比较火的RedisJson,相信大家都不陌生,还有一篇性能贴,说是RedisJson 横空出世,性能碾压ES和Mongo!,当然这些几百倍的提升可能比较客观,我比较关心的是RedisJson的json支持情况,全文检索功能,以及支持的中文分词
安装
1、官网有30天免费试用,内存有30M,创建一个实例即可,可用于测试
- 可使用redis-cli进行连接测试
[root@server bin]# ./redis-cli -h redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com -p 17137 -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137>
2、可以自己安装reJson模块
下载路径:https://redis.com/redis-enter...
安装:https://oss.redis.com/redisjs...
[root@server bin]# ./redis-server --loadmodule /opt/thunisoft/redis/redisjson/rejson.so
82538:C 29 Dec 2021 18:41:09.585 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
82538:C 29 Dec 2021 18:41:09.585 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=82538, just started
82538:C 29 Dec 2021 18:41:09.585 # Configuration loaded
82538:M 29 Dec 2021 18:41:09.587 * monotonic clock: POSIX clock_gettime
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 6.2.6 (00000000/0) 64 bit
.-`` .-```. ```\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 82538
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | https://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
82538:M 29 Dec 2021 18:41:09.589 # Server initialized
82538:M 29 Dec 2021 18:41:09.589 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
82538:M 29 Dec 2021 18:41:09.591 * version: 20006 git sha: db3329c branch: HEAD
82538:M 29 Dec 2021 18:41:09.591 * Exported RedisJSON_V1 API
82538:M 29 Dec 2021 18:41:09.591 * Enabled diskless replication
82538:M 29 Dec 2021 18:41:09.591 * Created new data type 'ReJSON-RL'
82538:M 29 Dec 2021 18:41:09.591 * Module 'ReJSON' loaded from /opt/thunisoft/redis/redisjson/rejson.so
82538:M 29 Dec 2021 18:41:09.602 * Loading RDB produced by version 6.2.6
82538:M 29 Dec 2021 18:41:09.602 * RDB age 98297 seconds
82538:M 29 Dec 2021 18:41:09.603 * RDB memory usage when created 0.77 Mb
82538:M 29 Dec 2021 18:41:09.603 # Done loading RDB, keys loaded: 2, keys expired: 0.
82538:M 29 Dec 2021 18:41:09.603 * DB loaded from disk: 0.011 seconds
82538:M 29 Dec 2021 18:41:09.603 * Ready to accept connections
修改redis.conf
/opt/thunisoft/redis/bin/redis.conf
--添加
loadmodule /opt/thunisoft/redis/redisjson/rejson.so
然后重启redis,JSON.SET已经可用
[root@server bin]# sh start.sh
[root@server bin]# ./redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
127.0.0.1:6379> JSON.SET jsonkey . '{"a":"b","c":["1","2","3"]}'
OK
127.0.0.1:6379> JSON.GET jsonkey
"{\"a\":\"b\",\"c\":[\"1\",\"2\",\"3\"]}"
127.0.0.1:6379> JSON.GET jsonkey .a
"\"b\""
JSON使用
JSON.SET
127.0.0.1:6379> JSON.SET doc . '{"a":2, "b": 3}'
OK
- SON.SET 是json设置命令
- doc 是 key
- . 是json文档的root,后面的一串是具体的 json 数据值
- 如果使用的是RedisJson2.0+版本,可以将.替换为,JSON.SET doc $ '{"a":2, "b": 3}'
JSON.GET
- JSON.GET获取json值
127.0.0.1:6379> JSON.GET doc
"{\"a\":2,\"b\":3}"
127.0.0.1:6379> JSON.GET doc a
"2"
- 嵌套结构,获取json值
127.0.0.1:6379> JSON.SET doc $ '{"a":2, "b": 3, "nested": {"a": 4, "b": null},"c":{"b":4}}'
OK
127.0.0.1:6379> JSON.GET doc b
"3"
-- $..b可以获取所有b的值
127.0.0.1:6379> JSON.GET doc $..b
"[3,null,4]"
JSON.STRAPPEND
- JSON.STRAPPEND
[path] - 将
json-string
值附加 到字符串中path
。path
如果未提供,则默认为 root。
127.0.0.1:6379> JSON.SET doc $ '{"a":"foo", "nested": {"a": "hello"}, "nested2": {"a": 31}}'
OK
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":\"foo\",\"nested\":{\"a\":\"hello\"},\"nested2\":{\"a\":31}}]"
127.0.0.1:6379>
127.0.0.1:6379> JSON.STRAPPEND doc $..a '"baz"'
1) (integer) 6
2) (integer) 8
3) (nil)
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":\"foobaz\",\"nested\":{\"a\":\"hellobaz\"},\"nested2\":{\"a\":31}}]"
JSON.DEL
127.0.0.1:6379> JSON.SET doc $ '{"a": 1, "nested": {"a": 2, "b": 3}}'
OK
127.0.0.1:6379> JSON.get doc
"{\"a\":1,\"nested\":{\"a\":2,\"b\":3}}"
127.0.0.1:6379>
127.0.0.1:6379>
--删除
127.0.0.1:6379> JSON.DEL doc $..a
(integer) 2
127.0.0.1:6379>
127.0.0.1:6379> JSON.get doc
"{\"nested\":{\"b\":3}}"
JSON.ARRAPPEND
语法:JSON.ARRAPPEND
将 json
值附加 到数组中 path
的最后一个元素之后。
127.0.0.1:6379> JSON.SET doc $ '{"a":[1], "nested": {"a": [1,2]}, "nested2": {"a": 42}}'
OK
127.0.0.1:6379> JSON.ARRAPPEND doc $..a 3 4
1) (integer) 3
2) (integer) 4
3) (nil)
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":[1,3,4],\"nested\":{\"a\":[1,2,3,4]},\"nested2\":{\"a\":42}}]"
json中嵌套数组,包含多条记录,类似于表
127.0.0.1:6379> JSON.SET testarray . '{"employees":[ {"name":"Alpha", "email":"[email protected]", "age":23}, {"name":"Beta", "email":"[email protected]", "age":28}, {"name":"Gamma", "email":"[email protected]", "age":33}, {"name":"Theta", "email":"[email protected]", "age":41} ]} ' OK 127.0.0.1:6379> 127.0.0.1:6379> 127.0.0.1:6379> 127.0.0.1:6379> JSON.get testarray "{\"employees\":[{\"name\":\"Alpha\",\"email\":\"[email protected]\",\"age\":23},{\"name\":\"Beta\",\"email\":\"[email protected]\",\"age\":28},{\"name\":\"Gamma\",\"email\":\"[email protected]\",\"age\":33},{\"name\":\"Theta\",\"email\":\"[email protected]\",\"age\":41}]}"
JSON.ARRINSERT
语法:JSON.ARRINSERT
将值插入到数组中
127.0.0.1:6379> JSON.SET doc $ '{"a":[3], "nested": {"a": [3,4]}}'
OK
127.0.0.1:6379> JSON.ARRINSERT doc $..a 0 1 2 5
1) (integer) 4
2) (integer) 5
127.0.0.1:6379> JSON.GET doc $
"[{\"a\":[1,2,5,3],\"nested\":{\"a\":[1,2,5,3,4]}}]"
还有许多JSON操作,可参考:https://oss.redis.com/redisjs...
JSON全文检索
使用文档:https://developer.redis.com/h...
可以看到默认情况下,中文是不会进行分词,只是默认的按照逗号进行分割。英文支持全文检索
查询资料得知redisjson在创建索引的时候可以指定分词
FT.CREATE {index}
[ON {data_type}]
[PREFIX {count} {prefix} [{prefix} ...]
[FILTER {filter}]
[LANGUAGE {default_lang}]
[LANGUAGE_FIELD {lang_attribute}]
[SCORE {default_score}]
[SCORE_FIELD {score_attribute}]
[PAYLOAD_FIELD {payload_attribute}]
[MAXTEXTFIELDS] [TEMPORARY {seconds}] [NOOFFSETS] [NOHL] [NOFIELDS] [NOFREQS] [SKIPINITIALSCAN]
[STOPWORDS {num} {stopword} ...]
SCHEMA {identifier} [AS {attribute}]
[TEXT [NOSTEM] [WEIGHT {weight}] [PHONETIC {matcher}] | NUMERIC | GEO | TAG [SEPARATOR {sep}] [CASESENSITIVE] [SORTABLE [UNF]] [NOINDEX]] |
[VECTOR {algorithm} {count} [{attribute_name} {attribute_value} ...]] ...
- json创建索引
- ON JSON,如果是文本,则指定TEXT
--新建一个索引:i_index1
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.CREATE i_index1 ON JSON LANGUAGE chinese SCHEMA $.title TEXT
OK
--插入数据
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> JSON.SET myDoc $ '{"title": "云南省昆明市盘龙区", "content": "bar1"}'
OK
--查询昆明市,可以查询出结果
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "昆明市" LANGUAGE chinese
1) (integer) 1
2) "myDoc"
3) 1) "$"
2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}"
- 分词方法
从下面的结果来看,查询的
云南省
,昆明市
,盘龙区
,均可以查询出来,但是查询昆明
,云南
,昆盘
等就查询不出来。redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "云南省" LANGUAGE chinese 1) (integer) 1 2) "myDoc" 3) 1) "$" 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}" redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "区" LANGUAGE chinese 1) (integer) 0 redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "云南省" LANGUAGE chinese 1) (integer) 0 redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "昆明市" LANGUAGE chinese 1) (integer) 1 2) "myDoc" 3) 1) "$" 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}" redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "昆明" LANGUAGE chinese 1) (integer) 0 redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "盘龙区" LANGUAGE chinese 1) (integer) 1 2) "myDoc" 3) 1) "$" 2) "{\"title\":\"\xe5\x9b\x9b\xe5\xb7\x9d\xe7\x9c\x81\xe6\x88\x90\xe9\x83\xbd\xe5\xb8\x82\xe6\x88\x90\xe5\x8d\x8e\xe5\x8c\xba\",\"content\":\"bar1\"}" redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "盘龙" LANGUAGE chinese 1) (integer) 0 redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "区" LANGUAGE chinese 1) (integer) 0
- 测试
南京长江大桥
可以看到将
南京长江大桥
,查询南京
,长江
和大桥
没有结果,查询南京市
,长江大桥
有结果,擦测可能分割成了南京市
,长江大桥
redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> JSON.SET myDoc $ '{"title": "南京市长江大桥", "content": "bar1"}' OK redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "南京市" LANGUAGE chinese 1) (integer) 1 2) "myDoc" 3) 1) "$" 2) "{\"title\":\"\xe5\x8d\x97\xe4\xba\xac\xe5\xb8\x82\xe9\x95\xbf\xe6\xb1\x9f\xe5\xa4\xa7\xe6\xa1\xa5\",\"content\":\"bar1\"}" redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "长江" LANGUAGE chinese 1) (integer) 0 redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "大桥" LANGUAGE chinese 1) (integer) 0 redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "长江大桥" LANGUAGE chinese 1) (integer) 1 2) "myDoc" 3) 1) "$" 2) "{\"title\":\"\xe5\x8d\x97\xe4\xba\xac\xe5\xb8\x82\xe9\x95\xbf\xe6\xb1\x9f\xe5\xa4\xa7\xe6\xa1\xa5\",\"content\":\"bar1\"}" redis-17137.c245.us-east-1-3.ec2.cloud.redislabs.com:17137> FT.SEARCH i_index1 "南京" LANGUAGE chinese 1) (integer) 0
创建索引时需要指定LANGUAGE chinese
redisjson:https://oss.redis.com/redisea...
- 全文检索支持的语言:
arabic
armenian
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
serbian
spanish
swedish
tamil
turkish
yiddish
chinese (see below)
RediSearch默认使用了Friso来进行中文分词
Friso:Friso 是使用 ANSI C 语言开发的一款开源中文分词器,使用流行的 mmseg
算法实现。完全基于模块化设计和实现,可以很方便的植入其他程序中,例如:MySQL,PHP,源码无需修改就能在各种平台下编译使用,同时支持对 UTF-8/GBK 编码的切分。
Friso分词
- 安装Friso分词,测试发现确实是这样
[root@server friso-1.6.1-release]# ./src/friso -init ./friso.ini
Initialized in 0.340000sec
Mode: Complex
+-Version: 1.6.1 (UTF-8)
+-----------------------------------------------------------+
| friso - a chinese word segmentation writen by c. |
| bug report email - [email protected]. |
| or: visit http://code.google.com/p/friso. |
| java edition for http://code.google.com/p/jcseg |
| type 'quit' to exit the program. |
+-----------------------------------------------------------+
friso>> 南京市长江大桥
分词结果:
南京市 长江大桥
Done, cost < 0.000000sec
friso>> 云南省昆明市盘龙区
分词结果:
云南省 昆明市 盘龙区
Done, cost < 0.000000sec
friso>>
Friso基于mmseg算法实现,以正向最大匹配为主,多种消除歧义的规则为辅
mmseg分词:http://technology.chtsai.org/...
每次从一个完整的句子里,按照从左向右的顺序,识别出多种不同的3个词的组合;然后根据下面的4条消歧规则,确定最佳的备选词组合;
选择备选词组合中的第1个词,作为1次迭代的分词结果;剩余的2个词继续进行下一轮的分词运算。
采用这种办法的好处是,为传统的前向最大匹配算法加入了上下文信息,解决了其每次选词只考虑词本身,而忽视上下文相关词的问题。
4条消歧规则包括,
1)备选词组合的长度之和最大。
2)备选词组合的平均词长最大;
3)备选词组合的词长变化最小;
4)备选词组合中,单字词的出现频率统计值最高。
对比abase数据库的分词(SCWS)
- scws分词,会分的很细,基本涵盖所有词组的拆分
postgres=# select to_tsvector('testzhcfg','南京市长江大桥');
to_tsvector
----------------------------------------------------------------------------------------
'南京':2 '南京市':1 '大':9 '大桥':6 '市':3 '桥':10 '江':8 '长':7 '长江':5 '长江大桥':4
(1 row)
postgres=# select to_tsvector('testzhcfg','云南省昆明市盘龙区');
to_tsvector
-----------------------------------------------------------------------------------------------------
'区':12 '龙':11 '云南':2 '云南省':1 '市':7 '昆':6,10 '盘龙':9 '盘龙区':8 '昆明':5 '昆明市':4 '省':3
(1 row)
ES
es有专门的分词引擎,支持多种分词器,常使用的IK分词
总结
1、RedisJson支持JSON全文检索,使用Friso分词,该分词分的不是特别细,会导致某些二元词组查询不到
2、对比JSON的操作功能比较全面,RedisJson出来没多久,网上的应用场景比较少
本文由博客一文多发平台 OpenWrite 发布!