PostgreSQL pg_trgm usage,字符串拆分,相似匹配,模糊查询
pg_trgm是一个比较有意思的模块, 主要功能是比较两个字符串的相似度, 可以使用GIN或者GIST索引。
原理解析:
要比较字符串的相似度,首先要对原来的字符串进行拆分 :
下面是trigram or trigraph 拆分规则(不区分大小写) :
字符串被前置两个空格,后置一个空格,然后按3个连续的字符为分组进行拆分。
例如 :
mysite=> select show_trgm('bwhite'); show_trgm ------------------------------------- {" b"," bw",bwh,hit,ite,"te ",whi} (1 row)
拆分完后, 比较两个字符的相似度,算法是(两个字符串相同trigram的个数)除以(总共被拆成多少个trigram),
pg_trgm这个extension新增的几个常用的函数和操作符 :
similarity(text, text)
计算两个字符串的相似度。
show_trgm(text)
返回字符串拆分后的trigram。
show_limit()
返回目前系统中设置的相似度阈值,大于或者等于这个阈值的两个字符串,similarity_op(text, text)或 字符串1 % 字符串2 返回true.
set_limit(real)
这个函数用来设置相似度阈值。
例如:
mysite=> select similarity('bwtigerte','bwhite'); similarity ------------ 0.214286 (1 row) mysite=> select set_limit(0.4); set_limit ----------- 0.4 (1 row)
测试环境:
os: ubuntu 14.04
postgresql: 9.4
1. 建库, 建表
shell>>>: sudo adduser mysite sudo su - postgres psql CREATE USER mysite WITH PASSWORD 'tiger'; CREATE DATABASE mysite OWNER mysite; GRANT ALL PRIVILEGES ON DATABASE mysite to mysite;
超级用户连接 mysite 数据库并安装 pg_trgm 扩展:
\c mysite CREATE EXTENSION pg_trgm;
测试扩展模块是否安装成功:
mysite=> select show_trgm('bwhite'); show_trgm ------------------------------------- {" b"," bw",bwh,hit,ite,"te ",whi} (1 row)
2. 创建表添加记录:
mysite=> select * from polls_textlike; id | title | content | pub_date ----+--------+------------------------------------------+------------------------------- 1 | tiger | tiger are cute | 2015-10-16 20:58:13.247366+08 2 | bwhite | when tiger gets angry, you just run away | 2015-10-17 13:55:46.163738+08 (2 rows)
为需要相似匹配的字段创建索引:
CREATE INDEX trgm_idx ON mysite USING gist (content gist_trgm_ops);
重点来了,计算 content 字段的相似度:
mysite=> SELECT content, similarity(content, 'tiger'), id FROM polls_textlike; content | similarity | id ------------------------------------------+------------+---- tiger are cute | 0.4 | 1 when tiger gets angry, you just run away | 0.153846 | 2 (2 rows)
更多:
别人的测试:
In the following example we’ll see the use of each one. Operator % will return true if the similarity of the strings is greater than similarity threshold returned by show_limit function. In this example, both string are equal, in consequence, the operation will return true:
palominodb=# select similarity(‘Palomino’,’Palomino’) AS Similarity, ‘Palomino'<->’Palomino’ AS distance, ‘Palomino’ % ‘Palomino’ AS SimilarOrNot; -[ RECORD 1 ]+– similarity | 1 distance | 0 similarornot | t
参考链接:
http://www.postgresql.org/docs/9.4/static/pgtrgm.html
http://www.pythian.com/blog/indexing-text-columns-with-gist-or-gin-to-optimize-like-ilike-using-pg_trgm-in-postgres-9-1-part-1-2/
http://blog.163.com/digoal@126/blog/static/163877040201191882553803/