parser切词&全文检索

全文检索:

http://www.postgresql.org/docs/9.3/interactive/textsearch.html

postgres 切词:

http://blog.163.com/digoal@126/blog/static/163877040201422410175698/
ts_parse: 获取切词后的所有词组
SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围');
SELECT * FROM ts_parse('english', 'hello world! 2010年保障房建设在全国范围');
to_tsvector,to_tsquery:获取一个text的所有切词
select to_tsvector('english', ' example is a concatenation of title and body');
select ' before, full text search functionality includes the ability to do many m' @@ 'ability';
select to_tsvector(' before, full text search functionality includes the ability to do many m') @@ to_tsquery('ability');
在tsvector字段上加Gin索引可以加快检索速度:
http://www.postgresql.org/docs/9.3/interactive/textsearch-tables.html#TEXTSEARCH-TABLES-SEARCH
建立一个索引需要格式如CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));#body==>text字段
建索引的时候需注意 to_tsvector函数有两个版本 That is, WHERE to_tsvector('english', body) @@ 'a & b' can use the index, but WHERE to_tsvector(body) @@ 'a & b' cannot.
 
    CREATE INDEX name ON table USING gist(column); 
    

Creates a GiST (Generalized Search Tree)-based index. The column can be of tsvector or tsquery type.

CREATE INDEX name ON table USING gin(column);

Creates a GIN (Generalized Inverted Index)-based index. The column must be of tsvector type.


使用zhparser插件:
一般postgresql的切词是english的,对于中文切词不支持。
http://blog.163.com/digoal@126/blog/static/163877040201422410175698/

相似度pg_trgm :

是一个比较字符串相似度中间件。 http://blog.163.com/digoal@126/blog/static/163877040201191882553803/
此插件暂时对中文不支持。

1:安装中间件:

postgres=# create extension pg_trgm;
CREATE EXTENSION
查看pg_trgm的所有方法:
postgres=# \dx+ pg_trgm 
                                  Objects in extension "pg_trgm"
                                        Object Description                                        
--------------------------------------------------------------------------------------------------
 function gin_extract_query_trgm(text,internal,smallint,internal,internal,internal,internal)
 function gin_extract_value_trgm(text,internal)
 function gin_trgm_consistent(internal,smallint,text,integer,internal,internal,internal,internal)
 function gtrgm_compress(internal)
 function gtrgm_consistent(internal,text,integer,oid,internal)
 function gtrgm_decompress(internal)
 function gtrgm_distance(internal,text,integer,oid)
 function gtrgm_in(cstring)
 function gtrgm_out(gtrgm)
 function gtrgm_penalty(internal,internal,internal)
 function gtrgm_picksplit(internal,internal)
 function gtrgm_same(gtrgm,gtrgm,internal)
 function gtrgm_union(bytea,internal)
 function set_limit(real)
 function show_limit()
 function show_trgm(text)
 function similarity(text,text)
 function similarity_dist(text,text)
 function similarity_op(text,text)
 operator %(text,text)
 operator <->(text,text)
 operator class gin_trgm_ops for access method gin
 operator class gist_trgm_ops for access method gist
 operator family gin_trgm_ops for access method gin
 operator family gist_trgm_ops for access method gist
 type gtrgm
(26 rows)

2:函数介绍:

1:similarity 计算两个字符串的相似度

postgres=# select similarity('entries','entry');
 similarity 
------------
        0.4

2:show_limit、set_limit查看和修改系统的相似度阀值:

postgres=# select show_limit();
 show_limit 
------------
        0.3
postgres=# select set_limit(0.3);
 set_limit 
-----------
       0.3

3:similarity_op,和操作符“%”一样都是判断两个字符串的相似度是否大于等于系统相似度阀值:

postgres=# select similarity_op('entries','ries');
 similarity_op 
---------------
 t
(1 row)
postgres=# select similarity_op('entries','ri');  
 similarity_op 
---------------
 f
postgres=# select 'entries' % 'ries';
 ?column? 
----------
 t
postgres=# select 'entries' % 'ri';  
 ?column? 
----------
 f

4:操作符“<->”  返回两个相似度的距离(1减去其两个字符串的相似度)

postgres=# select  similarity('entries' ,'entri');
 similarity 
------------
   0.555556
postgres=# select 'entries' <-> 'entri';
 ?column? 
----------
 0.444444
(1 row)

3:比较like/ilike,%,<->走索引和不走索引的时间比

like是区分大小写的。
ilike是不区分大小写的。
%和<->是pg_trgm的操作符。
postgres=# create table test_pg_trgm (id serial primary key,info text);
CREATE TABLE
postgres=#  insert into test_pg_trgm (info) select generate_series(1,1000000)||'digoal';
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (5 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
INSERT 0 1000000

建立gist索引:
postgres=# create index idx_test_pg_trgm_1 on test_pg_trgm using gist (info gist_trgm_ops);
LOG:  checkpoints are occurring too frequently (6 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
CREATE INDEX
Time: 35253.859 ms
建立gin索引:
postgres=# create index idx_test_pg_trgm_1 on test_pg_trgm using gin (info gin_trgm_ops); 
LOG:  checkpoints are occurring too frequently (3 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (4 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (3 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
CREATE INDEX
Time: 11286.458 ms

1:like

不走索引:
postgres=# select count(*) from test_pg_trgm where info ilike '%123444%al%';
 count 
-------
     1
(1 row)

Time: 373.475 ms
走gist索引:
postgres=# select count(*) from test_pg_trgm where info ilike '%123444%al%';
 count 
-------
     1
(1 row)

Time: 16.443 ms
走gin索引:
postgres=# select count(*) from test_pg_trgm where info ilike '%123444%al%';
 count 
-------
     1
(1 row)

Time: 3.065 ms

2:%

不走索引:
postgres=# select count(*) from test_pg_trgm where info % '%123%oal%';
 count 
-------
  1132
(1 row)

Time: 1532.612 ms
走gist索引:
postgres=# select count(*) from test_pg_trgm where info % '%123%oal%';
 count 
-------
  1132
(1 row)

Time: 166.685 ms
走gin索引:
postgres=# select count(*) from test_pg_trgm where info % '%123%oal%';
 count 
-------
  1132
(1 row)

Time: 2478.780 ms

3:<->

不走索引:
postgres=# select * from test_pg_trgm order by info <-> '1234digoal' limit 1;
  id  |    info    
------+------------
 1234 | 1234digoal
(1 row)

Time: 1759.215 ms
走gist索引:
postgres=# select * from test_pg_trgm order by info <-> '1234digoal' limit 1;
  id  |    info    
------+------------
 1234 | 1234digoal
(1 row)

Time: 126.471 ms
走gin索引:
postgres=#  select * from test_pg_trgm order by info <-> '1234digoal' limit 1;
  id  |    info    
------+------------
 1234 | 1234digoal
(1 row)

Time: 1690.692 ms















你可能感兴趣的:(postgresql)