普通索引,如B+树索引,hash索引是对表中某一字段数据的精确查找,例如字段 name ='dxt00',属于短文本查找。
例如,对于下面的查询B+树是支持的:(查找以‘xxx'开头的blog)
select * from blog where content like 'xxx%'
显然,这并不符合用户查询博客中是否含有某一关键词的需求,即:
select * from blog where content like '%xxx%'
这时候,我们就可以使用全文检索了。
全文检索是将存储于整本书或整篇文章中的任意内容查找出来的技术。
全文检索使用倒排索引来实现,用到了两种关联数组:
1)inverted file index
{单词,单词所在文档的ID}
2) full inverted index
{单词,(单词所在文档的ID,单词在具体文件中的位置)}
例如,书上给出的全文检索表,以及对应的inverted file index 和 full inverted index
Innodb中采用 full inverted index 方式。
结构为(word,ilist)
其中,ilist 为(DocumentId,position) 为了支持全文检索,必须有一个列与word进行映射,Innodb中这个列被命名为FTS_DOC_ID,类型必须是 BIGINT UNSIGNED NOT NULL,用户也可以在建表时自动添加FTS_DOC_ID
mysql> create table fts_a(
-> FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
-> body TEXT,
-> PRIMARY KEY(FTS_DOC_ID)
-> );
Query OK, 0 rows affected (0.12 sec)
插入数据:
mysql> insert into fts_a select null,'please porridge in the pot';
Query OK, 1 row affected (0.07 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into fts_a select null,'please porridge hot,please porridge cold';
Query OK, 1 row affected (0.03 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into fts_a select null,'Nine days old';
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> insert into fts_a select null,'Some like it hot,some like it cold';
Query OK, 1 row affected (0.02 sec)
Records: 1 Duplicates: 0 Warnings: 0
为body字段创建一个类为fulltext 的索引
mysql> alter table fts_a add fulltext key(body);
Query OK, 0 rows affected (0.39 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> show index from fts_a;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| fts_a | 0 | PRIMARY | 1 | FTS_DOC_ID | A | 4 | NULL | NULL | | BTREE | | | YES | NULL |
| fts_a | 1 | body | 1 | body | NULL | 4 | NULL | NULL | YES | FULLTEXT | | | YES | NULL |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.03 sec)
通过设置参数 innodb_ft_aux_table 来查看分词对应的信息:
mysql> set global innodb_ft_aux_table = 'index_study/fts_a';
Query OK, 0 rows affected (0.00 sec)
mysql> select * from information_schema.innodb_ft_index_table;
+----------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| cold | 2 | 4 | 2 | 2 | 36 |
| cold | 2 | 4 | 2 | 4 | 30 |
| days | 3 | 3 | 1 | 3 | 5 |
| hot | 2 | 4 | 2 | 2 | 16 |
| hot | 2 | 4 | 2 | 4 | 13 |
| like | 4 | 4 | 1 | 4 | 5 |
| like | 4 | 4 | 1 | 4 | 17 |
| nine | 3 | 3 | 1 | 3 | 0 |
| old | 3 | 3 | 1 | 3 | 10 |
| please | 1 | 2 | 2 | 1 | 0 |
| please | 1 | 2 | 2 | 2 | 0 |
| please | 1 | 2 | 2 | 2 | 20 |
| porridge | 1 | 2 | 2 | 1 | 7 |
| porridge | 1 | 2 | 2 | 2 | 7 |
| porridge | 1 | 2 | 2 | 2 | 20 |
| pot | 1 | 1 | 1 | 1 | 23 |
| some | 4 | 4 | 1 | 4 | 0 |
| some | 4 | 4 | 1 | 4 | 17 |
+----------+--------------+-------------+-----------+--------+----------+
18 rows in set (0.01 sec)
FIRST_DOC_ID :word第一次出现的文档ID
LAST_DOC_ID : word最后一次出现的文档ID
DOC_COUNT :含有word的文档个数
DOC_ID :当前文档ID
POSITION : word 当在前文档ID的位置
Innodb不会直接删除索引中对应的记录,而是将删除的文档ID插入到 DELETED表。
因此用户可以进行以下查询:
(我们试着删除第二行文档)
mysql> delete from index_study.fts_a where FTS_DOC_ID =2;
Query OK, 1 row affected (0.01 sec)
mysql> select * from information_schema.INNODB_FT_DELETED;
+--------+
| DOC_ID |
+--------+
| 2 |
+--------+
1 row in set (0.00 sec)
可以看到删除的文档ID插入到了表 information_schema.INNODB_FT_DELETED 中,并且innodb_ft_aux_table并没有改变:
mysql> select * from information_schema.innodb_ft_index_table;
+----------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| cold | 2 | 4 | 2 | 2 | 36 |
| cold | 2 | 4 | 2 | 4 | 30 |
| days | 3 | 3 | 1 | 3 | 5 |
| hot | 2 | 4 | 2 | 2 | 16 |
| hot | 2 | 4 | 2 | 4 | 13 |
| like | 4 | 4 | 1 | 4 | 5 |
| like | 4 | 4 | 1 | 4 | 17 |
| nine | 3 | 3 | 1 | 3 | 0 |
| old | 3 | 3 | 1 | 3 | 10 |
| please | 1 | 2 | 2 | 1 | 0 |
| please | 1 | 2 | 2 | 2 | 0 |
| please | 1 | 2 | 2 | 2 | 20 |
| porridge | 1 | 2 | 2 | 1 | 7 |
| porridge | 1 | 2 | 2 | 2 | 7 |
| porridge | 1 | 2 | 2 | 2 | 20 |
| pot | 1 | 1 | 1 | 1 | 23 |
| some | 4 | 4 | 1 | 4 | 0 |
| some | 4 | 4 | 1 | 4 | 17 |
+----------+--------------+-------------+-----------+--------+----------+
18 rows in set (0.01 sec)
如果用户想彻底删除倒排索引中该文档的分词信息,可以运行以下SQL语句:
mysql> set global innodb_optimize_fulltext_only = 1;
Query OK, 0 rows affected (0.00 sec)
再次查看deleted表和索引表:
mysql> select * from information_schema.INNODB_FT_DELETED;
+--------+
| DOC_ID |
+--------+
| 2 |
+--------+
1 row in set (0.00 sec)
mysql> select * from information_schema.innodb_ft_index_table;
+----------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+----------+--------------+-------------+-----------+--------+----------+
| cold | 4 | 4 | 1 | 4 | 30 |
| days | 3 | 3 | 1 | 3 | 5 |
| hot | 4 | 4 | 1 | 4 | 13 |
| like | 4 | 4 | 1 | 4 | 5 |
| like | 4 | 4 | 1 | 4 | 17 |
| nine | 3 | 3 | 1 | 3 | 0 |
| old | 3 | 3 | 1 | 3 | 10 |
| please | 1 | 1 | 1 | 1 | 0 |
| porridge | 1 | 1 | 1 | 1 | 7 |
| pot | 1 | 1 | 1 | 1 | 23 |
| some | 4 | 4 | 1 | 4 | 0 |
| some | 4 | 4 | 1 | 4 | 17 |
+----------+--------------+-------------+-----------+--------+----------+
12 rows in set (0.00 sec)
文档二已经彻底删除。
Mysql语句:
match (col1,col2,...) against (expr)
举例:
mysql> select * from fts_a
-> where match (body)
-> against ('porridge');
+------------+----------------------------+
| FTS_DOC_ID | body |
+------------+----------------------------+
| 1 | please porridge in the pot |
+------------+----------------------------+
1 row in set (0.01 sec)
mysql> select * from fts_a
-> where match (body)
-> against ('like');
+------------+------------------------------------+
| FTS_DOC_ID | body |
+------------+------------------------------------+
| 4 | Some like it hot,some like it cold |
+------------+------------------------------------+
1 row in set (0.00 sec)
explain看一下:
mysql> explain select * from fts_a
-> where match (body)
-> against ('like');
+----+-------------+-------+------------+----------+---------------+------+---------+-------+------+----------+-------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+----------+---------------+------+---------+-------+------+----------+-------------------------------+
| 1 | SIMPLE | fts_a | NULL | fulltext | body | body | 0 | const | 1 | 100.00 | Using where; Ft_hints: sorted |
+----+-------------+-------+------------+----------+---------------+------+---------+-------+------+----------+-------------------------------+
1 row in set, 1 warning (0.01 sec)
type = fulltext,即用了全文检索的倒排索引。