Java for Web学习笔记(一二二):搜索(4)MySQL全文索引(上)

全文索引

全文索引存储在索引数据中的词频和所在记录,频率越高,权重越低,用过一定的算个给出相关性评分(relevance score)。MySQL的MyISAM和InnoDB支持全文检索,但要注意:

  • InnoDB在版本5.6.4才开始提供全文索引
  • 虽然语法一样,但MyISAM和InnoDB在实现和算法是不同的,它们之间的相关性评分是不具备可比性,即不要用一个InnoDB表的相关度值和一个MyIASM表来比对。
  • MyISAM有543个stopwords(因词太常用,不作为索引),而InnoDB有36个,MySQL文档给出如果在引擎中加入或者删除stopword的命令,也给出了Full Text search的配置参数调整方式。

SQL

具体参考MYSQL的官方文档

为表格设置FULLTEXT KEY

-- 全文索引
ALTER TABLE TicketComment ADD FULLTEXT INDEX TicketComment_Search (Body);
-- 全文联合索引,同时检索这两列,只要在这两列当中出现的,都进行相关度打分
ALTER TABLE Ticket ADD FULLTEXT INDEX Ticket_Search (Subject, Body);

检索的sql语句

单词搜索

mysql> SELECT * FROM `TicketComment` WHERE MATCH(`Body`) AGAINST('test');
+-----------+----------+--------+-------------------+----------------------------+
| CommentId | TicketId | UserId | Body              | DateCreated                |
+-----------+----------+--------+-------------------+----------------------------+
|         2 |        1 |      4 | Comment Two: test | 2018-03-07 15:40:22.631000 |
|        12 |        1 |      4 | Test              | 2018-03-07 15:50:35.068000 |
+-----------+----------+--------+-------------------+----------------------------+
2 rows in set (0.04 sec)

查看关联分值

mysql> SELECT *, MATCH(`Body`) AGAINST('test') AS score From TicketComment;
+-----------+----------+--------+-------------------+----------------------------+--------------------+
| CommentId | TicketId | UserId | Body              | DateCreated                | score              |
+-----------+----------+--------+-------------------+----------------------------+--------------------+
|         1 |        1 |      4 | my comment: Hello | 2018-03-07 15:40:00.719000 |                  0 |
|         2 |        1 |      4 | Comment Two: test | 2018-03-07 15:40:22.631000 | 0.6055193543434143 |
|         3 |        1 |      4 | Comment Three : 3 | 2018-03-07 15:40:51.588000 |                  0 |
|         4 |        1 |      4 | Comment Four: 4   | 2018-03-07 15:41:00.622000 |                  0 |
|         5 |        1 |      4 | Comment Five: 5   | 2018-03-07 15:41:09.777000 |                  0 |
|         6 |        1 |      4 | Comment Six: 6    | 2018-03-07 15:41:16.899000 |                  0 |
|         7 |        1 |      4 | Comment Serven: 7 | 2018-03-07 15:41:28.665000 |                  0 |
|         8 |        1 |      4 | Comment  8        | 2018-03-07 15:41:37.733000 |                  0 |
|         9 |        1 |      4 | Comment 9         | 2018-03-07 15:41:43.515000 |                  0 |
|        10 |        1 |      4 | Comment 10        | 2018-03-07 15:41:51.349000 |                  0 |
|        11 |        1 |      4 | Comment 11        | 2018-03-07 15:42:01.263000 |                  0 |
|        12 |        1 |      4 | Test              | 2018-03-07 15:50:35.068000 | 0.6055193543434143 |
+-----------+----------+--------+-------------------+----------------------------+--------------------+
12 rows in set (0.01 sec)
SELECT *, MATCH(`Body`) AGAINST('test Hello') AS score From TicketComment;
# 下面的等同与上面,但我们可以将不同列的关联性加起来,或者将不同表格里面的关联性加起来(使用到join)
SELECT *, (MATCH(`Body`) AGAINST('test')+ MATCH(`Body`) AGAINST('Hello')) AS score From TicketComment; 

多词搜索

mysql> select * from TicketComment where match(`Body`) against('Five Six');
+-----------+----------+--------+-----------------+----------------------------+
| CommentId | TicketId | UserId | Body            | DateCreated                |
+-----------+----------+--------+-----------------+----------------------------+
|         5 |        1 |      4 | Comment Five: 5 | 2018-03-07 15:41:09.777000 |
|         6 |        1 |      4 | Comment Six: 6  | 2018-03-07 15:41:16.899000 |
+-----------+----------+--------+-----------------+----------------------------+
2 rows in set (0.00 sec)

测试发现数字属于stopwords,my也属于stopwords。

联合索引

mysql> select * from Ticket where Match(`subject`,`Body`) against('hello');
+----------+--------+---------+-----------------------------------+----------------------------+
| TicketId | UserId | Subject | Body                              | DateCreated                |
+----------+--------+---------+-----------------------------------+----------------------------+
|        1 |      3 | hello   | This is the frist ticket created! | 2018-01-15 16:09:13.016000 |
+----------+--------+---------+-----------------------------------+----------------------------+
1 row in set (0.01 sec)

使用boolean mode

我们在against里面可以注明使用boolean mode,可以得到一些逻辑组合,例如必须以什么开头,必须不包含,与还是或的关系。可使用的符号可以查询ft_boolean_syntax,其中ft就是fulltext的缩写。

mysql> SHOW VARIABLES LIKE 'ft%';
+--------------------------+----------------+
| Variable_name            | Value          |
+--------------------------+----------------+
| ft_boolean_syntax        | + -><()~*:""&| |
| ft_max_word_len          | 84             |
| ft_min_word_len          | 4              |
| ft_query_expansion_limit | 20             |
| ft_stopword_file         | (built-in)     |
+--------------------------+----------------+
  • + 表示必须包含。例如+apple,表示必须含有apple,并且以apple开始的,例如apple123。
  • 表示含有或者。例如apple banana,表示含有apple或者banana
  • - 表示不能包含。例如+apple -banana,表示含有apple但不能含有banana
  • > 提高该词的相关性,即优先含有该词
  • < 降低该词相关性,
  • ( ) 可以通过括号来使用字条件。例如+aaa +(>bbb
  • ~ 将其相关性由正转负,表示拥有该字会降低相关性,但不像「-」将之排除,只是排在较后面。
  • * 通配符,这个只能接在字符串后面。
  • " " :整体匹配,用双引号将一段句子包起来表示要完全相符,不可拆字。

使用例子:

select * from TicketComment where match(`Body`) against('Test -two' in boolean mode);
【参考】 https://blog.csdn.net/u011734144/article/details/52817766


相关链接:我的Professional Java for Web Applications相关文章

你可能感兴趣的:(JAVA,读书笔记,愷风(Wei)之Java,for,Web学习笔记)