SQLite FTS3 和 FTS4 插件

总览

FTS3和FTS4是SQLite虚表模块,允许用户在一堆文档中实现全文搜索。用户输入一个短语(term),或者一些列term,然后这个系统找到一些列文档,最佳地匹配了哪些terms。这篇文章介绍了FTS3和FTS4的部署和使用

FTS1和FTS2是过时的全文搜索模块。有一些已知的问题。FTS3的部分恭喜到了SQLite的项目中。现在作为SQLite的一部分被开发和维护。

1、FTS3和FTS4的介绍

FTS3和FTS4扩展模块允许用户创建特殊的表,内置了全文索引(FTS表)全文索引允许用户有效地查询数据库的所有行,找到包含一个或更多的单词(tokens),就算表包含很多文档。

比如,如果517430个文档中,每个文档都同时插入了FTS表和普通的SQLite表,使用以下的SQL脚本

CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */

然后下面任意一个query可以执行,来查找数据库中包含单词”linux“文档的数量。使用一个桌面PC硬件的配置,在FTS3的查询在大约0.03s返回,而查询普通表则需要22.5s

SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */

当然,上面这两个query并不是完全相等的。比如,LIKE query匹配的行,可能是linuxphobe或者EnterpricseLinux
,而在FTS3的MATCH query选择准确的”linux“。这俩都是大小写敏感的。FTS3表的存储消耗大约2000MB,普通标大概是1453MB。使用相同的硬件配置调用上面的query,FTS3表使用31分钟建立,普通表则是25MB。

1.1、FTS3和FTS4的不同

FTS3和FTS4几乎是相同的。大部分代码一样,接口一样。不同点在于:

  • FTS4包含query性能优化,可能显著地改进哪些包含常用term的全文query的性能(大比例的表的行都存在的)
  • FTS4支持另外的选项,可能在matchinfo()函数使用
  • 因为在硬盘保存了额外信息(在两个新的shadow 表) 为了支持性能优化和附加的matchinfo()选项,FTS4表可能比FTS3消耗更多磁盘空间,通常差距是少1-2%,但当FTS表很小的时候可能要高10%。
  • FTS4提供了钩子(压缩和反压缩选项)允许数据以压缩的形式保存,减少磁盘使用和IO

FTS4是FTS3的增强版。

应该在你的应用用哪个?FTS4有时候比FTS3显著地快。

新一点的应用,推荐使用FTS4;然而如果兼容性比较重要,FTS3也不错。

1.2、 创建和销毁FTS表

像其他虚表类型,使用CREATE VIRTUAL TABLE声明来创建。

-- Create an FTS table named "data" with one column - "content":
CREATE VIRTUAL TABLE data USING fts3();

-- Create an FTS table named "pages" with three columns:
CREATE VIRTUAL TABLE pages USING fts4(title, keywords, body);

-- Create an FTS table named "mail" with two columns. Datatypes
-- and column constraints are specified along with each column. These
-- are completely ignored by FTS and SQLite. 
CREATE VIRTUAL TABLE mail USING fts3(
  subject VARCHAR(256) NOT NULL,
  body TEXT CHECK(length(body)<10240)
);

删除也类似

-- Create, then immediately drop, an FTS4 table.
CREATE VIRTUAL TABLE data USING fts4();
DROP TABLE data;

1.3、填充FTS表

填充FTS表是使用INSERT、UPDATE和DELETE声明。

-- Create an FTS table
CREATE VIRTUAL TABLE pages USING fts4(title, body);

-- Insert a row with a specific docid value.
INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');

-- Insert a row and allow FTS to assign a docid value using the same algorithm as
-- SQLite uses for ordinary tables. In this case the new docid will be 54,
-- one greater than the largest docid currently present in the table.
INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');

-- Change the title of the row just inserted.
UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;

-- Delete the entire table contents.
DELETE FROM pages;

-- The following is an error. It is not possible to assign non-NULL values to both
-- the rowid and docid columns of an FTS table.
INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');

为了支持全文query,FTS维护了倒排索引。

1.4、简单的FTS查询

这两种查询很高效:

  • 使用rowid查询
  • 全文query。 MATCH ? FTS表就可以使用内置全文索引

如果上面两种查询都不是,那就只能线性扫描了。那可就太慢了。

-- The examples in this block assume the following FTS table:
CREATE VIRTUAL TABLE mail USING fts3(subject, body);

SELECT * FROM mail WHERE rowid = 15;                -- Fast. Rowid lookup.
SELECT * FROM mail WHERE body MATCH 'sqlite';       -- Fast. Full-text query.
SELECT * FROM mail WHERE mail MATCH 'search';       -- Fast. Full-text query.
SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20;   -- Fast. Rowid lookup.
SELECT * FROM mail WHERE subject = 'database';      -- Slow. Linear scan.
SELECT * FROM mail WHERE subject MATCH 'database';  -- Fast. Full-text query.

其实还支持更复杂的查询,包括phrase search,term-prefix search。

-- Example schema
CREATE VIRTUAL TABLE mail USING fts3(subject, body);

-- Example table population
INSERT INTO mail(docid, subject, body) VALUES(1, 'software feedback', 'found it too slow');
INSERT INTO mail(docid, subject, body) VALUES(2, 'software feedback', 'no feedback');
INSERT INTO mail(docid, subject, body) VALUES(3, 'slow lunch order',  'was a software problem');

-- Example queries
SELECT * FROM mail WHERE subject MATCH 'software';    -- Selects rows 1 and 2
SELECT * FROM mail WHERE body    MATCH 'feedback';    -- Selects row 2
SELECT * FROM mail WHERE mail    MATCH 'software';    -- Selects rows 1, 2 and 3
SELECT * FROM mail WHERE mail    MATCH 'slow';        -- Selects rows 1 and 3

1.5、总结

从用户的角度看,FTS表跟普通表很像,也就是那些增删改查而已。主要不同如下:

1、就像所有的虚表类型,不可能创建索引或者trigger。也不可能使用ALTER TABLE命令来添加列。(尽管可以重命名表名)
2、数据类型完全被忽略。所有都会转换成TEXT来存储
3、允许别名‘docid’
4、FTS MATCH操作符支持全文搜索
5、FTS辅助功能,snippet() offsets() matchinfo()
6、每个FTS有隐藏列。名字跟表名一样。只是用来做MATCH。

2、编译和enable FTS3和FTS4

尽管源码中包含FTS3和FTS4,默认没有enable。要enable的话,编译的时候定义宏SQLITE_ENABLE_FTS3。新的应用应该也定义SQLITE_ENABLE_FTS3_PARENTHESIS宏,来enable enhanced_query_syntax。通常通过添加下面的两个开关。

-DSQLITE_ENABLE_FTS3
-DSQLITE_ENABLE_FTS3_PARENTHESIS

enable FTS3,FTS4也就顺带着好了。

3、全文索引query

FTS表支持三个基础的query类型

  • Token 或者 token前缀 query。使用*来实现term prefix查询
-- Virtual table declaration
CREATE VIRTUAL TABLE docs USING fts3(title, body);

-- Query for all documents containing the term "linux":
SELECT * FROM docs WHERE docs MATCH 'linux';

-- Query for all documents containing a term with the prefix "lin". This will match
-- all documents that contain "linux", but also those that contain terms "linear",
--"linker", "linguistic" and so on.
SELECT * FROM docs WHERE docs MATCH 'lin*';

可以指定是哪一列:

-- Query the database for documents for which the term "linux" appears in
-- the document title, and the term "problems" appears in either the title
-- or body of the document.
SELECT * FROM docs WHERE docs MATCH 'title:linux problems';

-- Query the database for documents for which the term "linux" appears in
-- the document title, and the term "driver" appears in the body of the document
-- ("driver" may also appear in the title, but this alone will not satisfy the
-- query criteria).
SELECT * FROM docs WHERE body MATCH 'title:linux driver';

term必须是第一个term:

-- All documents for which "linux" is the first token of at least one
-- column.
SELECT * FROM docs WHERE docs MATCH '^linux';

-- All documents for which the first token in column "title" begins with "lin".
SELECT * FROM docs WHERE body MATCH 'title: ^lin*';
  • Phrase queries。召回文档包含一系列的term或者term prefixes,按照指定的顺序,没有中间接入的token。
-- Query for all documents that contain the phrase "linux applications".
SELECT * FROM docs WHERE docs MATCH '"linux applications"';

-- Query for all documents that contain a phrase that matches "lin* app*". As well as
-- "linux applications", this will match common phrases such as "linoleum appliances"
-- or "link apprentice".
SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
  • NEAR queries. 返回文档,term之前不能超过一定数量的介入term
-- Virtual table declaration.
CREATE VIRTUAL TABLE docs USING fts4();

-- Virtual table data.
INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');

-- Search for a document that contains the terms "sqlite" and "database" with
-- not more than 10 intervening terms. This matches the only document in
-- table docs (since there are only six terms between "SQLite" and "database" 
-- in the document).
SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';

-- Search for a document that contains the terms "sqlite" and "database" with
-- not more than 6 intervening terms. This also matches the only document in
-- table docs. Note that the order in which the terms appear in the document
-- does not have to be the same as the order in which they appear in the query.
SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';

-- Search for a document that contains the terms "sqlite" and "database" with
-- not more than 5 intervening terms. This query matches no documents.
SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';

-- Search for a document that contains the phrase "ACID compliant" and the term
-- "database" with not more than 2 terms separating the two. This matches the
-- document stored in table docs.
SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';

-- Search for a document that contains the phrase "ACID compliant" and the term
-- "sqlite" with not more than 2 terms separating the two. This also matches
-- the only document stored in table docs.
SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';

NOTE: 想想搜索的原理,这个NEAR是咋做的??以这个为例: SELECT * FROM docs WHERE docs MATCH ‘database NEAR/6 sqlite’;
倒排索引可以很轻松地先找出同时包含database 和 sqlite的doc,然后再看每个文档中这些term的位置,过滤掉不符合的就行。

还可以有多个NEAR,直接看例子吧:

-- The following query selects documents that contains an instance of the term 
-- "sqlite" separated by two or fewer terms from an instance of the term "acid",
-- which is in turn separated by two or fewer terms from an instance of the term
-- "relational".
SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';

-- This query matches no documents. There is an instance of the term "sqlite" with
-- sufficient proximity to an instance of "acid" but it is not sufficiently close
-- to an instance of the term "relational".
SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';

还有一些操作

  • AND操作决定文档列表取交集
  • OR操作决定文档列表取并集
  • NOT操作 计算相对补集(relative complement)

后面还有好多例子,如果真的用到再看吧~

你可能感兴趣的:(搜索,数据库)