FTS3和FTS4是SQLite虚表模块,允许用户在一堆文档中实现全文搜索。用户输入一个短语(term),或者一些列term,然后这个系统找到一些列文档,最佳地匹配了哪些terms。这篇文章介绍了FTS3和FTS4的部署和使用
FTS1和FTS2是过时的全文搜索模块。有一些已知的问题。FTS3的部分恭喜到了SQLite的项目中。现在作为SQLite的一部分被开发和维护。
FTS3和FTS4扩展模块允许用户创建特殊的表,内置了全文索引(FTS表)全文索引允许用户有效地查询数据库的所有行,找到包含一个或更多的单词(tokens),就算表包含很多文档。
比如,如果517430个文档中,每个文档都同时插入了FTS表和普通的SQLite表,使用以下的SQL脚本
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table */
CREATE TABLE enrondata2(content TEXT); /* Ordinary table */
然后下面任意一个query可以执行,来查找数据库中包含单词”linux“文档的数量。使用一个桌面PC硬件的配置,在FTS3的查询在大约0.03s返回,而查询普通表则需要22.5s
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
当然,上面这两个query并不是完全相等的。比如,LIKE query匹配的行,可能是linuxphobe或者EnterpricseLinux
,而在FTS3的MATCH query选择准确的”linux“。这俩都是大小写敏感的。FTS3表的存储消耗大约2000MB,普通标大概是1453MB。使用相同的硬件配置调用上面的query,FTS3表使用31分钟建立,普通表则是25MB。
FTS3和FTS4几乎是相同的。大部分代码一样,接口一样。不同点在于:
FTS4是FTS3的增强版。
应该在你的应用用哪个?FTS4有时候比FTS3显著地快。
新一点的应用,推荐使用FTS4;然而如果兼容性比较重要,FTS3也不错。
像其他虚表类型,使用CREATE VIRTUAL TABLE声明来创建。
-- Create an FTS table named "data" with one column - "content":
CREATE VIRTUAL TABLE data USING fts3();
-- Create an FTS table named "pages" with three columns:
CREATE VIRTUAL TABLE pages USING fts4(title, keywords, body);
-- Create an FTS table named "mail" with two columns. Datatypes
-- and column constraints are specified along with each column. These
-- are completely ignored by FTS and SQLite.
CREATE VIRTUAL TABLE mail USING fts3(
subject VARCHAR(256) NOT NULL,
body TEXT CHECK(length(body)<10240)
);
删除也类似
-- Create, then immediately drop, an FTS4 table.
CREATE VIRTUAL TABLE data USING fts4();
DROP TABLE data;
填充FTS表是使用INSERT、UPDATE和DELETE声明。
-- Create an FTS table
CREATE VIRTUAL TABLE pages USING fts4(title, body);
-- Insert a row with a specific docid value.
INSERT INTO pages(docid, title, body) VALUES(53, 'Home Page', 'SQLite is a software...');
-- Insert a row and allow FTS to assign a docid value using the same algorithm as
-- SQLite uses for ordinary tables. In this case the new docid will be 54,
-- one greater than the largest docid currently present in the table.
INSERT INTO pages(title, body) VALUES('Download', 'All SQLite source code...');
-- Change the title of the row just inserted.
UPDATE pages SET title = 'Download SQLite' WHERE rowid = 54;
-- Delete the entire table contents.
DELETE FROM pages;
-- The following is an error. It is not possible to assign non-NULL values to both
-- the rowid and docid columns of an FTS table.
INSERT INTO pages(rowid, docid, title, body) VALUES(1, 2, 'A title', 'A document body');
为了支持全文query,FTS维护了倒排索引。
这两种查询很高效:
如果上面两种查询都不是,那就只能线性扫描了。那可就太慢了。
-- The examples in this block assume the following FTS table:
CREATE VIRTUAL TABLE mail USING fts3(subject, body);
SELECT * FROM mail WHERE rowid = 15; -- Fast. Rowid lookup.
SELECT * FROM mail WHERE body MATCH 'sqlite'; -- Fast. Full-text query.
SELECT * FROM mail WHERE mail MATCH 'search'; -- Fast. Full-text query.
SELECT * FROM mail WHERE rowid BETWEEN 15 AND 20; -- Fast. Rowid lookup.
SELECT * FROM mail WHERE subject = 'database'; -- Slow. Linear scan.
SELECT * FROM mail WHERE subject MATCH 'database'; -- Fast. Full-text query.
其实还支持更复杂的查询,包括phrase search,term-prefix search。
-- Example schema
CREATE VIRTUAL TABLE mail USING fts3(subject, body);
-- Example table population
INSERT INTO mail(docid, subject, body) VALUES(1, 'software feedback', 'found it too slow');
INSERT INTO mail(docid, subject, body) VALUES(2, 'software feedback', 'no feedback');
INSERT INTO mail(docid, subject, body) VALUES(3, 'slow lunch order', 'was a software problem');
-- Example queries
SELECT * FROM mail WHERE subject MATCH 'software'; -- Selects rows 1 and 2
SELECT * FROM mail WHERE body MATCH 'feedback'; -- Selects row 2
SELECT * FROM mail WHERE mail MATCH 'software'; -- Selects rows 1, 2 and 3
SELECT * FROM mail WHERE mail MATCH 'slow'; -- Selects rows 1 and 3
从用户的角度看,FTS表跟普通表很像,也就是那些增删改查而已。主要不同如下:
1、就像所有的虚表类型,不可能创建索引或者trigger。也不可能使用ALTER TABLE命令来添加列。(尽管可以重命名表名)
2、数据类型完全被忽略。所有都会转换成TEXT来存储
3、允许别名‘docid’
4、FTS MATCH操作符支持全文搜索
5、FTS辅助功能,snippet() offsets() matchinfo()
6、每个FTS有隐藏列。名字跟表名一样。只是用来做MATCH。
尽管源码中包含FTS3和FTS4,默认没有enable。要enable的话,编译的时候定义宏SQLITE_ENABLE_FTS3。新的应用应该也定义SQLITE_ENABLE_FTS3_PARENTHESIS宏,来enable enhanced_query_syntax。通常通过添加下面的两个开关。
-DSQLITE_ENABLE_FTS3
-DSQLITE_ENABLE_FTS3_PARENTHESIS
enable FTS3,FTS4也就顺带着好了。
FTS表支持三个基础的query类型
-- Virtual table declaration
CREATE VIRTUAL TABLE docs USING fts3(title, body);
-- Query for all documents containing the term "linux":
SELECT * FROM docs WHERE docs MATCH 'linux';
-- Query for all documents containing a term with the prefix "lin". This will match
-- all documents that contain "linux", but also those that contain terms "linear",
--"linker", "linguistic" and so on.
SELECT * FROM docs WHERE docs MATCH 'lin*';
可以指定是哪一列:
-- Query the database for documents for which the term "linux" appears in
-- the document title, and the term "problems" appears in either the title
-- or body of the document.
SELECT * FROM docs WHERE docs MATCH 'title:linux problems';
-- Query the database for documents for which the term "linux" appears in
-- the document title, and the term "driver" appears in the body of the document
-- ("driver" may also appear in the title, but this alone will not satisfy the
-- query criteria).
SELECT * FROM docs WHERE body MATCH 'title:linux driver';
term必须是第一个term:
-- All documents for which "linux" is the first token of at least one
-- column.
SELECT * FROM docs WHERE docs MATCH '^linux';
-- All documents for which the first token in column "title" begins with "lin".
SELECT * FROM docs WHERE body MATCH 'title: ^lin*';
-- Query for all documents that contain the phrase "linux applications".
SELECT * FROM docs WHERE docs MATCH '"linux applications"';
-- Query for all documents that contain a phrase that matches "lin* app*". As well as
-- "linux applications", this will match common phrases such as "linoleum appliances"
-- or "link apprentice".
SELECT * FROM docs WHERE docs MATCH '"lin* app*"';
-- Virtual table declaration.
CREATE VIRTUAL TABLE docs USING fts4();
-- Virtual table data.
INSERT INTO docs VALUES('SQLite is an ACID compliant embedded relational database management system');
-- Search for a document that contains the terms "sqlite" and "database" with
-- not more than 10 intervening terms. This matches the only document in
-- table docs (since there are only six terms between "SQLite" and "database"
-- in the document).
SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR database';
-- Search for a document that contains the terms "sqlite" and "database" with
-- not more than 6 intervening terms. This also matches the only document in
-- table docs. Note that the order in which the terms appear in the document
-- does not have to be the same as the order in which they appear in the query.
SELECT * FROM docs WHERE docs MATCH 'database NEAR/6 sqlite';
-- Search for a document that contains the terms "sqlite" and "database" with
-- not more than 5 intervening terms. This query matches no documents.
SELECT * FROM docs WHERE docs MATCH 'database NEAR/5 sqlite';
-- Search for a document that contains the phrase "ACID compliant" and the term
-- "database" with not more than 2 terms separating the two. This matches the
-- document stored in table docs.
SELECT * FROM docs WHERE docs MATCH 'database NEAR/2 "ACID compliant"';
-- Search for a document that contains the phrase "ACID compliant" and the term
-- "sqlite" with not more than 2 terms separating the two. This also matches
-- the only document stored in table docs.
SELECT * FROM docs WHERE docs MATCH '"ACID compliant" NEAR/2 sqlite';
NOTE: 想想搜索的原理,这个NEAR是咋做的??以这个为例: SELECT * FROM docs WHERE docs MATCH ‘database NEAR/6 sqlite’;
倒排索引可以很轻松地先找出同时包含database 和 sqlite的doc,然后再看每个文档中这些term的位置,过滤掉不符合的就行。
还可以有多个NEAR,直接看例子吧:
-- The following query selects documents that contains an instance of the term
-- "sqlite" separated by two or fewer terms from an instance of the term "acid",
-- which is in turn separated by two or fewer terms from an instance of the term
-- "relational".
SELECT * FROM docs WHERE docs MATCH 'sqlite NEAR/2 acid NEAR/2 relational';
-- This query matches no documents. There is an instance of the term "sqlite" with
-- sufficient proximity to an instance of "acid" but it is not sufficiently close
-- to an instance of the term "relational".
SELECT * FROM docs WHERE docs MATCH 'acid NEAR/2 sqlite NEAR/2 relational';
还有一些操作
后面还有好多例子,如果真的用到再看吧~