liangbinny

Oracle全文检索方面的研究(全)

http://chaoji-liangbin.blog.163.com/blog/static/252392122010915101351354/

参考百度文档：

http://wenku.baidu.com/view/c53e9e36a32d7375a417801a.html

1、准备流程

1.1检查和设置数据库角色

首先检查数据库中是否有CTXSYS用户和CTXAPP脚色。如果没有这个用户和角色，意味着你的数据库创建时未安装intermedia功能。你必须修改数据库以安装这项功能。　默认安装情况下，ctxsys用户是被锁定的，因此要先启用ctxsys的用户。

默认ctxsys用户是被锁定的且密码即时失效，所以我们以sys用户进入em，然后修改ctxsys用户的状态和密码。如图：

1.2　赋权　

测试用户以之前已经建好的foo用户为例，以该用户下的T_DOCNEWS为例

先以sys用户dba身份登录，对foo赋resource,connect权限

GRANT resource, connect to foo;

再以ctxsys用户登录并对foo用户赋权

GRANT ctxapp TO foo;

GRANT execute ON ctxsys. ctx_cls TO foo;

GRANT execute ON ctxsys. ctx_ddl TO foo;

GRANT execute ON ctxsys. ctx_doc TO foo;

GRANT execute ON ctxsys. ctx_output TO foo;

GRANT execute ON ctxsys. ctx_query TO foo;

GRANT execute ON ctxsys. ctx_report TO foo;

GRANT execute ON ctxsys. ctx_thes TO foo;

GRANT execute ON ctxsys. ctx_ulexer TO foo;

查看系统默认的oracle text 参数

Select pre_name, pre_object from ctx_preferences

2、Oracle Text 索引原理

Oracle text 索引将文本中所有的字符转化成记号（token），如www.taobao.com 会转化

成www,taobao,com 这样的记号。

Oracle10g 里面支持四种类型的索引，context,ctxcat,ctxrule,ctxxpath

2.1 Context 索引

Oracle text 索引把全部的word 转化成记号，context 索引的架构是反向索引（inverted

index）,每个记号都映射着包含它自己的文本位置，如单词dog 可能会有如下的条目

这表示dog 在文档doc1，doc3，doc5 中都出现过。索引建好之后，系统中会自动产生

如下DR$MYINDEX$I,DR$MYINDEX$K,DR$MYINDEX$R,DR$MYINDEX$X,MYTABLE5 个表(假设表为

mytable, 索引为myindx) 。Dml 操作后， context 索引不会自动同步，需要利用

ctx_ddl.sync_index 手工同步索引。

例子：

Create table docs (id number primary key, text varchar2(200));

Insert into docs values(1, 'california is a state in the us.');

Insert into docs values(2, 'paris is a city in france.');

Insert into docs values(3, 'france is in europe.');

Commit;

--建立context 索引

Create index idx_docs on docs(text)

indextype is ctxsys.context parameters

('filter ctxsys.null_filter section group ctxsys.html_section_group');

--查询

Column text format a40; --字符串截为40位显示。

Select id, text from docs where contains(text, 'france') > 0;

id text

---------- -------------------------------

3 france is in europe.

2 paris is a city in france.

--继续插入数据

Insert into docs values(4, 'los angeles is a city in california.');

Insert into docs values(5, 'mexico city is big.');

commit;

Select id, text from docs where contains(text, 'city') > 0;--新插入的数据没有查询到

id text

--------------------------------------------

2 paris is a city in france.

--索引同步

begin

ctx_ddl.sync_index('idx_docs', '2m'); --使用2M同步索引

end;

--查询

Column text format a50;

Select id, text from docs where contains(text, 'city') > 0; --查到数据

id text

-----------------------------------------------

5 mexico city is big.

4 los angeles is a city in california.

2 paris is a city in france.

-- or 操作符

Select id, text from docs where contains(text, 'city or state ') > 0;

--and 操作符

Select id, text from docs where contains(text, 'city and state ') > 0;

或是

Select id, text from docs where contains(text, 'city state ') > 0;

--score 表示得分，分值越高，表示查到的数据越精确

SELECT SCORE(1), id, text FROM docs WHERE CONTAINS(text, 'oracle', 1) > 0;

Context 类型的索引不会自动同步，这需要在进行Dml 后，需要手工同步索引。与context 索引相对于的查询操作符为contains

2.2 Ctxcat 索引

用在多列混合查询中

Ctxcat 可以利用index set 建立一个索引集，把一些经常与ctxcat 查询组合使用的查询列添加到索引集中。比如你在查询一个商品名时，还需要查询生产日期，价格，描述等，你可可以将这些列添加到索引集中。oracle 将这些查询封装到catsearch 操作中，从而提高全文索引的效率。在一些实时性要求较高的交易上，context 的索引不能自动同步显然是个问题，ctxcat则会自动同步索引

例子：

Create table auction(Item_id number,Title varchar2(100),Category_id number,Price number,Bid_close date);

Insert into auction values(1, 'nikon camera', 1, 400, '24-oct-2002');

Insert into auction values(2, 'olympus camera', 1, 300, '25-oct-2002');

Insert into auction values(3, 'pentax camera', 1, 200, '26-oct-2002');

Insert into auction values(4, 'canon camera', 1, 250, '27-oct-2002');

Commit;

--确定你的查询条件(很重要)

--Determine that all queries search the title column for item descriptions

--建立索引集

begin

ctx_ddl.create_index_set('auction_iset');

ctx_ddl.add_index('auction_iset','price'); /* sub-index a*/

end;

--建立索引

Create index auction_titlex on auction(title) indextype is ctxsys.ctxcat

parameters ('index set auction_iset');

Column title format a40;

Select title, price from auction where catsearch(title, 'camera', 'order by price')> 0;

Title price

--------------- ----------

Pentax camera 200

Canon camera 250

Olympus camera 300

Nikon camera 400

Insert into auction values(5, 'aigo camera', 1, 10, '27-oct-2002');

Insert into auction values(6, 'len camera', 1, 23, '27-oct-2002');

commit;

--测试索引是否自动同步

Select title, price from auction where catsearch(title, 'camera',

'price <= 100')>0;

Title price

--------------- ----------

aigo camera 10

len camera 23

添加多个子查询到索引集：

begin

ctx_ddl.drop_index_set('auction_iset');

ctx_ddl.create_index_set('auction_iset');

ctx_ddl.add_index('auction_iset','price'); /* sub-index A */

ctx_ddl.add_index('auction_iset','price, bid_close'); /* sub-index B */

end;

drop index auction_titlex;

Create index auction_titlex on auction(title) indextype is ctxsys.ctxcat

parameters ('index set auction_iset');

SELECT * FROM auction WHERE CATSEARCH(title, 'camera','price = 200 order by bid_close')>0;

SELECT * FROM auction WHERE CATSEARCH(title, 'camera','order by price, bid_close')>0;

任何的Dml 操作后，Ctxcat 的索引会自动进行同步，不需要手工去执行，与ctxcat 索引相对应的查询操作符是catsearch.

语法：

Catsearch(

[schema.]column,

Text_query varchar2,

Structured_query varchar2,

Return number;

例子：

catsearch(text, 'dog', 'foo > 15')

catsearch(text, 'dog', 'bar = ''SMITH''')

catsearch(text, 'dog', 'foo between 1 and 15')

catsearch(text, 'dog', 'foo = 1 and abc = 123')

2.3 Ctxrule 索引

The function of a classification application is to perform some action based on document content.

These actions can include assigning a category id to a document or sending the document to a user.

The result is classification of a document.

例子：

Create table queries (query_id number,query_string varchar2(80));

insert into queries values (1, 'oracle');

insert into queries values (2, 'larry or ellison');

insert into queries values (3, 'oracle and text');

insert into queries values (4, 'market share');

commit;

Create index queryx on queries(query_string) indextype is ctxsys.ctxrule;

Column query_string format a35;

Select query_id,query_string from queries

where matches(query_string,

'oracle announced that its market share in databases

increased over the last year.')>0;

query_id query_string

---------- -----------------------------------

1 oracle

4 market share

在一句话中建立索引匹配查询

2.4 Ctxxpath 索引

Create this index when you need to speed up existsNode() queries on an XMLType column

3. 索引的内部处理流程

3.1 Datastore 属性

数据检索负责将数据从数据存储（例如 web 页面、数据库大型对象或本地文件系统）

中取出，然后作为数据流传送到下一个阶段。Datastore 包含的类型有Direct datastore,

Multi_column_datastore, Detail_datastore, File_datastore, Url_datastore, User_datastore,

Nested_datastore。

3.1.1.Direct datastore

支持存储数据库中的数据,单列查询.没有attributes 属性

支持类型：char, varchar, varchar2, blob, clob, bfile,or xmltype.

例子：

Create table mytable(id number primary key, docs clob);

Insert into mytable values(111555,'this text will be indexed');

Insert into mytable values(111556,'this is a direct_datastore example');

Commit;

--建立 direct datastore

Create index myindex on mytable(docs)

indextype is ctxsys.context

parameters ('datastore ctxsys.default_datastore');

Select * from mytable where contains(docs, 'text') > 0;

3.1.2.Multi_column_datastore

适用于索引数据分布在多个列中

the column list is limited to 500 bytes

支持number 和date 类型，在索引之前会先转化成textt

raw and blob columns are directly concatenated as binary data.

不支持long, long raw, nchar, and nclob, nested table

Create table mytable1(id number primary key, doc1 varchar2(400),doc2 clob,doc3

clob);

Insert into mytable1 values(1,'this text will be indexed','following example creates amulti-column ','denotes that the bar column ');

Insert into mytable1 values(2,'this is a direct_datastore example','use this datastore when your text is stored in more than one column','the system concatenates the text columns');

Commit;

--建立 multi datastore 类型

Begin

Ctx_ddl.create_preference('my_multi', 'multi_column_datastore');

Ctx_ddl.set_attribute('my_multi', 'columns', 'doc1, doc2, doc3');

End;

--建立索引

Create index idx_mytable on mytable1(doc1)indextype is ctxsys.context

parameters('datastore my_multi')

Select * from mytable1 where contains(doc1,'direct datastore')>0;

Select * from mytable1 where contains(doc1,'example creates')>0;

注意：检索时，检索词对英文，必须是有意义的词，比如，

Select * from mytable1 where contains(doc1,' more than one column ')>0;

可以查出第二条纪录，但你检索more将没有显示，因为more在那句话中不是有意义的一个词。

--只更新从表，看是否能查到更新的信息

Update mytable1 set doc2='adladlhadad this datastore when your text is stored test' where

id=2;

Begin

Ctx_ddl.sync_index('idx_mytable');

End;

Select * from mytable1 where contains(doc1,'adladlhadad')>0; --没有记录

Update mytable1 set doc1='this is a direct_datastore example' where id=2; --更新主表

Begin

Ctx_ddl.sync_index('idx_mytable');--同步索引

End;

Select * from mytable1 where contains(doc1,'adladlhadad')>0; -查到从表的更新

对于多列的全文索引可以建立在任意一列上，但是，在查询时指定的列必须与索引时指定的

列保持一致，只有索引指定的列发生修改，oracle 才会认为被索引数据发生了变化，仅修改

其他列而没有修改索引列，即使同步索引也不会将修改同步到索引中.

也就是说，只有更新了索引列，同步索引才能生效，，要更改其他列的同时也要再写一次即可。

在多列中，对任意一列建立索引即可，更新其他列的同时，在update那个列，同步索引一次即可看到效果了。

3.1.3 Detail_datastore

适用于主从表查询（原文：use the detail_datastore type for text stored directly in the database in

detail tables, with the indexed text column located in the master table）

因为真正被索引的是从表上的列，选择主表的那个列作为索引并不重要，但是选定之后，查

询条件中就必须指明这个列

主表中的被索引列的内容并没有包含在索引中

DETAIL_DATASTORE 属性定义

例子：

create table my_master –建立主表

(article_id number primary key,author varchar2(30),title varchar2(50),body varchar2(1));

create table my_detail –建立从表

(article_id number, seq number, text varchar2(4000),

constraint fr_id foreign key (ARTICLE_ID) references my_master (ARTICLE_ID));

--模拟数据

insert into my_master values(1,'Tom','expert on and on',1);

insert into my_master values(2,'Tom','Expert Oracle Database Architecture',2);

commit;

insert into my_detail values(1,1,'Oracle will find the undo information for this transaction

either in the cached

undo segment blocks (most likely) or on disk ');

insert into my_detail values(1,2,'if they have been flushed (more likely for very large

transactions).');

insert into my_detail values(1,3,'LGWR is writing to a different device, then there is no

contention for

redo logs');

insert into my_detail values(2,1,'Many other databases treat the log files as');

insert into my_detail values(2,2,'For those systems, the act of rolling back can be

disastrous');

commit;

--建立 detail datastore

begin

ctx_ddl.create_preference('my_detail_pref', 'DETAIL_DATASTORE');

ctx_ddl.set_attribute('my_detail_pref', 'binary', 'true');

ctx_ddl.set_attribute('my_detail_pref', 'detail_table', 'my_detail');

ctx_ddl.set_attribute('my_detail_pref', 'detail_key', 'article_id');

ctx_ddl.set_attribute('my_detail_pref', 'detail_lineno', 'seq');

ctx_ddl.set_attribute('my_detail_pref', 'detail_text', 'text');

end;

--创建索引

CREATE INDEX myindex123 on my_master(body) indextype is ctxsys.context

parameters('datastore my_detail_pref');

select * from my_master where contains(body,'databases')>0

--只更新从表信息，看是否还能查到

update my_detail set text='undo is generated as a result of the DELETE, blocks are modified,

and redo is sent over to

the redo log buffer' where article_id=2 and seq=1

begin

ctx_ddl.sync_index('myindex123','2m'); --同步索引

end;

select * from my_master where contains(body,'result of the DELETE')>0 –没有查到刚才的更新

--跟新从表后，更新主表信息

update my_master set body=3 where body=2

begin

ctx_ddl.sync_index('myindex123','2m');

end;

select * from my_master where contains(body,'result of the DELETE')>0 –查到数据

如果更新了子表中的索引列，必须要去更新主表索引列来使oracle 认识到被索引数据发生变

化（这个可以通过触发器来实现）。

3.1.4 File_datastore

适用于检索本地服务器上的文件（原文：The FILE_DATASTORE type is used for text stored in

files accessed through the local file system.）

多个路径标识：Unix 下冒号分隔开如path1:path2:pathn Windows 下用分号;分隔开

create table mytable3(id number primary key, docs varchar2(2000));

insert into mytable3 values(111555,'1.txt');

insert into mytable3 values(111556,'1.doc');

commit;

--建立 file datastore

begin

ctx_ddl.create_preference('COMMON_DIR2','FILE_DATASTORE');

ctx_ddl.set_attribute('COMMON_DIR2','PATH','D:/search');

end;

--建立索引

create index myindex3 on mytable3(docs) indextype is ctxsys.context parameters ('datastore COMMON_DIR2');

select * from mytable3 where contains(docs,'word')>0; --查询

--暂时测试支持doc，txt

3.1.5 Url_datastore

适用于检索internet 上的信息，数据库中只需要存储相应的url 就可以

例子：

create table urls(id number primary key, docs varchar2(2000));

insert into urls values(111555,'http://context.us.oracle.com');

insert into urls values(111556,'http://www.sun.com');

insert into urls values(111557,'http://www.itpub.net');

insert into urls values(111558,'http://www.ixdba.com');

commit;

--建立url datastore

begin

ctx_ddl.create_preference('URL_PREF','URL_DATASTORE');

ctx_ddl.set_attribute('URL_PREF','Timeout','300');

end;

--建立索引

create index datastores_text on urls (docs) indextype is ctxsys.context parameters

( 'Datastore URL_PREF' );

select * from urls where contains(docs,'Aix')>0

若相关的url 不存在，oracle 并不会报错，只是查询的时候找不到数据而已。

oracle 中仅仅保存被索引文档的url 地址，如果文档本身发生了变化，必须要通过修改索引

列（url 地址列）的方式来告知oracle，被索引数据已经发生了变化。

3.1.6.User_datastore

Use the USER_DATASTORE type to define stored procedures that synthesize documents during

indexing. For example, a user procedure might synthesize author, date, and text columns into one

document to have the author and date information be part of the indexed text.

3.1.7 Nested_datastore

全文索引支持将数据存储在嵌套表中

3.1.8.参考脚本

--建立direct_store

Create index myindex on mytable(docs)

indextype is ctxsys.context

parameters ('datastore ctxsys.default_datastore');

--建立mutil_column_datastore

Begin

Ctx_ddl.create_preference('my_multi', 'multi_column_datastore');

Ctx_ddl.set_attribute('my_multi', 'columns', 'doc1, doc2, doc3');

End;

Create index idx_mytable on mytable1(doc1)indextype is ctxsys.context

parameters('datastore my_multi')

--建立file_datafilestore

begin

ctx_ddl.create_preference('COMMON_DIR','FILE_DATASTORE');

ctx_ddl.set_attribute('COMMON_DIR','PATH','/opt/tmp');

end;

create index myindex on mytable1(docs) indextype is ctxsys.context parameters ('datastore

COMMON_DIR');

--建立url_datastore

begin

ctx_ddl.create_preference('URL_PREF','URL_DATASTORE');

ctx_ddl.set_attribute('URL_PREF','Timeout','300');

end;

create index datastores_text on urls (docs) indextype is ctxsys.context parameters

( 'Datastore URL_PREF' );

3.2 Filter 属性

过滤器负责将各种文件格式的数据转换为纯文本格式，索引管道中的其他组件只能处理纯文本数据，不能识别 microsoft word 或 excel 等文件格式，filter 有charset_filter、

inso_filter、null_filter、user_filter、procedure_filter 几种类型。(可将文档格式转化为数据库文本格式等。)

3.2.1 CHARSET_FILTER

把文档从非数据库字符转化成数据库字符（原文：Use the CHARSET_FILTER to convert

documents from a non-database character set to the character set used by the database）

例子：

create table hdocs ( id number primary key, fmt varchar2(10), cset varchar2(20),

text varchar2(80)

);

begin

cxt_ddl.create.preference('cs_filter', 'CHARSET_FILTER');

ctx_ddl.set_attribute('cs_filter', 'charset', 'UTF8');

end

insert into hdocs values(1, 'text', 'WE8ISO8859P1', '/docs/iso.txt');

insert into hdocs values (2, 'text', 'UTF8', '/docs/utf8.txt');

commit;

create index hdocsx on hdocs(text) indextype is ctxsys.context

parameters ('datastore ctxsys.file_datastore

filter cs_filter

format column fmt

charset column cset');

3.2.2 NULL_FILTER

默认属性，不进行任何过滤

oracle 不建议对html、xml 和plain text 使用auto_filter 参数，oracle 建议你使用

null_filter 和section group type

--建立null filter

create index myindex on docs(htmlfile) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group ctxsys.html_section_group');

Filter 的默认值会受到索引字段类型和datastore 的类型的影响，对于存储在数据库中的

varchar2、char 和clob 字段中的数据，oracle 自动选择了null_filtel,若datastore 的属性设置为

file_datastore，oracle 会选择 auto_filter 作为默认值。

3.2.3 AUTO_FILTER

通用的过滤器，适用于大部分文档，包括PDF 和Ms word，过滤器还会自动识别出plain-text, HTML, XHTML,

SGML 和XML 文档

Create table my_filter (id number, docs varchar2(1000));

Insert into my_filter values (1, 'Expert Oracle Database Architecture.pdf');

Insert into my_filter values (2, '1.txt');

Insert into my_filter values (3, '2.doc');

commit;

--建立 file datastore

Begin

ctx_ddl.create_preference('test_filter', 'file_datastore');

ctx_ddl.set_attribute('test_filter', 'path', '/opt/tmp');

End;

--错误信息表

select * from CTX_USER_INDEX_ERRORS

--建立 auto filter

Create index idx_m_filter on my_filter (docs) indextype is ctxsys.context

parameters ('datastore test_filter filter ctxsys.auto_filter');

select * from my_filter where contains(docs,'oracle')>0

AUTO_FILTER 能自动识别出大部分格式的文档，我们也可以显示的通过column 来指定文档类型，有text,binary,ignore，设置为binary 的文档使用auto_filter，设置为text 的文档使用null_filter，设置为ignore的文档不进行索引。

create table hdocs (id number primary key,fmt varchar2(10),text varchar2(80));

insert into hdocs values(1, 'binary', '/docs/myword.doc');

insert in hdocs values (2, 'text', '/docs/index.html');

insert in hdocs values (2, 'ignore', '/docs/1.txt');

commit;

create index hdocsx on hdocs(text) indextype is ctxsys.context

parameters ('datastore ctxsys.file_datastore filter ctxsys.auto_filter format column

fmt');

3.2.4 MAIL_FILTER

通过mail_filter 把RFC-822,RFC-2045 信息转化成索引文本

限制：

文档必须是us-ascii

长度不能超过1024bytes

document must be syntactically valid with regard to RFC-822

3.2.5 USER_FILTER

Use the USER_FILTER type to specify an external filter for filtering documents in a column

3.2.6 PROCEDURE_FILTER

Use the PROCEDURE_FILTER type to filter your documents with a stored procedure. The stored procedure is called

each time a document needs to be filtered.

3.2.7 参考脚本

--建立null filter

create index myindex on docs(htmlfile) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group ctxsys.html_section_group');

--建立 auto filter

Create index idx_m_filter on my_filter (docs) indextype is ctxsys.context

parameters ('datastore test_filter filter ctxsys.auto_filter');

Filter 错误记录表：CTX_USER_INDEX_ERRORS

3.3 Lexer 属性

Oracle 全文检索的lexer 属性用于处理各种不同的语言，最基本的英文使用basic_lexer，

中文则可以使用chinese_vgram_lexer 或chinese_lexer。

3.3.1 Basic_lexer

basic_lexer 属性支持如英语、德语、荷兰语、挪威语、瑞典语等以空格作为界限的语言（原

文：Use the BASIC_LEXER type to identify tokens for creating Text indexes for English and all

other supported whitespace-delimited languages.）

Create table my_lex (id number, docs varchar2(1000));

Insert into my_lex values (1, 'this is a example for the basic_lexer');

Insert into my_lex values (2, 'he following example sets Printjoin characters ');

Insert into my_lex values (3, 'To create the INDEX with no_theme indexing and with printjoins characters');

Insert into my_lex values (4, '中华人民共和国');

Insert into my_lex values (5, '中国淘宝软件');

Insert into my_lex values (6, '测试basic_lexer 是否支持中文');

Commit;

--建立basic_lexer

begin

ctx_ddl.create_preference('mylex', 'BASIC_LEXER');

ctx_ddl.set_attribute ('mylex', 'printjoins', '_-'); --保留_ -符号

ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO');

ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES');

ctx_ddl.set_attribute ('mylex','mixed_case','yes'); --区分大小写

end;

create index indx_m_lex on my_lex(docs) indextype is ctxsys.context parameters('lexer

mylex');

Select id from my_lex where contains(docs, 'no_theme') > 0;

select docs from my_lex where contains(docs,'中国')>0

3.3.2 Mutil_lexer

支持多种语言的文档，比如你可以利用这个lexer 来定义包含Endlish,German 和Japanese 的

文档（原文：Use MULTI_LEXER to index text columns that contain documents of different

languages. For example, you can use this lexer to index a text column that stores English, German,

and Japanese documents.）建立一个multi_lexer 属性的索引，并通过language 列设置需要索

引的语言，Oracle 会根据language 列的内容去匹配add_sub_lexer 过程中指定的语言标识符，如果匹配的上，就使用该sub_lexer 作为索引的lexer，如果没有找到匹配的，就使用default语言作为索引的lexer 列，注意客户端nls_language，可能会影响lexer 的选择

Select * from v$nls_parameters where parameter = 'NLS_LANGUAGE';

alter session set nls_language='simplified chinese';

alter session set nls_language='american';

例子：

create table globaldoc ( doc_id number primary key,lang varchar2(3),text clob);

--建立multi_lexer

begin

ctx_ddl.create_preference('english_lexer','basic_lexer');

ctx_ddl.set_attribute('english_lexer','index_themes','yes');

ctx_ddl.set_attribute('english_lexer','theme_language','english');

ctx_ddl.create_preference('german_lexer','basic_lexer');

ctx_ddl.set_attribute('german_lexer','composite','german');

ctx_ddl.set_attribute('german_lexer','mixed_case','yes');

ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');

ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');

ctx_ddl.create_preference('global_lexer', 'multi_lexer');

ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');

ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');

ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');

end;

create index globalx on globaldoc(text) indextype is ctxsys.context

parameters ('lexer global_lexer language column lang');

3.3.3 chinese_vgram_lexer 和chinese_lexer

basic_lexer 只能识别出被空格、标点和回车符分隔出来的部分，如果要对中文内容进行索引的话，就必须使用chinese_vgram_lexer 或是chinese_lexer

Chinese_lexer 相比chinese_vgram_lexer 有如下的优点：

产生的索引更小

更好的查询响应时间

产生更接近真实的索引切词，使得查询精度更高

支持停用词

因为chinese_lexer 采用不同的算法来标记tokens, 建立索引的时间要比chinese_vgram_lexer

长.

字符集：支持al32utf8，zhs16cgb231280，zhs16gbk，zhs32gb18030，zht32euc，zht16big5

zht32tris， zht16mswin950，zht16hkscs，utf8

--建立chinese lexer

Begin

ctx_ddl.create_preference('my_chinese_vgram_lexer', 'chinese_vgram_lexer');

ctx_ddl.create_preference('my_chinese_lexer', 'chinese_lexer');

End;

-- chinese_vgram_lexer

Create index ind_m_lex1 on my_lex(docs) indextype is ctxsys.context Parameters ('lexer foo.my_chinese_vgram_lexer');

Select * from my_lex t where contains(docs, '中国') > 0;

-- chinese_lexer

drop index ind_m_lex1 force;

Create index ind_m_lex2 on my_lex(docs) indextype is ctxsys.context

Parameters ('lexer ctxsys.my_chinese_lexer');

Select * from my_lex t where contains(docs, '中国') > 0;

3.3.4 User_lexer

Use USER_LEXER to plug in your own language-specific lexing solution. This enables you to

define lexers for languages that are not supported by Oracle Text. It also enables you to define a

new lexer for a language that is supported but whose lexer is inappropriate for your application.

3.3.5 Default_lexer

如果数据库在建立的时候指定的是中文则default_lexer 为chinese_vgram_lexer，如果是英文，则default_lexer 为basic_lexer

3.3.6 Query_procedure

This callback stored procedure is called by Oracle Text as needed to tokenize words in the query.

A space-delimited group of characters (excluding the query operators) in the query will be

identified by Oracle Text as a word.

3.3.7 参考脚本

--建立basic_lexer

begin

ctx_ddl.create_preference('mylex', 'BASIC_LEXER');

ctx_ddl.set_attribute ('mylex', 'printjoins', '_-'); --保留_ -符号

ctx_ddl.set_attribute ('mylex','mixed_case','yes'); --区分大小写

end;

create index indx_m_lex on my_lex(docs) indextype is ctxsys.context parameters('lexer

mylex');

--建立 chinese_vgram_lexer 或是chinese_lexer

Begin

ctx_ddl.create_preference('my_chinese_vgram_lexer', 'chinese_vgram_lexer');

ctx_ddl.create_preference('my_chinese_lexer', 'chinese_lexer');

End;

-- chinese_vgram_lexer

Create index ind_m_lex1 on my_lex(docs) indextype is ctxsys.context

Parameters ('lexer ctxsys.my_chinese_vgram_lexer');

3.4 Section Group 属性

Section group 支持查询包含内部结构的文档（如html、xml 文档等），可以指定对文档

的某一部分进行查询,你可以将查询范围限定在标题head 中。在html、xml 等类似结构的文

档中，除了用来显示的内容外，还包括了大量用于控制结构的标识，而这些标识可能是不希望被索引的，这就是section group 的一个主要功能(原文：In order to issue WITHIN queries on document sections, you must create a section group before you define your sections)

3.4.1 Null_section_group

系统默认，不进行任何节的过滤

例子：

Create table my_sec (id number, docs varchar2(100));

Insert into my_sec values (1, 'a simple section group, test null_section_group attribute.');

Insert into my_sec values (2, 'this record one, can be query in nornal');

Insert into my_sec values (4, 'this record

are tested for

the query in paragraph');

Commit;

--定义null_section_group

Create index ind_m_sec on my_sec(docs) indextype is ctxsys.context

parameters ('section group ctxsys.null_section_group');

Select * from my_sec where contains(docs, 'record and query') > 0;

--要预先定义sentence 或paragraph'，否则查询会出错

Select * from my_sec where contains(docs, '(record and query) within sentence') > 0;

Begin

ctx_ddl.create_section_group('test_null', 'null_section_group');

ctx_ddl.add_special_section('test_null', 'sentence');

ctx_ddl.add_special_section('test_null', 'paragraph');

End;

drop index ind_m_sec;

Create index ind_m_sec on my_sec(docs) indextype is ctxsys.context

parameters ('section group test_null');

Select * from my_sec where contains(docs, '(record and query) within sentence') > 0;

Select * from my_sec where contains(docs, '(record and query) within paragraph') > 0;

3.4.2 Basic_section_group

basic_section_group 才是支持节搜索的最基础的一种属性，但是它只支持以开头以

结尾的结构的文档

Create table my_sec1 (id number, docs varchar2(1000));

Insert into my_sec1 values (1, 'title

this is the contents of the example.

Use this example to test the basic_section_group.');

Insert into my_sec1 values (2, 'example

this line incluing the word title too.');

Commit;

Create index ind_my_sec1 on my_sec1(docs) indextype is ctxsys.context;

Select * from my_sec1 where contains (docs, 'heading') > 0;

--定义basic_section_group

Begin

Ctx_ddl.create_section_group('test_basic', 'basic_section_group');

End;

drop index ind_my_sec1;

Create index ind_my_sec1 on my_sec1(docs) indextype is ctxsys.context

parameters ('section group test_basic');

Select * from my_sec1 where contains (docs, 'heading') > 0;

Select * from my_sec1 where contains (docs, 'context') > 0;

Select * from my_sec1 where contains (docs, 'use') > 0;

节搜索的另一个主要功能就是可以限制查询的范围，上面的文档包含了两部分，标题和正文，

其中标题使用标签，正文使用标签,我们可以对basic_section_group 添加

区域属性，运行查询在文档的某个范围内进行

Drop index ind_my_sec1;

Begin

ctx_ddl.add_zone_section('test_basic', 'head', 'heading');

End;

Create index ind_my_sec1 on my_sec1(docs) indextype is ctxsys.context

parameters ('section group test_basic');

Select * from my_sec1 where contains (docs, 'title') > 0;

--在head 里面查询

Select * from my_sec1 where contains (docs, 'title within head') > 0;

3.4.3 Html_section_group

Html 文档具有很多不规范的表示方法，oracle 建议使用html_section_group 以便能够得到更

好的识别

--定义html_section_group

begin

ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');

end;

create index myindex on docs(htmlfile) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group htmgroup');

无论是field_section 还是zone_section，表示文档的tag 标签都是大小写敏感的，其大小写需

要和原文中匹配

3.4.4.Xml_section_group

Xml 文档的格式要求比html 文档严谨、规范，这也使得xml_section_group 比

html_section_group 具有了更多的功能

例子:

Create table my_sec2 (id number, docs varchar2(1000));

Insert into my_sec2 values (1, 'context.xml');

commit;

--定义xml_section_group

Begin

ctx_ddl.create_preference('test_file', 'file_datastore');

ctx_ddl.set_attribute('test_file', 'path', '/opt/tmp');

ctx_ddl.create_section_group('test_html', 'html_section_group');

ctx_ddl.create_section_group('test_xml', 'xml_section_group');

End;

Create index ind_t_docs on my_sec2 (docs) indextype is ctxsys.context

parameters('datastore ctxsys.test_file filter ctxsys.null_filter section group

ctxsys.test_xml')

Begin

ctx_ddl.add_attr_section('test_xml', 'name', 'const@name');

End;

Select * from my_sec2 where contains (docs, 'complete within name') > 0;

3.4.5.Auto_section_group

Xml_section_group 的增强型，对于xml_section_group 用户需要自己添加需要定义的节组，

而使用auto_section_group，则oracle 会自动添加节组以及属性信息

3.4.6 Path_section_group

和auto_section_group 十分类似，path_section_group 比auto_section_group 增加了haspath 和

inpath 操作，但是path_section_group 不支持add_stop_section 属性

3.4.7 参考脚本

--建立null_section_group

Create index ind_m_sec on my_sec(docs) indextype is ctxsys.context

parameters ('section group ctxsys.null_section_group');

--建立basic_section_group

Begin

Ctx_ddl.create_section_group('test_basic', 'basic_section_group');

End;

Begin

ctx_ddl.add_zone_section('test_basic', 'head', 'heading'); --设定节查询

End;

Create index ind_my_sec1 on my_sec1(docs) indextype is ctxsys.context

parameters ('section group test_basic');

--建立Html_section_group

begin

ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');

end;

create index myindex on docs(htmlfile) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group htmgroup');

--建立Xml_section_group

Begin

ctx_ddl.create_section_group('test_xml', 'xml_section_group');

End;

Create index ind_t_docs on my_sec2 (docs) indextype is ctxsys.context

parameters('filter ctxsys.null_filter section group ctxsys.test_xml')

3.5 Storage 属性

Oracle 全文检索通常会生成一系列的辅助表，生成规则是dr$+索引名+$+表用途标识，

由于这些表是oracle 自动生成的，通常没有办法为这些表指定存储空间。为构造text 索引所

生成的辅助表指定表空间、存储参数（use the storage preference to specify tablespace and

creation parameters for tables associated with a text index），oracle 提供了单一的存储类型

basic_storage。

在mytable1 表中建立了全文索检索myindex，系统中会自动产生如下5 个表：

DR$MYINDEX$I,DR$MYINDEX$K,DR$MYINDEX$R,DR$MYINDEX$X,MYTABLE1

参考脚本

--建立basic storage

Begin

Ctx_ddl.create_preference('mystore', 'basic_storage'); --建立storage

Ctx_ddl.set_attribute('mystore', --设置参数

'i_table_clause',

'tablespace foo storage (initial 1k)');