3.3 Lexer 属性
Oracle 全文检索的lexer 属性用于处理各种不同的语言,最基本的英文使用basic_lexer,
中文则可以使用chinese_vgram_lexer 或chinese_lexer。
3.3.1 Basic_lexer
basic_lexer 属性支持如英语、德语、荷兰语、挪威语、瑞典语等以空格作为界限的语言(原
文:Use the BASIC_LEXER type to identify tokens for creating Text indexes for English and all
other supported whitespace-delimited languages.)
Create table my_lex (id number, docs varchar2(1000));
Insert into my_lex values (1, 'this is a example for the basic_lexer');
Insert into my_lex values (2, 'he following example sets Printjoin characters ');
Insert into my_lex values (3, 'To create the INDEX with no_theme indexing and with printjoins characters');
Insert into my_lex values (4, '中华人民共和国');
Insert into my_lex values (5, '中国淘宝软件');
Insert into my_lex values (6, '测试basic_lexer 是否支持中文');
Commit;
/
--建立basic_lexer
begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute ('mylex', 'printjoins', '_-'); --保留_ -符号
ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO');
ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES');
ctx_ddl.set_attribute ('mylex','mixed_case','yes'); --区分大小写
end;
create index indx_m_lex on my_lex(docs) indextype is ctxsys.context parameters('lexer
mylex');
Select id from my_lex where contains(docs, 'no_theme') > 0;
select docs from my_lex where contains(docs,'中国')>0
3.3.2 Mutil_lexer
支持多种语言的文档,比如你可以利用这个lexer 来定义包含Endlish,German 和Japanese 的
文档(原文:Use MULTI_LEXER to index text columns that contain documents of different
languages. For example, you can use this lexer to index a text column that stores English, German,
and Japanese documents.)建立一个multi_lexer 属性的索引,并通过language 列设置需要索
引的语言,Oracle 会根据language 列的内容去匹配add_sub_lexer 过程中指定的语言标识符,如果匹配的上,就使用该sub_lexer 作为索引的lexer,如果没有找到匹配的,就使用default语言作为索引的lexer 列,注意客户端nls_language,可能会影响lexer 的选择
Select * from v$nls_parameters where parameter = 'NLS_LANGUAGE';
alter session set nls_language='simplified chinese';
alter session set nls_language='american';
例子:
create table globaldoc ( doc_id number primary key,lang varchar2(3),text clob);
--建立multi_lexer
begin
ctx_ddl.create_preference('english_lexer','basic_lexer');
ctx_ddl.set_attribute('english_lexer','index_themes','yes');
ctx_ddl.set_attribute('english_lexer','theme_language','english');
ctx_ddl.create_preference('german_lexer','basic_lexer');
ctx_ddl.set_attribute('german_lexer','composite','german');
ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');
ctx_ddl.create_preference('global_lexer', 'multi_lexer');
ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');
ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');
ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');
end;
create index globalx on globaldoc(text) indextype is ctxsys.context
parameters ('lexer global_lexer language column lang');
3.3.3 chinese_vgram_lexer 和chinese_lexer
basic_lexer 只能识别出被空格、标点和回车符分隔出来的部分,如果要对中文内容进行索引的话,就必须使用chinese_vgram_lexer 或是chinese_lexer
Chinese_lexer 相比chinese_vgram_lexer 有如下的优点:
产生的索引更小
更好的查询响应时间
产生更接近真实的索引切词,使得查询精度更高
支持停用词
因为chinese_lexer 采用不同的算法来标记tokens, 建立索引的时间要比chinese_vgram_lexer
长.
字符集:支持al32utf8,zhs16cgb231280,zhs16gbk,zhs32gb18030,zht32euc,zht16big5
zht32tris, zht16mswin950,zht16hkscs,utf8
--建立chinese lexer
Begin
ctx_ddl.create_preference('my_chinese_vgram_lexer', 'chinese_vgram_lexer');
ctx_ddl.create_preference('my_chinese_lexer', 'chinese_lexer');
End;
-- chinese_vgram_lexer
Create index ind_m_lex1 on my_lex(docs) indextype is ctxsys.context Parameters ('lexer foo.my_chinese_vgram_lexer');
Select * from my_lex t where contains(docs, '中国') > 0;
-- chinese_lexer
drop index ind_m_lex1 force;
Create index ind_m_lex2 on my_lex(docs) indextype is ctxsys.context
Parameters ('lexer ctxsys.my_chinese_lexer');
Select * from my_lex t where contains(docs, '中国') > 0;
3.3.4 User_lexer
Use USER_LEXER to plug in your own language-specific lexing solution. This enables you to
define lexers for languages that are not supported by Oracle Text. It also enables you to define a
new lexer for a language that is supported but whose lexer is inappropriate for your application.
3.3.5 Default_lexer
如果数据库在建立的时候指定的是中文则default_lexer 为chinese_vgram_lexer,如果是英文,则default_lexer 为basic_lexer
3.3.6 Query_procedure
This callback stored procedure is called by Oracle Text as needed to tokenize words in the query.
A space-delimited group of characters (excluding the query operators) in the query will be
identified by Oracle Text as a word.
3.3.7 参考脚本
--建立basic_lexer
begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute ('mylex', 'printjoins', '_-'); --保留_ -符号
ctx_ddl.set_attribute ('mylex','mixed_case','yes'); --区分大小写
end;
create index indx_m_lex on my_lex(docs) indextype is ctxsys.context parameters('lexer
mylex');
--建立 chinese_vgram_lexer 或是chinese_lexer
Begin
ctx_ddl.create_preference('my_chinese_vgram_lexer', 'chinese_vgram_lexer');
ctx_ddl.create_preference('my_chinese_lexer', 'chinese_lexer');
End;
-- chinese_vgram_lexer
Create index ind_m_lex1 on my_lex(docs) indextype is ctxsys.context
Parameters ('lexer ctxsys.my_chinese_vgram_lexer');