一、 Oracle Text 索引文档时所使用的主要逻辑步骤如下:
(1)数据存储逻辑搜索表的所有行,并读取列中的数据。通常,这只是列数据,但有些数据存储使用列数据作为文档数据的指针。例如,URL_DATASTORE 将列数据作为URL使用。
(2)过滤器提取文档数据并将其转换为文本表示方式。存储二进制文档 (如 Word 或 Acrobat 文件) 时需要这样做。过滤器的输出不必是纯文本格式 -- 它可以是 XML 或 HTML 之类的文本格式。
(3)分段器提取过滤器的输出信息,并将其转换为纯文本。包括 XML 和 HTML 在内的不同文本格式有不同的分段器。转换为纯文本涉及检测重要文档段标记、移去不可见的信息和文本重新格式化。
存储类指定构成Oracle Text索引的数据库表和索引的表空间参数和创建参数。它仅有一个基本对象:BASIC_STORAGE,其属性包括:I_Index_Clause、I_Table_Clause、K_Table_Clause、N_Table_Clause、P_Table_Clause、R_Table_Clause。
数据存储:关于列中存储文本的位置和其他信息。默认情况下,文本直接存储到列中,表中的每行都表示一个单独的完整文档。其他数据存储位置包括存储在单独文件中或以其 URL 标识的 Web 页上。七个基本对象包括:Default_Datastore、Detail_Datastore、Direct_Datastore、File_Datastore、Multi_Column_Datastore 、URL_Datastore、User_Datastore,。
(3)文档段组(Section Group)类
(5)索引集(Index Set)
索引集是一个或多个Oracle 索引 (不是Oracle Text索引) 的集合,用于创建 CTXCAT类型的Oracle Text索引,只有一个基本对象BASIC_INDEX_SET。
词法分析器类标识文本使用的语言,还确定在文本中如何标识标记。默认的词法分析器是英语或其他西欧语言,用空格、标准标点和非字母数字字符标识标记,同时禁用大小写。包含8个基本对象:BASIC_LEXER、CHINESE_LEXER、CHINESE_VGRAM_LEXER、JAPANESE_LEXER、JAPANESE_VGRAM_LEXER、KOREAN_LEXER、KOREAN__MORPH_ LEXER、MULTI_LEXER。
过滤器确定如何过滤文本以建立索引。可以使用过滤器对文字处理器处理的文档、格式化的文档、纯文本和 HTML 文档建立索引,包括5个基本对象:CHARSET_FILTER、INSO_FILTER INSO、NULL_FILTER、PROCEDURE_FILTER、USER_FILTER。
非索引字表类是用以指定一组不编入索引的单词 (称为非索引字)。有两个基本对象:BASIC_STOPLIST (一种语言中的所有非索引字) 、 MULTI_STOPLIST (包含多种语言中的非索引字的多语言非索引字表)。
二、使用Oracle Text建立全文索引的完整步骤,归纳起来如下:
index is the basic type of Oracle Text index. This is an index on a text column. A CONTEXT
index is useful when your source text consists of many large, coherent documents. Query this index with the CONTAINS
operator in the WHERE
clause of a SELECT
statement. This index requires manual synchronization after DML. See Syntax for CONTEXT Index Type.
type of index is a combined index on a text column and one or more other columns. CTXCAT
is typically used to index small documents or text fragments, such as item names, prices and descriptions found in catalogs. Query this index with the CATSEARCH
operator in the WHERE
clause of a SELECT
statement. This type of index is optimized for mixed queries. This index is transactional, automatically updating itself with DML to the base table. See Syntax for CTXCAT Index Type.
index is used to build a document classification application. The CTXRULE
index is an index created on a table of queries or a column containing a set of queries, where the queries serve as rules to define the classification criteria. Query this index with the MATCHES
operator in the WHERE
clause of a SELECT
statement. See Syntax for CTXRULE Index Type.
Create this index when you need to speed up existsNode()
queries on an XMLType column. See Syntax for CTXXPATH Index Type.
Type |
Description |
Lexer for indexing columns that contain documents of different languages. |
Lexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words. |
Lexer for indexing tables containing documents of different languages such as English, German, and Japanese. |
Lexer for extracting tokens from Chinese text. |
Lexer for extracting tokens from Chinese text. This lexer offers benefits over the · Generates a smaller index · Better query response time · Generates real world tokens resulting in better query precision · Supports stop words |
Lexer for extracting tokens from Japanese text. |
Lexer for extracting tokens from Japanese text. This lexer offers the following advantages over the · Generates smaller index · Better query response time · Generates real world tokens resulting in better precision |
Lexer for extracting tokens from Korean text. |
Lexer you create to index a particular language. |
Lexer for indexing tables containing documents of different languages; autodetects languages in a document. |
to index text columns that contain documents of different languages. For example, use this lexer to index a text column that stores English, Japanese, and German documents.
differs from MULTI_LEXER
automatically detects the language(s) of a document. Unlike MULTI_LEXER
does not require you to have a language column in your base table nor to specify the language column when you create the index. Moreover, it is not necessary to use sub-lexers, as with MULTI_LEXER
WORLD_LEXER supports all database character sets, and for languages whose character sets are Unicode-based, it supports the Unicode 5.0 standard. For a list of languages that WORLD_LEXER
can work with, see "World Lexer Features".
has the following attribute:
Attribute |
Attribute Value |
Enable mixed-case (upper- and lower-case) searches of text (for example, cat and Cat). Allowable values are |
Here is an example of creating an index using WORLD_LEXER
exec ctx_ddl.create_preference('MYLEXER', 'world_lexer');
create index doc_idx on doc(data)
indextype is CONTEXT
parameters ('lexer MYLEXER
CREATE INDEX [schema.]index ON [schema.]table(txt_column)
INDEXTYPE IS ctxsys.context [ONLINE]
[FILTER BY filter_column[, filter_column]...]
[ORDER BY oby_column[desc|asc][, oby_column[desc|asc]]...]
[LOCAL [(PARTITION [partition] [PARAMETERS('paramstring')]
[, PARTITION [partition] [PARAMETERS('paramstring')]])]
其中,PARALLEL n表示并行运行
Optionally specify indexing parameters in paramstring. You can specify preferences owned by another user using the user.preference notation.
The syntax for paramstring is as follows:
paramstring =
'[DATASTORE datastore_pref]
[FILTER filter_pref]
[CHARSET COLUMN charset_column_name]
[FORMAT COLUMN format_column_name]
[LEXER lexer_pref]
[LANGUAGE COLUMN language_column_name]
[WORDLIST wordlist_pref]
[STORAGE storage_pref]
[STOPLIST stoplist]
[SECTION GROUP section_group]
[MEMORY memsize]
[SYNC (MANUAL | EVERY "interval-string" | ON COMMIT)]