The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.
schema.xml位于solr/conf/目录下,类似于数据表配置文件,定义了加入索引的数据的数据类型,主要包括type、fields和其他的一些缺省设置。
The <types> section allows you to define a list of <fieldtype> declarations you wish to use in your schema, along with the underlying Solr class that should be used for that type, as well as the default options you want for fields that use that type.
types节点,这里面定义FieldType子节点,包括name,class,positionIncrementGap等一些参数。
<schema name="example" version="1.2"> <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <fieldtype name="binary" class="solr.BinaryField"/> <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> ... </types> ... </schema>
必要的时候fieldType还需要自己定义这个类型的数据在建立索引和进行查询的时候要使用的分析器analyzer,包括分词和过滤,如下:
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <!--这个分词包是空格分词,在向索引库添加text类型的索引时,Solr会首先用空格进行分词 然后把分词结果依次使用指定的过滤器进行过滤,最后剩下的结果,才会加入到索引库中以备查询。 注意:Solr的analysis包并没有带支持中文的包,需要自己添加中文分词器,google下。 --> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> </fieldType>
The <fields> section is where you list the individual <field> declarations you wish to use in your documents. Each <field> has a name that you will use to reference it when adding documents or executing searches, and an associated type which identifies the name of the fieldtype you wish to use for this field. There are various field options that apply to a field. These can be set in the field type declarations, and can also be overridden at an individual field's declaration.
fields节点内定义具体的字段(类似数据库的字段),含有以下属性:
<fields> <field name="id" type="integer" indexed="true" stored="true" required="true" /> <field name="name" type="text" indexed="true" stored="true" /> <field name="summary" type="text" indexed="true" stored="true" /> <field name="author" type="string" indexed="true" stored="true" /> <field name="date" type="date" indexed="false" stored="true" /> <field name="content" type="text" indexed="true" stored="false" /> <field name="keywords" type="keyword_text" indexed="true" stored="false" multiValued="true" /> <!--拷贝字段--> <field name="all" type="text" indexed="true" stored="false" multiValued="true"/> </fields>
建议建立一个拷贝字段,将所有的 全文本 字段复制到一个字段中,以便进行统一的检索:
以下是拷贝设置:
<copyField source="name" dest="all"/> <copyField source="summary" dest="all"/>
以下为一个完整的copyfield定义:
<schema name="eshequn.post.db_post.0" version="1.1" xmlns:xi="http://www.w3.org/2001/XInclude"> <fields> <!-- for title --> <field name="t" type="text" indexed="true" stored="false" /> <!-- for abstract --> <field name="a" type="text" indexed="true" stored="false" /> <!-- for title and abstract --> <field name="ta" type="text" indexed="true" stored="false" multiValued="true"/> </fields> <copyField source="t" dest="ta" /> <copyField source="a" dest="ta" /> </schema>Copy Fields
字段t是文章的标题,字段a是文章的摘要,字段ta是文章标题和摘要的联合。添加索引文档时,只需要传入t和a字段的内容,solr会自动索引ta字段。
One of the powerful features of Lucene is that you don't have to pre-define every field when you first create your index. Even though Solr provides strong datatyping for fields, it still preserves that flexibility using "Dynamic Fields". Using <dynamicField> declarations, you can create field rules that Solr will use to understand what datatype should be used whenever it is given a field name that is not explicitly defined, but matches a prefix or suffix used in a dynamicField.
For example the following dynamic field declaration tells Solr that whenever it sees a field name ending in "_i" which is not an explicitly defined field, then it should dynamically create an integer field with that name...
<dynamicField name="*_i" type="integer" indexed="true" stored="true"/>
The <uniqueKey> declaration can be used to inform Solr that there is a field in your index which should be unique for all documents. If a document is added that contains the same value for this field as an existing document, the old document will be deleted. It is not mandatory for a schema to have a uniqueKey field.
schema.xml文档注释中的信息:
1、为了改进性能,可以采取以下几种措施:
2、< schema name =" example " version =" 1.2 " >
3、filedType
< fieldType name =" string " class =" solr.StrField " sortMissingLast =" true " omitNorms =" true " />
可选的属性:
StrField类型不被分析,而是被逐字地索引/存储。
StrField和TextField都有一个可选的属性“compressThreshold”,保证压缩到不小于一个大小(单位:char)
< fieldType name =" text " class =" solr.TextField " positionIncrementGap =" 100 " >
solr.TextField 允许用户通过分析器来定制索引和查询,分析器包括 一个分词器(tokenizer)和多个过滤器(filter)
positionIncrementGap:可选属性,定义在同一个文档中此类型数据的空白间隔,避免短语匹配错误。
< tokenizer class =" solr.WhitespaceTokenizerFactory " />
空格分词,精确匹配。
< filter class =" solr.WordDelimiterFilterFactory " generateWordParts =" 1 " generateNumberParts =" 1 " catenateWords =" 1 " catenateNumbers =" 1 " catenateAll =" 0 " splitOnCaseChange =" 1 " />
在分词和匹配时,考虑 "-"连字符,字母数字的界限,非字母数字字符,这样 "wifi"或"wi fi"都能匹配"Wi-Fi"。
< filter class =" solr.SynonymFilterFactory " synonyms =" synonyms.txt " ignoreCase =" true " expand =" true " />
同义词
< filter class =" solr.StopFilterFactory " ignoreCase =" true " words =" stopwords.txt " enablePositionIncrements =" true " />
在禁用字(stopword)删除后,在短语间增加间隔
stopword:即在建立索引过程中(建立索引和搜索)被忽略的词,比如is this等常用词。在conf/stopwords.txt维护。
4、fields
< field name =" id " type =" string " indexed =" true " stored =" true " required =" true " />
< field name =" text " type =" text " indexed =" true " stored =" false " multiValued =" true " />
包罗万象(有点夸张)的field,包含所有可搜索的text fields,通过copyField实现。
< copyField source =" cat " dest =" text " />
在添加索引时,将所有被拷贝field(如cat)中的数据拷贝到text field中
作用:
< dynamicField name =" *_i " type =" int " indexed =" true " stored =" true " />
如果一个field的名字没有匹配到,那么就会用动态field试图匹配定义的各种模式。
< dynamicField name =" * " type =" ignored " multiValued=" true " />
如果通过上面的匹配都没找到,可以定义这个,然后定义个type,当String处理。(一般不会发生)但若不定义,找不到匹配会报错。
5、其他一些标签
< uniqueKey > id </ uniqueKey >
文档的唯一标识, 必须填写这个field(除非该field被标记required="false"),否则solr建立索引报错。(跟wiki里说的It is not mandatory for a schema to have a uniqueKey field. 相反)
< defaultSearchField > text </ defaultSearchField >
如果搜索参数中没有指定具体的field,那么这是默认的域。
< solrQueryParser defaultOperator =" OR " />
配置搜索参数短语间的逻辑,可以是"AND|OR"。