Text fields are typically indexed by breaking the text into words and applying various transformations such as lowercasing, removing plurals, or stemming to increase relevancy. The same text transformations are normally applied to any queries in order to match what is indexed.
The schema defines the fields in the index and what type of analysis is applied to them. The current schema your collection is using may be viewed directly via the Schema tab in the Admin UI, or explored dynamically using the Schema Browser tab.
The best analysis components (tokenization and filtering) for your textual content depends heavily on language. As you can see in the Schema Browser, many of the fields in the example schema are using afieldType named text_general, which has defaults appropriate for most languages.
If you know your textual content is English, as is the case for the example documents in this tutorial, and you'd like to apply English-specific stemming and stop word removal, as well as split compound words, you can use the text_en_splitting fieldType instead. Go ahead and edit the schema.xml in the solr/example/solr/collection1/conf directory, to use the text_en_splitting fieldType for the text and features fields like so:
<field name="features" type="text_en_splitting" indexed="true" stored="true" multiValued="true"/> ... <field name="text" type="text_en_splitting" indexed="true" stored="false" multiValued="true"/>
Stop and restart Solr after making these changes and then re-post all of the example documents using java -jar post.jar *.xml. Now queries like the ones listed below will demonstrate English-specific transformations:
A search for power-shot can match PowerShot, and adata can match A-DATA by using the WordDelimiterFilter and LowerCaseFilter.
A search for features:recharging can match Rechargeable using the stemming features of PorterStemFilter.
A search for "1 gigabyte" can match 1GB, and the commonly misspelled pixima can matches Pixma using the SynonymFilter.
A full description of the analysis components, Analyzers, Tokenizers, and TokenFilters available for use is here.
There is a handy Analysis tab where you can see how a text value is broken down into words by both Index time nad Query time analysis chains for a field or field type. This page shows the resulting tokens after they pass through each filter in the chains.
This url shows the tokens created from "Canon Power-Shot SD500" using the text_en_splitting type. Each section of the table shows the resulting tokens after having passed through the next TokenFilter in the (Index) analyzer. Notice how both powershot and power, shot are indexed, using tokens that have the same "position". (Compare the previous output with The tokens produced using the text_general field type.)
Mousing over the section label to the left of the section will display the full name of the analyzer component at that stage of the chain. Toggling the "Verbose Output" checkbox will show/hide the detailed token attributes.
When both Index and Query values are provided, two tables will be displayed side by side showing the results of each chain. Terms in the Index chain results that are equivalent to the final terms produced by the Query chain will be highlighted.
Other interesting examples:
English stemming and stop-words using the text_en field type
Half-width katakana normalization with bi-graming using the text_cjk field type
Japanese morphological decomposition with part-of-speech filtering using the text_ja field type
Arabic stop-words, normalization, and stemming using the text_ar field type