Quick Starting Solr
Welcome to Solr! You've made an excellent choice in picking a technology to power
your searching needs. In this chapter, we're going to cover the following topics:
An overview of what Solr and Lucene are all about
What makes Solr different from other database technologies
How to get Solr, what's included, and what is where
Running Solr and importing sample data
A quick tour of the interface and key configuration files
An introduction to Sol
Solr is an open source enterprise search server. It is a mature product powering search for public sites like CNet, Zappos, and Netflix, as well as intranet sites. It is written in Java, and that language is used to further extend/modify Solr. However, being a server that communicates using standards such as HTTP and XML, knowledge of Java is very useful but not strictly a requirement. In addition to the standard ability to return a list of search results for some query, it has numerous other features such as result highlighting, faceted navigation (for example, the ones found on most e-commerce sites), query spell correction, auto-suggest queries, and "more like this" for finding similar documents.
Common Solr UsageWebDBDataDataSolr
Quick Starting Solr
Lucene, the underlying engine
Before describing Solr, it is best to start with Apache Lucene, the core technology underlying it. Lucene is an open source, high-performance text search engine library. Lucene was developed and open sourced by Doug Cutting in 2000 and has evolved and matured since then with a strong online community. Being just a code library, Lucene is not a server and certainly isn't a web crawler either. This is an important fact. There aren't even any configuration files. In order to use Lucene directly, one writes code to store and query an index stored on a disk. The major features found in Lucene are as follows:
A text-based inverted index persistent storage for efficient retrieval of documents by indexed terms
A rich set of text analyzers to transform a string of text into a series of terms (words), which are the fundamental units indexed and searched
A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matches
A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the more likely candidates first, with flexible means to affect the scoring
A highlighter feature to show words found in context
A query spellchecker based on indexed content
For even more information on the query spellchecker, check out the Lucene In Action book (LINA for short) by Erik Hatcher and Otis Gospodneti..
Solr, the Server-ization of Lucene
With the definition of Lucene behind us, Solr can be described succinctly as the server-ization of Lucene. However, it is definitely not a thin wrapper around the Lucene libraries. Most of Solr's features are distinct from Lucene, such as faceting, but not far into the implementation. The line is often blurred as to what is Solr and what is Lucene. Without further adieu, here is the major feature-set in Solr:
HTTP request processing for indexing and querying documents.
Several caches for faster query responses.
A web-based administrative interface including:
Runtime performance statistics including cache
hit/miss rates.
A query form to search the index.
A schema browser with histograms of popular terms along
with some statistics.
Detailed breakdown of scoring mathematics and text
analysis phases.
Configuration files for the schema and the server itself (in XML).
Solr adds to Lucene's text analysis library and makes it
configurable through XML.
Introduces the notion of a field type (this is important yet
surprisingly not in Lucene). Types are present for dates and
special sorting concerns.
The disjunction-max query handler is more usable by end user queries and
applications than Lucene's underlying raw queries.
Faceting of query results.
A spell check plugin used for making alternative query suggestions (that is,
"did you mean ___")
A more like this plugin to list documents that are similar to a
chosen document.
A distributed Solr server model with supporting scripts to support larger
scale deployments.
These features will be covered in more detail in later chapters.
Comparison to database technology
和数据库技术做比较
Knowledge of relational databases (often abbreviated RDBMS or just database for
short) is an increasingly common skill that developers possess. A database and a
[Lucene] search index aren't dramatically different conceptually. So let's start off
by assuming that you know database basics, and I'll describe how a search index
is different.
This comparison puts aside the possibility that your database has built-in
text indexing features. The point here is only to help you understand Solr.
This biggest difference is that a Lucene index is like a single-table database without
any support for relational queries (JOINs). Yes, it sounds crazy, but remember that
an index is usually only there to support search and not to be the primary source
of the data. So your database may be in "third normal form" but the index will be
completely de-normalized and contain mostly just the data needed to be searched.
One redeeming aspect of the single table schema is that fields can be multi-valued.
Other notable differences are as follows:
Updates: Entire documents can be deleted and added again but not updated.
Substring Search versus Text Search: Using a database, the poor man's
search would be a substring search such as SELECT * FROM mytable WHERE
name LIKE '%Books%'. That would match "CookBooks" as well as "My
Books". Lucene instead fundamentally searches on terms (words). Depending
on analysis configuration, this can mean that various forms of the word
(example: book, singular) are found too, even phonetic (sounds-like) matches
are possible. Using advanced ngram analysis techniques, it can do partial
words too, although this is uncommon.
Scored Results and Boosting: Much of the power of Lucene is in its ability to
score each matched document according to how well the search matched it.
For example, if multiple words are searched for and are optional (a boolean
OR search), then Lucene scores documents that matched more terms higher
than those that just matched one. There are a variety of other factors too,
and it's possible to adjust weightings of different fields. By comparison, a
database has no concept of this, a record either matched or not. Of course,
Lucene can sort on field values if that is needed.
Slow commits: Solr is highly optimized for search speed, and that speed is
largely attributable to caches. When a commit is done to finalize documents
that were just added, all of the caches need to be rebuilt, which could take
between seconds and a minute, depending on various factors.
Getting started
Solr is a Java based web application, but you don't need to be particularly familiar
with Java in order to use it. With most topics, this book assumes little to no such
knowledge on your part. However, if you wish to extend Solr, then you will
definitely need to know Java. I also assume a basic familiarity with the command
line, whether it is DOS or any Unix shell.
Before truly getting started with Solr, let's get the prerequisites out of the way. Note
that if you are using Mac OS X, then you should have the needed pieces already
(though you may need the developer tools add-on). If any of the -version test
commands mentioned as follows fail, then you don't have it. URLs are provided
for convenience, but it is up to you to install the software according to instructions
provided at the relevant sites.
A Java Development Kit (JDK) v1.5 or later: You can download the JDK
from http://java.sun.com/javase/. Typing java -version will tell
you which version of Java you are using if any, and you should type
javac -version to ensure that you have the development kit too. You
only need the JRE to run Solr, but you will need the JDK to compile it
from source and to extend it.
Apache Ant: Any recent version should do and is available at http://ant.
apache.org/. If you never modify Solr and just stick to a recent official
release, then you can skip this. Note that the software provided with this
book uses Ant as well. Therefore, you'll want Ant if you wish to follow
along. Typing ant -version should demonstrate that you have it installed.
Subversion or Git for source control of Solr: http://subversion.tigris.
org/getting.html or http://git-scm.com/. This isn't strictly necessary,
but it's recommended for working with Solr's source code. If you choose to
use a command line based distribution of either, then svn -version or
git --version should work. Further instructions in this book are based
on the command line, because it is a universal access method.
Any Java EE servlet engine app-server: This is a Java web server. Solr
includes one already, Jetty, and we'll be using this throughout the book.
In a later chapter, "Solr in the real world", deploying to an alternative
is discussed.
The last official release or fresh code from
source control
Let's finally get started and get Solr running. The official site for Solr is at
http://lucene.apache.org/solr, where you can download the latest official
release. Solr 1.3 was released on September 15th, 2008. Solr 1.4 is expected around
the same time a year later and thus is probably available as you read this. This book
was written in-between these releases and so it contains many but not all of 1.4's
features. An alternative to downloading an official release is getting the latest code
from source control (that is version control). In either case, the directory structure
is conveniently identical and both include the source code. For many open source
projects, the choice is almost always the last official release and not the latest source.
However, Solr's committers have made unit and integration testing a priority,
evident by the testing infrastructure and test code-coverage of over 70 percent
(http://hudson.zones.apache.org/hudson/view/Solr/job/Solr-trunk/
clover/), which is very good. Many projects have none at all. As a result, the latest
source release is very stable, and it also makes changes to Solr easier, given that so
many tests are in place to give confidence that Solr is working properly—so far as
the tests test it, of course. And unlike a database, which is almost never modified
to suit the needs of a project, Solr is modified often. Also note that there are a good
many feature additions provided as source code patches within Solr's JIRA (its issue
tracking system). The decision is of course up to you. If you are satisfied with the
feature-set in the latest release and/or you don't think you'll be modifying Solr at all,
then the latest release is fine. One way to gauge what (completed) features are not
yet in the latest official release is to visit Solr's JIRA at http://issues.apache.org/
jira/browse/SOLR, and then click on Roadmap. Also, the Wiki at http://wiki.
apache.org/solr/ should have features that are not yet in the latest release version
marked as such.
Choose to get Solr through source control even if you are going to stick
with the last official release. When/if you make changes to Solr, it will
then be easier to see what those differences are. Switching to a different
release becomes much easier too.
We're going to get the code through a subversion and check out the trunk (a source
control term for the latest code). If you are using an IDE or some GUI tool for
subversion, then feel free to use that. The command line will suffice too. You should
be able to successfully execute the following:
svn co http://svn.apache.org/repos/asf/lucene/solr/trunk/ solr_svn
That will result in Solr being checked out into the solr_svn directory. If you prefer
one of the official releases, then use one of the following URLs, instead of the one
above: http://svn.apache.org/repos/asf/lucene/solr/tags/ (put that into
your web browser to see the choices). So called nightlies are also available if you
don't want to use a subversion but want recent code.
Testing and building Solr
If you prefer a downloadable pre-built Solr, instead of using a subversion, then you
can skip this section.
Ant basics
Apache ant is a cross-platform build scripting tool specified with
XML. It is largely Java oriented. An ant script is assumed to be
named build.xml in the root of a project. It contains a set of named
ant targets that you can run. In order to list them while including
description, type ant -p to get a nice report. In order to run a target,
simply supply it to ant as the first argument such as ant compile.
Targets often internally invoke other targets, and you'll see this in
the output. In the end, ant should report BUILD SUCCESSFUL if
successful and BUILD FAILED if not. Note that ant's use of the term
'build' is universal in ant, even if 'build' is not an apt description of
what a target performed.
Testing and building Solr is easy. Before we build Solr, we're going to test it first
to ensure that there are no failing tests. Simply execute the test target in Solr's
installation directory like ant test. That should have executed without any errors.
On my old machine, it took about ten minutes to run. If there were errors (extremely
rare), then you'll have to switch to a different version or wait shortly for it to be fixed.
Now to build a ready-to-install Solr, just type ant dist. This is going to fill the dist
directory with some JAR files and a WAR file. If you are not familiar with Java, these
files are a packaging mechanism for compiled code and related resources. These files
are technically ZIP files but with a different file extension, and so you can use any
ZIP file tools to view their contents. The most important one is the WAR file which
we'll be using next.
Solr's installation directory structure
In this section, we'll orient you to Solr's directory structure. This is not Solr's home
directory, but a different place that we'll mention after this.
build: Only appears after Solr is built to house compiled code before being
packaged. You won't need to look in here.
client: Contains convenient language-specific APIs for talking to Solr
as an alternative to using your own code to send XML over HTTP. As of
this writing, this only contains a couple of Ruby choices. The Java client
called SolrJ is actually in src/solrj. More information on using clients to
communicate with Solr is in Chapter 8.
dist: The built Solr JAR files and WAR file are here, as well as the
dependencies. This directory is created and filled when Solr is built.
example: This is an installation of the Jetty servlet engine (a Java web server)
including some sample data and Solr configuration. The interesting child
directories are:
example/etc: Jetty's configuration. Among other things, here
you can change the web port used from the pre-supplied 8983
to 80 (HTTP default).
example/multicore: Houses multiple Solr home directories in
a Solr multicore setup. This will be discussed in Chapter 7.
example/solr: A Solr home directory for the default setup
that we'll be using.
example/webapps: Solr's WAR file is deployed here.
lib: All of Solr's API dependencies. The larger pieces are Lucene, some
Apache commons utilities, and Stax for efficient XML processing.
site: This is for managing what is published on the Solr web site. You won't
need to go in here.
src: Various source code. It's broken down into a few notable directories:
src/java: Solr's source code, written in Java.
src/scripts: Unix bash shell scripts, particularly useful
in larger production deployments employing multiple
Solr servers.
src/solrj: Solr's Java client.
src/test: Solr's test source code and test files.
src/webapp: Solr's web administration interface, including
Java Servlets (source code form) and JSPs. This is mostly what
constitutes the WAR file. The JSPs for the admin interface
are under here in web/admin/, if you care to tweak any to
your needs.
If you are a Java developer, you may have noticed that the Java source in Solr
is not located in one place. It's in src/java for the majority of Solr, src/common
for the parts of Solr that are common to both the server side and Solrj client side,
src/test for the test code, and src/webapp/src for the servlet-specific code.
I am merely pointing this out to help you find code, not to be critical. Solr's files
are well organized.
Solr's home directory
A Solr home directory contains Solr's configuration and data (a Lucene Index) for a running Solr instance. Solr includes a sample, one at example/solr, which we'll be using in-place throughout most of the book. Technically, example/multicore is also a valid Solr home but for a multi-core setup, which will be discussed much later. You know you're looking at a Solr home directory when it contains either a solr.xml file (formerly multicore.xml in Solr 1.3), or if it contains both a conf and a data directory, though strictly speaking these might not be the actual requirements.
可运行的solr例子中solr home目录包含solr的配置文件和lucene的索引文件。Solr中的一个例子在example/solr,在本书中很多地方用到。Exalple/multicore是一个multi-core setup的solr home,后面将会讨论。Solr home目录包含了conf和data目录,严格说这些并不是实际需要的。
data might not yet be present because you haven't started Solr yet, which will create it if it's not present and assuming it's not configured to be named differently.
Solr's home directory is laid out like this:
bin: Suggested directory to place Solr replication scripts, if you have a more advanced setup.
conf: Configuration files. The two I mention below are very important, but it will also contain some other .txt and .xml files, which are referenced by these two files for different things such as special text analysis steps.
conf/schema.xml: This is the schema for the index including field type definitions with associated analyzer chains.
conf/solrconfig.xml: This is the primary Solr configuration file.
conf/xslt: This directory contains various XSLT files that can be used to transform Solr's XML query responses into formats such as Atom/RSS.
这个目录包含各种XSLT文件,他们用来把solr的xml查询返回结果转换为atom rss等格式。
data: Contains the actual Lucene index data. It's binary data, so you won't be doing anything with it except perhaps deleting it occasionally.
lib: Optional placement of extra Java JAR files that Solr will load on startup, allowing you to externalize plugins from the Solr distribution (the WAR file) for convenience. If you extend Solr without modifying Solr itself, then those modifications can be deployed in a JAR file here.可选的jar文件放置位置solr启动时会load,为了方法从solr分发中增加外部插件。如果你扩展了solr但没有修改solr本身,修改的部分可以打包成jar文件部署到这里。
It's really important to know how Solr finds its home directory. This is covered next.
How Solr finds its home
In the next section, you'll start Solr. When Solr starts up, about the first thing it does is load its configuration from its home directory. Where that is exactly can be specified in several different ways.
Solr first checks for a Java system property named solr.solr.home. There are a few ways to set a Java system property, but a universal one, no matter which servlet engine you use, is through the command line where Java is invoked. You could explicitly set Solr's home like so when you start Jetty: java -Dsolr.solr.home=solr/ -jar start.jar, or you could use Java Naming and Directory Interface (JNDI) to bind the directory path to java:comp/env/solr/home. As with Java system properties, there are multiple ways to do this. Some are app-server dependent, but a universal one is to add the following to the WAR file's web.xml located in src/web-app/web/WEB-INF (you'll find this there already
but commented out).
<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>solr/</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
As this is a change to web.xml, you'll need to re-run ant dist-war to repackage it, and only then you'll redeploy it. Doing this with Jetty supplied with Solr is insufficient because JNDI itself isn't set up. I'm not going to get into this further, because if you know what JNDI is and want to use it, then you'll surely figure out how to do it for your particular app-server.
Finally, if Solr's home isn't configured as a Java system property or through JNDI, then it defaults to solr/. In the examples above, I used that particular path too. We're going to simply stick with this path for the rest of this book, because this is a development, not production, setting.
In a production environment, you will almost certainly configure Solr's home rather than let it fall back to the default solr/. You will also probably use an absolute path instead of a relative one, which wouldn't work if you accidentally start your app-server from a different directory.
When troubleshooting setting Solr's home, be sure to look at the very first Solr log messages when Solr starts:
Aug 7, 2008 4:59:35 PM org.apache.solr.core.Config getInstanceDir
INFO: Solr home defaulted to 'null' (could not find system property or JNDI)
Aug 7, 2008 4:59:35 PM org.apache.solr.core.Config setInstanceDir
INFO: Solr home set to 'solr/'
This shows that Solr was left to default to solr/. You'll see this output when you start Solr, as described in the next section.
Deploying and running Solr
The file we're going to deploy is the file ending in .war in the dist directory (dist/apache-solr-1.4.war). The WAR file in particular is important, because this single file represents an entire Java web application. It includes Solr's JAR file, all of Solr's dependencies (which amount to other JAR files), Java Server Pages (JSPs) (which are rendered to a web browser when the WAR is deployed), and various configuration files and other web resources. It does not include Solr's home directory, however.
How one deploys a WAR file to a Java servlet engine depends on that servlet engine, but it is common for there to be a directory named something like webapps, which contains WAR files optionally in an expanded form. By expanded, I mean that the WAR file may be uncompressed and thus a directory by the same name. This can be a convenient deployed form in order to make changes in-place (such as to JSP files and static web files) without requiring rebuilding a WAR file and replacing an existing one. The disadvantage is that changes are not directly tracked by source control (example: Subversion). Another thing to note about the WAR file is that by convention, its name (without the .war extension, if present) is the path portion of the URL where the web server mounts the web application. For example, if you have an apache-solr-1.4.war file, then you would access it at http://localhost:8983/apache-solr-1.4/, assuming it's on the local machine and running at that default port.
We're going to deploy this WAR file into the Jetty servlet engine included with Solr. If you are using a pre-built downloaded Solr distribution, then Solr is already deployed into Jetty as solr.war. Solr has an ant target that does this (and some other things we don't care about) called example, so you can simply run it like ant example. This target didn't keep the original WAR filename when copying it. It abbreviated it to simply solr.war. This means that the URL path is just solr. By the way, because ant targets generally call other necessary ant targets, it was technically not necessary to run ant dist earlier in order for this step to work. This would not have run the tests, however.
Now we're going to start up Jetty and finally see Solr running (albeit without any
data to query yet). First go to the example directory, and then run Jetty's start.jar
file by typing the following command:
cd example
java -jar start.jar
You'll see about a page of output including references to Solr. When it is finished, you should see this output at the very end of the command prompt:
2008-08-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983
The 0.0.0.0 means it's listening to connections from any host (not just localhost, notwithstanding potential firewalls) and 8983 is the port. If Jetty reports this, then it doesn't necessarily mean that Solr was deployed successfully. You might see an error such as a stack trace in the output, if something went wrong. Even if it did go wrong, you should be able to access the web server at this address:
http://localhost:8983. It will show you a list of links to web applications which will just be Solr for this setup. Solr should have this link: http://localhost:8983/solr, and if you go there, then you should either see details about an error if Solr wasn't loaded correctly, or a simple page with a link to Solr's admin page, which should be http://localhost:8983/solr/admin/. You'll be visiting that link often.
To quit Jetty (and many other command line programs for that
matter), hit Ctrl-C on the keyboard.
A quick tour of Solr!
Start up Jetty if it isn't already up and point your browser to the admin URL:
http://localhost:8983/solr/admin/, so that we can get our bearings on this interface that is not yet familiar to you. We're not going to discuss any page in any depth at this point.
This part of Solr is somewhat rough and is subject to change more than any other part of Solr.
The top gray area in the previous screenshot is a header that is on every page. When you start dealing with multiple Solr instances (development machine versus production, multicore, Solr clusters), it is important to know where you are. The IP and port are obvious. The (example) is a reference to the name of the schema. That's just a simple label at the top of the schema file to name the schema. If you have multiple schemas for different data sets, then this is a useful differentiator. Next is the current working directory cwd, and Solr's home.
The block below this is a navigation menu to the different admin screens and
configuration data. The navigation menu is explained as follows:
SCHEMA: This downloads the schema configuration file (XML) directly to the browser.
Firefox conveniently displays XML data with syntax highlighting. Safari, on the other hand, tries to render it and the result is unusable. Your mileage will vary depending on the browser you use. You can always use your browser's view source command if needed.
CONFIG: It is similar to the SCHEMA choice, but this is the main configuration file for Solr.
ANALYSIS: It is used for diagnosing potential query/indexing problems having to do with the text analysis. This is a somewhat advanced screen and will be discussed later.
用来诊断查询和建索引时分词的潜在问题。
SCHEMA BROWSER: This is a neat view of the schema reflecting various heuristics of the actual data in the index. We'll return here later.
索引中实际数据的视图
STATISTICS: Here you will find stats such as timing and cache hit ratios. In Chapter 9, we will visit this screen to evaluate Solr's performance.
统计分析,通过这些数据可以评估solr的性能
INFO: This lists static versioning information about internal components to Solr. Frankly, it's not very useful.内部组件的版本信息 基本没用。
DISTRIBUTION: It contains Distributed/Replicated status information, only applicable for such configurations. More information on this is in Chapter 9.发布复制状态信息,只适用于配置
PING: Ignore this, although it can be used for a health-check in distributed mode.通过它可以查看分布式的健康状态。
LOGGING: This allows you to adjust the logging levels for different parts of
Solr at runtime. For Jetty as we're running it, this output goes to the console
and nowhere else.在运行时调整solr不同部分的log等级。
Solr uses SLF4j for its logging, which in Solr, is by default configured to use Java's built-in logging (that is JUL or JDK14 Logging). If you're more familiar with another framework like Log4J, then you can do this by simply removing the slf4j-jdk14 JAR file and adding slf4j-log4j12 (not included). If you're using Solr 1.3, then you're stuck with JUL.
JAVA PROPERTIES: It lists Java system properties.
THREAD DUMP: This displays a Java thread dump useful for experienced Java developers in diagnosing problems.
After the main menu is the Make a Query text box where you can type in a simple query. There's no data in Solr yet, so there's no point trying that right now.
FULL INTERFACE: As you might guess, it brings you to a form with more options, especially useful when diagnosing query problems or if you forget what the URL parameters are for some of the query options. The form is still very limited, however, and only allows a fraction of the query options that you can submit to Solr.提供更多的查询选项。诊断查询问题或忘记了查询的参数时非常有用。还有更多的选项没有体现出了。
Finally, the bottom Assistance area contains useful information for Solr online. The
last section of this chapter has more information on such resources.
Loading sample data
Solr happens to come with some sample data and a loader script, found in the example/exampledocs directory. We're going to use that, but just for the remainder of this chapter so that we can explore Solr more without getting into schema decision making and deeper data loading options. For the rest of the book, we'll base the examples on the supplemental files, which are provided online.
Firstly, ensure that Solr is running. You should assume that it is always in a running state throughout this book to follow any example. Now go into the example/exampledocs directory, and run the following:
exampledocs$ java -jar post.jar *.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file hd.xml
SimplePostTool: POSTing file ipod_other.xml
SimplePostTool: POSTing file ipod_video.xml
SimplePostTool: POSTing file vidcard.xml
SimplePostTool: COMMITting Solr index changes..
Or if you are using a Unix-like environment, you have the option of using the post.sh shell script, which behaves similarly. What this does is it invokes the Java program embedded in post.jar with each file in the current directory ending in .xml. post.jar is a simple program that iterates over each argument given (a file reference), and HTTP posts it to Solr running on the current machine at the example server's default configuration (being http://localhost:8983/solr/update ). I recommend examining the contents of the post.sh shell script for illustrative purposes. As seen above, the command will mention the files it is sending. Finally it will send a commit command, which will cause documents that were posted prior to the last commit to be saved and visible.
The post.sh and post.jar programs could theoretically be used in a production scenario, but they are intended just for demonstration of the technology with the example data.
Let's take a look at one of these documents like monitor.xml:
<add>
<doc>
<field name="id">3007WFP</field>
<field name="name">Dell Widescreen UltraSharp 3007WFP</field>
<field name="manu">Dell, Inc.</field>
<field name="cat">electronics</field>
<field name="cat">monitor</field>
<field name="features">30" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast</field>
<field name="includes">USB cable</field>
<field name="weight">401.6</field>
<field name="price">2199</field>
<field name="popularity">6</field>
<field name="inStock">true</field>
</doc>
</add>
The schema for the XML files that are posted to Solr are very simple. This one here doesn't demonstrate all of it, but this is most of what matters. Multiple documents (represented by the <doc> tag) can be present in series within the <add> tag, which is recommended in bulk data loading scenarios for performance. Remember that Solr gets a <commit/> tag sent to it in a separate POST. This syntax and command-set may very well be all that you use. More about these options and other data loading choices will be discussed in Chapter 3.
A simple query
On the main admin page, let's run a simple query searching for monitor.
When using Solr's search form, don't hit the return key. It would be nice if it submits the form, but it adds a carriage return to the search box instead. If you leave this carriage return there and hit Search, then you'll get an error. Perhaps this will be fixed at some point.
Before we go over the XML output, I want to point out the URL and its parameters, which you will become very familiar with:
http://localhost:8983/solr/select/?q=monitor&version=2.2&start=0&rows=10&indent=on.
The form (whether the basic one or the Full Interface one) simply constructs a URL with appropriate parameters, and your browser sees the XML results. It is convenient to use the form at first, but then subsequently make direct modifications to the URL in the browser instead of returning to the form. The form only controls a basic subset of all possible parameters. The main benefit to the form is that it applies the URL escaping for special characters in the query, and for some basic options, you needn't remember what the parameter names are.
Solr's search results from its web interface are in XML. As suggested earlier, you'll probably find that using the Firefox web browser provides the best experience due to the syntax coloring. Internet Explorer displays XML content well too. If you, at some point, want Solr to return a web page to your liking or an alternative XML structure, then that will be covered later. Here is the XML response with my comments:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">3</int>
<lst name="params">
<str name="indent">on</str>
<str name="rows">10</str>
<str name="start">0</str>
<str name="q">monitor</str>
<str name="version">2.2</str>
</lst>
</lst>
The first section of the response, which precedes the <result> tag that is about to follow, indicates how long the query took (measured in milliseconds), as well as listing the parameters that define the query. Solr has some sophisticated caching, and you will find that your queries will often complete in a millisecond or less, if you've run the query before. In the params list, q is clearly your query. rows and start have to do with paging. Clearly you wouldn't want Solr to always return all of the results at once, unless you really knew what you were doing. indent indents the XML output, which is convenient for experimentation. version isn't used much, but if you start building clients that interact with Solr, then you'll want to specify the version to reduce the possibility of things breaking, if you were to upgrade Solr. These parameters in the output are convenient for experimentation but can be configured to be omitted. Next up is the most important part, the results.
<result name="response" numFound="2" start="0">
The numFound number is self explanatory. start is the index into the query results that are returned in the XML. Often, you'll want to see the score of the documents. However, the very basic query performed from the front Solr page doesn't include the score, despite the fact that it's sorted by it (Solr's default). The full interface form includes the score by default. Queries that include the score will include a maxScore attribute in the result tag. The maxScore for the query is independent of any paging, so that no matter which part of the result set you've paged into (by using the start parameter), the maxScore will be the same. The content of the result tag is a list of documents that matched the query in a score sorted order. Later, we'll do some sorting by specified fields.
<doc>
<arr name="cat"><str>electronics</str><str>monitor</str></arr>
<arr name="features"><str>30" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast</str></arr>
<str name="id">3007WFP</str>
<bool name="inStock">true</bool>
<str name="includes">USB cable</str>
<str name="manu">Dell, Inc.</str>
<str name="name">Dell Widescreen UltraSharp 3007WFP</str>
<int name="popularity">6</int>
<float name="price">2199.0</float>
<str name="sku">3007WFP</str>
<arr name="spell"><str>Dell Widescreen UltraSharp 3007WFP</str>
</arr>
<date name="timestamp">2008-08-09T03:56:41.487Z</date>
<float name="weight">401.6</float>
</doc>
<doc>
...
</doc>
</result>
</response>
The document list is pretty straightforward. By default, Solr will list all of the stored fields, plus the score if you asked for it (we didn't in this case). Remember that not all of the fields are necessarily stored (that is, you can query on them but not store them for retrieval—an optimization choice). Notice that basic data types str, bool, date, int, and float are used. Also note that certain fields are multi-valued, as indicated by an arr tag.
This was a basic query. As you start adding more query options like faceting, highlighting, and so on, you will see additional XML following the result tag.
Some statistics
Let's take a look at the statistics page: http://localhost:8983/solr/admin/stats.jsp. Before we loaded data into Solr, this page reported that numDocs was 0, but now it should be 26. If you're wondering what maxDocs is and the difference, maxDocs reports a number that is in some situations higher due to documents that have been deleted but not yet committed. That can happen either due to an explicit delete posted to Solr or by adding a document that replaces another in order to enforce a unique primary key. While you're at this page, notice that the query handler named /update has some stats too:
name
/update
class
org.apache.solr.handler.XmlUpdateRequestHandler
version
$Revision: 679936 $
description
Add documents with XML
stats
handlerStart: 1218253728453
requests: 19
errors: 4
timeouts: 0
totalTime: 1392
avgTimePerRequest: 73.26316
avgRequestsPerSecond: 2.850955E-4
In my case, as seen above, there are some errors reported because I was fooling around, posting all of the files in the exampledocs directory, not just the XML ones. Another Solr handler name you'll want to examine is standard, which has been processing our queries.
These statistics are as up-to-date as Solr is running, they are not stored to disk. As such, you cannot use them for long-term statistics.
The schema and configuration files
Solr's configuration files are extremely well documented. We're not going to go over the details here but this should give you a sense of what is where.
The schema (defined in schema.xml) contains field type definitions (defined within the <types> tag) and lists the fields that make up your schema (within the <fields> tag), which references a type. The schema contains other information too such as the primary key (the field that uniquely identifies each document—a constraint that Solr enforces) and the default search field. The sample schema in Solr uses the field named text, confusingly, there is a field type named text too. But remember that the monitor.xml document we reviewed earlier had no field named text, right? It is common for the schema to call out for certain fields to be copied to other fields—particularly fields not in input documents. So, even though the input documents don't have a field named text, there are <copyField> tags in the schema, which call for the fields named cat, name, manu, features, and includes to be copied to text. This is a popular technique to speed up queries, so that queries can search over a small number of fields rather than a long list of them. Such fields used this way are rarely stored, as they are just needed for querying and so are indexed. There is a lot more we could talk about in the schema, but we're going to move on for now.
Schema配置文件包含字段类型定义,list了字段及字段的类型。Schema包含了primary key 和默认搜索字段。Sample schema中用了定义的text字段类型也是text。但是monitor.xml文档中没有text的字段。这种字段是公用的字段copy其他字段到确定的字段(注意字段不在输入的文档中)。所以即使输入的文档中没有text的字段但有copyfield的标签,把cat,name,manu,features,includes都copy到了text中。这种常用的技术可以加快查询,这样可以从很少的字段中查询。使用这种方法的字段很少保存,仅仅建立索引和查询使用。
Solr's solrconfig.xml file contains lots of parameters that can be tweaked. At the moment, we're just going to take a peak at the request handlers that are defined with <requestHandler> tags. They make up about half of the file. In our first query, we didn't specify any request handler, so we got the default one. It's defined here:
介绍request handler
<requestHandler name="standard" class="solr.SearchHandler"
default="true">
<!-- default values for query parameters -->
<lst name="defaults">
<str name="echoParams">explicit</str>
<!--
<int name="rows">10</int>
<str name="fl">*</str>
<str name="version">2.1</str>
-->
</lst>
</requestHandler>
When you POST commands to Solr (such as to index a document) or query Solr (HTTP GET), it goes through a particular request handler. Handlers can be registered against certain URL paths. When we uploaded the documents earlier, it went to the handler defined like this:
当我们post 命令或者查询时将通过request handler,handler可以和url path关联。
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
The request handlers oriented to querying using the class solr.SearchHandler are much more interesting.
The important thing to realize about using a request handler is that they are nearly completely configurable through URL parameters or POST'ed form parameters. They can also be specified in solrconfig.xml within either default, appends, or invariants named lst blocks, which serve to establish defaults. More on this is in Chapter 4. This arrangement allows you to set up a request handler for a particular application that will be querying Solr without forcing the application to specify all of its query options.
第四章详细介绍
The standard request handler defined previously doesn't really define any defaults other than the parameters that are to be echoed in the response. Remember its presence at the top of the XML output? By changing explicit to none you can have it omitted, or use all and you'll potentially see more parameters, if other defaults happened to be configured in the request handler. This parameter can alternatively be specified in the URL through echoParams=none. Remember to separate URL parameters with ampersands.
Solr resources outside this book
The following are some prominent Solr resources that you should be aware of:
Solr's Wiki: http://wiki.apache.org/solr/ has a lot of great documentation and miscellaneous information. For a Wiki, it's fairly organized too. In particular, if you are going to use a particular app-server in production, then there is probably a Wiki page there on specific details.
Within the Solr installation, you will also find that there are README.txt files in many directories within Solr and that the configuration files are very well documented.
Solr's mailing lists contain a wealth of information. If you have a few discriminating keywords then you can find nuggets of information in there with a search engine. The mailing lists of Solr and other Lucene sub-projects are best searched at: http://www.lucidimagination.com/search/ or Nabble.com.
It is highly recommended to subscribe to the Solr-users mailing list. You'll learn a lot and potentially help others too.
Solr's issue tracker, a JIRA installation at http://issues.apache.org/jira/browse/SOLR contains information on enhancements and bugs. Some of the comments for these issues can be extensive and enlightening. JIRA also uses a Lucene-powered search.
Notation convention: Solr's JIRA issues are referenced like this: SOLR-64. You'll see such references in this book and elsewhere. You can easily look these up at Solr's JIRA. You may also see issues for Lucene that follow the same convention, for example, LUCENE-1215.
Summary
This completes a quick introduction to Solr. In the ensuing chapters, you're really going to get familiar with what Solr has to offer. I recommend you proceed in order from the next chapter through Chapter 6, because these build on each other and expose nearly all of the capabilities in Solr. These chapters are also useful as a reference to Solr's features. You can of course skip over sections that are not interesting to you. Chapter 8, is one you might peruse at any time, as it may have a section particularly applicable to your Solr usage scenario.
Accompanying the book at PACKT's web site is both source code and data to be indexed by Solr. In order to try out the same examples used in the book, you will have to download it and run the provided ant task, which prepares it for you. This first chapter is the only one that is not based on that supplemental content.