甘末

SOLR_in action

©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

MEAP Edition
Manning Early Access Program
Solr in Action
version 6

For more information on this and other Manning titles go to
www.manning.com

PART 1: MEET SOLR
1 Introduction to Solr
2 Getting to know Solr
3 Key Solr concepts
4 Configuring Solr
5 Indexing
6 Text analysis
PART 2: CORE SOLR CAPABILITIES
7 Performing queries & handling results
8 Faceted search
9 Hit highlighting
10 Search suggestions
11 Result Grouping / Field Collapsing
12 Taking Solr to production
PART 3: TAKING SOLR TO THE NEXT LEVEL
13 Scaling Solr / SolrCloud
14 Multi-lingual Search
15 Complex data operations
16 Relevancy tuning
17 Thinking outside the box
APPENDIXES
A Building Solr from source
B Working with the Solr community

Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
1
Introduction to Solr
This chapter covers
 Characteristics of the types of data handled by search engines
 Common search engine use cases
 Key components of Solr
 Reasons to choose Solr
 Feature overview
With fast-growing technologies like social media, cloud computing, mobile applications,
and big data, these are exciting times to be in computing. One of the main challenges facing
software architects is the need to handle massive volumes of data consumed and produced
by a huge global user base. In addition, users expect online applications to always be
available and responsive. To address the scalability and availability needs of modern web
applications, we’ve seen a growing interest in specialized, non-relational data storage and
processing technologies, collectively known as NoSQL (Not only SQL). These systems share a
common design pattern of matching the storage and processing engine to specific types of
data rather than forcing all data into the once de facto standard relational model. In other
words, NoSQL technologies are optimized to solve a specific class of problems for specific
types of data. The need to scale has led to hybrid architectures composed of a variety of
NoSQL and relational databases; gone are the days of the one-size-fits-all data processing
solution.
This book is about a specific NoSQL technology, Apache Solr, which, like its non-relational
brethren, is optimized for a unique class of problems. Specifically, Solr is a scalable, ready-
to-deploy enterprise search engine that’s optimized to search large volumes of text-centric
data and return results sorted by relevance. That was a bit of a mouthful, so let’s break the
previous statement down into its basic parts:
1
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
 Scalable—Solr scales by distributing work (indexing and query processing) to multiple
servers in a cluster
 Ready to deploy—Solr is open-source, is easy to install and configure, and provides
a preconfigured example to help you get started
 Optimized for search—Solr is fast and can execute complex queries in subsecond
speed, often only 10’s of milliseconds
 Large volumes of documents—Solr is designed to deal with indexes containing
millions of documents
 Text-centric—Solr is optimized for searching natural language text, like emails, web
pages, resumes, PDF documents, and social messages like tweets or blogs
 Results sorted by relevance—Solr returns documents in ranked order based on how
relevant each document is to the user’s query
In this book, you’ll learn how to use Solr to design and implement scalable search
solutions. We’ll begin our journey by learning about the types of data and uses cases Solr
supports. This will help you understand where Solr fits into the big picture of modern
application architectures and which problems Solr is designed to solve.
1.1 Why do I need a search engine?
We suspect that because you’re looking at this book, you already have an idea about why
you need a search engine. Therefore, rather than speculate on why you’re considering Solr,
we’ll get right down to the hard questions you need to answer about your data and use cases
in order to decide if a search engine is right for you. In the end, it comes down to
understanding your data and users and then picking a technology that works for both. Let’s
start by looking at the properties of data that a search engine is optimized to handle.
1.1.1 Managing text-centric data
A hallmark of modern application architectures is matching the storage and processing
engine to your data. If you’re a programmer, then you know to select the best data structure
based on how you use the data in an algorithm, that is, you don’t use a linked list when you
need fast random lookups. The same principle applies with search engines. There are four
main characteristics of data search engines like Solr are optimized to handle.

Text-centric
Read-dominant
Document-oriented
Flexible schema
A possible fifth characteristic is having a large volume of data to deal with, that is, “big
data,” but our focus is on what makes a search engine special among other NoSQL
technologies. It goes without saying that Solr can deal with large volumes of data.
2
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Although these are the four main characteristics of data that search engines like Solr
handle efficiently, you should think of them as rough guidelines and not strict rules. Let’s dig
into each of these characteristics to see why they’re important for search. For now, we’ll
focus on the high-level concepts and get into the “how” in later chapters.
TEXT CENTRIC
You’ll undoubtedly encounter the term “unstructured” to describe the type of data that’s
handled by a search engine. We think “unstructured” is a little ambiguous because any text
document based on human language has implicit structure. You can think of the term
“unstructured” as being from the perspective of a computer, which sees text as a stream of
characters. The character stream must be parsed using language-specific rules to extract the
structure and make it searchable, which is exactly what search engines do.
We think the label “text-centric” is more appropriate for describing the type of data Solr
handles because a search engine is specifically designed to extract the implicit structure of
text into its index to improve searching. Text-centric data implies that the text of a
document contains “information” that users are interested in finding. Of course, a search
engine also supports non-text data like dates and numbers, but its primary strength is
handling text data based on natural language.
Also, the “centric” part is important because if users aren’t interested in the information
in the text, then a search engine may not be the best solution for your problem. For
example, consider an application where employees create travel expense reports. Each
report contains a number of structured data fields like date, expense type, currency, and
amount. In addition, each expense may include a notes field where employees can provide a
brief description of the expense. This would be an example of data that contains text but
isn’t “text-centric” in that it’s unlikely that the accounting department needs to search the
notes field when generating monthly expense reports. Put simply, despite data containing
text fields doesn’t mean it’s a natural fit for a search engine.
Take a moment and think about whether your data is “text-centric,” The main
consideration for us is whether or not the text fields in your data contain information that
users will want to query. If yes, then a search engine is probably a good choice. We’ll see
how to unlock the structure in text using Solr’s text analysis in chapters 5 and 6.
READ DOMINANT
Another key aspect of data that search engines handle efficiently is that it’s read-dominant,
as compared to writes. First, though, let’s be clear that Solr does allow you to update
existing documents in your index. You can think of read-dominant as meaning that
documents are read much more often then they’re created or updated. But don’t take this to
mean that you can’t write a lot of data or that you have limits on how frequently you can
write new data. In fact, one of the key features in Solr 4 is near-real-time search, which
allows you to index thousands of documents per second and have them be searchable almost
immediately.
3
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
The key point behind read-dominant data is that when you do write data to Solr, it’s
intended to be read and reread many, many times over its lifetime. You can think of a search
engine as being optimized for executing queries (a read operation), for example, as opposed
to storing data (a write operation). Also, if you have to update existing data in a search
engine often, then that could be an indication that a search engine might not be the best
solution for your needs. Another NoSQL technology, like Cassandra, might be a better choice
when you need fast random-writes to existing data.
DOCUMENT ORIENTED
To this point, we’ve used the generic label “data” but in reality, search engines work with
documents. In a search engine, a document is a self-contained collection of fields where each
field only holds data and doesn’t contain nested fields. In other words, a document in a
search engine like Solr has a flat structure and doesn’t depend on other documents. The
“flat” concept is slightly relaxed in Solr in that a field can have multiple values but fields
don’t contain sub-fields. That is, you can store multiple values in a single field but you can’t
nest fields inside of other fields.
The flat, document-oriented approach in Solr works well with data that’s already in
document format, such as a web page, blog, or PDF document, but what about modeling
normalized data stored in a relational database? In this case, you need to denormalize data
spread across multiple tables into a flat, self-contained document structure. We’ll learn how
to approach problems like this in chapter 3.
You also want to consider which fields in your documents must be stored in Solr and
which should be stored in another system, such as a database. Put simply, a search engine
isn’t the place to store data unless it’s useful for search or displaying results. For example, if
you have a search index for online videos, then you don’t want to store the binary video files
in Solr. Rather, large binary fields should be stored in another system, such as a content
distribution network (CDN). In general, you should store the minimal set of information for
each document needed to satisfy search requirements. This is a clear example of not treating
Solr as a general data storage technology; Solr’s job is to find videos of interest and not to
manage large binary files.
FLEXIBLE SCHEMA
The last main characteristic of search engine data is that it has a flexible schema. This
means that documents in a search index don’t need to have a uniform structure. In a
relational database, every row in a table has the same structure. In Solr, documents can
have different fields. Of course, there should be some overlap between the fields in
documents in the same index but they don’t have to be identical.
For example, imagine a search application for finding homes for rent or sale. Listings will
obviously share fields like location, number of bedrooms, and number of bathrooms, but
they’ll also have different fields based on the listing type. A home for sale would have fields
for listing price and annual property taxes, whereas a home for rent would have a field for
monthly rent and pet policy.
4
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

To summarize, search engines in general and Solr in particular are optimized to handle
data having four specific characteristics: text-centric, read-dominant, document-oriented,
and flexible schema. Overall, this implies that Solr isn’t a general-purpose data storage and
processing technology, which is one of the main differentiating factors of NoSQL (not only
SQL) technologies.
The whole point of having a wide variety of options for storing and processing data is that
you don’t have to find a one-size-fits-all technology. Search engines are good at specific
things and quite horrible at others. This means in most cases, you’re going to find Solr
complements relational databases and other NoSQL technologies more than it replaces them.
Now that we’ve talked about the type of data Solr is optimized to handle, let’s think about
the primary use cases a search engine like Solr is designed to handle. These use cases are
intended to help you understand how a search engine is different than other data processing
technologies.
1.1.2 Common search engine use cases
In this section, we look at some of the things you can do with a search engine like Solr. As
with our discussion of the types of data in section 1.1.1, use these as guidelines and not
strict rules. Before we get into specifics though, you should keep in mind that the bar for
excellence in search is high. Modern users are accustomed to web search engines like Google
and Bing being fast and effective at serving modern web information needs. Moreover, most
popular websites have powerful search solutions to help people find information quickly.
When you’re evaluating a search engine like Solr and designing your search solution, make
sure you put user experience as a high priority.
BASIC KEYWORD SEARCH
It almost seems obvious to point out that a search engine supports keyword search, as that’s
its main purpose. But it’s worth mentioning because keyword search is the most typical way
users will begin working with your search solution. It’d be rare for a user to want to fill out a
complex search form initially. Given that basic keyword search will be the most common way
users will interact with your search engine, then it stands to reason that this feature must
provide a great user experience.
In general, users want to type in a few simple keywords and get back great results. This
may sound like a simple task of matching query terms to documents but consider a few of
the issues that must be addressed to provide a great user experience:
 Relevant results must be returned quickly, within a second or less in most cases
 Spell correction in case the user misspells some of the query terms
 Auto-suggestions to save users some typing, particularly for mobile applications
 Handling synonyms of query terms
 Matching documents containing linguistic variations of query terms
5
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
 Phrase handling, that is, does the user want documents matching all words or any of
the words in a phrase
 Proper handling of queries with common words like “a,” “an,” “of,” and “the”
 Giving the user a way to see more results if the top results aren’t satisfactory
As you can see, a number of issues exist that make a seemingly simple feature hard to
implement without a specialized approach. But with a search engine like Solr, these features
come out of the box and are easy to implement. Once you give users a powerful tool to
execute keyword searches, you need to consider how to display the results. This brings us to
our next use case of ranking results based on their relevance to the user’s query.
RANKED RETRIEVAL
A search engine stands alone as a way to return “top” documents for a query. In a SQL
query to a relational database, a row either matches a query or not and results are sorted
based on one of the columns. On the other hand, a search engine returns documents sorted
in descending order by a score that indicates the strength of the match of the document to
the query. How strength of match is calculated depends on a number of factors but in
general a higher score means the document is more relevant to the query.
Ranking documents by relevancy is important for a couple of reasons. First, modern
search engines typically store a large volume of documents, often millions or billions of
documents. Without ranking documents by relevance to the query, users can become
overloaded with results without any clear way to navigate them. Second, users are more
comfortable and accustomed to getting results from other search engines using only a few
keywords. Users are impatient and expect the search engine to “do what I mean, not what I
say.” This is true of search solutions backing mobile applications where users on the go will
enter short queries with potential misspellings and expect it to “just work.”
To influence ranking, you can assign more weight or “boost” certain documents, fields, or
specific terms. For example, you can boost results by their age to help push newer
documents towards the top of search results. We’ll learn about ranking documents in chapter
3.
BEYOND KEYWORD SEARCH
With a search engine like Solr, users can type in a few keywords and get back some results.
For many users, though, this is only the first step in a more interactive session where the
search results give them the ability to keep exploring. One of the primary use cases of a
search engine is to drive an information discovery session. Frequently, your users won’t
know exactly what they’re looking for and typically don’t have any idea what information is
contained in your system. A good search engine helps users narrow in on their information
needs.
The central idea here is to return some documents from an initial query, as well as some
tools to help users refine their search. In other words, in addition to returning matching
documents, you also return tools that give your users an idea of what to do next. For
6
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
example, you can categorize search results using document features to allow users to narrow
down their results. This is known as faceted-search and is one of the main strengths of Solr.
We’ll see an example of faceted search for a real estate search in section 1.2. Facets are
covered in-depth in chapter 8.
DON’T USE A SEARCH ENGINE TO DO …
Lastly, let’s consider a few use cases where a search engine wouldn’t be useful. First, search
engines are designed to return a small set of documents per query, usually 10 to 100. More
documents for the same query can be retrieved using Solr’s built-in paging support. Consider
a query that matches a million documents—if you request all of those documents back at
once, you should be prepared to wait a long time. The query itself will likely execute quickly
but reconstructing a million documents from the underlying index structure will be extremely
slow, as engines like Solr store fields on disk in a format from which it’s easy to create a few
documents, but takes a long time to reconstruct many documents when generating results.
Another use case where you shouldn’t use a search engine is for deep analytics tasks that
require access to a large subset of the index. Even if you avoid the previous issue by paging
through results, the underlying data structure of a search index isn’t designed for retrieving
large portions of the index at once.
We’ve touched on this previously, but we’ll reiterate that search engines aren’t the place
for querying across relationships between documents. Solr does support querying using a
parent-child relationship, but doesn’t provide support for navigating complex relational
structures. In chapter 3, you’ll learn some techniques to adapt relational data to work with
Solr’s flat document structure.
Lastly, there’s no direct support in most search engines for document-level security, at
least not in Solr. If you need fine-grained permissions on documents, then you’ll have to
handle that outside of the search engine.
Now that we’ve seen the types of data and use cases where a search engine is the right
solution, it’s time to dig into what Solr does and how it does it from a high level. In the next
section, you’ll learn what Solr does and how it approaches important software design
principles like integration with external systems, scalability, and high-availability.
1.2 What is Solr?
In this section, we introduce the key components of Solr by designing a search application
from the ground up. This will help you understand what specific features Solr provides and
the motivation for their existence. But, before we get into the specifics of what Solr is, let’s
make sure you know what Solr isn’t:

Solr isn’t a web search engine like Google or Bing
Solr has nothing to do with search engine optimization (SEO) for a website
Now, imagine we need to design a real estate search web application for potential
homebuyers. The central use case for this application will be searching for homes to buy
across the United States using a web browser. Figure 1.1 depicts a screen shot from this
fictitious web application. Don’t focus too much on the layout or design of the user interface,
7
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
it’s only a quick mock-up to give us some visual context. What’s important is the type of
experience that Solr can support.

Figure 1.1 Mock-up screen shot of fictitious search application to depict Solr features
Let’s take a quick tour of the screen shot in figure 1.1 to illustrate some of Solr’s key
features. First, starting at the top-left corner, working clock-wise, Solr provides powerful
features to support a keyword search box. As we discussed in section 1.1.2, providing a
great user experience with basic keyword search requires complex infrastructure that Solr
provides out-of-the-box. Specifically, Solr provides spell checking, auto-suggesting as the
user types, synonym handling, phrase queries, and text-analysis tools to deal with linguistic
variations in query terms, such as “buying a house” or “purchase a home.”
Solr also provides a powerful solution for implementing geo-spatial queries. In figure 1.1,
matching home listings are displayed on a map based on their distance from the latitude /
longitude of the center of some fictitious neighborhood. With Solr’s geo-spatial support, you
can sort documents by geo-distance or even rank documents by geo-distance. It’s also
important that geo-spatial searches are fast and efficient to support a user interface that
allows users to zoom in and out and move around on a map.
Once the user performs a query, the results can be further categorized using Solr’s
faceting support to show features of the documents in the result set. Facets are a way to
8
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
categorize the documents in a result set in order to drive discovery and query refinement. In
figure 1.1, search results are categorized into three facets for features, home style, and
listing type.
Now that we have a basic idea of the type of functionality we need to support our real
estate search application, let’s see how we’d implement these features with Solr. To begin,
we need to know how Solr matches home listings in the index to queries entered by users,
as this is the basis for all search applications.
1.2.1 Information retrieval engine
Solr is built on Apache Lucene, a popular Java-based open source information retrieval
library. We’ll save a detailed discussion of what information retrieval is for chapter 3. For
now, we’ll touch on the key concepts behind information retrieval starting with the formal
definition taken from one of the prominent academic texts on modern search concepts:
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).
Introduction to Information Retrieval, Manning, et. al., 2008, pp. 1
In the case of our example real estate application, the user’s primary information need is
finding a home to purchase based on location, home style, features, and price. Our search
index will contain home listings from across the U.S., which definitely qualifies as a “large
collection.” In a nutshell, Solr uses Lucene to provide the core infrastructure for indexing
documents and executing searches to find documents.
Under the covers, Lucene is a Java-based library for building and managing an inverted
index, which is a specialized data structure for matching query terms to text-based
documents. Figure 1.2 provides a simplified depiction of a Lucene inverted index for our
example real estate search application.
9
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

Figure 1.2 The key data structure supporting information retrieval is the inverted index
You’ll learn all about how an inverted index works in chapter 3. For now, it’s sufficient to
review figure 1.2 to get a feel for what happens when a new document (#44 in the diagram)
is added to the index and how documents are matched to query terms using the inverted
index.
You might be thinking that a relational database could easily return the same results
using a SQL query, which is true for this simple example. But one key difference between a
Lucene query and a database query is that in Lucene, results are ranked by their relevance
to a query and database results can only be sorted by one of the table columns. In other
words, ranking documents by relevance is a key aspect of information retrieval and helps
differentiate it from other types of search.
10
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Building a web-scale inverted index
It might surprise you that search engines like Google also use an inverted index for
searching the web. In fact, the need to build a web-scale inverted index led to the
invention of MapReduce.
MapReduce is a programming model that distributes large-scale data processing
operations across a cluster of commodity servers by formulating an algorithm into two
phases: map and reduce. With roots in functional programming, MapReduce was adapted
by Google for building its massive inverted index to power web search. Using MapReduce,
the map phase produces unique term and document ID where the term occurs. In the
reduce phase, terms are sorted so that all term / docID pairs are sent to the same
reducer process for each unique term. The reducer sums up all term frequencies for each
term to generate the inverted index.
Apache Hadoop provides an open source implementation of MapReduce and is used by
the Apache Nutch open source project to build a Lucene inverted index for web-scale
search. A thorough discussion of Hadoop and Nutch are beyond the scope of this book,
but we encourage you to investigate these projects if you need to build a large-scale
search index.

Now that we know that Lucene provides the core infrastructure to support search, let’s
look at what value Solr adds on top of Lucene, starting with how you define how your index
is structured using Solr’s flexible schema.xml configuration document.
1.2.2 Flexible schema management
Although Lucene provides the infrastructure for indexing documents and executing queries,
what’s missing from Lucene is an easy way to configure how you want your index to be
structured. With Lucene you need to write Java code to define fields and how to analyze
those fields. Solr adds a simple, declarative way to define the structure of your index and
how you want fields to be represented and analyzed using an XML configuration document
named schema.xml. Under the covers, Solr translates the schema.xml document into a
Lucene index. This saves you programming and makes your index structure easier to
understand and communicate to others. On the other hand, a Solr built index is 100%
compatible with a programmatically built Lucene index.
Solr also adds a few nice constructs on top of the core Lucene indexing functionality.
Specifically, Solr provides copy fields and dynamic fields. Copy fields provide a way to take
the raw text contents of one or more fields and have them applied to a different field.
Dynamic fields allow you to apply the same field type to many different fields without
explicitly declaring them in schema.xml. This is useful for modeling documents that have
many fields. We cover schema.xml in depth in chapters 5 and 6.
11
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
In terms of our example real estate application, it might surprise you that we can use the
Solr example server out-of-the-box without making any changes to the schema.xml. This
shows how flexible Solr’s schema support is, because the example Solr server is designed to
support product search but works fine for our real estate search example.
At this point, we know that Lucene provides a powerful library for indexing documents,
executing queries, and ranking results. And, with schema.xml, you have a flexible way to
define the index structure using an XML configuration document instead of having to
program to the Lucene API. Now you need a way access these services from the web. In the
next section, we learn how Solr runs as a Java web application and integrates with other
technologies using proven standards such as XML, JSON, and HTTP.
1.2.3 Java web application
Solr is a Java web application that runs in any modern Java Servlet engine like Jetty or
Tomcat, or a full J2EE application server like JBoss or Oracle AS. Figure 1.3 depicts major
software components of a Solr server.

Figure 1.3 Diagram of the main components of Solr 4
12
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Admittedly, figure 1.3 is a little overwhelming at first glance. Take a moment to scan over
the diagram to get a feel for some of the terminology. Don’t worry if you’re not familiar with
all of the terms and concepts represented in the diagram–after reading this book, you
should have a strong understanding of the all concepts presented in figure 1.3.
As we mentioned in the introduction to this chapter, the Solr designers recognized that
Solr fits better as a complementary technology that works within existing architectures. In
fact, you’ll be hard-pressed to find an environment where Solr doesn’t drop right in. As we’ll
see in chapter 2, you can start the example Solr server in a couple of minutes after you
finish the download.
To achieve the goal of easy integration, Solr’s core services need to be accessible from
many different applications and languages. Solr provides simple REST-like services based on
proven standards of XML, JSON, and HTTP. As a brief aside, we avoid the “RESTful” label for
Solr’s HTTP-based API as it doesn’t strictly adhere to all REST (representational state
transfer) principles. For instance, in Solr you use HTTP POST to delete documents instead of
HTTP DELETE.
A REST-like interface is nice as a foundation, but often times, developers like to have
access to a client library in their language of choice to abstract away some of the boilerplate
machinery of invoking a web service and processing the response. The good news here is
that most popular languages have a Solr client library including Python, PHP, Java, .NET, and
Ruby.
1.2.4 Multiple indexes in one server
One hallmark of modern application architectures is the need for flexibility in the face of
rapidly changing requirements. One of the ways Solr helps this situation is that you don’t
have to do all things in Solr with one index, because Solr supports running multiple cores in
a single engine. In figure 1.3, we’ve depicted multiple cores as separate layers all running in
the same Java web application environment.
Think of each core as a separate index and configuration and there can be many cores in
a single Solr instance. This allows you to manage multiple cores from one server so that you
can share server resources and administration tasks like monitoring and maintenance. Solr
provides an API for creating and managing multiple cores.
One use of Solr’s multicore support is data partitioning, such as having one core for
recent documents and another core for older documents, known as chronological sharding.
Another use of Solr’s multicore support is to support multitenant applications.
In our real estate application, we might use multiple cores to manage different types of
listings that are different enough to justify having different indexes for each. Consider real
estate listings for rural land instead of homes. Buying rural land is a different process than
buying a home in a city, so it stands to reason that we might want to manage our land
listings in a separate core.
13
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
1.2.5 Extendable (plug-ins)
Figure 1.3 depicts three main sub-systems in Solr: document management, query
processing, and text analysis. Of course, these are high-level abstractions for complex sub-
systems in Solr; we’ll learn about each later in the book. Each of these systems is composed
of a modular “pipeline” that allows you to plug-in new functionality. This means that instead
of overriding the entire query-processing engine in Solr, you plug-in a new search
component into an existing pipeline. This makes the core Solr functionality easy to extend
and customize to meet your specific application needs.
1.2.6 Scalable
Lucene is an extremely fast search library and Solr takes full advantage of Lucene’s speed.
But regardless of how fast Lucene is, a single server will reach its limits in terms of how
many concurrent queries from different users it can handle due to CPU and I/O constraints.
As a first pass to scalability, Solr provides flexible cache management features that help
your server avoid recomputing expensive operations. Specifically, Solr comes preconfigured
with a number of caches to save expensive recomputations, such as caching the results of a
query filter. We’ll learn about Solr’s cache management features in chapter 4.
Caching only gets you so far so at some point, you’re going to need to scale out your
capacity to handle more documents and higher query throughput by adding more servers.
For now let’s focus on the two most common dimensions of scalability in Solr. First is query
throughput, which is the number of queries your engine can support per second. Even
though Lucene can execute each query quickly, it’s limited in terms of how many concurrent
requests a single server can handle. For higher query throughput, you add replicas of your
index so that more servers can handle more requests. This means if your index is replicated
across three servers, then you can handle roughly three times the number of queries per
second because each server handles one-third of the query traffic. In practice, it’s rare to
achieve perfect linear scalability so adding three servers may only allow you to handle two
and one half times the query volume of one server.
The other dimension of scalability is the number of documents indexed. If you’re dealing
with large volumes of documents, then you’ll likely reach a point where you have too many
documents in a single instance and query performance will suffer. To handle more
documents, you split the index into smaller chunks called “shards” and then distribute the
searches across the shards.
Scaling-out with virtualized commodity hardware
One trend in modern computing is building software architectures that can scale
horizontally using virtualized commodity hardware. Put simply, add more commodity
servers to handle more traffic. Fueling this trend towards using virtualized commodity
hardware are cloud-computing providers such as Amazon EC2. Although Solr will run on
virtualized hardware, you should be aware that search is I/O and memory intensive.
14
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Therefore, if search performance is a top priority for your organization, then you should
consider deploying Solr on higher end hardware with high-performance disks, ideally
solid-state drives (SSD). Hardware considerations for deploying Solr are discussed in
chapter 13.

Scalability is important, but ability to survive failures is also important for a modern
system. In the next section, we discuss how Solr handles software and hardware failures.
1.2.7 Fault tolerant
Beyond scalability, you need to consider what happens if one or more of your servers fails,
particularly if you’re planning to deploy Solr on virtualized hardware or commodity hardware.
The bottom line is that you must plan for failures. Even the best architectures and high-end
hardware will experience failures.
Let’s assume you have four shards for your index and the server hosting shard #2 loses
power. At this point, Solr can’t continue indexing documents and can’t service queries, so
your search engine is effectively “down.” To avoid this situation, you can add replicas of each
shard. In this case, when shard #2 fails, Solr reroutes indexing and query traffic to the
replica and your Solr cluster remains online. The impact of this failure is that indexing and
queries can still be processed but may not be as fast because you’ve one less server to
handle requests. We’ll discuss failover scenarios in chapter 16.
At this point, you’ve seen that Solr has a modern, well-designed architecture that’s
scalable and fault-tolerant. Although these are important aspects to consider if you’ve
already decided to use Solr, you still might not be convinced that Solr is the right choice for
your needs. In the next section, we describe the benefits of Solr from the perspective of
different stakeholders, such as the software architect, system administrator, and CEO.
1.3 Why Solr?
In this section, we hope to provide you with some key information to help you decide if Solr
is the right technology for your organization. Let’s begin by addressing why Solr is attractive
to software architects.
1.3.1 Solr for the software architect
When evaluating new technology, software architects must consider a number of factors, but
chief among those are stability, scalability, and fault-tolerance. Solr scores high marks in all
three categories.
In terms of stability, Solr is a mature technology supported by a vibrant community and
seasoned committers. One thing that shocks new users to Solr and Lucene is that it isn’t
unheard of to deploy from source code pulled directly from the trunk rather than waiting for
an official release. We won’t advise you either way on whether this is acceptable for your
organization. We only point this out because it’s a testament to the depth and breadth of
15
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
automated testing in Lucene and Solr. Put simply, if you have a nightly build off trunk where
all the automated tests pass, then you can be sure the core functionality is solid.
We’ve touched on Solr’s approach to scalability and fault-tolerance in sections 1.2.6 and
1.2.7. As an architect, you’re probably most curious about the limitations of Solr’s approach
to scalability and fault-tolerance. First, you should realize that the sharding and replication
features in Solr have been rewritten in Solr 4 to be robust and easier to manage. The new
approach to scaling is called SolrCloud. Under the covers, SolrCloud uses Apache Zookeeper
to distribute configuration across a cluster of Solr nodes and to keep track of cluster state.
Here are some highlights of the new SolrCloud features in Solr:
 Centralized configuration
 Distributed indexing with no Single Point of Failure (SPoF)
 Automated fail-over to a new shard leader
 Queries can be sent to any node in a cluster to trigger a full distributed search across
all shards with fail-over and load-balancing support built-in
But this isn’t to say that Solr scaling doesn’t have room for improvement. SolrCloud is
lacking in two areas. First, not all features work in distributed mode, such as joins. Second,
the number of shards for an index is a fixed value that can’t be changed without reindexing
all of the documents. We’ll get into all of the specifics of SolrCloud in chapter 16, but we
want to make sure architects are aware that Solr scaling has come a long way in the past
couple of years yet still has some more work to do.
1.3.2 Solr for the system administrator
As a system administrator, high among your priorities in adopting a new technology like Solr
is whether it fits into your existing infrastructure. The easy answer is yes it does. As Solr is
Java based, it runs on any OS platform that has a J2SE 6.x/7.x JVM. Out of the box, Solr
embeds Jetty, the open source Java servlet engine provided by Oracle. Otherwise, Solr is a
standard Java web application that deploys easily to any Java web application server like
JBoss and Oracle AS.
All access to Solr can be done via HTTP and Solr is designed to work with caching HTTP
reverse proxies like Squid and Varnish. Solr also works with JMX so you can hook it up to
your favorite monitoring application, such as Nagios.
Lastly, Solr provides a nice administration console for checking configuration settings,
statistics, issuing test queries, and monitoring the health of SolrCloud. Figure 1.4 provides a
screen shot of the Solr 4 administration console. We’ll learn more about the administration
console in chapter 2.
16
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

Figure 1.4 Screen shot of Solr 4 administration console where you can send test queries, ping the server,
view configuration settings, and see how your shards and replicas are distributed in a cluster.
1.3.3 Solr for the CEO
Although it’s unlikely that a CEO will be reading this book, here are some key talking points
about Solr in case your CEO stops you in the hall. First, executive types like to know an
investment in a technology today is going to payoff in the long term. With Solr, you can
emphasize that many companies are still running on Solr 1.4, which was released in 2009,
which means Solr has a successful track record and is constantly being improved.
Also, CEO’s like technologies that are predictable. As you’ll see in the next chapter, Solr
“just works” and you can have it up and running in minutes. Another concern is what
happens if the Solr guy walks out the door—will business come to a halt? It’s true that Solr is
complex technology but having a vibrant community behind it means you have help when
you need it. And, you have access to the source code, which means if something is broken
and you need a fix, you can do it yourself. Many commercial service providers also can help
you plan, implement, and maintain your Solr installation; many of which offer training
courses for Solr.
17
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
This may be one for the CFO, but Solr doesn’t require much initial investment to get
started. Without knowing the size and scale of your environment, we’re confident in saying
that you can start up a Solr server in a few minutes and be indexing documents quickly. A
modest server running in the cloud can handle millions of documents and many queries with
sub-second response times.
1.4 Feature overview
Finally, let’s do a quick run-down of Solr’s main features organized around the following
categories:
 User experience
 Data modeling
 New features in Solr 4
Providing a great user experience with your search solution will be a common theme
throughout this book so let’s start by seeing how Solr helps make your users happy.
1.4.1 User experience features
Solr provides a number of important features that help you deliver a search solution that’s
easy to use, intuitive, and powerful. But you should note that Solr only exposes a REST-like
HTTP API and doesn’t provide search-related UI components in any language or framework.
You’ll have to roll up your sleeves and develop your own search UI components that take
advantage of some of the following user experience features:
 Pagination and sorting
 Faceting
 Auto-suggest
 Spell checking
 Hit highlighting
 Geo-spatial search
PAGINATION AND SORTING
Rather than returning all matching documents, Solr is optimized to serve paginated requests
where only the top N documents are returned on the first page. If users don’t find what
they’re looking for on the first page, then you can request subsequent pages using simple
API request parameters. Pagination helps with two key outcomes: 1) results are returned
more quickly, because each request only returns a small subset of the entire search results,
2) helps you track how many queries result in requests for more pages, which may be an
indication of a problem in relevance scoring. You’ll learn about paging and sorting in chapter
7.
18
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
FACETING
Faceting provides users with tools to refine their search criteria and discover more
information by categorizing search results into sub-groups using facets. In our real estate
example (figure 1.1) we saw how search results from a basic keyword search were organized
into three facets: Features, Home Style, and Listing Type. Solr faceting is one of the more
popular and powerful features available in Solr; we cover faceting in depth in chapter 8.
AUTO-SUGGEST
Most users will expect your search application to “do the right thing” even if they provide
incomplete information. Auto-suggest helps users by allowing them to see a list of suggested
terms and phrases based on documents in your index. Solr’s auto-suggest features allows
user to start typing a few characters and receive a list of suggested queries as they type.
This reduces the number of incorrect queries, particularly because many users may be
searching from a mobile device with small keyboards.
Auto-suggest gives users examples of terms and phrases available in the index. Referring
back to our real estate example, as a user types “hig…” Solr’s auto-suggestion feature can
return suggestions like “highlands neighborhood” or “highlands ranch.” We cover auto-
suggest in chapter 10.
SPELL CHECKER
In the age of mobile devices and people on the go, spell-correction support is essential.
Again, users expect to be able to type misspelled words into the search box and the search
engine should handle it gracefully. Solr’s spell checker supports two basic modes:
Auto-correct—Solr can make the spell correction automatically based on whether the
misspelled term exists in the index.
Did you mean that Solr can return a suggested query that might produce better results so
that you can display a hint to your users, such as “Did you mean highlands?” if you user
typed in “hilands.”
Spell correction was revamped in Solr 4 to be easier to manage and maintain; we’ll see
how this works in chapter 10.
HIT HIGHLIGHTING
When searching documents that have a significant amount of text, you can display specific
sections of each document using Solr’s hit highlighting feature. Most useful for longer format
documents, so you can help users find relevant information in longer documents by quickly
scanning highlighted sections in search results. We cover hit highlighting in chapter 9.
GEO-SPATIAL
Geographical location is a first-class concept in Solr 4 in that it has built-in support for
indexing latitude and longitude values as well as sorting or ranking documents by
geographical distance. Solr can find and sort documents by distance from a geo-location
(latitude and longitude). In the real estate example, matching listings are displayed on an
interactive map where users can zoom in/out and move the map center point to find near-by
listings using geo-spatial search.
19
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Another exciting addition to Solr 4 is that you can index geographical shapes such as
polygons, which allows you to find documents that intersect geographical regions. This might
be useful for finding home listings in specific neighborhoods using a precise geographical
representation of a neighborhood. We cover Solr’s geo-spatial search features in chapter 14.
1.4.2 Data modeling features
As we discussed in section 1.1, Solr is optimized to work with specific types of data. In this
section, we provide an overview of key features that help you model data for search,
including:
 Field collapsing / grouping
 Flexible query support
 Joins
 Clustering
 Importing rich document formats like PDF and Word
 Importing data from relational databases
 Multilingual support
FIELD COLLAPSING / GROUPING
Although Solr requires a flat, denormalized document, Solr allows you to treat multiple
documents as a group based on some common property shared by all documents in a group.
Field grouping, also known as field collapsing, allows you to return unique groups instead of
individual documents in the results.
The classic example of field collapsing is threaded email discussions where emails
matching a specific query could be grouped under the original email message that started
the conversation. You’ll learn about field grouping in chapter 11.
POWERFUL AND FLEXIBLE QUERY SUPPORT
Solr provides a number of powerful query features including:
 Conditional logic using and, or, not
 Wildcard matching
 Range queries for dates and numbers
 Phrase queries with slop to allow for some distance between terms
 Fuzzy string matching
 Regular expression matching
 Function queries
Don’t worry if you don’t know what all of these terms mean as we’ll cover all of them in
depth in chapter 7.
20
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
JOINS
In SQL, you use a JOIN to create a relation by pulling data from two or more tables together
using a common property such as a foreign key. But in Solr, joins are more like sub-queries
in SQL in that you don’t build documents by joining data from other documents. For
example, with Solr joins, you can return child documents of parents that match your search
criteria. One example where Solr joins are useful would be grouping all retweets of a Twitter
message into a single group. We discuss joins in chapter 14.
DOCUMENT CLUSTERING
Document clustering allows you to identify groups of documents that are similar, based on
the terms present in each document. This is helpful to avoid returning many documents
containing the same information in search results. For example, if your search engine is
based on news articles pulled from multiple RSS feeds, then it’s likely that you’ll have many
documents for the same news story. Rather than returning multiple results for the same
story, you can use clustering to pick a single representative story. Clustering techniques are
discussed briefly in chapter 17.
IMPORT COMMON DOCUMENT FORMATS LIKE PDF AND WORD
In some cases, you may want to take a bunch of existing documents in common formats like
PDF and Microsoft Word and make them searchable. With Solr this is easy because it
integrates with Apache Tika project that supports most popular document formats. Importing
rich format documents is covered in chapter 12.
IMPORT DATA FROM RELATIONAL DATABASES
If the data you want to search with Solr is in a relational database, then you can configure
Solr to create documents using a SQL query. We cover Solr’s data import handler (DIH) in
chapter 12.
MULTILINGUAL SUPPORT
Solr and Lucene have a long history of working with multiple languages. Solr has language
detection built-in and provides language-specific text analysis solutions for many languages.
We’ll see Solr’s language detection in action in chapter 6.
1.4.3 New features in Solr 4
Before we wrap-up this chapter, let’s look at a few of the exciting new features in Solr 4. In
general, version 4 is a huge milestone for the Apache Solr community as it addresses many
of the major pain-points discovered by real users over the past several years. We selected a
few of the main features to highlight here but we’ll also point out new features in Solr 4
throughout the book.
 Near-real-time search
 Atomic updates with optimistic concurrency
 Real-time get
 Write durability using a transaction log
21
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
 Easy sharding and replication using Zookeeper
NEAR-REAL-TIME SEARCH
Solr’s Near-Real-Time (NRT) search feature supports applications that have a high velocity of
documents that need to be searchable within seconds of being added to the index. With NRT,
you can use Solr to search rapidly changing content sources such as breaking news and
social networks. We cover NRT in chapter 13.
ATOMIC UPDATES WITH OPTIMISTIC CONCURRENCY
The atomic update feature allows a client application to add, update, delete, and increment
fields on an existing document without having to resend the entire document. For example, if
the price of a home in our example real estate application from section 1.2 changes, then we
can send an atomic update to Solr to change the price field specifically.
You might be wondering what happens if two different users attempt to change the same
document concurrently. In this case, Solr guards against incompatible updates using
optimistic concurrency. In a nutshell, Solr uses a special version field named version to
enforce safe update semantics for documents. In the case of two different users trying to
update the same document concurrently, the user that submits updates last will have a stale
version field so their update will fail. Atomic updates and optimistic concurrency are covered
in chapter 12.
REAL-TIME GET
At the beginning of this chapter, we stated that Solr is a NoSQL technology. Solr’s real-time
get feature definitely fits within the NoSQL approach by allowing you to retrieve the latest
version of a document using its unique identifier regardless of whether that document has
been committed to the index. This is similar to using a key-value store like Cassandra to
retrieve data using a row key.
Prior to Solr 4, a document wasn’t retrievable until it was committed to the Lucene index.
With the real-time get feature in Solr 4, you can safely decouple the need to retrieve a
document by its unique ID from the commit process. This can be useful if you need to update
an existing document after it’s sent to Solr without having to do a commit first. As we’ll learn
in chapter 5, commits can be expensive and impact query performance.
DURABLE WRITES
When a document is sent to Solr for indexing, it’s written to a transaction log to prevent data
loss in the event of server failure. Solr’s transaction log sits between the client application
and the Lucene index. It also plays a role in servicing real-time get requests as documents
are retrievable by their unique identifier regardless of whether they’re committed to Lucene.
The transaction log allows Solr to decouple update durability from update visibility. This
means that documents can be on durable storage but aren’t visible in search results yet. This
gives your application control over when to commit documents to make them visible in
search results without risking data loss if a server fails before you commit. We’ll discuss
durable writes and commit strategies in chapter 5.
22
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
EASY SHARDING AND REPLICATION WITH ZOOKEEPER
If you’re new to Solr, then you may not be aware that scaling previous versions of Solr was a
cumbersome process at best. With SolrCloud, scaling is simple and automated because Solr
uses Apache Zookeeper to distribute configuration and manage shard leaders and replicas.
The Apache website (zookeeper.apache.org) describes Zookeeper as a “centralized service
for maintaining configuration information, naming, providing distributed synchronization, and
providing group services.”
In Solr, Zookeeper is responsible for assigning shard leaders and replicas and keeps track
of which servers are available to service requests. SolrCloud bundles Zookeeper so you don’t
need to do any additional configuration or setup to get started with SolrCloud. We’ll dig into
the details of SolrCloud in chapter 16.
1.5 Summary
We hope you now have a good sense for what types of data and use cases Solr supports. As
you learned in section 1.1, Solr is optimized to handle data that’s text-centric, read-
dominant, document-oriented, and has a flexible schema. We also learned that search
engines like Solr aren’t general-purpose data storage and processing solutions but are
intended to power keyword search, ranked retrieval, and information discovery. Using the
example of a fictitious real estate search application, we saw how Solr builds upon Lucene to
add declarative index configuration and web services based on HTTP, XML, and JSON. Solr 4
can be scaled in two dimensions to support millions of documents and high-query traffic
using sharding and replication. Solr 4 has no single points of failure.
We also touched on some key reasons why to choose Solr based on the perspective of
key stakeholders. We saw how Solr addresses the concerns of software architects, system
administrators, and even the CEO. Lastly, we covered some of Solr’s main features and gave
you pointers where you can learn more about each feature in this book.
We hope you’re excited to continue learning about Solr, so now it’s time to download the
software and run it on your local system, which is what we’ll do in chapter 2.
23
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
2
Getting to know Solr
This chapter covers
 Downloading and installing the Apache Solr 4.1 binary distribution
 Starting the example Solr server and indexing example documents
 Basic search features: sorting, paging, and results formatting
 Exploring the Solritas example search UI
 Self-guided tour of the Solr administration console
 Adapting the Solr example server to your specific needs
It’s natural to have a sense of uneasiness when you start using an unfamiliar technology.
You can put your mind at ease because Solr is designed to be easy to install and set up. In
the spirit of being agile, you can start out simple and incrementally add complexity to your
Solr configuration. For example, Solr allows you to split a large index into smaller subsets,
called shards, as well as add replicas to increase your capacity to serve queries. But you
don’t need to worry about index sharding or replication until you need them.
By the end of this chapter, you’ll have Solr running on your computer, know how to start
and stop Solr, know your way around the web-based administration console, and have a
basic understanding of key Solr terminology such as solr home, core, and collection.
What’s in a name? Solr 4 vs. SolrCloud
You may have heard of SolrCloud and wondered what the difference is between Solr 4
and SolrCloud. Technically, SolrCloud is the code name for a sub-set of features in Solr 4
that makes it easier to configure and run a scalable, fault-tolerant cluster of Solr servers.
Think of SolrCloud as a way to configure a distributed installation of Solr 4.
24
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Also, SolrCloud doesn’t have anything to do with running Solr in a cloud-computing
environment like Amazon EC2, although you can run Solr in a cloud. We presume that the
“cloud” part of the name reflects the underlying goal of the SolrCloud feature set to
enable elastic scalability, high-availability, and ease of use we’ve all come to expect from
cloud-based services. We cover SolrCloud in-depth in chapter 13.

Let’s get started by downloading Solr from the Apache website and installing it on your
computer.
2.1 Getting started
Before we can get to know Solr, we have to get it running on your local computer. This starts
with downloading the binary distribution of Solr 4.1 from Apache and extracting the
downloaded archive. Once installed, we’ll show you how to start the example Solr server and
verify that it’s running by visiting the Solr administration console from your web browser.
Throughout this process, we assume you’re comfortable executing simple commands from
the command line of your chosen operating system. There is no GUI installer for Solr, but
you’ll soon see that the process is so simple you won’t need a GUI-driven installer.
2.1.1 Installing Solr
Installing Solr is a bit of a misnomer in that all you really need to do is download the binary
distribution (.zip or .tgz) and extract it. Before you do that, let’s make sure you have the
necessary prerequisite Java 1.6 or greater (also known as J2SE 6) installed. To verify you
have the correct version of Java, open a command-line on your computer and do:
java -version
You should see output that looks similar to the following:
java version “1.6.0_24”
Java™ SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot™ 64-Bit Server VM (build 19.1-b02, mixed mode)
If you don’t have Java installed, we recommend you use Oracle’s JVM.
1
Also, keep in
mind that even though the Solr server requires Java, that doesn’t mean you have to use
Java in your application to interact with Solr. All client interaction with Solr happens over
HTTP so you can use any language that provides an HTTP client library. In addition, a
number of open source client libraries are available for Solr for popular languages like .NET,
Python, Ruby, PHP, and of course Java.
Assuming you’ve Java installed, you’re now ready to install Solr. Apache provides source
and binary distributions of Solr; for now, we’ll focus on the installation steps using the binary
distribution. We cover how to build Solr from source in the appendix.
In your browser, go to the Solr home page http://lucene.apache.org/solr and click on the
Download button for Apache Solr 4.1 on the right; there is also a button for downloading

1
http://www.oracle.com/technetwork/java/javase/downloads/index.html
25
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Solr 3.6.x so be sure to choose the one for Solr 4. This will direct you to a mirror site for
Apache downloads; it’s advisable to download from a mirror site to avoid overloading the
main Apache site. If you’re on Windows, download: solr-4.1.0.zip. If you’re on Unix, Linux,
or Mac OS X, download: solr-4.1.0.tgz.
After downloading, move the downloaded file to a permanent location on your computer.
For example, on Windows, you could move it to the C:\ root directory, or on Linux, choose a
location like /opt/solr. For Windows users, we highly recommend that you extract Solr to a
directory that doesn’t have spaces in the name, i.e. avoid extracting Solr in directories like
C:\Documents and Settings or C:\Program Files. Your mileage may vary on this, but
being a Java-based software, you’re likely to run into issues with paths that contain a space.
There is no formal installer needed because Solr is self-contained in a single archive file—
all you need to do is extract it. When you extract the archive, all files will be created under
the solr-4.1.0 directory. On Windows, you can use the built-in Zip extraction support or a
tool like WinZip. On Unix, Linux or Mac, do: tar zxf solr-4.1.0.tgz. This will create
the directory structure shown in figure 2.1.

Figure 2.1 Directory listing of the solr-4.1.0 installation after extracting the downloaded archive on your
computer. We’ll refer to the top-level directory as $SOLR_INSTALL in the rest of this chapter.
We refer to the location where you extracted the Solr archive (.zip or .tgz) as
$SOLR_INSTALL in the rest of this chapter. We use this name because as you’ll see shortly,
26
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Solr home will be a different path, so we didn’t want to use $SOLR_HOME as the alias for the
top-level directory where you extracted Solr. So now that Solr is installed, you’re ready to
start it up.
2.1.2 Starting the Solr example server
To start Solr, open a command line and do the following:

cd $SOLR_INSTALL/example
java -jar start.jar

Remember that $SOLR_INSTALL is the alias we’re using to represent the directory where
you extracted the Solr download archive, such as C:\solr-4.1.0 on Windows. That’s all there
is to starting Solr.
During initialization, you’ll see some log messages printed to the console. If all goes well,
you should see the following log message at the bottom:
… :INFO:oejs.AbstractConnector:Started [email protected]:8983
WHAT HAPPENED?
That was so easy, you might be wondering what was actually accomplished. To be clear, you
now have a running version of Solr 4.1 on your computer. You can verify that Solr started
correctly by directing your web browser to the Solr administration page at:
http://localhost:8983/solr. Figure 2.2 provides a screen shot of the Solr administration
console; please take a minute to get acquainted with the layout and navigational tools in the
console.
27
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

Figure 2.2 The Solr 4 administration console, which provides a wealth of tools for working with your new
Solr instance. Click on the link labeled “collection1” to access more tools such as the query form.
Behind the scenes, start.jar launched a Java web server named Jetty, listening on port
8983. Solr is a web application running in Jetty. Figure 2.3 illustrates what is now running on
your computer.
28
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

Figure 2.3 Solr from a systems perspective showing the Solr web application (solr.war) running in Jetty on
top of Java. There is one Solr home directory set per Jetty server using Java system property
solr.solr.home. Solr can host multiple cores per server and each core has a separate directory containing
core-specific configuration and index (data) under Solr home, e.g. collection1.
TROUBLESHOOTING
There’s not much that can go wrong when starting the example server. The most common
issue if the server doesn’t start correctly is the default port 8983 is already in use by another
process. If this is the case, you’ll see an error that looks like: java.net.BindException:
Address already in use. This is easy to resolve by changing the port Solr binds to by
changing your start command to specify a different port for Jetty to bind to using: java -
Djetty.port=8080 -jar start.jar. Using this command, Jetty will bind to port 8080
instead of 8983.
Jetty versus Tomcat
We recommend just staying with Jetty when first learning Solr. If your organization
already uses Tomcat or some other Java web application server, such as Resin, then you
can deploy the Solr WAR file. Since we’re just getting to know Solr in this chapter, we’ll
refer you to the Appendix B to learn how to deploy the Solr WAR.
Solr uses Jetty to make the initial setup and configuration process a no-brainer. However,
this doesn’t mean that Jetty is a bad choice for production deployment. If your
29
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
organization already has a standard Java web application platform, then Solr will work
with it. However, if you have some choice, then we recommend you try out Jetty. It’s
fast, stable, mature, and easy to administer and customize. In fact, Google uses Jetty for
their AppEngine, see http://www.infoq.com/news/2009/08/google-chose-jetty/, which
gives great credibility to Jetty as a solid platform for running Solr in even the most
demanding environments!

STOPPING SOLR
For local operation, you can just kill the Solr server by doing Ctrl-C in the console window
where you started Solr. Typically, this is safe enough for development and testing. Jetty does
provide a safer mechanism for stopping the server, which will be discussed in chapter 12.
Now we have a running server, let’s take a minute to understand where Solr gets its
configuration information and where it manages its Lucene index. Understanding how the
example server you just started is configured will help you when you’re ready to start
configuring a Solr server for your application.
2.1.3 Understanding Solr Home
In Solr, a “core” is composed of a set of configuration files, Lucene index files, and Solr’s
transaction log. One Solr server running in Jetty can host multiple cores. Recall in chapter 1,
we designed a real estate search application that had multiple cores, one for houses and a
separate core for land listings. We used two separate cores because the indexed data was
different enough to justify having two different index structures. The Solr example server
you started in section 2.1.2 has a single core named “collection1”.
As a brief aside, Solr also uses the term “collection”, which really only has meaning in the
context of a Solr cluster where a single index is distributed across multiple servers.
Consequently, we feel it’s easier to focus on understanding what a Solr core is for now. We’ll
return to the distinction between core and collection in chapter 13 when we discuss
SolrCloud.
Solr home is a directory structure that encapsulates one or more cores, which are
configured by solr.xml. Solr also provides a core admin API that allows you to create,
update, and delete cores programmatically from your application. Behind the scenes the core
admin API makes changes to the solr.xml configuration file. Listing 2.1 shows the default
solr.xml for the example server. Out of the box, you don’t need to make any changes to this
file.
Listing 2.1 Default solr.xml for example server defining the collection1 core
#A

30
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
defaultCoreName=“collection1”
host=" ${host:}" hostPort="$ {jetty.port:}"
hostContext=" ${hostContext:}" zkClientTimeout="$ {zkClientTimeout:15000}">
#C

#A Persistent attribute controls whether or not changes made from the core admin API are persisted
to this file
#B Define one or more cores under the element
#C The collection1 core configuration and index files are in the collection1 directory under solr
home
Each Solr process has one and only one solr home directory, which is set by a global Java
system property: solr.solr.home. Figure 2.4 shows a directory listing of the default Solr
home “solr” for the example server.

Figure 2.4 Directory listing of the default Solr home directory for the Solr examples. It contains a single
core named “collection1,” which is configured in solr.xml. The collection1 directory corresponds to the core
named “collection1” and contains core-specific configuration files, Lucene index, and transaction log.
31
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
We’ll learn more about the main Solr configuration file for a core, named solrconfig.xml,
in chapter 4. Also, schema.xml is the main configuration file that governs index structure
and text analysis for documents and queries; you’ll learn all about schema.xml in chapter 5.
For now, just take a moment to scan figure 2.3 so that you have a sense for the basic
structure.
The example directory contains two other solr home directories for exploring advanced
functionality. Specifically, the example/example-DIH directory provides a Solr core for
learning about the data import handler (DIH) feature in Solr. Also, the
example/multicore directory provides an example of a multi-core configuration. We’ll
learn more about these features later in the book. For now, let’s continue with the simple
example by adding some documents to the index, which you’ll need to work through the
examples in section 2.2 below.
2.1.4 Indexing the example documents
When you first start Solr, there are no documents in the index. It’s just an empty server
waiting to be filled with data to search. We cover indexing in more detail in chapter 5. For
now, we’ll gloss over the details in order to get some example data in Solr index so that we
can try out some queries. Open a new command-line interface and do the following:
cd $SOLR_INSTALL/example/exampledocs
java -jar post.jar *.xml
You should see output that looks like:
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update using content-
type application/xml…
POSTing file gb18030-example.xml
POSTing file hd.xml
POSTing file ipod_other.xml
POSTing file ipod_video.xml
POSTing file manufacturers.xml
POSTing file mem.xml
POSTing file money.xml
POSTing file monitor.xml
POSTing file monitor2.xml
POSTing file mp500.xml
POSTing file sd500.xml
POSTing file solr.xml
POSTing file utf8-example.xml
POSTing file vidcard.xml
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update…

The post.jar file sends XML documents to Solr using HTTP POST. After all the documents
are sent to Solr, the post.jar application issues a commit, which makes the example
documents findable in Solr. To verify that the example documents were added successfully,
go to the Query page in the Solr administration console (http://localhost:8983/solr) and
execute the “find all documents” query (:). You need to click on the “collection1” link on
32
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
the left to access the Query page. Figure 2.5 shows what you should see after executing the
find all documents query.

Figure 2.5 Screenshot of the Query form on the Solr administration console. You can verify that the
example documents were indexed correctly by executing the find all documents query :
At this point, we have a running Solr instance with some example documents loaded.
2.2 Searching is what it’s all about
Now it’s time to see Solr shine. Without a doubt, Solr’s main strength is powerful query
processing. Think about it this way, who cares how scalable or fast a search engine is if the
results it returns aren’t useful or accurate? In this section, you’ll see Solr query processing in
action, which we think will help you see why Solr is such a powerful search technology.
Throughout this section, pay close attention to the link between each query we execute
and the documents that Solr returns, especially the order of the documents in the results.
This will help you start to think like a search engine, which will come in handy in chapter 3
when we cover core search concepts.
33
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
2.2.1 Exploring Solr’s query form
You’ve already used Solr’s query form to execute the “find all documents” query (:). Let’s
take a quick tour of the other features on this form so that you get a sense for the types of
queries Solr supports. Figure 2.6 provides some annotations of key sections of this form.
Take a minute to read through each annotation in the diagram.

Figure 2.6 Annotated screen shot of Solr’s query form to illustrate the main features of Solr query
processing, such as filters, results format, sorting, paging, and search components.
In figure 2.6, we formulate a query that returns two of the example documents we added
in section 2.1.4 above. Take a moment to fill-out the form and execute the query in your
own environment. Do the two documents that Solr returned make sense? Table 2.1 provides
an overview of the form fields we’re using for this example.

34
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Table 2.1 Overview of query parameters from figure 2.6
Form field Value Description
q iPod Main query parameter; documents are scored by their similarity to
terms in this parameter.
fq manu:Belkin Filter query; restricts the result set to documents matching this filter
but doesn’t affect scoring. In this example, we filter results that have
manufacturer field “manu” equal to “Belkin”
sort price asc Specifies the sort field and sort order; in this case we want results
sorted by the price field in ascending order (asc) so that documents
with the lowest price are listed first.
start 0 Specifies the starting page for results; since this is our first request,
we want the first page using 0-based indexing. Start should be
incremented by the page size to advance to the next page.
rows 10 Page size; restricts the number of results returned per page, in this
case 10. The next page index will be 10 and not 1.
fl name,price,
features,score
List of fields to return for each document in the result set. The
“score” field is a built-in field that holds each document’s relevancy
score for the query. You have to request the score field explicitly as
is done in this example.
df text Default search field for any query terms that don’t specify which
field to search on; text is the catch all field for the example server.
wt xml Response writer type; governs the format of the response

As we discussed in chapter 1, section 1.2.3, all interaction with Solr’s core services, such
as query processing, is performed with HTTP requests. When you fill out the query form, an
HTTP GET request is created and sent to Solr. The form field names shown in table 2.1
correspond to parameters passed to Solr in the HTTP GET request. Listing 2.1 shows the
actual HTTP GET request sent to Solr when you execute the query depicted in figure 2.6;
note that the actual request doesn’t include line breaks between the parameters, which
we’ve included here to make it easier to see the separate parameters.
Listing 2.1 Breakdown of the HTTP GET request sent by the Query form
http://localhost:8983/solr/collection1/select? #A
q=iPod& #B
fq=manu%3ABelkin& #C
sort=price+asc& #D
fl=name%2Cprice%2Cfeatures%2Cscore& #E
df=text& #F
wt=xml& #G
35
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
start=0&rows=10 #H
#A Invokes the “select” request handler for the “collection1” core
#B Main query component looking for documents containing “iPod”
#C Filter documents that have manu field equal to “Belkin”
#D Sort results by price in ascending order (smallest to largest)
#E Return the name, price, features, and score fields in results
#F Default search field is “text”
#G Return results in XML format
#H Start at page 0 and return up to 10 results
Looking for more example queries?
We cover queries in more depth in chapter 7. But, if you don’t want to wait that long and
want to see more queries in action, then we recommend looking at the Solr tutorial
provided with Solr. Open $SOLR_INSTALL/docs/tutorial.html in your web browser
and you’ll find some additional queries for the example documents you loaded in the
previous section 2.1.4.

Lastly, we probably don’t have to tell you that this form isn’t designed for end users; the
Solr team built the query form so that developers and administrators have a way to send
queries without having to formulate HTTP requests manually or develop a client application
just to send a query to Solr. But, let’s be clear that with Solr, you’re responsible for
developing the user interface to Solr. As we’ll see in section 2.2.5 below, Solr provides an
example search UI, called Solritas, to help you get started building your own awesome
search application.
2.2.2 What comes back from Solr when you search
We’ve seen what gets sent to Solr, so now let’s learn about what comes back from Solr in
the results. The key point in this section is that Solr returns documents that match the query
as well as additional information that can be processed by your Solr client to deliver a quality
search experience. The operative phrase being “by your Solr client”! Solr returns the raw
materials that you need to translate into a quality search experience for your users.
Figure 2.7 shows what comes back from the example query we used in section 2.2.1. As
you can see, the results are in XML format and are sorted by lowest to highest price. Each
document contains the term “iPod”. Paging doesn’t really come into play with this result set
because there are only two results total.
36
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

Figure 2.7 Solr response in XML format from our sample query from figure 2.6
So far, we’ve only seen results returned as XML, but Solr also supports other formats such as
CSV, JSON, and language specific formats for popular languages. For instance, Solr can
return a Python specific format that allows the response to be safely parsed into a Python
object tree using the eval function.
2.2.3 Ranked retrieval
As we touched upon on chapter 1, the key differentiator between Solr’s query processing and
that of a database or other NoSQL data store is ranked retrieval—the process of sorting
documents by their relevance to a query, where the most relevant documents are listed first.
Let’s see ranked retrieval at work with some of the example documents you indexed in
section 2.1.4. To begin, enter “iPod” in the q text box, “name,features,score” in the fl
text field, and push the Execute button. This should return three documents sorted in
descending order by score. Take a moment to scan the results and decide if you agree with
the ranking for this simple query.
Intuitively, the ordering makes sense because the query term “iPod” occurs three times in
the first document listed, twice in the name and once in the features; it only occurs once in
37
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
the other documents. The actual numeric value of the score field isn’t as important as it’s
used internally by Lucene to do the ranking. The key take-away is that every document that
matches a query is assigned a relevance score for that specific query and results are
returned in descending order by score; the higher the score, the more relevant the document
is to the query.
Next, change your query to be “iPod power” and you’ll see that the same three
documents are returned and are in the same order. This is because all three documents
contain both query terms in either their name or features field. However, the scores of the
top two documents are much closer: 1.521 and 1.398 for the second query versus 1.333 and
0.770 (rounded) for the first query. This makes sense because “power” occurs twice in the
second document so its relevance to the “iPod power” query is much higher than its
relevance to the “iPod” query.
Lastly, change your query to be “iPod power^2” which boosts the “power” query term
by 2. In a nutshell, this means that the “power” term is twice as important to this query as
the “iPod” term, which has an implicit boost of 1. Again, the same three documents are
returned but in a different order. Now the top document in the results is “Belkin Mobile
Power Cord for iPod w/ Dock” because it contains the term “power” in the name and features
field and we told Solr that “power” is twice as important as “iPod” for this query.
Now you have a taste of what ranked retrieval looks like. You’ll learn more about ranked
retrieval and boosting in chapters 3 and 7. Let’s move on and see some other features of
query processing, starting with how to work with queries that return more than three
documents using paging and sorting.
2.2.4 Paging and sorting
Our example Solr index only contains 32 documents, but a production Solr instance typically
has millions of documents. You can imagine that in a Solr instance for an electronics super-
store, a query for “iPod” would probably match thousands of products and accessories. To
ensure results are returned quickly, especially on bandwidth constrained mobile devices, you
don’t want to return thousands of results at once, even if the most relevant are listed first.
PAGING
The solution, of course, is to return a small sub-set of results called a “page” along with
navigational tools to allow the user to request more pages if needed. Paging is a first-class
concept in Solr query processing in that every query includes parameters that control the
page size (rows) and starting position (start). By default, Solr uses a page size of 10, but
you can control that using the “rows” parameter in the query request. To request the “next”
page in the results, you increment the start parameter by the page size. For example, if
you’re on the first page of results (start=0), then to get the next page, you increment start
parameter by the page size, i.e. start=10.
It’s important to use as small a page size as possible to satisfy your requirements
because the underlying Lucene index isn’t optimized for returning many documents at once.
Rather, Lucene is optimized for query processing so the underlying data structures are
38
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
designed to maximize matching and scoring documents. Once the search results are
identified, Solr must re-construct each document, in most cases by reading data off disk. It
uses intelligent caching to be as efficient as possible, but in comparison to query processing,
results construction is a slow process, especially for large page sizes. Consequently, you’ll
get much better performance from Solr using small page sizes.
SORTING
As we learned in section 2.2.2, results are sorted by relevance score, in descending order
(highest to lowest score). However, you can request Solr to sort results by other fields in
your documents. You’ve already seen an example of this in section 2.2.1 where we sorted
results by the price field in ascending order, which produces the lowest priced products at
the top.
Sorting and paging go hand-in-hand because the sort order determines the page position
for results. To help get you thinking about sorting and paging, consider the question of
whether Solr will return deterministic results when paging without specifying a sort order?
On the surface, this seems obvious because the results are descending sorted by score if you
don’t specify a sort parameter. But, what if all documents in a query have the same score?
For example, if your query is “inStock:true” then all matching documents will have the
same score; be sure to verify this yourself using the Query form.
It turns out that Solr will indeed return all documents when you page through the results
even though the score is the same. This works because Solr finds all documents that match a
query and then applies the sorting and paging offsets to the entire set of documents. In
other words, Solr keeps track of the entire set of documents that match a query
independently of the sorting and paging offsets.
2.2.5 Enabling search components
The query form contains a list of check boxes that enable advanced functionality during
query processing; in Solr speak these additional features are called “search components”. As
shown in figure 2.6, the form contains check boxes that reveal additional form fields to
activate the following search components:
 dismax – Disjunction Max query processor (chapter 7)
 edismax – Extended Disjunction Max query processor (chapter 7)
 hl – Hit highlighting (chapter 9)
 facet – Faceting (chapter 8)
 spatial – Geo-spatial search, such as sorting by geo-distance (chapter 14)
 spellchecking – Spell checking on query terms (chapter 10)
If you click on any of these checkboxes, you’ll see that it’s not clear what to do when you
look at the form. That’s because using these search components from the query form
requires some additional knowledge that we can’t cover quickly in this getting started
chapter. Rest assured that we cover each of these components in-depth later in the book.
39
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
For now, though, we can see some of these search components in action using Solr’s
example search interface, called “Solritas”, available in your local Solr instance at:
http://localhost:8983/solr/collection1/browse. Navigate to this URL in your web
browser and you’ll see a screen that looks like figure 2.8.

Figure 2.8 The Solritas Simple example, which illustrates how to use various search components, such as
faceting, More Like This, hit highlighting, and spatial, to provide a rich search experience for your users.
As shown at the top of figure 2.8, Solr provides three examples to choose from: Simple,
Spatial, and Group By. We’ll briefly cover the key aspects of the Simple example here and
encourage you to explore the other two examples on your own.
Take a moment to scan over figure 2.8 to identify the various search components at
work. One of the more interesting search components in this example is the facet search
component shown on the left side of the page, starting with the header Field Facets. The
facet component categorizes field values in the search results into useful sub-sets to help the
user refine their query and discover new information. For instance, when we search for
“video”, Solr returns three example documents and the faceting component categorizes the
40
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
“cat” field of these documents into three sub-sets: electronics (3), graphics card (2), and
music (1). Click on the music facet link and you’ll see the results are filtered from three
documents down to only one. The idea here is that in addition to search results, you can help
users refine their search criteria by categorizing the results into different filters. Take a few
minutes to explore the various facets shown in the example. We cover facets in detail in
chapter 8.
Next, let’s take a look at another search component that isn’t immediately obvious from
figure 2.8, namely the spell check component. To see how spell checking works, type
“vydeoh” in the search box instead of “video”. Of course no results are found, as shown in
figure 2.9, but Solr does return a link that effectively asks the user if they meant “video” and
if so, they can re-search using the link.

Figure 2.9 Example of how the spell check component allows your search UI to prompt the user to re-
search using the correct spelling of a query term, in this case, Solr found “video” as the closest match for
“vydeoh”.
There’s a lot of powerful functionality packed into the three Solritas examples and we
encourage you to spend a little time with each. For now, let’s move on and tour the rest of
the administration console.
41
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
2.3 Tour the Solr administration console
At this point, you should have a good feel for the query form, so let’s take a quick tour of the
rest of the administration console shown in figure 2.10.

Figure 2.10 The Solr administration console; explore each page using the toolbar on the left.
Rather than spending your time reading about the administration panel, we think it’s
better to just start clicking through some of the pages yourself. Thus, we leave it as an
exercise for you to visit all the links in the administration console to get a sense for what is
available on each page. To get you started, here are some highlights of what the
administration console provides:
 See how your Solr is configured from Dashboard
 View recent log messages from Logging
 Temporarily change log verbosity settings from Level under Logging
 Add and manage multiple cores from Core Admin
 View Java system properties from Java Properties
 Get a dump of all active threads in your JVM from Thread Dump
In addition to the main pages described above, there are a number of core specific pages
for each core in your server. Recall that the example server we’ve been working with has
only one core named “collection1”. The core-specific pages allow you to do the following:
42
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
 View core specific properties such as the number of Lucene segments from the main
core page, e.g. collection1
 Send a quick request to a core to make sure it’s alive using Ping
 Execute queries against the core’s index using Query
 View the currently active schema.xml for the core from Schema; you’ll learn all about
schema.xml in chapters 5 and 6
 View the currently active solconfig.xml for the core from Config; you’ll learn more
about solrconfig.xml in chapter 4
 See how your index is replicated to other servers from Replication
 Analyze text from Analysis; you’ll learn all about text analysis in chapter 6, including
how to use the Analysis form
 Determine how fields in your documents are analyzed from Schema Browser
 Get information about the top terms for a field using Load Term Info on the Schema
Browser
 View the status and configuration for Plug-ins from PlugIns / Stats; you’ll learn all
about PlugIns in chapter 4
 View statistics about core Solr cache regions, such as how many documents are in the
documentCache from PlugIns / Stats
 Manage the data import handler from Dataimport; this isn’t enabled in the example
server
We’ll dig into the details for most of these pages in various places throughout the book,
when it’s more appropriate. For instance, you’ll learn all about the Analysis page in chapter 6
when we cover text analysis. For now, take a few moments to explore these pages on your
own. To give your self-guided tour some direction, see if you can answer the following
questions about your Solr server.

What’s the value of the lucene-spec version property for your Solr server?
What’s the log level of the org.apache.solr.core.SolrConfig class?
What’s the value of the maxDoc property for the collection1 core?
What’s the value of the java.vm.vendor Java system property?
What’s the segment count for the collection1 core?
What’s the response time of pinging your server?
What’s the top term for the manu field? (hint: select the “manu” field in the schema
browser and click on the Load Term Info button)
What’s the current size of your documentCache? (hint: think stats)
43
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
What’s the analyzed value of the name “Belkin Mobile Power Cord for iPod w/ Dock”?
(hint: select the name field on the Analyzed page)
Let’s now turn our attention to what needs to be done to start customizing Solr for your
specific needs.
2.4 Adapting the example to your needs
So now that you’ve had a chance to work with the example server, you might be
wondering what’s the best way to proceed with adapting it to your specific needs. You have a
couple of choices here. First, you could just use the example directory as-is and start making
changes to it to meet your needs. However, we think it’s better to keep a copy of example
around and make your application specific changes in a clone of example. This allows you to
refer back to example in case you break something when working on your own application.
If you choose the latter approach, then you need to select a name for the directory that is
more appropriate for your application than “example”. For example, if we were building the
real estate search application described in chapter 1, then we might name the directory:
realestate. Once you’ve settled on a name, do the following steps to create a clone of the
example directory in Solr:
Create a deep copy of the example directory; e.g. cp -R example realestate
Clean-up the cloned directory to remove un-used solr home directories, such as
example-DIH and multicore; they’ll be in example if you need to refer back to them.
Under the solr home directory, rename collection1 to something more intuitive for
your application.
Update solr.xml to point to the name of your new collection by replacing “collection1”
with the name of your core from step 3.
Note that you don’t need to make any changes to the Solr configuration files, such
solrconfig.xml or schema.xml, at this time. These files are designed to provide a good
experience out-of-the-box and let you adapt them to your needs iteratively without having to
swallow such a big pill at once.
Cleaning up your index
There may come a time when you want to start with a fresh index. After stopping Solr,
you can remove all documents by deleting the contents of the data directory for your
core, such as solr/collection1/data/*. When you restart Solr, you’ll have a fresh index
with 0 documents.

Restart Solr from your new directory using the same process from section 2.1.2. For
example, to restart our clone for the realestate application, we’d do:
cd $SOLR_INSTALL/realestate
java -jar start.jar
44
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Lastly, you might be wondering about setting JVM options, configuring backups,
monitoring, setting Solr up as a service, and so on? We feel these are important concerns
when you’re ready to go to production so we cover these questions in chapter 12, when we
discuss taking Solr to production.
2.5 Summary
To recap, we started by installing Solr 4.1 from the binary distribution Apache provided. In
reality, the installation process was only a matter of choosing an appropriate directory where
to extract the downloaded archive (.zip or .tgz) and then doing the extraction. Next, we
started the example Solr server and added some example documents using the post.jar
command-line application.
After adding documents, we introduced you to Solr’s query form, where you learned the
basic components of a Solr query. Specifically, you learned how to construct queries
containing a main query parameter “q” as well as an optional filter “fq.” You saw how to
control which fields are returned using the “fl” parameter and how to control the ordering of
results using “sort.” We also touched on ranked retrieval concepts where results are ordered
by relevancy score. You’ll learn more about queries in chapter 7.
We also introduced you to search components and provided some insights into how they
work in Solr using the Solritas example user interface. Specifically, you saw an example of
how the facet component allows users to refine their search criteria using dynamically
generated filters called facets. We also touched on how the spell check component allows
you to prompt users “did you mean X? ” when their query contains a misspelled term.
Next, we gave you some tips on what other tools are available in the Solr administration
console. You’ll find many great tools and statistics available about Solr, so we hope you were
able to answer the questions we posed as you walked through the administration console in
your browser. Lastly, we presented the steps to clone the example directory to begin
customizing it for your own application. We think this is a good way to start so that you
always have a working example to refer to as you customize Solr for your specific needs.
So now that you have a running Solr instance, it’s time to learn about core Solr concepts.
In chapter 3 you’ll gain a better understanding of core search concepts that will help you
throughout the rest of your Solr journey.
45
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
3
Key Solr concepts
This chapter covers
 What differentiates Solr from traditional database technologies
 The basic structure of Solr’s internal index
 How Solr performs complex queries using terms, phrases, and fuzzy matching
 How Solr calculates scores for matching queries to most relevant documents
 How to balance returning relevant results versus ensuring they’re all returned
 How to model your content into a denormalized document
 How search scales across servers to handle billions of documents and queries
Now that we have Solr up and running, it’s important to gain a basic understanding of how a
search engine operates and why you’d choose to use Solr to store and retrieve your content
as opposed to a traditional database. In this chapter, we’ll provide a solid understanding of
how a search engine stores documents in its internal index, how it calculates a relevancy
score to ensure only the “best” results are returned for display, and how Solr is able to scale
to handle billions of documents as it still maintains lightning-fast search response times.
Our main goal for this chapter is to provide you with the theoretical underpinnings
necessary to understand and maximize your use of Solr. If you have a solid background in
search technology or information retrieval already, then you may wish to skip some or all of
this chapter, but if not, it will help you understand more advanced topics later in this book
and maximize the quality of your users’ search experience. Although the content in this
chapter is generally applicable to most search engines, we’ll be specifically focusing upon
Solr’s implementation of each of the concepts. By the end of this chapter, you should have a
solid understanding of how Solr’s internal index works, how to perform complex Boolean and
fuzzy queries with Solr, how Solr’s default relevancy scoring model works, and how Solr’s
46
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
architecture enables queries to remain fast as it scales to handle billions of documents across
many servers.
Let’s begin with a discussion of the core concepts behind search in Solr, including how the
search index works, how a search engine matches queries and documents, and how Solr
enables powerful query capabilities to make finding content a problem of the past.
3.1 Searching, matching, and finding content
Many different kinds of systems exist today to help us solve challenging data storage and
retrieval problems—relational databases, key-value stores, map-reduce engines operating
upon files on disk, and graph databases, among thousands of others. Search engines, and
Solr in particular, help to solve a particular class of problem very well – those requiring the
ability to search across large amounts of unstructured text and pull back the most relevant
results.
In this section, we’ll describe the core features of a modern search engine, including a
explanation of a search “document”, an overview of the inverted search index at the core of
Solr’s fast full-text searching capabilities, and a broad overview of how this inverted search
index enables arbitrarily complex term, phrase, and partial matching queries.
3.1.1 What is a document?
We posted some documents to Solr in chapter 2 and then ran some example searches
against Solr, so this is not the first time we have mentioned documents. It is important,
however, that we have a solid understanding of the kind of information which we can put
into Solr to be searched upon (a document), and how that information is structured.
Solr is a document storage and retrieval engine. Every piece of data submitted to Solr for
processing is a document. A document could be a newspaper article, a resume or social
profile, or in an extreme case even an entire book.
Each document contains one or more fields, each of which is modeled as a particular
field type: string, tokenized text, boolean, date-time, lat/long, etc. The number of
potential field types is infinite because a field type is composed of zero or more analysis
steps that change how the data in the field is processed and mapped into the Solr index.
Each field is defined in Solr’s schema (discussed in chapter 5) as a particular field type,
which allows Solr to know how to handle the content as it is received. Listing 3.1 shows an
example document, defining the values for each field.
Listing 3.1 Example Solr document

company123
Atlanta
Georgia
Code Monkeys R Us, LLC
we write lots of code
2013-06-01T15:26:37Z

47
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
When we run a search against Solr, we can search on one or more of these fields (or
even fields not contained in this particular document), and Solr will return documents that
contain content in those fields matching the query specified in the search.
It is worth noting that, unlike Lucene, Solr is not schema-less. All field types must be
defined, and all field names (or dynamic field naming patterns) must be specified in Solr’s
schema.xml, as we’ll discuss further in chapter 5. This does not mean that every document
must contain every field, only that all possible fields must be mapped to a particular field
type should they appear in a document and need to be processed.
A document, then, is a collection of fields that map to particular field types defined in a
schema. Each field in a document is analyzed according to its field type and stored in a
search index in order to later retrieve the document by sending in a related query. The
primary search results returned from a Solr query are documents containing one or more
fields.
3.1.2 The fundamental search problem
Before we dive into an overview of how search works in Solr, it is helpful to understand what
fundamental problem search engines are solving.
Let’s say, for example, you were tasked with creating some search functionality that
helps users search for books. Your initial prototype might look something like figure 3.1.

Figure 3.1 Example search interface, as would be seen on a typical website, demonstrating how a user
would submit a query to your application
Now, imagine a customer wants to find a book on purchasing a new home and searches
for “buying a home.” Some potentially relevant books titles you may want to return for this
search are listed in table 3.1.
Table 3.1 Books relevant to the query “buying a home”
Potentially Relevant Books
The Beginner’s Guide to Buying a House
How to Buy Your First House
Purchasing a Home
Becoming a New Home Owner
Buying a New Home
Decorating Your Home

48
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
All other book titles, as listed in table 3.2, would not be considered relevant for customers
interested in purchasing a new home
Table 3.2 Books not relevant to the query “buying a home”
Irrelevant Books
A Fun Guide to Cooking
How to Raise a Child
Buying a New Car
A naïve approach to implementing this search using a traditional SQL (Structured Query
Language) database would be to simply query for the exact text that users enter:
SELECT * FROM Books
WHERE Name = ‘buying a new home’;
The problem with this approach, of course, is that none of the book titles in your book
catalog will match the text that customers type in exactly, which means they will not find
any results for this query. In addition, customers will only ever see results for future queries
if the query matches the full book title exactly.
Perhaps a more forgiving approach would be to search for each single word within a
customer’s query:
SELECT * FROM Books
WHERE Name LIKE ‘%buying%’
AND Name LIKE ‘%a%’
AND Name LIKE ‘%home%’;
The previous query, while relatively expensive for a traditional database to handle
because it can’t use available database indexes, would at least produce one match for the
customer which contains all desired words, as shown in table 3.3.
Table 3.3 Results from database “like” query requiring a fuzzy match for every term
Matching Books Non-matching Books
Buying a New Home The Beginner’s Guide to Buying a House
How to Buy Your First House
Purchasing a Home
Becoming a New Home Owner
A Fun Guide to Cooking
How to Raise a Child
Buying a New Car
Decorating Your Home
49
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Of course, you may believe that requiring documents to match all of the words your
customers include in their queries is overly restrictive. You could easily make the search
experience more flexible by only requiring a single word to exist in any matching book titles,
by issuing the following SQL query:
SELECT * FROM Books
WHERE Name LIKE ‘%buying%’
OR Name LIKE ‘%a%’
OR Name LIKE ‘%home%’;
The results of the above query can be seen in table 3.4. You’ll see that this query
matched many more book titles than the previous query because this query only required a
minimum of one of the keywords to match. Additionally, because this query is performing
only partial string matching on each keyword, any book title that merely contains the letter
“a” is also returned. The preceding example, which required all of the terms, also matched
on the letter “a”, but we did not experience this problem of returning too many results
because the other keywords were more restrictive.
Table 3.4 Results from database “like” query only requiring a fuzzy match at least one term
Matching books Non-matching books
A Fun Guide to Cooking How to Buy Your First House
Decorating Your Home
How to Raise a Child
Buying a New Car
Buying a New Home
The Beginner’s Guide to Buying a House
Purchasing a Home
Becoming a New Home owner

We have just seen that the first query (requiring all words to match) resulted in many
relevant books not being found, while the second query (requiring only one of the words to
match) resulted in many more relevant books being found, but resulted in many irrelevant
books being found, as well.
The above examples demonstrate several difficulties with this implementation:
 It does not understand linguistic variations of words such as “buying” vs “buy”
 It does not understand synonyms of words such as “buying” and “purchasing”
 Unimportant words such as “a” prevent results from matching as expected (either
excluding relevant results or including irrelevant results depending upon whether “all”
50
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
or “any” of the words much match).
 There is no sense of relevancy ordering in the results – books which only match one of
the queried words often show up higher than books matching multiple or all of the
words in the customer’s query.
These queries will become slow as the size of the book catalog grows or the number of
customer queries grows, as the query must scan through every book’s title to find partial
matches instead of using an index to look up the words.
Search engines like Solr really shine in solving problems like the ones listed above. Solr is
able to perform text analysis on content and on search queries to determine textually similar
words, understand and match on synonyms, remove unimportant words like “a”, “the”, and
“of”, and score each result based upon how well it matches the incoming query to ensure
that the best results are returned first and that your customers do not have to page through
countless less relevant results to find the content they were expecting. Solr accomplishes all
of this utilizing an index that maps content to documents instead of mapping documents to
content like a traditional database model. This “inverted index” is at the heart of how search
engines work.
3.1.3 The inverted index
Solr utilizes Lucene’s inverted index to power its fast searching capabilities, as well as
many of the additional bells and whistles it provides at query time. While we’ll not get into
many of the internal Lucene data structures in this book (I recommend picking up a copy of
Lucene in Action if you want a deeper dive), it is important to understand the high-level
structure of the inverted index.
Recalling our previous book-searching example, we can get a feel for what an index
mapping each term to each document would look like from table 3.5.

51
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Table 3.5 Mapping of text from multiple documents into an inverted index. The left table
contains nine original documents with their content, while the right table represents an inverted
search index containing each of the terms from the original content mapped back to the
documents in which they can be found.
Original Documents

Lucene’s Inverted Index

Doc # Content Field
1 A Fun Guide
to Cooking
2 Decorating
Your Home
3 How to Raise
a Child
4 Buying a New
Car
5 Buying a New
Home
6 The
Beginner’s
Guide to
Buying a
House
7 Purchasing a
Home
8 Becoming a
New Home
owner
9 How to Buy
Your First
House

Term Doc #

(Continued)
…

a
1,3,4,5,
6,7,8

… …
becoming 8 guide 1,6
beginner’s 6

home 2,5,7,8
buy 9 house 6,9
buying 4,5,6 how 3,9
car 4 new 4,5,8
child 3 owner 8
cooking 1 purchasing 7
decorating 2

raise 3
first 9 the 6
fun 1

to 1,6,9
… …

your 2,9

Table 3.5 demonstrates the process of mapping the content from documents into an
inverted index. While a traditional database representation of multiple documents would
contain a document’s id mapped to one or more content fields containing all of the
words/terms in that document, an inverted index inverts this model and maps each
52
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
word/term in the corpus to all of the documents in which it appears. You can tell from
looking at table 3.5 that the original input text was split on spaces and that each term was
transformed into lowercase text before being inserted into the inverted index, but everything
else remained the same. It is worth noting that any additional text transformations are
possible, not just these simple ones – terms can be modified, added, or even removed
during the content analysis process, which will be covered in detail in chapter 5.
A few final important details should be noted about the inverted index:
 All terms in the index map to one or more documents
 Terms in the inverted index are sorted in ascending alphanumeric order
 This view of the inverted index is greatly simplified – we’ll see in section 3.1.6 that
additional information can also be stored in the index to improve Solr’s querying and
scoring capabilities.
As you will see in the next section, the structure of Lucene’s inverted index allows many
powerful query capabilities that maximize both the speed and flexibility of keyword based
searching.
3.1.4 Terms, phrases, & Boolean logic
Now that we’ve seen what content looks like in Lucene’s inverted index, let’s jump into
the mechanics of how a query is able to make use of this index to find matching documents.
In this section, we’ll go over the basics of looking up terms and phrases in this inverted
search index and utilizing Boolean logic and fuzzy queries to enhance these lookup
capabilities. Referring back to our book-searching example, let’s take a look at a simple
query of “new house”, as portrayed in figure 3.2.

Figure 3.2 Simple search to demonstrate nuances of query interpretation
We saw in the last section that all of the text in our content field was broken up into
individual terms when inserted into the Lucene index. Now that we have an incoming query,
we need to select from among several options for querying the index:
 Search for two different terms, “new” and “house”, requiring both to match
 Search for two different terms, “new” and “house”, requiring only one to match
 Search for the exact phrase “new house”
53
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
All of these options are perfectly valid approaches, depending upon your use case, and
thanks to Solr’s powerful querying capabilities built using Lucene, are very easy to
accomplish using boolean logic.
REQUIRED TERMS
Let’s examine the first option, breaking the query into multiple terms and requiring them all
to match. There are two identical ways to write this query using the standard query parser in
Solr:
 +new +house
 new AND house
These two are logically identical, and in fact, the second example gets parsed and
ultimately reduced down to the first example. The “+” symbol is a unary operator which
means that part of the query immediately following it is required to exist in any documents
matched, whereas the “AND” keyword is a binary operator which means that the part of the
query immediately preceding and the part of the query immediately following it are both
required.
OPTIONAL TERMS
In contrast to the “AND” operator, Solr also supports the “OR” binary operator, which
means that either the part of the query preceding or the part of the query following it is
required to exist in any documents matched. By default, Solr is also configured to treat any
part of the query without an explicit operator as an optional parameter, making the following
identical:
 new house
 new OR house
Solr’s Default Operator
While the default configuration in Solr assumes that a term or phrase by itself is an
optional term, this is configurable on a per-query basis using the q.op url parameter with
many of Solr’s query handlers.
/select/?q=new house&q.op=OR vs /select?q=new house&q.op=AND
Do note, however, that if you change the default operator from OR to AND that now parts
of the query without an operator will always be required unless you explicitly place an OR
between them to override the default AND operator.

54
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
NEGATED TERMS
In addition to making parts of a query optional and required, it is also possible to require
that they NOT exist in any matched documents through either of the following equivalent
queries:
• new house –rental
• new house NOT rental
In the queries above, no document which contains the word “rental” will be returned, only
documents matching “new” or “house.”
PHRASES
Solr does not just support searching single terms, however; it can also search for phrases,
which ensures that multiple terms appear together, in order:
• “new home” OR “new house”
• “3 bedrooms” AND “walk in closet” AND “granite countertops”
GROUPED EXPRESSIONS
In addition to the above query expressions, one final basic Boolean construct which Solr
supports is the grouping of terms, phrases, and other query expressions. The Solr query
syntax can represent arbitrarily complex queries through grouping terms together using
parenthesis like the following examples:
• New AND (house OR (home NOT improvement NOT depot NOT grown))
• (+(buying purchasing -renting) +(home house residence –(+property -bedroom)))
The use of required terms, optional terms, negated terms, and grouped expressions
provides a very powerful and flexible set of query capabilities that allow arbitrarily complex
lookup operations against the search index, as we’ll see in the following section.
3.1.5 Finding sets of documents
With a basic understanding of terms, phrases, and Boolean queries in place, we can now dive
into exactly how Solr is able to utilize the internal Lucene inverted index to find matching
documents. Let us recall our index of books from earlier, re-produced in table 3.6.
Table 3.6 Inverted index of terms from a collection of book titles
Term Document

(Continued)…
a 1,3,4,5,6,7,8

… …
becoming 8 guide 1,6
beginner’s 6

home 2,5,7,8
buy 9 house 6,9
buying 4,5,6 how 3,9
55
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
car 4 new 4,5,8
child 3 owner 8
cooking 1 purchasing 7
decorating 2

raise 3
first 9 the 6
fun 1

to 1,6,9
… …

your 2,9

If a customer passes in a query now of new home, how exactly is Solr able to find
documents matching that query given the above inverted index?
The answer is that the query new home is actually a two term query (there is a default
operator between new and home, remember?). As such both terms must be looked up
separately in the Lucene Index:
Term Documents
New => 4,5,8
Home => 2,5,7,8
Once the list of matching documents is found for each term, Lucene will perform set
operations to arrive at an appropriate final result set which matches the query. Assuming our
default operator is an OR, this query would result in a union of the result sets for both terms,
as pictured in the Venn diagram in figure 3.3.

Figure 3.3 Results returned from an Union query using the “OR” operator
Likewise, if our query had been new AND home or if the default operator had been set to
AND, then the intersection of the results for both terms would have been calculated to return
a result set of only document 5 and document 8, as shown in figure 3.4.

56
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

Figure 3.4 Results returned from an Intersection query using the “AND” operator
In addition to union and intersection queries, negating particular terms is also common.
Figure 3.5 demonstrates a breakdown of the results expected for each of the result set
permutations of this two-term search query (assuming a default OR operator).

Figure 3.5 Graphical representation of using common boolean query operators
57
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
As you can see, the ability to search for required terms, optional terms, negated terms,
and grouped terms provides a very powerful mechanism for looking up single keywords. As
we’ll see in the following section, Solr also provides the ability to query for multi-term
phrases.
3.1.6 Phrase Queries and Term Positions
We saw earlier that, in addition to querying for terms in our Lucene Index, that it is also
possible to query Solr for phrases. Recalling that the index contains only individual terms,
however, you may be wondering exactly how we can search for full phrases.
The short answer is that each term in a phrase query is still looked up in the Lucene
Index individually, just as if the query new home had been submitted instead of “new home.”
Once the overlapping document set is found, however, a feature of the index that we
conveniently left out of our initial inverted index discussion is utilized. This feature, called
Term Positions, is the optional recording of the relative position of terms within a document.
Table 3.7 demonstrates how documents (on the left side of the table) map into an inverted
index containing Term Positions (on the right side of the table).
Table 3.7 Inverted Index with term positions
Original Documents

Lucene’s Inverted Index with Term Positions

DOCUMENT # CONTENT FIELD
1 A Fun Guide to Cooking
2 Decorating Your Home
3 How to Raise a Child
4 Buying a New Car
5 Buying a New Home
6 The Beginner’s Guide to Buying
a House
7 Purchasing a Home
8 Becoming a New Home owner
9 How to Buy Your First House

TERM DOCUMENT TERM POSITION
a 1 1
3 4
4 2
… …
cooking 1 5
decorating 2 1
your 2 2
9 4
home 2 3
5 4
7 3
8 4
… … …
new 4 3
5 3
8 3
Car 4 4
The 6 1
Beginner’s 6 2
House 6 7
9 6
Purchasing 7 1
… … …

58
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
From the inverted index in table 3.7, you can see that a query for new AND home would
yield a result containing documents 5 and 8. The term position actually goes one step
further, telling us where in the document each term appears. Table 3.8 shows a condensed
version of the inverted index focused upon only the primary terms under discussion: new
and home.
Table 3.8 Condensed inverted index with term positions
TERM DOCUMENT TERM POSITION
home 5 4
8 4
new 5 3
8 3
In this particular example, the term new happens to be in position 3, and the term home
happens to be in position 4 in both of the matched documents. This makes sense, as the
original book titles for these books were “Buying a New Home” and “Becoming a new Home
Owner.” We have thus seen the power of term positions – they allow us to reconstruct the
original positions of our indexed terms within their respective documents, making it possible
to search for specific phrases at query time.
Searching for specific phrases is not the only benefit provided by term positions, though.
We’ll see in the next section another great example of their use to improve our search
results quality.
3.1.7 Fuzzy matching
It’s not always possible to know up-front exactly what will be found in the Solr index for any
given search, so Solr provides the ability to perform several types of fuzzy matching queries.
Fuzzy matching is defined as the ability to perform inexact matches on terms in the search
index. For example, someone may want to search for any words that start with a particular
prefix (known as wildcard searching), may want to find spelling variations within one or two
characters (known as fuzzy or edit distance searching), or may even want to match two
terms within some maximum distance of each other (known as proximity searching). For use
cases in which multiple variations of the terms or phrases queried may exist across the
documents being searched, these fuzzy matching capabilities serve as a very powerful tool.
In this section, we’ll explore multiple fuzzy matching query capabilities in Solr, including
wildcard searching, range searching, edit distance searching, and proximity searching.
WILDCARD SEARCHING
One of the most common forms of fuzzy matching in Solr is generally the use of wildcards.
Suppose you wanted to find any documents which started with the letters “offic". One way to
do this would be to create a query which enumerates all of the possible variations:

Query: office OR officer OR official OR officiate OR …
59
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Requiring that this list of words be turned into a query up-front can be an unreasonable
expectation for customers, or even for you on behalf of your customers. Since all of the
variations you could match already exist in the Solr index, you can use the asterisk ()
wildcard character to perform this same function for you:
Query: offi --> Matches office, officer, official, etc.
In addition to matching the end of a term, a wildcard character can be used inside of the
search term, as well, such as if you wanted to match both “officer” and “offer”:
Query: offr --> Matches offer, officer, officiator, etc.
The asterisk wildcard () shown above matches zero or more characters in a term. If you
want to match only a single character, you can make use for the question mark (?) for this
purpose:
Query: off?r  Matches offer, but NOT officer

Leading Wildcards
While the wild card functionality in Solr is fairly robust, it is only possible, by default, to
use wildcards inside or at the end of a term. If you needed to match all terms ending in
“ing” (like caring, liking, and smiling), for example, you would receive an exception if you
tried running this search:
Query: *ing
The reason for this is that Solr searches through the inverted index for terms which begin
with the characters provided before the wildcard, and without some initial characters to
begin that lookup, it is too expensive to walk the entire index to find matching terms.
If you really need to be able to search using these leading wildcards, a solution to this
problem does exist, but it will require you to perform some additional configuration. The
solution is achieved by adding the ReversedWildcardFilterFactory to your field
type’s analysis chain (configuring text processing will be discussed in chapter 5).
The ReversedWildcardFilterFactory works by double storing the indexed
content in the Solr index (once for the text of each term, and once for the reversed
text of each term):
Index: caring liking smiling
#gnirac #gnikil #gnilims
When a query is submitted with the leading wildcard of *ing, Solr knows to search
on the reversed version, thus getting around Solr’s inability to perform leading wildcard
searches by turning them into standard wildcard searches on the reversed content.
Do note, however, that turning this feature on requires dual-storing all terms in the Solr
index, increasing the size of the index and thus slowing overall searches down somewhat.
Turning this capability on is thus not recommended unless it is needed within your search
application.

60
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
One last important point to note about wildcard searching is that wildcards are only
meant to work on individual search terms, not on phrase searches, as demonstrated by the
following example:
Works: softwar* eng?neering
Does Not work: “softwar* eng?neering”
If you need the ability to perform wildcard searches within a phrase, you will have to
store the entire phrase in the index as a single term, which you should feel comfortable
doing by the end of chapter 5.
RANGE SEARCHING
Solr also provides the ability to search for values which fall between known values. This can
be useful when you want to search for a particular subset of documents falling within a
range. For example, if you only wanted to match documents created in the six months
between February 2nd
, 2012 and August 2nd
, 2012, you could perform the following search:
Query: created:[2012-02-01T00:00.0Z TO 2012-08-02T00:00.0Z]
This range query format also works on other field types:
Query: yearsOld:[18 TO 21] //matches 18, 19, 20, 21
Query: title:[boat TO boulder] //matches boat, boil, book, boulder, etc.
Query: price:[12.99 TO 14.99] //matches 12.99, 13.000009, 14.99, etc.
Each of the above range queries surrounds the range with square brackets, which is the
“inclusive” range syntax. Solr also supports exclusive range searching through use of curly
braces:
Query: yearsOld:{18 TO 21} //Matches 19 and 20 but NOT 18 or 21
Thought it may look odd syntactically, Solr also provides the ability to mix and match
inclusive and exclusive bounds:
Query: yearsOld:[18 TO 21} //matches 18, 19, 20, but NOT 21
While range searches perform more slowly than searches on a single term, they provide
tremendous flexibility to find documents matching dynamically defined groups of values
which lie within a particular range within the Solr index. It is important to note that the
ordering of terms of range queries is exactly that – the order in which they are found in the
Solr index, which is an alphanumeric sorted order. Thus if you were to create a text field
containing integers, those integers would actually be found in the following order: 1, 11,
111, 12, 120, 13, etc. Numeric types in Solr, at least the ones we’ll recommend in the
coming chapters, compensate for this by indexing the incoming content in a special way, but
it is important to understand that the sort order within the Solr index is dependent upon how
the data within the field is processed when it is written to the Solr index. We’ll dive much
deeper into this kind of content analysis in chapter 5.
FUZZY/EDIT DISTANCE SEARCHING
For many search application, it is important not only to match a customer’s text exactly,
but also to allow some flexibility for handling spelling errors or even slight variations in
correct spellings. Solr provides the ability to handle character variations using edit distance
61
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
measurements based upon a Damerau-Levenshtein distances, which account for more than
80% of all human misspellings.
1

Solr achieves these fuzzy edit distance searches through the use of the tilde (~)
character as follows:
Query: administrator~
Matches: adminstrator, administrater, administratior, etc.
The above query matches both the original term (administrator) and any other terms
within 2 “edit distances” of the original term. An edit distance here is defined as an insertion,
a deletion, a substitution, or a transposition of characters. The term adminstrator (missing
the “i" in the 6th
position) is one edit distance away from administrator because it has one
character deletion. Likewise the term sadministrator is one edit distance away because it has
one insertion (the “s” that was added to the beginning), and the term administratro is one
edit distance away because it has transposed the last two characters (“or” became “ro”).
It is also possible to modify the strictness of edit distance searches to allow matching of
terms with an edit distance of greater than one:
Query: administrator~1 --> matches within one edit distance
Query: administrator~2 --> matches within two edit distances
(this is the default if no edit distance is provided)
Query: administrator~N --> matches within N edit distances
Please note that any edit distances requested above 2 will become increasingly slower
and will be more likely to match unexpected terms. Term searches with edit distances of 1 or
2 are performed using a very efficient Levenshtein Automaton, but will fall back to slower
edit distance implementation for edit distances above 2.
PROXIMITY SEARCHING
In the last section, we saw that edit distances could be used to find terms which were “close”
to the original term, but not exactly the same. This edit distance principle is applicable
beyond just searching for alternate characters within a term – it can also be applied between
terms for variations of phrases.
Let’s say that that you want to search across a Solr index of employee profiles for
executives within your company. One way to do this would be to enumerate each of the
possible executive titles within your company:
Query: “chief executive officer” OR “chief financial officer” OR
“chief marketing officer” OR “chief technology officer” OR …
Of course, this assumes you know all of the possible titles, which may be unrealistic if
you’re searching across other companies with which you’re poorly acquainted, or if you have
a more challenging use case. Another possible strategy is to search for each term
independently:

1
FRED J . DAMERAU A Technique for Computer Detection and Correction of Spelling Errors. Communications
of the ACM 7(3):171-176 (1964)
62
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Query: chief AND officer
This should match all of the possible use cases, but it will also match any document which
contains both of those words ANYWHERE in the document. One problematic example would
be a document containing the text “One chief concern arising from the incident was the
safety of the police officer on duty.” This document is clearly a poor match for our use case,
but it and similar bad matches would be returned given the above query.
Thankfully, Solr provides a simple solution to this problem: proximity searching. In the
above example, a good strategy would be to ask Solr to bring back all documents which
contain the term “chief” near officer “officer”. This can be accomplished through the following
example queries:
Query: “chief officer”~1
chief and officer must be a maximum of one edit distance away.
Examples: “chief executive officer”, “officer chief”
Query: “chief officer”~2
Examples: “chief business development officer”, “officer, police chief”
Query: “chief officer”~N
Finds chief within N edit distances of officer.
The edit distances above can be seen as nothing more than sloppy phrase searches. In
fact, an exact phrase search of “chief development officer” could easily be rewritten as “chief
development officer”~0. These queries will yield the exact same results, because an edit
distance of zero is the very definition of an exact phrase search. Both mechanisms make use
of the term positions stored in the Solr index (which we discussed in section 3.1.6) to
calculate the edit distances.
3.1.8 Quick Recap
At this point, you should have a basic grasp of how Solr stores information in its inverted
index and queries that index to find matching documents. This includes looking up terms,
using Boolean logic to create arbitrarily complex queries, and getting results back as a result
of the set intersections of each of the term lookups. We also discussed how Solr stores term
positions and is able to use those to find exact phrases and even fuzzy phrase matches
through the use of proximity queries and edit distance calculations. For fuzzy searching
within single terms, we examined the use of wildcards and edit distance searching to find
misspellings or very similar words. While Solr’s query capabilities will be expanded upon in
chapter 6, these key operations serve as the foundation for generating most Solr queries.
They also prepare us nicely with the needed background for our discussion of Solr’s keyword
relevancy scoring model, which we’ll discuss in the next section.
3.2 Relevancy
Finding matching documents is the first critical step in creating a great search experience,
but it is only the first step. No customer is willing to wade through page after page of search
results to find the document he or she is seeking. In my experience, only 10% of customers
63
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
are willing to go beyond the first page of any given search on most websites, and only 1%
are willing to navigate to the third page.
Solr does a very good job out of the box at ensuring the ordering of search results brings
back the best matches at the top of the results list. It does this by calculating a relevancy
score for each document and then sorting the search results from the top score to the
lowest. This section will provide an overview of how these relevancy scores are calculated
and what factors influence them. We’ll dig into both the theory behind Solr’s default
relevancy calculation and also into the specific calculations used to calculate the relevancy
scores, providing intuitive examples along the way to ensure you leave this section with a
solid understanding of what, to many, can be the most eluding aspect of working with Solr.
We’ll start by discussing the Similarity class, which is responsible for most aspects of a
query’s relevancy score calculation.
3.2.1 Default similarity
Solr’s relevancy scores are based upon a Similarity class which can be defined on a per-field
basis in Solr’s Schema.xml (discussed later in chapter 5). The Similarity class is a Java class
that defines how a relevancy score is calculated based upon the results of a query. While you
can choose from multiple similarity classes, or even write your own, it is important to
understand Solr’ default Similarity implementation and the theory behind why it works so
well.
By default, Solr uses Lucene’s (appropriately named) DefaultSimilarity class. This class
utilizes a two-pass model to calculate similarity. First, it makes use of a Boolean model
(described in the last section) to filter out any documents that do not match the customer’s
query. Then, it utilizes a Vector Space model for scoring, drawing the query as a vector, as
well as an additional vector for each document. The similarity score for each document is
based upon the cosine between the query vector and that document’s vector, as depicted in
figure 3.6

Figure 3.6 Cosine Similarity of Term Vectors. The query term vector, v(q), is closer to the document 1
term vector v(d1) than the document 2 term vector (d2), as measured by the cosine of the angle between
each document term vector and the query vector. The smaller the angle between the query term vector
and a document term vector, the more similar the query and the document are considered to be.
64
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
In this Vector Space scoring model a term vector is calculated for each document and is
compared with the corresponding term vector for the query. The similarity of two vectors can
be found by calculating a cosine between them: with a cosine of 1 being a perfect match and
a cosine of zero representing no similarity. More intuitively, the closer the two vectors appear
together, as in the image above, the more similar they are. Thus, the smaller the angle
between vectors, or the larger the cosine, the closer the match.
Of course, the most challenging part of this whole process is actually coming up with
reasonable vectors which represent the important features of the query and of each
document for comparison. Let’s take a look at the entire, complicated relevancy formula for
the DefaultSimilarity class. We’ll then go line by line to explain intuitively what each
component of the relevancy formula is attempting to accomplish.
Given a query (q) and a document (d), the similarity score for the document to the query
can be calculated as shown in figure 3.7.

Figure 3.7 DefaultSimilarity Scoring Algorithm. Each component in this formula will be explained in detail in
the following sections.
Wow – that equation can be quite overwhelming, especially at first glance. Fortunately, it
is much more intuitive when each of the pieces are broken down. The math is presented
above for reference, but you will likely never need to really dig into the full equation unless
you decide to overwrite the similarity for your search application.
65
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
At a high level, the important concepts are demonstrated by the high-level formula –
namely, Term Frequency (tf), Inverse Document Frequency (idf), Term Boosts (t.getBoost),
the Field Normalization (norm), the Coordination Factor (coord), and the Query
Normalization (queryNorm). Let’s dive into the purpose of each of these.
3.2.2 Term Frequency (tf)
Term Frequency is a measure of how often a particular term appears in a matching
document, and is an indication of how “well” a document matches the term.
If you were searching through a search index filled with newspaper articles for an article
on the President of the United States, would you prefer to find articles which only mention
the president once, or articles which discuss the president consistently throughout the
article? What if an article just happens to contain the phrases “President” and “United States”
each one time (perhaps out of context) – should it be considered as relevant as an article
that contains these phrases multiple times?
Take the example from table 3.9. Clearly the second article above is more relevant than
the first, and the identification of the phrases “President” and “United States” multiple times
throughout the article provides a strong indication that the content of the second article is
more closely related to this query.
Table 3.9 Documents mentioning “President” and “United States”
Article 1 (less relevant) Article 2 (More relevant)
Dr. Smolla is the president of
Furman University, one of the top liberal
arts universities in the southern United
States.
In 2011, Furman was ranked the 2nd

most rigorous college in the country by
Newsweek magazine, behind St. John’s
College (NM).
Furman also consistently ranks
among the most beautiful campuses to
visit and ranks among the top 50 liberal
arts colleges nation-wide each year.
Today, international leaders met with the
President of the United States to discuss
options for dealing with growing instability in
global financial markets. President Obama
indicated that the United States is cautiously
optimistic about the potential for significant
improvements in several struggling world
economies pending the results of upcoming
elections. The President indicated that the
United States will take whatever actions
necessary to promote continued stability in
the global financial markets.
In general, a document is considered to be more relevant for a particular topic (or query
term) if the topic appears multiple times.
This is the basic premise behind the TF (Term Frequency) component of the default Solr
relevancy formula. The more times the search term appears within a document, the more
relevant that document is considered. It is unlikely the case that 10 appearances of a term
makes the document 10 times more relevant, however, so TF is actually calculated using the
square root of the number of times the search term appears within the document, in order to
66
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
diminish the additional contribution to the relevancy score for each subsequent appearance
of the search term.
3.2.3 Inverse Document Frequency (idf)
Not all search terms are created equal. Imagine if someone were to search for the book The
Cat in the Hat by Dr. Seuss, and the top results returned were those which included a high
term frequency for the words “the” and “in” instead of “cat” and “hat”. Common sense would
indicate that words which are more rare across all documents are likely to be better matches
for a query than terms which are more common.
Inverse Document Frequency (IDF) is a measure of how “rare” a search term is, and it is
calculated by finding the document frequency (how many total documents the search term
appears within), and calculating it’s inverse (see the full formula in section 3.2.1 for the
actual calculation).
Because the Inverse Document Frequency appears for the term in both the query and the
document, it is squared in the relevancy formula.
Figure 3.8 shows an visual example of the “rare-ness” of each of the words in the title
“The Cat in the Hat,” with a higher IDF being represented as a larger term.
Figure 3.8 Visual depiction of the relative significance of terms as measured by Inverse Document
Frequency. The terms which are more rare are depicted as larger, indicating a larger Inverse Document
Frequency.
Likewise, if someone were searching for a profile for “an experienced Solr development
team lead,” we wouldn’t expect documents to rank higher which best match the words “an”,
“team”, or “experienced”. Instead, we would expect the important terms to resemble the
largest terms in figure 3.9.

Figure 3.9 Another demonstration of relative score of terms derived from Inverse Document Frequency.
Once again, a higher Inverse Document Frequency indicates a more rare and more relevant term, which is
depicted here using larger text.
67
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Clearly the user is looking for someone who knows Solr who can be a team lead, so these
terms stand out with considerably more weight when found in any document.
Term Frequency and Inverse Document Frequency, when multiplied together in the
relevancy calculation, provide a nice counter-balance to each other. The term frequency
elevates terms which appear multiple times within a document, while the inverse document
frequency penalizes those terms which appear commonly across many documents. Thus
common words such as “the”, “an”, and “of” in the English language ultimately yield very low
scores, even though they may appear many times in any given document.
3.2.4 Boosting (t.getBoost)
It is not necessary to leave all aspects of your relevancy calculations of up to Solr. If you
have domain knowledge about your content – that certain fields or terms are more (or less)
important than others, you can supply boosts at either indexing time or query time to ensure
that the weights of those fields or terms are adjusted accordingly.
Query-time boosting is the most flexible and easiest to understand form of boosting,
utilizing the following syntax:
Query: title:(solr in action)^2.5 description:(solr in action)
The above example provides a boost of 2.5 to the search phrase in the title field, while
providing the default boost of 1.0 to the description field. It is important to note that, unless
otherwise specified, all terms receive a default boost of 1.0 (which means multiply the
calculated score by 1, or leave it as calculated).
Query boosts can also be used to “penalize” certain terms if a boost of less than 1.0 is
used:
Query: title:(solr in action) description:(solr in action)^0.2
Do note, however, that a boost of less than 1 is still a positive boost – it does not
penalize the document in absolute terms, it simply boosts the term less than normal boost of
1 that it otherwise would have received.
These query-time boosts can be applied to any part of the query:
Query: title:(solr^2 in^.01 action^1.5)3 OR “solr in action”^2.5
Certain query parsers even allow boosts to be applied to an entire field by default, which
we’ll cover further in chapter 6.
In addition to query time boosting, it is also possible to boost documents or fields within
documents at index time. These boosts are factored into the Field Norm, which is covered in
the following section.
3.2.5 Normalization Factors
The default Solr relevancy formula calculates three kinds of normalization factors
(norms): Field Norms, Query Norms, and the Coord Factor.
FIELD NORMS (NORM)
The Field Normalization Factor (Field Norm) is a combination of factors describing the
importance of a particular field on a per-document basis. Field Norms are calculated at index
68
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
time and are represented as an additional byte per field in the Solr index. This byte packs a
lot of information in: the boost set on the document when indexed, the boost set on the field
when indexed, and a length normalization factor which penalizes longer documents and helps
shorter documents (under the assumption that finding any given keyword in a longer
document is more likely and thus less relevant). The actual field norms are calculated using
the formula in figure 3.10.

Figure 3.10 Field Norms Calculation. Field Norms factor in the matching documents boost, the matching
field’s boost, and a length normalization factor which penalizes longer documents. These three fairly
separate pieces of data are stored as a single byte in the Solr index, which is the only reason they are
combined together into this single “field norms” variable.
The d.getBoost() component represents the boost applied to the document when it is sent
to Solr, and the f.getBoost() component represents the boost applied to the field for which
the norm is being calculated. It is worth mentioning that Solr actually allows the same field
to be added to a document multiple times (performing some magic under the covers to
actually map each separate entry for the field into the same underlying Lucene field).
Because duplicate fields are ultimately mapped to the same underlying field, if multiple
copies of the field exist, the f.getBoost() actually becomes the product of the field boost for
each of the multiple fields with the same name.
If the title field were added to a document 3 times, for example, once with a boost of 3,
once with a boost of 1, and once with a boost of .5, then the f.getBoost() for each of the
three fields (or one underlying field) would be:
Boost: (3)· (1) · (.5) = 1.5
In addition to the index time boosts, a parameter called the Length Norm is also factored
into the Field Norm. The Length Norm is computed by taking the square root of the number
of terms in the field for which it is calculated.
The purpose of the Length Norm is to adjust for documents of varying lengths, such that
longer documents do not maintain an unfair advantage simply by having a larger likelihood
of containing any particular term a given number of times.
For example, let’s say that you perform a search for the keyword “Beijing”. Would you
prefer for a news article to come up mentions Beijing 5 times, or would you rather have
an obscure 300 page book come back which also happens to mention Beijing only 5 times.
Common sense would indicate that document in which Beijing is proportionally more
prevalent is probably a better match, everything else being equal. This is what the Length
Norm attempts to take into account.
The overall Field Norm, calculated from the product of the document boost, the field
boost, and the length norm, is encoded into a single byte which is stored in the Solr index.
Because the amount of information being encoded from this product is larger than a single
byte can store, some precision loss does occur during this encoding. In reality, this loss of
69
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
fidelity generally has negligible effects on overall relevancy, as it is usually only big
differences which matter given the variance in all other relevancy criteria.
QUERY NORMS (QUERYNORM)
The Query Norm is one of the least interesting factors in the default Solr relevancy
calculation. It does not affect the overall relevancy ordering, as the same queryNorm is
applied to all documents. It merely serves as a normalization factor to attempt to make
scores between queries comparable. It utilizes the sum of the squared weights for each of
the query terms to generate this factor, which is multiplied with the rest of the relevancy
score to normalize it. The Query Norm should not affect the relative weighting of each
document that matches a given query.
THE COORD FACTOR (COORD)
One final normalization factor taken into account in the default Solr relevancy calculation is
the Coord factor. The Coord factor’s role is to measure how much of the query each
document matches. For example, let’s say you perform the following search:
Query: Accountant AND (“San Francisco” OR “New York” OR “Paris”)
You may prefer to find an accountant with offices in each of the cities you mentioned, as
opposed to an accountant who has happened to mention “New York” many times over and
over again.
If all four of these terms match, the Coord factor is 4/4. If three match, the Coord factor
is 3/4, and if only one matches, then it is 1/4.
The idea here behind the Coord factor is that, all things being equal, documents which
contain more of the terms in the query should score higher than documents which only
match a few.
We have now discussed all of the major components of the default relevancy algorithm in
Solr. We discussed term frequency, inverse document frequency, the two most key
components of the relevancy score calculation. We then went through boosting and
normalization factors, which refine the scores calculated by term frequency and inverse
document frequency alone. With a solid conceptual understanding and a detailed overview of
the specific components of the relevancy scoring formula, we’re now set to discuss Precision
and Recall, two important aspects for measuring the overall quality of the result sets
returned from any search system.
3.3 Precision and recall
The information retrieval concepts of Precision (a measure of accuracy) and Recall (a
measure of thoroughness) are very simple to explain, but are also very important to
understand when building any search application or understanding why the results being
returned are not meeting your business requirements. We’ll provide a brief summary here of
each of these key concepts.
70
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
3.3.1 Precision
The precision of a search results set (the documents which match a query) is a measurement
answering attempting to answer the question “Were the documents which came back the
ones I was looking for?”.
More technically, precision is defined as (between 0.0 and 1.0):
#Correct Matches / #Total Results Returned
Let’s return to our earlier example from section 3.1 about searching for a book on the topic
of buying a new home. We’ve determined that by our internal company measurements that
the books in table 3.10 would be considered good matches for such a query.
Table 3.10 Relevant List of Books
Relevant Books
1 The Beginner’s Guide to Buying a House
2 How to Buy Your First House
3 Purchasing a Home
All other book titles, for purposes of this example, would not be considered relevant for
someone interested in purchasing a new home. A few examples are listed in table 3.11.
Table 3.11 Irrelevant List of Books
Irrelevant Books
4 A Fun Guide to Cooking
5 How to Raise a Child
6 Buying a New Car
For this example, if all of the documents which were supposed to be returned (documents
1, 2, and 3) were returned, and no more, the precision of this query would be 1.0 (3 Correct
Matches / 3 Total Matches), which would be perfect.
If, however, all results came back, the precision would only be .5, since half of the results
which were returned were not correct – that is, they were not “precise.”
Likewise, if only one result came back from the relevant list (number 2, for example), the
precision would still be 1.0, because all of the results which came back were correct. As you
can see, Precision is a measure of how “good” each of the results of a query are, but it pays
no attention to how thorough they are – a query which returns one single correct document
out of a million other correct documents is still considered perfectly precise.
Because Precision only considers the overall accuracy of the results that come back and
not the comprehensiveness of the result set, we need to counterbalance the Precision
measurement with one that takes thoroughness into account – Recall.
71
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
3.3.2 Recall
Whereas Precision measures how “correct” each of the results being returned is, Recall is
a measure of how thorough the search results are. Recall is essentially answering the
question of “How many of the correct documents were returned.
More technically, Recall is defined as:
#Correct Matches / (#Correct Matches + #Missed Matches)
To demonstrate an example of using the Recall calculation, our example showing relevant
books and irrelevant books from the last section has been recreated in table 3.12 for
reference purposes.
Table 3.12 List of relevant and irrelevant books
Relevant Books
1 The Beginner’s Guide to Buying a House
2 How to Buy Your First House
3 Purchasing a Home

Irrelevant Books
4 A Fun Guide to Cooking
5 How to Raise a Child
6 Buying a New Car
If all six documents were returned for a search query, the Recall would actually be 1
since all correct matches were found and there were no missed matches (whereas we saw
the Precision earlier would be .5).
Likewise, if only document 1 were returned, the Recall would only be 1/3, since 2 of the
documents that should have been returned/recalled were missing.
This underlies the critical difference between Precision and Recall – Precision is high when
the results returned are “correct”, whereas Recall is high when the correct results are
“present.” Recall does not care that all of the results are correct, whereas Precision does not
care that all of the results are present.
In the next section, we’ll talk about strategies for striking an appropriate balance between
Precision and Recall.
3.3.3 Striking the right balance
Though there is clearly tension between the two, Precision and Recall are not mutually
exclusive. In the case from the previous example where the query only returns documents 1,
2, and 3, the Precision and Recall are actually both 1.0, because all of the results were
correct, and all of the correct results were found.
Maximizing for full Precision and full Recall is the ultimate goal of just about every search
relevancy-tuning endeavor. With a contrived example (or a hand-tuned set of results), this
seems easy, but in reality, this is a very challenging problem.
Many techniques can be undertaken within Solr to improve either Precision or Recall,
though most are geared more toward increasing Recall in terms of the full document set
being returned. Aggressive textual analysis (to find multiple variations of words) is a great
72
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
example of trying to find “more” matches, though these additional matches may hurt overall
precision if the textual analysis is so aggressive that it matches incorrect word variations.
One common way to approach the Precision versus Recall problem in Solr is to actually
attempt to solve for both: measuring for Recall across the entire result set and measuring for
Precision only within the first page (or few pages) of search results. Following this model,
“better” matches will be boosted to the top of the search results based upon how well you
tune your use of Solr’s relevancy scoring calculations, but you will also find that many poorer
matches appear at the bottom of your search results list if you actually went to the last page
of the search results.
This is only one way to approach the problem, however. Since many websites, for
example, want to appear to have as much content as possible, and since those sites know
that visitors will never actually go beyond the first few pages, they can actually show very
precise results on the first few pages which yielding a very high Recall value across the entire
result set since they are increasing the chances of pulling back more content by being very
lenient in which keywords are able to match the initial query.
The decision on how to best balance Precision and Recall is ultimately dependent upon
your use case. In scenarios like legal discovery, there is a very heavy emphasis placed on
Recall, as there are legal ramifications if any documents are missed. For other use cases, the
requirement may simply be to find a few really great matches and find nothing that does not
exactly match every term within the query.
Most search applications fall somewhere between these two extremes, and striking the
right balance between Precision and Recall is a never-ending challenge – mostly because
there is generally no one right answer. Regardless, understanding the concepts of Precision
and Recall and why changes you make swing you more towards one of these two conceptual
goals (and likely away from the other) is critical to effectively improving the quality of your
search results. We have an entire chapter dedicated to relevancy tuning, chapter 16, so you
can be sure you will see this tension between precision and recall surface again.
3.4 Searching at scale
One of the most appealing aspects of Solr, beyond it’s speed, relevancy, and powerful text
searching features, is how well it scales. Solr is able to scale to handle billions of documents
and an infinite number of queries by adding servers. Chapter 12 will provide an in-depth
overview of scaling Solr in production, but this section will lay the groundwork for how to
think about the necessary characteristics for operating a scalable search engine. Specifically,
we’ll discuss the nature of Solr documents as denormalized documents and why this enables
linear scaling across servers, how distributed searching works, the conceptual shift from
thinking about servers to thinking about clusters of servers, and some of the limits of scaling
Solr.
73
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
3.4.1 The denormalized document
Central to Solr is the concept of all documents being denormalized. A completely de-
normalized document is one in which all fields are self-contained within the document, even
if the values in those fields are duplicated across many documents. This concept of
denormalized data is common to many NoSQL technologies. A good example of
denormalization is a user profile document having a “city”, “state”, and “postalCode” field,
even though in most cases the “city” and “state” field will be exactly the same across all
documents for each unique “postalCode” value. This is in contrast to a normalized document
where relationships between parts of the document may be broken up into multiple smaller
documents, the pieces of which can be joined back together at query time. A normalized
document would only have a “postalCode” field, and a separate location document would
exist for each unique “postalCode” so that the “city” and “state” would not need to be
duplicated on each user profile document. If you have any training whatsoever in building
normalized tables for relational databases, please leave that training at the door when
thinking about modeling content into Solr. Figure 3.11 demonstrates a traditional normalized
database table model, with big “X” over it to make it obvious that this is not the kind of data-
modeling strategy you will use with Solr.

Figure 3.11 Solr documents do not follow the traditional normalized model of a relational database. This
figure demonstrates how NOT to think of Solr Documents. Instead of thinking in terms of multiple entities
with relationship to each other, a Solr Document is modeled as a flat, denormalized data structure, as
shown in listing 3.2.
Notice that the information in figure 3.11 represents two users working at a company
called “Code Monkeys R Us, LLC”. While this figure shows the data nicely normalized into
separate tables for the employees’ personal information, location, and company, this is not
74
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
how we would represent these users in a Solr Document. Listing 3.2 shows the denormalized
representation for each of these employees as mapped to a Solr Document.
Listing 3.2 Two Denormalized User Documents

456
Coco
I’m a real monkey
Norcross
Georgia
Code Monkeys R Us, LLC
we write lots of code
Decatur
Georgia
2013-06-01T15:26:37Z

123 John Doe Senior Software Engineer with 10 years of experience with java, ruby, and .net Atlanta Georgia Code Monkeys R Us, LLC we write lots of code Decatur Georgia 2013-06-05T12:25:12Z Notice that all of the company information is repeated in both the first and second user’s documents from listing 3.2, which seems to go against the principles of normalized database design for reducing data redundancy and minimizing data dependency. In a traditional relational database, a query can be constructed which will join data from multiple tables when resolving a query. While some basic “join” functionality does now exist Solr (which will be discussed in chapter 14), it is only recommended for cases where it is impractical to actually denormalize content. Solr knows about terms that map to documents but does not natively know about any relationships between documents. That is, if you wanted to search for all users (in the previous example) who work for companies in Decatur, GA, you would 75 Licensed to Jiwen Zhang

Figure 4.1 Depiction of how solr.xml and solrconfig.xml are used to configure Solr during initialization
During initialization, Solr locates solr.xml in the top-level Solr home directory; in the
example server this is $SOLR_INSTALL/example/solr/solr.xml. The solr.xml file
identifies one or more cores to initialize. Listing 4.1 shows the solr.xml for the example
server.
Listing 4.1: Default solr.xml for example server defining the collection1 core
#A

defaultCoreName=“collection1”
host=" ${host:}" hostPort="$ {jetty.port:}"
82
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
hostContext=" ${hostContext:}" zkClientTimeout="$ {zkClientTimeout:15000}">
#C

#A persistent attribute controls whether changes made from the core admin API are persisted to this
file
#B define one or more cores under the element
#C the collection1 core configuration and index files are in the collection1 directory under solr home
The initial configuration only has a single core named “collection1”, but in general there
can be many cores defined in solr.xml. For each core, Solr locates the solrconfig.xml file,
under $SOLR_HOME/$ instanceDir/conf/solrconfig.xml, where $instanceDir is the
directory for a specific core as specified in solr.xml. Solr uses the solrconfig.xml file to
initialize the core.
Now that we’ve seen how Solr identifies configuration files during startup, let’s turn our
attention to understanding the main sections of the solrconfig.xml, as that will give you an
idea of what’s to come in the rest of this chapter.
4.1 Overview of solrconfig.xml
To illustrate the concepts in solrconfig.xml, we’ll build upon the work done in chapter 2 by
using the pre-configured example server and the Solritas example search UI. To begin, we
recommend that you start up the example server we used in chapter 2 using:
cd $KaTeX parse error: Expected 'EOF', got '#' at position 532: \dotshVersion> #̲A ... The http utility provides a couple of other options to allow you to override the address of your Solr server or to change the response type to something other than XML, such as JSON. To see a full list of options, simply do: java -jar target/sia-examples.jar http -h Figure 4.4 shows the sequence of events and main components involved in handling this Solr request.$

Figure 4.4 Sequence of events to process a request to the /select request handler.
Starting at the top-left of figure 4.4:

A client application sends an HTTP GET request to
http://localhost:8983/solr/collection1/select?q=… Query parameters are passed along
in the query string of the GET request.
Jetty accepts the request and routes it to Solr’s unified request dispatcher using the
/solr context in the request path. In technical terms, the unified request dispatcher
is a Java servlet filter mapped to /* for the solr Web application, see
org.apache.solr.servlet.SolrDispatchFilter.
91
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Solr’s request dispatcher uses the “collection1” part of the request path to determine
the core name. Next, the dispatcher locates the /select request handler registered in
solrconfig.xml for the collection1 core.
The /select request handler processes the request using a pipeline of search
components (covered in section 4.2.4 below).
After the request is processed, results are formatted by a response writer component
and returned to the client application, by default the /select handler returns results
as XML. Response writers are covered in section 4.5.
The main purpose of the request dispatcher is to locate the correct core to handle the
request, such as collection1, and then route the request to the appropriate request handler
registered in the core, in this case /select. In practice, the default configuration for the
request dispatcher is sufficient for most applications. On the other hand, it is common to
define a custom search request handler or to customize one of the existing handlers, such as
/select. Let’s dig into how the /select handler works to gain a better understanding of how
to customize a request handler.
4.2.2 Search handler
Listing 4.5 shows the definition of the /select request handler from solrconfig.xml.
Listing 4.5 Definition of /select request handler from solrconfig.xml
class=“solr.SearchHandler”> #B
#C
explicit
10 #D
text

#A A specific type of request handler designed to process queries
#B Java class that implements the request handler
#C List of default parameters (name/value pairs)
#D Sets the default page size to 10
Behind the scenes, all request handlers are implemented by a Java class, in this case
solr.SearchHandler. At runtime, solr.SearchHandler resolves to the built-in Solr
class: org.apache.solr.handler.component.SearchHandler. In general, anytime
you see “solr.” as a prefix of a class in solrconfig.xml, then you know this translates to the
fully qualified Java package: org.apache.solr.handler.component. This shorthand
notation helps reduce clutter in Solr’s configuration documents. In Solr, there are two main
types of request handlers:
search handler – query processing
update handler - indexing
92
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
We’ll learn more about update handlers in the next chapter when we cover indexing. For
now, let’s concentrate on how search request handlers process queries, as depicted in figure
4.5.

Figure 4.5 Search request handler made up of parameter decoration (defaults, appends, invariants), first-
components, components, and last-components
The search handler structure depicted in figure 4.5 is designed to make it easy for you to
adapt Solr’s query processing pipeline for your application. For example, you can define your
own request handler or, more commonly, add a custom search component to an existing
request handler, such as /select. In general, a search handler is comprised of the following
phases, where each phase can be customized in solrconfig.xml:

request parameter decoration using:
a. defaults: set default parameters on the request if they are not explicitly
provided by the client
b. invariants: set parameters to static values, which override values
provided by the client
c. appends: additional parameters to be combined with the parameters
provided by the client
93
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
first-components: optional chain of search components that are applied first to
perform pre-processing tasks
components: primary chain of search components; must at least include the query
component
last-components: optional chain of search components that are applied last to
perform post-processing tasks
A request handler does not need to define all phases in solrconfig.xml. As you can see
from listing 4.5, the /select only defines the defaults section. This means that all other
phases are inherited from the base solr.SearchHandler implementation. In practice,
customized request handlers are commonly used to simplify client applications. For instance,
the Solritas example we introduced in chapter 2 uses a custom request handler /browse to
power a feature-rich search experience while keeping the client-side code for Solritas very
simple.
4.2.3 Browse request handler for Solritas: an example
Hiding complexity from client code is at the heart of Web services and object-oriented
design. Solr adopts this proven design pattern by allowing you to define a custom search
request handler for your application, which allows you to hide complexity from your Solr
client. For example, rather than requiring every query to send the correct parameters to
enable spell correction, you can use a custom request handler that has spell correction
enabled by default.
The Solr example server comes pre-configured with a great example of this design
pattern at work to support the Solritas example application. Listing 4.6 shows an abbreviated
definition of the /browse request handler from solrconfig.xml.
Listing 4.6 Browse request handler for Solritas
#A
#B
explicit
velocity #C
browse #C
layout #C
Solritas #C

edismax #D

text^0.5 features^1.0 … #E

text^0.5 features^1.0 … #F

on #G
…
on #H
…
on #I
…
94
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

spellcheck #J

#A A SearchHandler invokes query processing pipeline
#B default list of query parameters
#C VelocityResponseWriter settings
#D Use the extended dismax query parser
#E Query settings
#F Enable the MoreLikeThis component
#G Enable the Facet component
#H Enable the Highlight component
#I Enable spell checking
#J Invoke the spell checking component as the last step in the pipeline
We recommend that you take a minute to go through all the sections of the /browse
request handler in the actual solrconfig.xml file. One thing that should stand out to you is
that a great deal of effort was put into configuring this handler, in order to demonstrate
many of the great features in Solr. When starting out with Solr, you definitely do not need to
configure something similar for your application all at once. In other words, you can build up
a custom request handler over time as you gain experience with Solr.
Let’s see the /browse request handler in action using the Solritas example. With the
example Solr server running, direct your browser to
http://localhost:8983/solr/collection1/browse. Enter “iPod” into the search box as shown in
Figure 4.6.
95
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

Figure 4.6 Screen shot of Solritas example powered by the /browse request handler
Take a moment to scan over figure 4.6 to see all the search features activated for this
simple query. Behind the scenes, the Solritas search form submits a query to the /browse
request handler. In the log, we see:
INFO: [collection1] webapp=/solr path=/browse params={q=iPod} hits=3
status=0 QTime=22
Notice that the only parameter sent by the search form is q=iPod, but the response
includes facets, more like this, spell correction, paging, and hit highlighting. That’s an
impressive list of features for a simple request like q=iPod! As you may have guessed,
these features are enabled using default parameters in the /browse request handler.
The defaults element from listing 4.6 is an ordered list of name/value pairs that
provides default values for query parameters if they are not explicitly sent by the client
application. For example, the default value for the response writer type parameter “wt” is
96
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
“velocity” (velocity). Velocity is an open source templating
engine written in Java2
.
From the log message shown above, the only parameter sent by the form was “q”, so all
other parameters are set by defaults. Let’s do a little experiment to see the actual query that
gets processed. Instead of using response writer type “velocity”, let’s set the wt parameter
to “xml” so we can see the response in raw form without the HTML decoration provided by
Velocity. Also, in order to see all the query parameters, we need to set the echoParams
value to “all”. This is a good example of overriding default values by explicitly passing
parameters from the client. Listing 4.7 shows the GET URL and a portion of the
element returned with the response; remember to use the http tool provided with the book
source code to execute this request. Notice how the number of parameters actually sent to
the /browse request handler is quite large.
Listing 4.7 Actual list of parameters sent to the /browse request handler for q=iPod
http://localhost:8983/solr/collection1/browse?q=iPod&wt=xml&echoParams=all

on text,features,name,sku,id,manu,cat,title,description,keywords ,author,resourcename +1YEAR 50 *:* 200 layout all *,score 600 0 cat manu_exact content_type author_s html browse 2 10 NOW/YEAR- 10YEARS false 3 content features title name 750

2
http://velocity.apache.org/engine/index.html
97
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
true
xml
edismax
10
after
0
title
cat,inStock
0
on
5

price
popularity
manufacturedate_dt

on
Solritas
text

ipod
GB

… #A

#A Many more default parameters in this request not shown here
From looking at listing 4.7, it should be clear that parameter decoration for a search
request handler is a powerful feature in Solr. Specifically, the defaults list provides two main
benefits to your application. First, helps simplify client code by establishing sensible defaults
for your application in one place. For instance, setting the response writer type “wt” to
“velocity” means that client applications do not need to worry about setting this parameter.
Moreover, if you ever swap out Velocity for another templating engine, your client code does
not need to change!
Second, as you can see from listing 4.7, the actual request includes a number of complex
parameters needed to configure search components used by Solritas. For example, there are
over twenty different parameters to configure the faceting component for Solritas. By pre-
configuring complex components like faceting, you can establish consistent behavior for all
queries while keeping your client code simple.
The /browse handler serves as a good example of what is possible with Solr query
processing, but it’s also unlikely that it can be used by your application because the default
parameters are tightly coupled to the Solritas data model. For example, range faceting is
configured for the price, popularity, and manufacturedate_dt fields. Consequently, you
should treat the /browse handler as an example and not a 100% reusable solution when
designing your own application-specific request handler.
98
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
4.2.4 Extending query processing with search components
Beyond a set of defaults, the /browse request handler defines an array of search
components to be applied after the default set of search components are applied to the
request using the element. From listing 4.6, notice that the /browse
request handler specifies:

spellcheck

This configuration means that the default set of search components is applied and then the
spellcheck component is applied. This is a very common design pattern for search request
handlers. In fact, you’ll be hard-pressed to come up with an example of where you need to
redefine the phase for a search handler. Figure 4.7 shows the chain of six
built-in search components that get applied during the phase of query
processing:

Figure 4.7 Chain of six built-in search components
QUERY COMPONENT
The query component is the core of Solr’s query processing pipeline. At a high-level, the
query component parses and executes queries using the active searcher, which is discussed
in section 4.3 below. The specific query parsing strategy is controlled by the “defType”
parameter. For instance, the /browse request handler uses the edismax query parser
(edismax), which will be discussed in chapter 7.
The query component identifies all documents in the index that match the query. The set
of matching documents can then be used by other components in the query processing
chain, such as the facet component. The query component is always enabled and all other
components need to be explicitly enabled using query parameters.
FACET COMPONENT
Given a result set identified by the query component, the facet component, if enabled,
calculates field-level facets. We cover faceting in-depth in chapter 8. The key take-away for
now is that faceting is built-in to every search request and it just needs to be enabled with
query request parameters. For /browse, faceting is enabled using default parameter: on.
MORELIKETHIS COMPONENT
Given a result set created by the query component, the More Like This component, if
enabled, identifies other documents that are similar to the documents in search results. To
see an example of the More Like This component in action, search for “hard drive” in the
99
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Solritas example. Click on the More Like This link for the Samsung SpinPoint P120
SP2514N - hard drive - 250 GB - ATA-133 result to see a list of similar documents as
shown in figure 4.8 below.

Figure 4.8 Example of similar items found by the More Like This search component
We cover the More Like This component in chapter 10.
HIGHLIGHT COMPONENT
If enabled, the highlight component highlights sections of text in matching documents to
help the user identify highly relevant sections of text in matching documents. Hit highlighting
is covered in chapter 9.
STATS COMPONENT
The stats component computes simple statistics like min, max, sum, mean, and standard
deviation for numeric fields in matching documents. To see an example of what the stats
component produces, execute GET request as shown in listing 4.8:
Listing 4.8 Request summary statistics for the price field using the stats component
http://localhost:8983/solr/collection1/select?
q=%3A&
wt=xml&
stats=true&
stats.field=price #A

#B
0.0
100
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
2199.0
16
16
5251.270030975342
6038619.175900028
328.20437693595886
536.3536996709846

#A Request statistics for the price field
#B Summary statistics returned for the price field

DEBUG COMPONENT
The Debug component returns the parsed query string that was executed and detailed
information about how the score was calculated for each document in the result set. The
parsed query value is returned to help you track down query formulation issues. The debug
component is useful for troubleshooting ranking problems. To see the debug component at
work, direct your browser to the following URL:
http://localhost:8983/solr/collection1/browse?q=iPod&wt=xml&debugQuery=true
You should notice that this is the exact same query that we executed from the Solritas
form except we changed the response writer type “wt” to XML (instead of Velocity) and
enabled the debug component using debugQuery=true in the HTTP GET request. Listing
4.9 shows a snippet of the XML output produced by the debug component:
Listing 4.9 Snippet of the XML output produced by the Debug component
http://localhost:8983/solr/collection1/browse?
q=iPod&
wt=xml&
debugQuery=true #A
iPod iPod (+DisjunctionMaxQuery((id:iPod^10.0 | #B author:ipod^2.0 | title:ipod^10.0 | text:ipod^0.5 | cat:iPod^1.4 | keywords:ipod^5.0 | manu:ipod^1.1 | description:ipod^5.0 | resourcename:ipod | name:ipod^1.2 | features:ipod | sku:ipod^1.5)))/no_coord ... #C 0.13513829 = (MATCH) max of: 0.045974977 = (MATCH) weight(text:ipod^0.5 in 4) [DefaultSimilarity], result of: 0.045974977 = score(doc=4,freq=3.0 = termFreq=3.0 ), ... #A Enable the debug component #B query produced by the edismax query parser #C Explanation of score calculation for each document in the request 101 Licensed to Jiwen Zhang
At this point, you should have a solid understanding of how Solr processes query
requests. Before we move on to another configuration topic, you should be aware that the
Solr administration console provides access to all active search request handlers under
Plugins / Stats > QUERYHANDLER. Figure 4.9 shows properties and statistics for the
/browse search handler, which as you might have guessed is just another MBean.
102
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828

Figure 4.9 Screen shot showing properties and statistics for the /browse request handler in the Solr
administration console under Plugins / Stats > QUERYHANDLER
Now let’s turn our attention to configuration settings that help optimize query
performance.
4.3 Managing searchers
The element contains settings that allow you to optimize query performance using
techniques like caching, lazy field loading, and new searcher warming. It goes without saying
that designing for optimal query performance from the start is critical to the success of your
search application. In this section, you’ll learn about managing searchers, which is one of the
most important techniques for optimizing query performance.
4.3.1 New searcher overview
In Solr, queries are processed by a component called a searcher. There is only one “active”
searcher in Solr at any given time. All query components for all search request handlers
execute queries against the active searcher.
The active searcher has a read-only view of a snapshot of the underlying Lucene index. It
follows that if you add a new document to Solr, then it is not visible in search results from
103
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
the current searcher. This raises the question of how do new documents become visible in
search results? The answer, of course, is to close the current searcher and open a new one
that has a read-only view of the updated index. This is what it means to commit documents
to Solr. The actual commit process in Solr is more complicated but we’ll save a thorough
discussion of the nuances of commits for the next chapter. For now you can think of a
commit as a black-box operation that makes new documents and any updates to your
existing index visible in search results by opening a new searcher.
Figure 4.10 shows the active searcher MBean for the collection1 core in the example
server available under the CORE section of the Plugins / Stats page.

Figure 4.10 Inspecting the active searcher MBean in the collection1 core from the Solr admin console
On the CORE page, take note of the searcherName property (in the diagram it is
“Searcher@25082661 main”). Let’s trigger the creation of a new searcher by re-sending all
the example documents to your server as we did in section 2.1.4 using:
cd $SOLR_INSTALL/example/exampledocs
104
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
java -jar post.jar *.xml
Now, refresh the CORE page and notice that the searcherName property has changed
to be a different instance of the searcher. A new searcher was created because the post.jar
command sent a commit after adding the example documents.
So now that we know a commit creates a new searcher to make new documents and
updates visible, let’s think about the implications of creating a new searcher. First, the old
searcher must be destroyed. However, there could be queries currently executing against the
old searcher so Solr must wait for all in-progress queries to complete.
Also, any cached objects that are based on the current searcher’s view of the index must
be invalidated. We’ll learn more about Solr cache management in the next section. For now,
think about a cached result set from a specific query. As some of the documents in the
cached results may have been deleted and new documents may now match the query, it
should be clear that the cached result set is not valid for the new searcher.
Because pre-computed data, such as a cached query result set, must be invalidated and
re-computed, it stands to reason that opening a new searcher on your index is potentially an
expensive operation. This can have a direct impact on user experience. For instance, imagine
a user paging through search results and a new searcher is opened after they click on page 2
but before they request page 3. When the user requests the next page, all of the previously
computed filters and cached documents are no longer valid. Without some care, the user is
likely to experience some slowness, especially if their query is complex.
The good news is that Solr has a number of tools to help alleviate this situation. First and
foremost, Solr supports the concept of warming a new searcher in the background and
keeping the current searcher active until the new one is fully warmed.
4.3.2 Warming a new searcher
Solr takes the approach that it is better to serve stale results for a short period of time
rather than allowing query performance to slow down significantly. This means that Solr does
not close the current searcher until a new searcher is warmed up and ready to execute
queries with optimal performance.
Warming a new searcher is much like a sprinter in track and field. Before a sprinter goes
full speed in a race, she makes sure her muscles are warmed up and ready perform at full
speed when the gun fires to start the race. Just as a sprinter wouldn’t start a race with cold
muscles, nor should Solr activate a “cold” searcher.
In general, there are two types of warming activities: 1) auto-warming new caches from
the old caches, and 2) executing cache warming queries. We’ll learn more about auto-
warming caches in the next section when we dig into Solr’s cache management features.
A cache-warming query is a pre-configured query (in solrconfig.xml) that gets executed
against a new searcher in order to populate the new searcher’s caches. Listing 4.11 shows
the configuration of cache warming queries for the example server.

105
Licensed to Jiwen Zhang [email protected]
©Manning Publications Co. Please post comments or corrections to the Author Online forum:
http://www.manning-sandbox.com/forum.jspa?forumID=828
Listing 4.11 Define a listener for newSearcher events to execute warming queries
#A
#B
#C #A Define a listener to handle newSearcher events #B Define a named list of query objects to warm the new searcher #C Intentionally commented out so that you configure application specific queries for your environment The configuration settings in listing 4.11 register a named list () of queries to execute whenever a newSearcher event occurs in Solr, such as after a commit. Also, note that the actual queries are commented out! This is intentional because there is a cost to executing warming queries and the Solr developers wanted to ensure you configure warming queries explicitly for your application. In other words, the cache warming queries are application specific so the out-of-the-box defaults are strictly for example purposes. Put simply, you are responsible for configuring warming queries. CHOOSING WARMING QUERIES Having the facility to warm new searchers by executing queries is only a great feature if you can identify queries that will help improve query performance. As a rule of thumb, warming queries should contain query parameters (q, fq, sort, etc.) that are used frequently by your application. Since we haven't covered Solr query syntax yet, we'll table the discussion of creating warming queries in until chapter 7. For now, it's sufficient to make a mental note that you need to revisit this topic once you have a more thorough understanding of Solr query construction. We should also mention that you do not need to have any warming queries for your application. If query performance begins to suffer after commits, then you'll know it is time to consider using warming queries. TOO MANY WARMING QUERIES The old adage of "less is more" applies to warming queries. Each query takes time to execute so having many warming queries configured can lead to long delays in opening new searchers. Thus, it's best to keep the list of warming queries to the minimal set of the most important queries for your search application. You might be wondering what the problem with a new searcher taking a long time to warm-up is. It turns out that warming too many searchers in your application concurrently can consume too many resources (CPU and memory), thus leading to a degraded search experience. 106 Licensed to Jiwen Zhang
USECOLDSEARCHER
Before we turn our attention to Solr’s cache management, we want mention two additional
searcher-related elements in solrconfig.xml. The element covers the
case where a new search request is made and there is no currently registered searcher. If
is false, then Solr will block until the warming searcher has
completed executing all warming queries; this is the default configuration for the example
Solr server: false
On the other hand, if is true, then Solr will immediately register
the warming searcher regardless of how “warm” it is. Returning to our track-and-field
analogy, false would mean the starting official waits to start the race until our sprinter is
fully warmed-up, regardless of how long that takes. Conversely, a true value means that the
race will start immediately regardless of how warmed-up our sprinter is.
MAXWARMINGSEARCHERS
It’s conceivable that a new commit is issued before the new searcher warming process
completes, which implies another new searcher needs to be warmed up. This is especially
true if your searchers take considerable time to warm-up. The
element allows you to control the maximum number of searchers that can be warming

SOLR_in action

你可能感兴趣的:(Solr)