Solr performance tips

Page query result:

  • purpose: pre-load more documents to avoid additional queries when navigating through pages, so that pagination result loads from document cache

  • how: solrconfig.xml   queryResultWindowSize = documents per page * max number of pages browsed, queryResultMaxDocsCached = queryResultWindowSize


Configure document cache:

  • purpose: set the right document cache size, so that cache is large enough to avoid fetching data from index multiple times during a single query

  • how: solrconfig.xml, documentCache section, size = number of concurrent queries * max docs fetched per query, initialSize = size

  • note: consider using FastLRUCache if there are more reads than writes    
    autowarmCount needs not to be set for document cache as lucene doc id changes after re-index
    set intialSize = size to avoid wasting time on cache resize 
    monitor cache after size adjustment: frequent eviction (cache too small), low hit rate:(turn off cache)


Configure query result cache:

  • purpose: load query results from cache as much as possible

  • how: solrconfig.xml, queryResultCache section, size = number of unique queries * number of sort criteria * 2 (asc or desc)  initialSize = size   autowarnCount = size * 1/4

  • note: consider using FastLRUCache if there are more reads than writes 
    autowarmCount specifies number of entries copied on invalidation (eg. commit operation)


Configure filter cache:

  • purpose: load filter query (fq) results from cache as much as possible

  • how: solrconfig.xml, filterCache section, size = number of unique filters


Startup or after commit warmup tuning:

  • purpose: pre-load result for heavily used or slow queries into cache to avoid warmup slowness (startup or commit)

  • how: solrconfig.xml

    after startup:

    <listener event="firstSearcher" class="solr.QuerySenderListener">

        <arr name="queries">
            <lst>
                <str name="q">cats</str>
                <str name="fq">category:1</str>

            </lst>

            <lst>...</lst>

        </arr>

    </listener>


    after commit:

    <listener event="newSearcher" class="solr.QuerySenderListener">
        <arr name="queries">
            <lst>
                <str name="q">cats</str>
                <str name="fq">category:1</str>

            </lst>

            <lst>...</lst>

        </arr>

    </listener>


Cache whole result pages (HTTP cache):

  • purpose: cache solr http response on client side by using http cache

  • how:

    <requestDispatcher handleSelect="true">
        <httpCaching lastModifiedFrom="openTime" etagSeed="Solr">
            <cacheControl>max-age=3600, public</cacheControl>
        </httpCaching>
    </requestDispatcher>

  • note: handleSelect=true   handler resolution via request parameter qt 
    set "max-age" to half of the index update interval 
    set "private" to cacheControl if only want browser to cache solr response 
    lastModFrom="openTime" is the default, the Last-Modified value (and validation against If-Modified-Since requests) will all be relative to when the current Searcher was opened. You can change it to lastModFrom="dirLastMod" if you want the value to exactly correspond to when the physical index was last modified. 
    etagSeed="..." is an option you can change to force the ETag header (and validation against If-None-Match requests) to be differnet even if the index has not changed (ie: when making significant changes to your config file). 
    lastModifiedFrom and etagSeed are both ignored if you use the never304="true" option. (used if you want proxy server to handle tag/modified time calculation)


Improve facet performance:

  • purpose: improve performance of facet query via facet method if query result contains many documents and facet field's cardinality is low

  • how: add facet.method=enum to query  or  facet.<fieldname>.method=enum to query

  • note: facet.method=fc (default)  iterates result documents and calculate count for each facet while facet.method=enum uses facet terms's docId to intersects with query result set's docId


Improve indexing time on large doc set:

  • purpose: improve response time for indexing a large number of documents by committing more frequently

  • how:use solr's auto commit feature

    commit within specified time:

    <updateHandler class="solr.DirectUpdateHandler2">
        <autoCommit>
            <maxTime>60000</maxTime>
            <openSearcher>true</openSearcher>
        </autoCommit>
    </updateHandler>


    commit after indexing specified number of documents:

    <updateHandler class="solr.DirectUpdateHandler2">
        <autoCommit>
            <maxDocs>50000</maxDocs>
            <openSearcher>true</openSearcher>
        </autoCommit>
    </updateHandler>



    Commit faster than auto-commit setting for specific doc (xml data only)

    <add commitWithin="100">
        <doc>
            <field name="id">1</field>
            <field name="title">Book 1</field>
        </doc>
    </add>


Analyzing performance:

  • purpose: give a detail view of solr query execution so that slow queries can be tuned

  • how: Add request parameter debugQuery=true

    Eg. http://localhost:8983/solr/select?q=metal&facet=true&facet.field=date&facet.query=from:[10+TO+2000]&debugQuery=true 

    this shows a debug section in solr response xml, which has a breakdown of time spent on each components

  • note: solr processing can be divided into two phases:  prepare and process


Avoid filter caching:

  • purpose: there are cases when want to avoid filter caching for unique queries, such as time range search, to avoid wasting memory and CPU

  • how: add hint {!cache=false} to query
    Eg. q=solr+cookbook&fq=category:books&fq={!cache=false}date:2012-06-12T13:22:12Z

  • note: filters that are not cached will be executed in parallel with the query


Control filter query execution order:

  • purpose: a filter query may contain multiple clauses, we want to control the order of execution so that cheap filters are applied first to narrow down result set as much as possible, and expensive filters (Eg, function) are applied later

  • how: specify cost to fq clause

    Eg. q=solr+cookbook&fq=category:books&fq={!frange l=10 u=100 cache=false cost=50}log(sum(sqrt(popularity),100))&fq={!frange l=0 u=10 cache=false cost=150}if(exists(price_promotion),sum(0,price_promotion),sum(0,price))

  • note: order of execution can only be controlled for non-cached filter queries


Improve numeric query performance:

  • purpose: improve numerice range search performance

  • how: decreases the precisionStep of a float field

    <fieldType name="float" class="solr.TrieFloatField" precisionStep="4" positionIncrementGap="0"/>

  • note: text range search is usually faster than numeric range search  
    decrease precisionStep results in more tokens generated by a single value and slightly increases index size   for integer  precisionStep = 4 results in 32 bit/4 = 8 tokens
    precisionStep=0 turn off indexing of multiple tokens per value


Use near real time search feature:

  • purpose: solr supports near real time indexing by perform a soft commit. A hard commit syncs index change to disk, which is time consuming. A soft commit is much faster and searcher can see index changes immediately.

  • how: solrconfig.xml

    <autoSoftCommit>

       <maxTime>${solr.autoSoftCommit.maxTime:5000}</maxTime>

    </autoSoftCommit>

  • note: A nice article that explains soft commit, hard commit and transaction log in solr
    http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

    Another way to improve time responsiveness for search is to use  “get” query if document id is known. This retrieve document directly from log even though the data is not committed yet. Eg. http://localhost:8983/solr/get?ids=mydoc See https://wiki.apache.org/solr/RealTimeGet

        

你可能感兴趣的:(Solr,performance)