The Well-Configured Solr Instance 告诉你如何调节solr实例到最佳性能 |
Configuring solrconfig.xml
solrconfig.xml的配置对solr工作的影响很大
能完成以下内容:
request handlers, which process the requests to Solr, such as requests to add documents to the index or requests to return results for a query
listeners, processes that "listen" for particular query-related events; listeners can be used to trigger the execution of special code, such as invoking some common queries to warm-up caches the Request Dispatcher for managing HTTP communications the Admin Web interface parameters related to replication and duplication (these parameters are covered in detail in Legacy Scaling and Distribution)
主要讲述的内容:
DataDir and DirectoryFactory in SolrConfig
Lib Directives in SolrConfig Schema Factory Definition in SolrConfig IndexConfig in SolrConfig RequestHandlers and SearchComponents in SolrConfig InitParams in SolrConfig UpdateHandlers in SolrConfig Query Settings in SolrConfig RequestDispatcher in SolrConfig Update Request Processors Codec Factory
Substituting Properties in Solr Config Files
solrconf,xml中支持动态设置属性值
${propertyname[:option default value]}
给予默认值或者运行时指定值或者报错
几种指定变量的方式:
JVM System Properties
Any JVM System properties, usually specified using the -D flag when starting the JVM, can be used as variables in any XML configuration file in Solr.
For example, in the sample solrconfig.xml files, you will see this value which defines the locking type to use:
Which means the lock type defaults to "native" but when starting Solr, you could override this using a JVM system property by launching the Solr it with: bin/solr start -Dsolr.lock.type=none In general, any Java system property that you want to set can be passed through the bin/solr script using the standard -Dproperty=value syntax. Alternatively, you can add common system properties to the SOLR_OPTS environment variable defined in the Solr include file (bin/solr.in.sh). For more information about how the Solr include file works, refer to: Taking Solr to Production.
设置参数的两种方式:
一个是启动时传入
一个是在solr的初始化文件中设置
solrcore.properties
If the configuration directory for a Solr core contains a file named solrcore.properties that file can contain
any arbitrary user defined property names and values using the Java standard properties file format, and those properties can be used as variables in the XML configuration files for that Solr core. For example, the following solrcore.properties file could be created in the conf/ directory of a collection using one of the example configurations, to override the lockType used. #conf/solrcore.properties solr.lock.type=none
第二种方式使用 solrcore.properties
这个文件的名称和位置默认在conf下,可以使用core.properties中指定名称和位置
User defined properties from core.properties
For example, consider the following core.properties file:
#core.properties name=collection2 my.custom.prop=edismax The my.custom.prop property can then be used as a variable, such as in solrconfig.xml:
Implicit Core Properties
隐式定义的核心属性:
All implicit properties use the solr.core. name prefix, and reflect the runtime value of the equivalent
core.pr
operties property: solr.core.name solr.core.config solr.core.schema solr.core.dataDir solr.core.transient solr.core.loadOnStartup
DataDir and DirectoryFactory in SolrConfig
Specifying a Location for Index Data with the dataDir Parameter
通过dataDir指定索引数据的存放位置 If you are using replication to replicate the Solr index (as described in Legacy Scaling and Distribution), then the
相对路径和绝对路径及副本设置
Specifying the DirectoryFactory For Your Index
You can force a particular implementation by specifying solr.MMapDirector
yFactory, solr.NIOFSDirectoryFactory, or solr.SimpleFSDirectoryFactory. The solr.RAMDirectoryFactory is memory based, not persistent, and does not work with replication. Use this DirectoryFactory to store your index in RAM.
不同操作系统采用不同的文件目录系统,还可以将索引建在hdfs上
solr.HdfsDirectoryFactory instead of either of the above implementations.
Lib Directives in SolrConfig
能够使用正则表达式,所有的位置都是相对solr实例:
All directories are resolved as relative to the Solr instanceDir
Schema Factory Definition in SolrConfig
While the "read" features of the Solr API are supported for all Schema types, support for making Schema modifications programatically depends on the
Managed Schema Default
Solr implicitly uses a ManagedIndexSchemaFactory
一个例子:
mutable - controls whether changes may be made to the Schema data. This must be set to true to allow edits to be made with the Schema API. managedSchemaResourceName is an optional parameter that defaults to "managed-schema", and defines a new name for the schema file that can be anything other than "schema.xml".
Classic schema.xml
disallows any programatic changes to the Schema at run time.
不支持运行时的修改,仅仅支持修改后重新加载生效模式
Switching from schema.xml to Managed Schema
可以将 不能编辑的schema.xml转为可编辑的 模式在solrconfig.xml中配置
Changing to Manually Edited schema.xml
改变为手动编辑的模式
步骤:
Rename the managed-schema file to schema.xml. Modify solrconfig.xml to replace the schemaFactory class. Remove any ManagedIndexSchemaFactory definition if it exists. Add a ClassicIndexSchemaFactory definition as shown above Reload the core(s). If you are using SolrCloud, you may need to modify the files via ZooKeeper.
IndexConfig in SolrConfig
In most cases, the defaults are fine
...
Parameters covered in this section:
Writing New Segments Merging Index Segments Compound File Segments Index Locks Other Indexing Settings
Writing New Segments
ramBufferSizeMB
maxBufferedDocs
useCompoundFile
上面是文件的更新控制
Merging Index Segments
mergePolicyFactory
default in Solr is to use a TieredMergePolicy Other policies available are the LogByteSizeMergePolicy and LogDocMergePolicy.
Controlling Segment Sizes: Merge Factors
For TieredMergePolicy, this is controlled by setting the
对于合并索引片段能加快搜索但是需要提交创建索引的时间
Customizing Merge Policies
一个例子:
mergeScheduler
The merge scheduler controls how merges are performed
The default ConcurrentMergeScheduler 多线程
The alternative, SerialMergeScheduler, 串行线程
mergedSegmentWarmer
有利于近时时搜索
Compound File Segments
复合文件片段
Index Locks
lockType
锁类型
StandardDirectoryFactory (the default)
native
simple single hdfs
writeLockTimeout
写入锁的超时时间
Other Indexing Settings
其余的一些参数:
reopenReaders
deletionPolicy
infoStream
例子:
RequestHandlers and SearchComponents in SolrConfig
Request Handlers
SearchHandlers UpdateRequestHandlers ShardHandlers Other Request Handlers Search Components Default Components First-Components and Last-Components Components Other Useful Components
Request Handlers
请求处理器和路径的映射关系
SearchHandlers
参数和特点
UpdateRequestHandlers
ShardHandlers
Other Request Handlers
其实solr中的处理器的类型也不是很多现在也就四五种,两种常用的,搜索和更新
Search Components
Search components define the logic that is used by the SearchHandler to perform queries for users.
对应的是搜索处理器
Default Components
除了用 first-components and last-component 来定义外,默认组件的顺序是:
query solr.QueryComponent Described in the section
Query Syntax and Parsing.
facet solr.FacetComponent Described in the section Faceting. mlt solr.MoreLikeThisComponent Described in the section MoreLikeThis. highlight solr.HighlightComponent Described in the section Highlighting. stats solr.StatsComponent Described in the section The Stats Component. debug solr.DebugComponent Described in the section on Common Query Parameters expand solr.ExpandComponent Described in the section Collapse and Expand Results.
可以通过配置相同名称对默认的组件进行替换
First-Components and Last-Components
Components
如果不使用 first和last来添加组件,默认的组件将不启动
Other Useful Components
SpellCheckComponent, described in the section
Spell Checking.
TermVectorComponent, described in the section The Term Vector Component. QueryElevationComponent, described in the section The Query Elevation Component. TermsComponent, described in the section The Terms Component.
InitParams in SolrConfig
An
给指定的处理器路径进行统一的默认配置
If we later want to change the /query request handler to search a different field by default, we could override the
可以在当个路径中进行覆盖
Wildcards
例子:
UpdateHandlers in SolrConfig
...
Topics covered in this section:
Commits commit and softCommit autoCommit commitWithin Event Listeners Transaction Log
Commits
Data sent to Solr is not searchable until it has been committed to the index.
commit and softCommit
commit是硬提交,数据完全提交到硬盘中
softCommit 能快速的将索引可见,实现近实时索引,但是机器挂了会丢数据
autoCommit
commitWithin
for that reason the default is to perform a soft commit
With this configuration, when you call commitWithin as part of
your update message, it will automatically perform a hard commit every time.
Event Listeners
These can be triggered to occur after any commit (event="postCommit") or only after optimize commands (event="postOptimize")
两种监听配置
监听到后可以进行相应的处理:
RunExecutableListener
有些参数--
Transaction Log
a transaction log is required for that feature. It is configured in the updateHandler section of solrconfig.xml.
有一些配置的参数;
Query Settings in SolrConfig
The settings in this section affect the way that Solr will process and respond to queries
...
Topics covered in this section:
Caches Query Sizing and Warming Query-Related Listeners
Caches
将查询的条件和结果缓存下来,当再次查询时从缓存中获取,提高查询速度.
当重新打卡索引时对缓存进行预热更新.
使用的有三种:
In Solr, there are three cache implementations: solr.search.LRUCache, solr.search.FastLRUCache, and solr.search.LFUCache .
filterCache
当使用fq参数查询时,会将条件和结果缓存下来,等待下次相同的查询条件命中,进行快速返回
initialSize="512" autowarmCount="128"/>
queryResultCache
This cache holds the results of previous searches: ordered lists of document IDs (DocList) based on a query, a sort, and the range of documents requested
initialSize="512" autowarmCount="128" maxRamMB="1000"/>
documentCache
This cache holds Lucene Document objects (the stored fields for each document).
Since Lucene internal document IDs are transient, this cache is not auto-warmed.
initialSize="512" autowarmCount="0"/> User Defined Caches 自定义缓存 initialSize="1024" autowarmCount="1024" regenerator="org.mycompany.mypackage.MyRegenerator" /> 预热器的另一个配置:
regenerator="solr.NoOpRegenerator".
Query Sizing and Warming
maxBooleanClauses
最大布尔查询数量,依赖最后一个初始化配置:
enableLazyFieldLoading
useFilterForSortedQuery
没有使用score进行排序时很有用 queryResultWindowSize 超范围查询结果缓存:大于指定数目:
queryResultMaxDocsCached
useColdSearcher This setting controls whether search requests for which there is not a currently registered searcher should wait for a new searcher to warm up (false) or proceed immediately (true). When set to "false", requests will block until the searcher has warmed its caches. maxWarmingSearchers
Query-Related Listeners
两种类型:
RequestDispatcher in SolrConfig
Topics in this section: handleSelect Element requestParsers Element httpCaching Element
handleSelect Element
向后兼容 ...
requestParsers Element
The
几个参数的介绍;
formdataUploadLimitInKB="2048" addHttpRequestToContext="false" />
httpCaching Element
etagSeed="Solr">
cacheControl Element
Update Request Processors
Anatomy and life cycle
Configuration Update processors in SolrCloud Using custom chains Update Request Processor Factories
Anatomy and life cycle
更新过程有默认的处理链,除非你配置了一个自己的处理链.
处理器要有处理器工厂,符合两个要求:
An update request processor need not be thread safe because it is used by one and only
one requesthread and destroyed once the request is complete.
The factory class can accept configuration parameters and maintain any state that may be required between requests. The factory class must be thread-safe.
Configuration
配置在solrconfig.xml中加载时就加载或者使用参数,运行时加载
自定义需要参考默认的处理器,一些必备的处理过程
The default update request processor chain
按照顺序:
LogUpdateProcessorFactory - Tracks the commands processed during this request and
logs them
DistributedUpdateProcessorFactory - Responsible for distributing update requests to the right node e.g.
routing requests to the leader of the right shard and distributing updates from the leader to each replica. This processor is activated only in SolrCloud mode. RunUpdateProcessorFactory - Executes the update using internal Solr APIs.
Custom update request processor chain
updateRequestProcessorChain
Solr will automatically insert DistributedUpdateProcessorFactory in this chain that does not include it just prior to the RunUpdateProcessorFactory
Configuring individual processors as top-level plugins
updateProcessor
接下来可以使用作为自定义的参数:
updateRequestProcessorChains and updateProcessors
Update processors in SolrCloud
A critical SolrCloud functionality is the routing and distributing of requests – for update requests this routing is implemented by the DistributedUpdateRequestProcessor, and this processor is given a special status by Solr due to its important function.
更新处理器链中分布式更新处理时,分布式处理器之前是在接收到的节点进行处理,到分布式处理器后会进行路由的分发,到指定的lead节点处理,后进行日志记录,分发到副本进行处理;
举个栗子:
For example, consider the "dedupe" chain which we saw in a section above. Assume that a 3 node SolrCloud
cluster exists where node A hosts the leader of shard1, node B hosts the leader of shard2 and node C hosts the replica of shard2. Assume that an update request is sent to node A which forwards the update to node B (because the update belongs to shard2) which then distributes the update to its replica node C. Let's see what happens at each node: Node A: Runs the update through the SignatureUpdateProcessor (which computes the signature and puts it in the "id" field), then LogUpdateProcessor and then DistributedUpdateProcessor. This processor determines that the update actually belongs to node B and is forwarded to node B. The update is not processed further. This is required because the next processor which is RunUpdateProcessor will execute the update against the local shard1 index which would lead to duplicate data on shard1 and shard2. Node B: Receives the update and sees that it was forwarded by another node. The update is directly sent to DistributedUpdateProcessor because it has already been through the SignatureUpdateProcessor on node A and doing the same signature computation again would be redundant. The DistributedUpdateProc essor determines that the update indeed belongs to this node, distributes it to its replica on Node C and then forwards the update further in the chain to RunUpdateProcessor. Node C: Receives the update and sees that it was distributed by its leader. The update is directly sent to DistributedUpdateProcessor which performs some consistency checks and forwards the update further in the chain to RunUpdateProcessor. In summary: All processors before DistributedUpdateProcessor are only run on the first node that receives an update request whether it be a forwarding node (e.g. node A in the above example) or a leader (e.g. node B). We call these pre-processors or just processors. All processors after DistributedUpdateProcessor run only on the leader and the replica nodes. They are not executed on forwarding nodes. Such processors are called "post-processors".
post-processors
Using custom chains
update.chain request parameter
你可以选择使用那个更新处理器链来处理请求
update.chain
curl
"http://localhost:8983/solr/gettingstarted/update/json?update.chain=dedupe&commit=tr ue" -H 'Content-type: application/json' -d ' [ { "name" : "The Lightning Thief", "features" : "This is just a test", "cat" : ["book","hardcover"] }, { "name" : "The Lightning Thief", "features" : "This is just a test", "cat" : ["book","hardcover"] } ]'
processor & post-processor request parameters
使用这两个参数来构造一个动态的处理过程
Constructing a chain at request time
# Executing processors configured in solrconfig.xml as (pre)-processors
curl "http://localhost:8983/solr/gettingstarted/update/json?processor=remove_blanks,signa ture&commit=true" -H 'Content-type: application/json' -d ' [ { "name" : "The Lightning Thief", "features" : "This is just a test", "cat" : ["book","hardcover"] }, { "name" : "The Lightning Thief", "features" : "This is just a test", "cat" : ["book","hardcover"] } ]' # Executing processors configured in solrconfig.xml as pre and post processors curl "http://localhost:8983/solr/gettingstarted/update/json?processor=remove_blanks&postprocessor=signature&commit=true" -H 'Content-type: application/json' -d ' [ { "name" : "The Lightning Thief", "features" : "This is just a test", "cat" : ["book","hardcover"] }, { "name" : "The Lightning Thief", "features" : "This is just a test", "cat" : ["book","hardcover"] } ]'
Configuring a custom chain as a default
将自己配置的自定义处理过程定义为默认的处理的两种方式:
This can be done by adding either "update.chain" or "processor" and "post-processor" as default parameter for a
given path which can be done either via InitParams in SolrConfig or by adding them in a "defaults" section which is supported by all request handlers.
例子:
InitParams
defaults
class="solr.extraction.ExtractingRequestHandler" >
Update Request Processor Factories
有下列工厂类,具体功能见文档:
AddSchemaFieldsUpdateProcessorFactory:
CloneFieldUpdateProcessorFactory: DefaultValueUpdateProcessorFactory: DocBasedVersionConstraintsProcessorFactory: DocExpirationUpdateProcessorFactory: IgnoreCommitOptimizeUpdateProcessorFactory: RegexpBoostProcessorFactory: SignatureUpdateProcessorFactory: StatelessScriptUpdateProcessorFactory: TimestampUpdateProcessorFactory: URLClassifyProcessorFactory: UUIDUpdateProcessorFactory:
FieldMutatingUpdateProcessorFactory derived factories
ConcatFieldUpdateProcessorFactory
CountFieldValuesUpdateProcessorFactory FieldLengthUpdateProcessorFactory FirstFieldValueUpdateProcessorFactory HTMLStripFieldUpdateProcessorFactory IgnoreFieldUpdateProcessorFactory LastFieldValueUpdateProcessorFactory MaxFieldValueUpdateProcessorFactory MinFieldValueUpdateProcessorFactory ParseBooleanFieldUpdateProcessorFactory ParseDateFieldUpdateProcessorFactory ParseNumericFieldUpdateProcessorFactory derived classes ParseDoubleFieldUpdateProcessorFactory: Attempts to mutate selected fields that have only CharSequence-typed values into Double values. ParseFloatFieldUpdateProcessorFactory : Attempts to mutate selected fields that have only CharSequence-typed values into Float values. ParseIntFieldUpdateProcessorFactory : Attempts to mutate selected fields that have only CharSequence-typed values into Integer values. ParseLongFieldUpdateProcessorFactory : Attempts to mutate selected fields that have only CharSequence-typed values into Long values.
PreAnalyzedUpdateProcessorFactory
RegexReplaceProcessorFactory : RemoveBlankFieldUpdateProcessorFactory : TrimFieldUpdateProcessorFactory: TruncateFieldUpdateProcessorFactory: UniqFieldsUpdateProcessorFactory :
Update Processor factories that can be loaded as plugins
可以自己扩展的接个插件工厂包:
LangDetectLanguageIdentifierUpdateProcessorFactory : 这个是google的??
TikaLanguageIdentifierUpdateProcessorFactory
UIMAUpdateRequestProcessorFactory
Update Processor factories you should not modify or remove
最好不要乱修改 solr的更新处理器工厂
Codec Factory
定义写入磁盘的编码方式,没有定义solr将使用默认值,在solrconfig.xml中定义
A compressionMode option:
BEST_SPEED (default) is optimized for search speed performance BEST_COMPRESSION is optimized for disk space usage
例子:
|
Solr Cores and solr.xml
In Solr, the term core is used to refer to a single index and associated transaction log and configuration files (including the solrconfig.xml and Schema files, among others).
In standalone mode, solr.xml must reside in solr_home. In SolrCloud mode, solr.xml will be loaded from Zookeeper if it exists, with fallback to solr_home.
The recommended way is to dynamically create cores/collections using the APIs
The following sections describe these options in more detail.
Format of solr.xml: Details on how to define solr.xml, including the acceptable parameters for the solr.xml file Defining core.properties: Details on placement of core.properties and available property options. CoreAdmin API: Tools and commands for core administration using a REST API. Config Sets: How to use configsets to avoid duplicating effort when defining a new core.
Format of solr.xml
This section will describe the default solr.xml file included with Solr and how to modify it for your needs. For details on how to configure core.properties, see the section
Defining core.properties
.
Defining solr.xml
Solr.xml Parameters The The The The The Substituting JVM System Properties in solr.xml
Defining solr.xml
You can find solr.xml in your Solr Home directory or in Zookeeper. The default solr.xml file looks like this: Unless the -DzkHost or -DzkRun are specified at startup time, this section is ignored.
Solr.xml Parameters
The
几个属性值的介绍
The
This section is ignored unless the solr instance is started with either -DzkRun or -DzkHost
solrcloud模式下的参数配置及访问控制令牌配置
The
日志类及是否启用
The
日志监控配置信息
The
定义分片处理器:
Custom shard handlers can be defined in solr.xml if you wish to create a custom shard handler.
Since this is a custom shard handler, sub-elements are specific to the implementation.
Substituting JVM System Properties in solr.xml
可以在 solr.xml中配置 jvm属性
${propertyname[:option default value]} 设置默认值
动态设置jvm的属性值将覆盖设置的默认值
Defining core.properties
core.properties文件是典型的javaproperties文件形式,例子:
name=my_core_name
Placement of core.properties
core.properties的位置在solr_home下的core文件中
Defining core.properties Files
name
The name of the SolrCore. You'll use this name to reference the SolrCore when running
commands with the CoreAdminHandler
config
The configuration file name for a given core. The default is solrconfig.xml.
schema
The schema file name for a given core. The default is schema.xml but please note that if
you are using a "managed schema" (the default behavior) then any value for this property which does not match the effective managedSchemaResourceName will be read once, backed up, and converted for managed schema use.
dataDir
The core's data directory (where indexes are stored) as either an absolute pathname, or a
path relative to the value of instanceDir. This is data by default.
configSet
The name of a defined configset, if desired, to use to configure the core
properties
The name of the properties file for this core. The value can be an absolute pathname or a
path relative to the value of instanceDir
transient
If true, the core can be unloaded if Solr reaches the transientCacheSize. The default
if not specified is false. Cores are unloaded in order of least recently used first.
Setting to true is not recommended in SolrCloud mode.
loadOnStartup
If true, the default if it is not specified, the core will loaded when Solr starts. Setting to fals
e is not recommended in SolrCloud mode.
coreNodeName
Used only in SolrCloud, this is a unique identifier for the node hosting this replica. By
default a coreNodeName is generated automatically, but setting this attribute explicitly allows you to manually assign a new core to replace an existing replica. For example: when replacing a machine that has had a hardware failure by restoring from backups on a new machine with a new hostname or port..
ulogDir
The absolute or relative directory for the update log for this core (SolrCloud)
shard
The shard to assign this core to (SolrCloud) collection
The name of the collection this core is part of (SolrCloud).
roles
Future param for SolrCloud or a way for users to mark nodes for their own use
这个不太理解
Additional "user defined" properties may be specified for use as variables. For more information on how to define local properties, see the section Substituting Properties in Solr Config Files.
用户自定义属性???
CoreAdmin API 适用于单机版本
SolrCloud users should not typically use the CoreAdmin API directly
solrcloud模式通常不直接使用coreadmin api
the cores running in that node and is accessible at the /solr/admin/cores path.
HTTP requests that specify an "action" request parameter
All action names are uppercase, and are defined in depth in the sections below
STATUS
CREATE RELOAD RENAME SWAP UNLOAD MERGEINDEXES SPLIT REQUESTSTATUS
STATUS
The STATUS action returns the status of all running Solr cores, or status for only the named core.
http://localhost:8983/solr/admin/cores?action=STATUS&core=core0
Input
core 指定core的名字
indexInfo 是否返回索引的信息 默认返回,当数量过多时,加快返回可以设置为false
CREATE
The CREATE action creates a new core and registers it. If a Solr core with the given name already exists, it will continue to handle requests while the new core is initializing. When the new core is ready, it will take new requests and the old core will be unloaded.
创建一个已经存在的core,旧的core将被替换掉
http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path/to/dir&config=config_file_name.xml&dataDir=data
Input name instanceDir config schema dataDir configSet collection shard property.name=value async
Example
http://localhost:8983/solr/admin/cores?action=CREATE&name=my_core&collection=my_collection&shard=shard2
RELOAD
The RELOAD action loads a new core from the configuration of an existing, registered Solr core.
http://localhost:8983/solr/admin/cores?action=RELOAD&core=core0
Input
core
RENAME
The RENAME action changes the name of a Solr core.
http://localhost:8983/solr/admin/cores?action=RENAME&core=core0&other=core5
Input
core
other
async
SWAP
交换名字
http://localhost:8983/solr/admin/cores?action=SWAP&core=core1&other=core0
UNLOAD
卸载
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=core0
Input
MERGEINDEXES
合并索引
The MERGEINDEXES action merges one or more indexes to another index.
http://localhost:8983/solr/admin/cores?action=MERGEINDEXES&core=new_core_name&indexDir=/solr_home/core1/data/index&indexDir=/solr_home/core2/data/index
Alternatively, we can instead use a srcCore parameter, as in this example:
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=new_core_name&srcCore=core1&srcCore=core2
SPLIT
The SPLIT action splits an index into two or more indexes.
The SPLIT action supports five parameters, which are described in the table below
Input
core
path
多值,索引将被写入的目录
targetCore
多值,索引将被合并的目标solr core
ranges 不懂
split.key
async Examples The core index will be split into as many pieces as the number of path or targetCore parameters. Usage with two targetCore parameters: http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&targetCore=core2 Usage of with two path parameters: http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&path=/path/to/in dex/1&path=/path/to/index/2 Usage with the split.key parameter: http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&split.key=A! Usage with ranges parameter: http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&targetCore=core2&targetCore=core3&ranges=0-1f4,1f5-3e8,3e9-5dc
REQUESTSTATUS
查看异步请求状态
Input
requestid
http://localhost:8983/solr/admin/cores?action=REQUESTSTATUS&requestid=1
Config Sets
在多个 core中分享配置文件的方式
On a multicore Solr instance, you may find that you want to share configuration between a number of different
cores. You can achieve this using named configsets, which are essentially shared configuration directories stored under a configurable configset base directory. To create a configset, simply add a new directory under the configset base directory. The configset will be identified by the name of this directory. Then into this copy the config directory you want to share. The structure should look something like this: / /configset1 /conf /managed-schema /solrconfig.xml /configset2 /conf /managed-schema /solrconfig.xml The default base directory is $SOLR_HOME/configsets, and it can be configured in solr.xml. To create a new core using a configset, pass configSet as one of the core properties. For example, if you do this via the core admin API: http:// |
Configuration APIs
Solr includes several APIs that can be used to modify settings in solrconfig.xml.
修改solrconfig.xml
Blob Store API
Config API Request Parameters API Managed Resources
Blob Store API
The Blob Store REST API provides REST methods to store, retrieve or list files in a Lucene index.
The blob store is only available when running in SolrCloud mode
The blob store API is implemented as a requestHandler. A special collection named ".system" must be created as the collection that contains the blob store index.
Create a .system Collection
You can create the .system collection with the Collections API, as in this example:
curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=.system&replication Factor=2"
Upload Files to Blob Store
After the .system collection has been created, files can be uploaded to the blob store with a request similar to
the following: curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename} http://localhost:8983/solr/.system/blob/{blobname} For example, to upload a file named "test1.jar" as a blob named "test", you would make a POST request like: curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @test1.jar http://localhost:8983/solr/.system/blob/test A GET request will return the list of blobs and other details: curl http://localhost:8983/solr/.system/blob?omitHeader=true Output:
{
"response":{"numFound":1,"start":0,"docs":[ { "id":"test/1", "md5":"20ff915fa3f5a5d66216081ae705c41b", "blobName":"test", "version":1, "timestamp":"2015-02-04T16:45:48.374Z", "size":13108}] } } Details on individual blobs can be accessed with a request similar to: curl http://localhost:8983/solr/.system/blob/{blobname} For example, this request will return only the blob named 'test': curl http://localhost:8983/solr/.system/blob/test?omitHeader=true Output: { "response":{"numFound":1,"start":0,"docs":[ { "id":"test/1", "md5":"20ff915fa3f5a5d66216081ae705c41b", "blobName":"test", "version":1, "timestamp":"2015-02-04T16:45:48.374Z", "size":13108}] } } The filestream response writer can return a particular version of a blob for download, as in: curl http://localhost:8983/solr/.system/blob/{blobname}/{version}?wt=filestream > {outputfilename} For the latest version of a blob, the {version} can be omitted, curl http://localhost:8983/solr/.system/blob/{blobname}?wt=filestream > {outputfilename}
文件的上传,查看,及下载
Use a Blob in a Handler or Component
To use the blob as the class for a request handler or search component, you create a request handler in solrconfig.xml as usual. You will need to define the following parameters:
class: the fully qualified class name. For example, if you created a new request handler class calledCRUDHandler, you would enter org.apache.solr.core.CRUDHandler. runtimeLib: Set to true to require that this component should be loaded from the classloader that loads the runtime jars.
Config API
This feature is enabled by default and works similarly in both SolrCloud and standalone mode.
When using this API, solrconfig.xml is is not changed. Instead, all edited configuration is stored in a file called configoverlay.json. The values in configoverlay.json override the values in solrconfig.xml.
API Entry Points
Commands Commands for Common Properties Commands for Custom Handlers and Local Components Commands for User-Defined Properties How to Map solrconfig.xml Properties to JSON Examples Creating and Updating Common Properties Creating and Updating Request Handlers Creating and Updating User-Defined Properties How It Works Empty Command Listening to config Changes
API Entry Points
/config: retrieve or modify the config. GET to retrieve and POST for executing commands
/config/overlay: retrieve the details in the configoverlay.json alone /config/params : allows creating parameter sets that can override or take the place of parameters defined in solrconfig.xml.
Commands
The config commands are categorized into 3 different sections which manipulate various data structures in solr config.xml. Each of these is described below.
Common Properties Components User-defined properties
The common properties are those that are frequently need to be customized in a Solr instance. They are manipulated with two commands:
set-property: Set a well known property. The names of the properties are predefined and fixed. If the property has already been set, this command will overwrite the previous setting. unset-property: Remove a property set using the set-property command.
Commands for Custom Handlers and Local Components
大小写不敏感,添加 更新 删除 三种动作
The full list of available commands follows below:
General Purpose Commands
These commands are the most commonly used:
add-requesthandler update-requesthandler delete-requesthandler add-searchcomponent update-searchcomponent delete-searchcomponent add-initparams update-initparams delete-initparams add-queryresponsewriter update-queryresponsewriter delete-queryresponsewriter
Advanced Commands
These commands allow registering more advanced customizations to Solr:
add-queryparser update-queryparser delete-queryparser add-valuesourceparser update-valuesourceparser delete-valuesourceparser add-transformer update-transformer delete-transformer add-updateprocessor update-updateprocessor delete-updateprocessor add-queryconverter update-queryconverter delete-queryconverter add-listener update-listener delete-listener add-runtimelib update-runtimelib delete-runtimelib
What about <updateRequestProcessorChain>?
The Config API does not let you create or edit
example:
curl http://localhost:8983/solr/techproducts/config -H 'Content-type:application/json' -d '{ "add-updateprocessor" : { "name" : "firstFld", "class": "solr.FirstFieldValueUpdateProcessorFactory", "fieldName":"test_s"}}' You can use this directly in your request by adding a parameter in the
Commands for User-Defined Properties
Solr lets users templatize the solrconfig.xml using the place holder format ${variable_name:default_
val}. You could set the values using system properties, for example, -Dvariable_name= my_customvalue. The same can be achieved during runtime using these commands: set-user-property: Set a user-defined property. If the property has already been set, this command will overwrite the previous setting. unset-user-property: Remove a user-defined property. The structure of the request is similar to the structure of requests using other commands, in the format of "comm and":{"variable_name":"property_value"}. You can add more than one variable at a time if necessary.
运行时设置jvm属性
How to Map solrconfig.xml Properties to JSON
将处理过程参数和map进行对应
Here is what a request handler looks like in solrconfig.xml:
The same request handler defined with the Config API would look like this:
{
"add-requesthandler":{ "name":"/query", "class":"solr.SearchHandler", "defaults":{ "echoParams":"explicit", "wt":"json", "indent":true } } } A searchComponent in solrconfig.xml looks like this: And the same searchComponent with the Config API: { "add-searchcomponent":{ "name":"elevator", "class":"QueryElevationComponent", "queryFieldType":"string", "config-file":"elevate.xml" } } Set autoCommit properties in solrconfig.xml: Define the same properties with the Config API: { "set-property": { "updateHandler.autoCommit.maxTime":15000, "updateHandler.autoCommit.openSearcher":false } }
Name Components for the Config API
对于没有名字的,要强制给予一个名字
Examples
Creating and Updating Common Properties
This change sets the query.filterCache.autowarmCountto 1000 items and unsets the query.filterCa che.size. curl http://localhost:8983/solr/techproducts/config -H 'Content-type:application/json' -d'{ "set-property" : {"query.filterCache.autowarmCount":1000}, "unset-property" :"query.filterCache.size"}' Using the /config/overlay endpoint, you can verify the changes with a request like this: curl http://localhost:8983/solr/gettingstarted/config/overlay?omitHeader=true And you should get a response like this: { "overlay":{ "znodeVersion":1, "props":{"query":{"filterCache":{ "autowarmCount":1000, "size":25}}}}}
Creating and Updating Request Handlers
To create a request handler, we can use the add-requesthandler command:
curl http://localhost:8983/solr/techproducts/config -H 'Content-type:application/json' -d '{ "add-requesthandler" : { "name": "/mypath", "class":"solr.DumpRequestHandler", "defaults":{ "x":"y" ,"a":"b", "wt":"json", "indent":true }, "useParams":"x" }, }' Make a call to the new request handler to check if it is registered: curl http://localhost:8983/solr/techproducts/mypath?omitHeader=true And you should see the following as output: { "params":{ "indent":"true", "a":"b", "x":"y", "wt":"json"}, "context":{ "webapp":"/solr", "path":"/mypath", "httpMethod":"GET"}}
To update a request handler, you should use the update-requesthandler command :
curl http://localhost:8983/solr/techproducts/config -H 'Content-type:application/json' -d '{ "update-requesthandler": { "name": "/mypath", "class":"solr.DumpRequestHandler", "defaults": { "x":"new value for X", "wt":"json", "indent":true }, "useParams":"x" } }' As another example, we'll create another request handler, this time adding the 'terms' component as part of the definition: curl http://localhost:8983/solr/techproducts/config -H 'Content-type:application/json' -d '{ "add-requesthandler": { "name": "/myterms", "class":"solr.SearchHandler", "defaults": { "terms":true, "distrib":false }, "components": [ "terms" ] } }'
Creating and Updating User-Defined Properties
his command sets a user property.
curl http://localhost:8983/solr/techproducts/config -H'Content-type:application/json' -d '{ "set-user-property" : {"variable_name":"some_value"}}' Again, we can use the /config/overlay endpoint to verify the changes have been made: curl http://localhost:8983/solr/techproducts/config/overlay?omitHeader=true And we would expect to see output like this {"overlay":{ "znodeVersion":5, "userProps":{ "variable_name":"some_value"}} } To unset the variable, issue a command like this: curl http://localhost:8983/solr/techproducts/config -H'Content-type:application/json' -d '{"unset-user-property" : "variable_name"}'
How It Works
Every core watches the ZooKeeper directory for the configset being used with that core. In standalone mode,
however, there is no watch (because ZooKeeper is not running). If there are multiple cores in the same node using the same configset, only one ZooKeeper watch is used. For instance, if the configset 'myconf' is used by a core, the node would watch /configs/myconf. Every write operation performed through the API would 'touch' the directory (sets an empty byte[] to trigger watches) and all watchers are notified. Every core would check if the Schema file, solrconfig.xml or configoverlay.json is modified by comparing the znode versions and if modified, the core is reloaded. If params.json is modified, the params object is just updated without a core reload
当配置被更改时,zookeeper的监听器收到监听,做检查,发现更改后自动重新加载
Empty Command
If an empty command is sent to the /config endpoint, the watch is triggered on all cores using this configset.
For example: curl http://localhost:8983/solr/techproducts/config -H'Content-type:application/json' -d '{}' Directly editing any files without 'touching' the directory will not make it visible to all nodes. It is possible for components to watch for the configset 'touch' events by registering a listener using SolrCore#r egisterConfListener() .
空命令触发
Listening to config Changes
Any component can register a listener using:
SolrCore#addConfListener(Runnable listener) to get notified for config changes. This is not very useful if the files modified result in core reloads (i.e., configo verlay.xml or Schema). Components can use this to reload the files they are interested in.
添加监听器
Request Parameters API
The Request Parameters API allows creating parameter sets that can override or take the place of parameters defined in solrconfig.xml.
In this case, the parameters are stored in a file named params.json. This file is
kept in ZooKeeper or in the conf directory of a standalone Solr instance.
The settings stored in params.json are used at query time to override settings defined in solrconfig.xml in some cases as described below.
When might you want to use this feature?
To avoid frequently editing your solrconfig.xml to update request parameters that change often. To reuse parameters across various request handlers. To mix and match parameter sets at request time. To avoid a reload of your collection for small parameter changes.
The Request Parameters Endpoint
All requests are sent to the /config/params endpoint of the Config API.
Setting Request Parameters
The request to set, unset, or update request parameters is sent as a set of Maps with names. These objects can
be directly used in a request or a request handler definition. The available commands are: set: Create or overwrite a parameter set map. unset: delete a parameter set map. update: update a parameter set map. This is equivalent to a map.putAll(newMap) . Both the maps are merged and if the new map has same keys as old they are overwritten You can mix these commands into a single request if necessary.
Each map must include a name so it can be referenced later, either in a direct request to Solr or in a request
必须要有名字方便引用
handler definition. In the following example, we are setting 2 sets of parameters named 'myFacets' and 'myQueries'. curl http://localhost:8983/solr/techproducts/config/params -H 'Content-type:application/json' -d '{ "set":{ "myFacets":{ "facet":"true", "facet.limit":5}}, "set":{ "myQueries":{ "defType":"edismax", "rows":"5", "df":"text_all"}} }'
In the above example all the parameters are equivalent to the "defaults" in solrconfig.xml. It is possible to add invariants and appends as follows
curl http://localhost:8983/solr/techproducts/config/params -H
'Content-type:application/json' -d '{ "set":{ "my_handler_params":{ "facet.limit":5, "_invariants_": { "facet":true, "wt":"json" }, "_appends_":{"facet.field":["field1","field2"] } }} }'
now it is possible to define a request handler as follows
It will be equivalent to a requesthandler definition as follows, Update example, curl http://localhost:8983/solr/techproducts/config/params -H 'Content-type:application/json' -d '{ "update":{ "myFacets":{ "facet.limit":10}}, }' This command will add (or replace) the facet.limit param to the myFacets map, keeping all other existing m yFacets params. To see the parameters that have been set, you can use the /config/params endpoint to read the contents of params.json, or use the name in the request:
curl http://localhost:8983/solr/techproducts/config/params
#Or use the params name curl http://localhost:8983/solr/techproducts/config/params/myQueries
The useParams Parameter
When making a request, the useParams parameter applies the request parameters set to the request. This is translated at request time to the actual params. For example (using the names we set up in the earlier example, please replace with your own name): http://localhost/solr/techproducts/select?useParams=myQueries It is possible to pass more than one parameter set in the same request. For example: http://localhost/solr/techproducts/select?useParams=myFacets,myQueries In the above example the param set 'myQueries' is applied on top of 'myFacets'. So, values in 'myQueries' take precedence over values in 'myFacets'. Additionally, any values passed in the request take precedence over 'useParams' params. This acts like the "defaults" specified in the ' g.xml. The parameter sets can be used directly in a request handler definition as follows. Please note that the 'useParams' specified is always applied even if the request contains useParams.
如何去使用定义的请求参数
To summarize, parameters are applied in this order:
parameters defined in parameters applied in _invariants_ in params.json and that is specified in the requesthandler definition or even in request parameters defined in the request directly. parameter sets defined in the request, in the order they have been listed with useParams. parameter sets defined in params.json that have been defined in the request handler. parameters defined in
Public APIs
Java访问请求参数
The RequestParams Object can be accessed using the method SolrConfig#getRequestParams(). Each paramset can be accessed by their name using the method RequestParams#getRequestParams(String name).
Managed Resources
资源管理多种方式
All of the examples in this section assume you are running the "techproducts" Solr example:
bin/solr -e techproducts
Overview
Let's begin learning about managed resources by looking at a couple of examples provided by Solr for managing stop words and synonyms using a REST API. After reading this section, you'll be ready to dig into the details of how managed resources are implemented in Solr so you can start building your own implementation.
Stop words
To begin, you need to define a field type that uses the ManagedStopFilterFactory , such as:
There are two important things to notice about this field type definition. First, the filter implementation class is so
lr.ManagedStopFilterFactory . This is a special implementation of the StopFilterFactory that uses a set of stop words that are managed from a REST API. Second, the managed=”english” attribute gives a name to the set of managed stop words, in this case indicating the stop words are for English text. The REST endpoint for managing the English stop words in the techproducts collection is: /solr/techproduc ts/schema/analysis/stopwords/english. The example resource path should be mostly self-explanatory. It should be noted that the ManagedStopFilterFactory implementation determines the /schema/analysis/stopwords part of the path, which makes sense because this is an analysis component defined by the schema. It follows that a field type that uses the following filter: managed="french" /> would resolve to path: /solr/techproducts/schema/analysis/stopwords/french. So now let’s see this API in action, starting with a simple GET request:
curl "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"
Assuming you sent this request to Solr, the response body is a JSON document: { "responseHeader":{ "status":0, "QTime":1 }, "wordSet":{ "initArgs":{"ignoreCase":true}, "initializedOn":"2014-03-28T20:53:53.058Z", "managedList":[ "a", "an", "and", "are", ... ] } } The sample_techproducts_configs config set ships with a pre-built set of managed stop words, however you should only interact with this file using the API and not edit it directly. One thing that should stand out to you in this response is that it contains a managedList of words as well as i nitArgs . This is an important concept in this framework—managed resources typically have configuration and data. For stop words, the only configuration parameter is a boolean that determines whether to ignore the case of tokens during stop word filtering (ignoreCase=true|false). The data is a list of words, which is represented as a JSON array named managedList in the response. Now, let’s add a new word to the English stop word list using an HTTP PUT: curl -X PUT -H 'Content-type:application/json' --data-binary '["foo"]' "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english" Here we’re using cURL to PUT a JSON list containing a single word “foo” to the managed English stop words set. Solr will return 200 if the request was successful. You can also put multiple words in a single PUT request. You can test to see if a specific word exists by sending a GET request for that word as a child resource of the set, such as: curl "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english/foo" This request will return a status code of 200 if the child resource (foo) exists or 404 if it does not exist the managed list. To delete a stop word, you would do: curl -X DELETE "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english/foo" Note: PUT/POST is used to add terms to an existing list instead of replacing the list entirely. This is because it is more common to add a term to an existing list than it is to replace a list altogether, so the API favors the more common approach of incrementally adding terms especially since deleting individual terms is also supported
停用词的CURD操作
Synonyms
For the most part, the API for managing synonyms behaves similar to the API for stop words, except instead of
working with a list of words, it uses a map, where the value for each entry in the map is a set of synonyms for a term. As with stop words, the sample_techproducts_configs config set includes a pre-built set of synonym mappings suitable for the sample data that is activated by the following field type definition in schema.xml:
和停用词类似,同义词有一定不一样词的存放使用的是map的方式.
To get the map of managed synonyms, send a GET request to:
curl "http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english" This request will return a response that looks like: { "responseHeader":{ "status":0, "QTime":3}, "synonymMappings":{ "initArgs":{ "ignoreCase":true, "format":"solr"}, "initializedOn":"2014-12-16T22:44:05.33Z", "managedMap":{ "GB": ["GiB", "Gigabyte"], "TV": ["Television"], "happy": ["glad", "joyful"]}}}
Managed synonyms are returned under the managedMap property which contains a JSON Map where the value
of each entry is a set of synonyms for a term, such as "happy" has synonyms "glad" and "joyful" in the example above. To add a new synonym mapping, you can PUT/POST a single mapping such as: curl -X PUT -H 'Content-type:application/json' --data-binary '{"mad":["angry","upset"]}' "http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english" The API will return status code 200 if the PUT request was successful. To determine the synonyms for a specific term, you send a GET request for the child resource, such as /schema/analysis/synonyms/english/mad
would return ["angry","upset"].
You can also PUT a list of symmetric synonyms, which will be expanded into a mapping for each term in the list. For example, you could PUT the following list of symmetric synonyms using the JSON list syntax instead of a map: curl -X PUT -H 'Content-type:application/json' --data-binary '["funny", "entertaining", "whimiscal", "jocular"]' "http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english" Note that the expansion is performed when processing the PUT request so the underlying persistent state is still a managed map. Consequently, if after sending the previous PUT request, you did a GET for /schema/analys is/synonyms/english/jocular, then you would receive a list containing ["funny", "entertaining", "whimiscal"]. Once you've created synonym mappings using a list, each term must be managed separately. Lastly, you can delete a mapping by sending a DELETE request to the managed endpoint
同音字的curd操作
Applying Changes
Changes made to managed resources via this REST API are not applied to the active Solr components until the
Solr collection (or Solr core in single server mode) is reloaded. For example:, after adding or deleting a stop word, you must reload the core/collection before changes become active.
重新加载才能生效
This approach is required when running in distributed mode so that we are assured changes are applied to all
cores in a collection at the same time so that behavior is consistent and predictable. It goes without saying that you don’t want one of your replicas working with a different set of stop words or synonyms than the others. One subtle outcome of this apply-changes-at-reload approach is that the once you make changes with the API, there is no way to read the active data. In other words, the API returns the most up-to-date data from an API perspective, which could be different than what is currently being used by Solr components. However, the intent of this API implementation is that changes will be applied using a reload within a short time frame after making them so the time in which the data returned by the API differs from what is active in the server is intended to be negligible.
一个正确的启用流程及重新索引
RestManager Endpoint
Metadata about registered ManagedResources is available using the /schema/managed and /config/managed
endpoints for each collection. Assuming you have the managed_en field type shown above defined in your schema.xml, sending a GET request to the following resource will return metadata about which schema-related resources are being managed by the RestManager: curl "http://localhost:8983/solr/techproducts/schema/managed" The response body is a JSON document containing metadata about managed resources under the /schema root: { "responseHeader":{ "status":0, "QTime":3 }, "managedResources":[ { "resourceId":"/schema/analysis/stopwords/english", "class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource", "numObservers":"1" }, { "resourceId":"/schema/analysis/synonyms/english", "class":"org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory$SynonymMan ager", "numObservers":"1" } ] } You can also create new managed resource using PUT/POST to the appropriate URL – before ever configuring anything that uses these resources. For example: imagine we want to build up a set of German stop words. Before we can start adding stop words, we need to create the endpoint: /solr/techproducts/schema/analysis/stopwords/german To create this endpoint, send the following PUT/POST request to the endpoint we wish to create: curl -X PUT -H 'Content-type:application/json' --data-binary \ '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}' \ "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german" Solr will respond with status code 200 if the request is successful. Effectively, this action registers a new endpoint for a managed resource in the RestManager. From here you can start adding German stop words as we saw above: curl -X PUT -H 'Content-type:application/json' --data-binary '["die"]' \ "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german" For most users, creating resources in this way should never be necessary, since managed resources are created automatically when configured. However: You may want to explicitly delete managed resources if they are no longer being used by a Solr component. For instance, the managed resource for German that we created above can be deleted because there are no Solr components that are using it, whereas the managed resource for English stop words cannot be deleted because there is a token filter declared in schema.xml that is using it. curl -X DELETE "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"
你可以定义一个资源管理和删除他!但是定义在schema.xml中的文件不能被删除!~
Solr Plugins
Solr allows you to load custom code to perform a variety of tasks within Solr, from custom Request Handlers to
process your searches, to custom Analyzers and Token Filters for your text field. You can even load custom Field Types. These pieces of custom code are called plugins. Not everyone will need to create plugins for their Solr instances - what's provided is usually enough for most applications. However, if there's something that you need, you may want to review the Solr Wiki documentation on plugins at SolrPlugins. If you have a plugin you would like to use, and you are running in SolrCloud mode, you can use the Blob Store API and the Config API to load the jars to Solr. The commands to use are described in the section Adding Custom Plugins in SolrCloud Mode.
solr可以自定义插件,如果你需要的话
Adding Custom Plugins in SolrCloud Mode
When running Solr in SolrCloud mode and you want to use custom code (such as custom analyzers, tokenizers,
query parsers, and other plugins), it can be cumbersome to add jars to the classpath on all nodes in your cluster. Using the Blob Store API and special commands with the Config API, you can upload jars to a special system-level collection and dynamically load plugins from them at runtime with out needing to restart any nodes.
通过命令上传jar到solrcloud中比每个jar放在节点下方便
This Feature is Disabled By Default
In addition to requiring that Solr by running in SolrCloud mode, this feature is also disabled by default unless all Solr nodes are run with the -Denable.runtime.lib=true option on startup. Before enabling this feature, users should carefully consider the issues discussed in the Securing Runtime Libraries section below.
此功能默认是不可用的,需要启动时设置参数
Uploading Jar Files
The first step is to use the Blob Store API to upload your jar files. This will to put your jars in the .system collecti
on and distribute them across your SolrCloud nodes. These jars are added to a separate classloader and only accessible to components that are configured with the property runtimeLib=true. These components are loaded lazily because the .system collection may not be loaded when a particular core is loaded.
上传jar
Config API Commands to use Jars as Runtime Libraries
The runtime library feature uses a special set of commands for the Config API to add, update, or remove jar files
currently available in the blob store to the list of runtime libraries. The following commands are used to manage runtime libs: add-runtimelib update-runtimelib delete-runtimelib
curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d '{ "add-runtimelib": { "name":"jarblobname", "version":2 }, "update-runtimelib": { "name":"jarblobname", "version":3 }, "delete-runtimelib": "jarblobname" }'
使用命令对jar操作
The name to use is the name of the blob that you specified when you uploaded your jar to the blob store. You
should also include the version of the jar found in the blob store that you want to use. These details are added to configoverlay.json. The default SolrResourceLoader does not have visibility to the jars that have been defined as runtime libraries. There is a classloader that can access these jars which is made available only to those components which are specially annotated. Every pluggable component can have an optional extra attribute called runtimeLib=true, which means that the components are not loaded at core load time. Instead, they will be loaded on demand. If all the dependent jars are not available when the component is loaded, an error is thrown. This example shows creating a ValueSourceParser using a jar that has been loaded to the Blob store. curl http://localhost:8983/solr/techproducts/config -H 'Content-type:application/json' -d '{ "create-valuesourceparser": { "name": "nvl", "runtimeLib": true, "class": "solr.org.apache.solr.search.function.NvlValueSourceParser, "nvlFloatValue": 0.0 } }'
需要资源管理去加载在运行时,设置一个参数.
Securing Runtime Libraries
A drawback of this feature is that it could be used to load malicious executable code into the system. However, it
is possible to restrict the system to load only trusted jars using PKI to verify that the executables loaded into the system are trustworthy. The following steps will allow you enable security for this feature. The instructions assume you have started all your Solr nodes with the -Denable.runtime.lib=true.
这种方式有一个缺点就是有可能加载了恶意的jar到系统中.下面介绍如何保证安全的加载
Step 1: Generate an RSA Private Key
The first step is to generate an RSA private key. The example below uses a 512-bit key, but you should use the
strength appropriate to your needs. $ openssl genrsa -out priv_key.pem 512
使用rsa加密算法生成一个私钥
Step 2: Output the Public Key
The public portion of the key should be output in DER format so Java can read it. $ openssl rsa -in priv_key.pem -pubout -outform DER -out pub_key.der
公钥部分的编码保证jar能识别
Step 3: Load the Key to ZooKeeper
The .der files that are output from Step 2 should then be loaded to ZooKeeper under a node /keys/exe so they
are available throughout every node. You can load any number of public keys to that node and all are valid. If a key is removed from the directory, the signatures of that key will cease to be valid. So, before removing the a key, make sure to update your runtime library configurations with valid signatures with the update-runtimeli b command. At the current time, you can only use the ZooKeeper zkCli.sh (or zkCli.cmd on Windows) script to issue these commands (the Solr version has the same name, but is not the same). If you are running the embedded ZooKeeper that is included with Solr, you do not have this script already; in order to use it, you will need to download a copy of ZooKeeper v3.4.6 from http://zookeeper.apache.org/. Don't worry about configuring the download, you're just trying to get the command line utility script. When you start the script, you will connect to the embedded ZooKeeper. If you have your own ZooKeeper ensemble running already, you can find the script in $ZK_INSTALL/bin/zkCli.sh (or zkCli.cmd if you are using Windows). To load the keys, you will need to connect to ZooKeeper with zkCli.sh, create the directories, and then create the key file, as in the following example. # Connect to ZooKeeper # Replace the server location below with the correct ZooKeeper connect string for your installation. $ .bin/zkCli.sh -server localhost:9983 # After connection, you will interact with the ZK prompt. # Create the directories [zk: localhost:9983(CONNECTED) 5] create /keys [zk: localhost:9983(CONNECTED) 5] create /keys/exe # Now create the public key file in ZooKeeper # The second path is the path to the .der file on your local machine [zk: localhost:9983(CONNECTED) 5] create /keys/exe/pub_key.der /myLocal/pathTo/pub_key.der After this, any attempt to load a jar will fail. All your jars must be signed with one of your private keys for Solr to trust it. The process to sign your jars and use the signature is outlined in Steps 4-6.
使用zookeeper命令行创建指定的文件夹和将公钥上传到指定的位置
Step 4: Sign the jar File
Next you need to sign the sha1 digest of your jar file and get the base64 string. $ openssl dgst -sha1 -sign priv_key.pem myjar.jar | openssl enc -base64 The output of this step will be a string that you will need to add the jar to your classpath in Step 6 below.
为你的jar生成一个签名,保存下来
Step 5: Load the jar to the Blob Store
Load your jar to the Blob store, using the Blob Store API. This step does not require a signature; you will need the signature in Step 6 to add it to your classpath. curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename} http://localhost:8983/solr/.system/blob/{blobname} The blob name that you give the jar file in this step will be used as the name in the next step.
将jar上传到.system 集合中 这步和原来没区别
Step 6: Add the jar to the Classpath
Finally, add the jar to the classpath using the Config API as detailed above. In this step, you will need to provide the signature of the jar that you got in Step 4
curl http://localhost:8983/solr/techproducts/config -H 'Content-type:application/json' -d '{ "add-runtimelib": { "name":"blobname", "version":2, "sig":"mW1Gwtz2QazjfVdrLFHfbGwcr8xzFYgUOLu68LHqWRDvLG0uLcy1McQ+AzVmeZFBf1yLPDEHBWJb5 KXr8bdbHN/ PYgUB1nsr9pk4EFyD9KfJ8TqeH/ijQ9waa/vjqyiKEI9U550EtSzruLVZ32wJ7smvV0fj2YYhrUaaPzOn9g0 =" } }
使用步骤4中获取的签名,将上传的jar加载到classpath下,已供使用!
JVM Settings
Configuring your JVM can be a complex topic. A full discussion is beyond the scope of this document. Luckily,
most modern JVMs are quite good at making the best use of available resources with default settings. The following sections contain a few tips that may be helpful when the defaults are not optimal for your situation. For more general information about improving Solr performance, see https://wiki.apache.org/solr/SolrPerformanceFactors.
默认的jvm配置已经是很好的,如果你有特殊的需要也可以设置.
Choosing Memory Heap Settings
JVM 的 两个参数
These are -Xms,
which sets the initial size of the JVM's memory heap, and -Xmx, which sets the maximum size to which the heap is allowed to grow.
及jvm的垃圾回收和IO性能问题的考虑
Use the Server HotSpot VM
If you are using Sun's JVM, add the -server command-line option when you start Solr. This tells the JVM that it
should optimize for a long running, server process. If the Java runtime on your system is a JRE, rather than a full JDK distribution (including javac and other development tools), then it is possible that it may not support the -s erver JVM option. Test this by running java -help and look for -server as an available option in the displayed usage message.
使用 -server 参数 当运行solr的时候
Checking JVM Settings
A great way to see what JVM settings your server is using, along with other useful information, is to use the
admin RequestHandler, solr/admin/system. This request handler will display a wealth of server statistics and settings. You can also use any of the tools that are compatible with the Java Management Extensions (JMX). See the section Using JMX with Solr in Managing Solr for more information.
如何去查看solr中的jvm参数信息
|
Managing Solr
This section describes how to run Solr and how to look at Solr when it is running. It contains the following
sections: Taking Solr to Production: Describes how to install Solr as a service on Linux for production environments. Securing Solr: How to use the Basic and Kerberos authentication and rule-based authorization plugins for Solr, and how to enable SSL. Running Solr on HDFS: How to use HDFS to store your Solr indexes and transaction logs. Making and Restoring Backups of SolrCores: Describes backup strategies for your Solr indexes. Configuring Logging: Describes how to configure logging for Solr. Using JMX with Solr: Describes how to use Java Management Extensions with Solr. MBean Request Handler: How to use Solr's MBeans for programmatic access to the system plugins and stats.
可以看到这部分内容还是挺丰富的呢
|
Taking Solr to Production
在生产环境使用solr
This section provides guidance on how to setup Solr to run in production on *nix platforms, such as Ubuntu.
Specifically, we’ll walk through the process of setting up to run a single Solr instance on a Linux host and then provide tips on how to support multiple Solr nodes running on the same host. Service Installation Script Planning your directory structure Solr Installation Directory Separate Directory for Writable Files Create the Solr user Run the Solr Installation Script Solr Home Directory Environment overrides include file Log settings init.d script Progress Check Fine tune your production setup Memory and GC Settings Out-of-Memory Shutdown Hook SolrCloud ZooKeeper chroot Solr Hostname Override settings in solrconfig.xml Enable Remote JMX Access Running multiple Solr nodes per host
Service Installation Script
Solr includes a service installation script (bin/install_solr_service.sh) to help you install Solr as a service on Linux. Currently, the script only supports Red Hat, Ubuntu, Debian, and SUSE Linux distributions.
Before running the script, you need to determine a few parameters about your setup. Specifically, you need to decide where to install Solr and which system user should be the owner of the Solr files and process.
使用
bin/install_solr_service.sh 来快速的安装solr实例
Planning your directory structure
We recommend separating your live Solr files, such as logs and index files, from the files included in the Solr
distribution bundle, as that makes it easier to upgrade Solr and is considered a good practice to follow as a system administrator.
给出的solr安装目录的建议
Solr Installation Directory
By default, the service installation script will extract the distribution archive into /opt. You can change this location using the -i option when running the installation script. The script will also create a symbolic link to the versioned directory of Solr. For instance, if you run the installation script for Solr X.0.0, then the following directory structure will be used: /opt/solr-X.0.0 /opt/solr -> /opt/solr-X.0.0 Using a symbolic link insulates any scripts from being dependent on the specific Solr version. If, down the road, you need to upgrade to a later version of Solr, you can just update the symbolic link to point to the upgraded version of Solr. We’ll use /opt/solr to refer to the Solr installation directory in the remaining sections of this page.
solr默认安装将生成一个软连接,以后升级时,仅仅需要更新对应的solr真实目录即可.
Separate Directory for Writable Files
You should also separate writable Solr files into a different directory; by default, the installation script uses /var
/solr, but you can override this location using the -d option. With this approach, the files in /opt/solr will remain untouched and all files that change while Solr is running will live under /var/solr.
solr默认的软件安装位置,及数据安装位置(可以指定)
Create the Solr user
Running Solr as root is not recommended for security reasons. Consequently, you should determine the
username of a system user that will own all of the Solr files and the running Solr process. By default, the installation script will create the solr user, but you can override this setting using the -u option. If your organization has specific requirements for creating new user accounts, then you should create the user before running the script. The installation script will make the Solr user the owner of the /opt/solr and /var/solr di rectories. You are now ready to run the installation script.
考虑安全性,需要为solr指定一个非root用户,默认脚本会自己创建一个solr用户,这个也可以指定
Run the Solr Installation Script
To run the script, you'll need to download the latest Solr distribution archive and then do the following (NOTE:
replace solr-X.Y.Z with the actual version number): $ tar xzf solr-X.Y.Z.tgz solr-X.Y.Z/bin/install_solr_service.sh --strip-components=2 The previous command extracts the install_solr_service.sh script from the archive into the current directory. If installing on Red Hat, please make sure lsof is installed before running the Solr installation script (su do yum install lsof). The installation script must be run as root: $ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz By default, the script extracts the distribution archive into /opt, configures Solr to write files into /var/solr, and runs Solr as the solr user. Consequently, the following command produces the same result as the previous command: $ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt -d /var/solr -u solr -s solr -p 8983 You can customize the service name, installation directories, port, and owner using options passed to the installation script. To see available options, simply do: $ sudo bash ./install_solr_service.sh -help Once the script completes, Solr will be installed as a service and running in the background on your server (on port 8983). To verify, you can do: $ sudo service solr status We'll cover some additional configuration settings you can make to fine-tune your Solr setup in a moment. Before moving on, let's take a closer look at the steps performed by the installation script. This gives you a better overview and will help you understand important details about your Solr installation when reading other pages in this guide; such as when a page refers to Solr home, you'll know exactly where that is on your system.
一些solr安装命令的例子及状态的查看,也可以查看参数自己设置适合自己的
我的尝试:
solr-6.0.1/bin/install_solr_service.sh solr-6.0.1.zip -i /usr/local/ -d /zyy/solr
这个相当于其余的都使用默认的
$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt -d /var/solr -u solr -s
solr -p 8983
这条就是默认的全命令
Solr Home Directory
The Solr home directory (not to be confused with the Solr installation directory) is where Solr manages core
directories with index files. By default, the installation script uses /var/solr/data. If the -d option is used on the install script, then this will change to the data subdirectory in the location given to the -d option. Take a moment to inspect the contents of the Solr home directory on your system. If you do not store solr.xml in ZooKeeper, the home directory must contain a solr.xml file. When Solr starts up, the Solr start script passes the location of the home directory using the -Dsolr.solr.home system property.
solr home 是通过设置 系统属性来实现的
solr.xml必须存在,不论是在zookeeper上还是在data目录下
Environment overrides include file
The service installation script creates an environment specific include file that overrides defaults used by the bin/solr script. The main advantage of using an include file is that it provides a single location where all of your environment-specific overrides are defined. Take a moment to inspect the contents of the /etc/default/solr.in.sh file, which is the default path setup by the installation script. If you used the -s option on the install script to change the name of the service, then the first part of the filename will be different. For a service named solr-demo, the file will be named /etc/default/solr-demo.in.sh. There are many settings that you can override using this file. However, at a minimum, this script needs to define the SOLR_PID_DIR and SOLR_HOME
variables, such as: SOLR_PID_DIR=/var/solr SOLR_HOME=/var/solr/data The SOLR_PID_DIR variable sets the directory where the start script will write out a file containing the Solr server’s process ID.
默认使用的是 /etc/default/solr.in.sh.这个文件设置启动参数的,上面提到的两个必须的参数都是存在这个初始化脚本里的. 配置上 /usr/local/solr/bin/init.d/solr 这个脚本
就可以设置一个动态启动了.(看看后面说嘛,不说自己总结下)
Log settings
Solr uses Apache Log4J for logging. The installation script copies /opt/solr/server/resources/log4j.properties to /var/solr/log4j.properties and customizes it for your environment. Specifically it updates the Log4J settings to create logs in the /var/solr/logs directory. Take a moment to verify that the Solr include file is configured to send logs to the correct location by checking the following settings in /etc/defau lt/solr.in.sh : LOG4J_PROPS=/var/solr/log4j.properties SOLR_LOGS_DIR=/var/solr/logs For more information about Log4J configuration, please see: Configuring Logging
关于日志配置文件的设置及日志文件的位置的配置
init.d script
When running a service like Solr on Linux, it’s common to setup an init.d script so that system administrators can control Solr using the service tool, such as: service solr start. The installation script creates a very basic init.d script to help you get started. Take a moment to inspect the /etc/init.d/solr file, which is the default script name setup by the installation script. If you used the -s option on the install script to change the name of the service, then the filename will be different. Notice that the following variables are setup for your environment based on the parameters passed to the installation script:
/etc/init.d/solr 这个文件就是 service start solr 调用的脚本了 这个很重要
SOLR_INSTALL_DIR=/opt/solr
SOLR_ENV=/etc/default/solr.in.sh RUNAS=solr
这三个参数的意思是:
solr的安装位置 -->用来调用solr
solr的覆盖参数设置
solr的运行者
The SOLR_INSTALL_DIR and SOLR_ENV variables should be self-explanatory. The RUNAS variable sets the owner of the Solr process, such as solr; if you don’t set this value, the script will run Solr as root, which is not recommended for production. You can use the /etc/init.d/solr script to start Solr by doing the following as root: # service solr start The /etc/init.d/solr script also supports the stop, restart, and status commands. Please keep in mind that the init script that ships with Solr is very basic and is intended to show you how to setup Solr as a service. However, it’s also common to use more advanced tools like supervisord or upstart to control Solr as a service on Linux. While showing how to integrate Solr with tools like supervisord is beyond the scope of this guide, the i nit.d/solr script should provide enough guidance to help you get started. Also, the installation script sets the Solr service to start automatically when the host machine initializes. Progress Check
In the next section, we cover some additional environment settings to help you fine-tune your production setup.
However, before we move on, let's review what we've achieved thus far. Specifically, you should be able to control Solr using /etc/init.d/solr. Please verify the following commands work with your setup: $ sudo service solr restart $ sudo service solr status The status command should give some basic information about the running Solr node that looks similar to: Solr process PID running on port 8983 { "version":"5.0.0 - ubuntu - 2014-12-17 19:36:58", "startTime":"2014-12-19T19:25:46.853Z", "uptime":"0 days, 0 hours, 0 minutes, 8 seconds", "memory":"85.4 MB (%17.4) of 490.7 MB"} If the status command is not successful, look for error messages in /var/solr/logs/solr.log.
Fine tune your production setup
Memory and GC Settings
By default, the bin/solr script sets the maximum Java heap size to 512M (-Xmx512m), which is fine for getting
started with Solr. For production, you’ll want to increase the maximum heap size based on the memory requirements of your search application; values between 10 and 20 gigabytes are not uncommon for production servers. When you need to change the memory settings for your Solr server, use the SOLR_JAVA_MEM variable in the include file, such as: SOLR_JAVA_MEM="-Xms10g -Xmx10g" Also, the include file comes with a set of pre-configured Java Garbage Collection settings that have shown to work well with Solr for a number of different workloads. However, these settings may not work well for your specific use of Solr. Consequently, you may need to change the GC settings, which should also be done with the GC_TUNE variable in the /etc/default/solr.in.sh include file. For more information about tuning your memory and garbage collection settings, see: JVM Settings.
关于内存和GC的参数设置
Out-of-Memory Shutdown Hook
The bin/solr script registers the bin/oom_solr.sh script to be called by the JVM if an OutOfMemoryError
occurs. The oom_solr.sh script will issue a kill -9 to the Solr process that experiences the OutOfMemoryE rror. This behavior is recommended when running in SolrCloud mode so that ZooKeeper is immediately notified that a node has experienced a non-recoverable error. Take a moment to inspect the contents of the /op t/solr/bin/oom_solr.sh script so that you are familiar with the actions the script will perform if it is invoked by the JVM.
当发生内存溢出的时候,solr会调用oom_solr.sh来杀死当前的进程.
SolrCloud
To run Solr in SolrCloud mode, you need to set the ZK_HOST variable in the include file to point to your
ZooKeeper ensemble. Running the embedded ZooKeeper is not supported in production environments. For instance, if you have a ZooKeeper ensemble hosted on the following three hosts on the default client port 2181 (zk1, zk2, and zk3), then you would set:
ZK_HOST=zk1,zk2,zk3
When the ZK_HOST variable is set, Solr will launch in "cloud" mode.
当使用solrcloud模式时,需要设置zk_host参数,设置后相当于自动开启了solrCloud模式!
ZooKeeper chroot
If you're using a ZooKeeper instance that is shared by other systems, it's recommended to isolate the SolrCloud
znode tree using ZooKeeper's chroot support. For instance, to ensure all znodes created by SolrCloud are stored under /solr, you can put /solr on the end of your ZK_HOST connection string, such as: ZK_HOST=zk1,zk2,zk3/solr Before using a chroot for the first time, you need to create the root path (znode) in ZooKeeper by using the zkcl i.sh script. We can use the makepath command for that: $ server/scripts/cloud-scripts/zkcli.sh -zkhost zk1,zk2,zk3 -cmd makepath /solr If you also want to bootstrap ZooKeeper with existing solr_home, you can instead use use zkcli.sh / zkcli.bat's bootstrap command, which will also create the chroot path if it does not exist. See Com mand Line Utilities for more info.
如果你和别人共用了一个zookeeper集群,那么建议改变solrCloud在zookeeper的根目录,设置如上,需要自己创建该目录!
Solr Hostname
Use the SOLR_HOST variable in the include file to set the hostname of the Solr server.
SOLR_HOST=solr1.example.com
Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this
determines the address of the node when it registers with ZooKeeper.
在 solrcloud模式下,推荐设置 solr_host参数
Override settings in solrconfig.xml
Solr allows configuration properties to be overridden using Java system properties passed at startup using the -
Dproperty=value syntax. For instance, in solrconfig.xml, the default auto soft commit settings are set to: In general, whenever you see a property in a Solr configuration file that uses the ${solr.PROPERTY:DEFAULT _VALUE} syntax, then you know it can be overridden using a Java system property. For instance, to set the maxTime for soft-commits to be 10 seconds, then you can start Solr with -Dsolr.autoSoftCommit.maxTime =10000, such as: $ bin/solr start -Dsolr.autoSoftCommit.maxTime=10000 The bin/solr script simply passes options starting with -D on to the JVM during startup. For running in production, we recommend setting these properties in the SOLR_OPTS variable defined in the include file. Keeping with our soft-commit example, in /etc/default/solr.in.sh, you would do: SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000"
举了一个例子,来说明如何在初始化配置中设置系统参数
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000" 这个样子就可以了
Enable Remote JMX Access
If you need to attach a JMX-enabled Java profiling tool, such as JConsole or VisualVM, to a remote Solr server,
then you need to enable remote JMX access when starting the Solr server. Simply change the ENABLE_REMOTE_JMX_OPTS property in the include file to true. You’ll also need to choose a port for the JMX RMI connector to bind to, such as 18983. For example, if your Solr include script sets:
设置的例子:
ENABLE_REMOTE_JMX_OPTS=true RMI_PORT=18983 The JMX RMI connector will allow Java profiling tools to attach to port 18983. When enabled, the following properties are passed to the JVM when starting Solr: -Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.local.only=false \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.port=18983 \ -Dcom.sun.management.jmxremote.rmi.port=18983 We don’t recommend enabling remote JMX access in production, but it can sometimes be useful when doing performance and user-acceptance testing prior to going into production.
如何设置系统的远程访问(不推荐)
Running multiple Solr nodes per host
The bin/solr script is capable of running multiple instances on one machine, but for a typical installation, this
is not a recommended setup. Extra CPU and memory resources are required for each additional instance. A single instance is easily capable of handling multiple indexes.
When to ignore the recommendation
For every recommendation, there are exceptions, particularly when discussing extreme scalability. The best reason for running multiple Solr nodes on one host is decreasing the need for extremely large heaps. When the Java heap gets very large, it can result in extremely long garbage collection pauses, even with the GC tuning that the startup script provides by default. The exact point at which the heap is considered "very large" will vary depending on how Solr is used. This means that there is no hard number that can be given as a threshold, but if your heap is reaching the neighborhood of 16 to 32 gigabytes, it might be time to consider splitting nodes. Ideally this would mean more machines, but budget constraints might make that impossible. There is another issue once the heap reaches 32GB. Below 32GB, Java is able to use compressed pointers, but above that point, larger pointers are required, which uses more memory and slows down the JVM. Because of the potential garbage collection issues and the particular issues that happen at 32GB, if a single instance would require a 64GB heap, performance is likely to improve greatly if the machine is set up with two nodes that each have a 31GB heap.
不推荐在一台机器上启动多个solr实例,考虑到java的垃圾回收问题.
If your use case requires multiple instances, at a minimum you will need unique Solr home directories for each
node you want to run; ideally, each home should be on a different physical disk so that multiple Solr nodes don’t have to compete with each other when accessing files on disk. Having different Solr home directories implies that you’ll need a different include file for each node. Moreover, if using the /etc/init.d/solr script to control Solr as a service, then you’ll need a separate script for each node. The easiest approach is to use the service installation script to add multiple services on the same host, such as:
$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -s solr2 -p 8984
The command shown above will add a service named solr2 running on port 8984 using /var/solr2 for
writable (aka "live") files; the second server will still be owned and run by the solr user and will use the Solr distribution files in /opt. After installing the solr2 service, verify it works correctly by doing: $ sudo service solr2 restart $ sudo service solr2 status
如果一定要启动多个实例,可以向上面的命令那样进行处理
这里相当于又设置了一份,我们自己可以写一个脚本来完成自动化部署嘛
....这个等我学习完liunx部分的知识,动手写一个自动化的集群和单机版的部署脚本
|
Securing Solr
When planning how to secure Solr, you should consider which of the available features or approaches are right
for you. Authentication or authorization of users using: Kerberos Authentication Plugin Basic Authentication Plugin Rule-Based Authorization Plugin Custom authentication or authorization plugin Enabling SSL If using SolrCloud, ZooKeeper Access Control
这部分主要说的是solr的安全工作,身份认证及权限管理
这部分原来写过,仅仅研究一下单机版本的访问控制即可
|
Kerberos Authentication Plugin
还是需要一个服务器作为一个令牌中心
|
SolrCloud
Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high
availability. Called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features: Central configuration for the entire cluster Automatic load balancing and fail-over for queries ZooKeeper integration for cluster coordination and configuration.
In this section, we'll cover everything you need to know about using Solr in SolrCloud mode. We've split up the
details into the following topics: Getting Started with SolrCloud How SolrCloud Works Shards and Indexing Data in SolrCloud Distributed Requests Read and Write Side Fault Tolerance SolrCloud Configuration and Parameters Setting Up an External ZooKeeper Ensemble Using ZooKeeper to Manage Configuration Files ZooKeeper Access Control Collections API Parameter Reference Command Line Utilities SolrCloud with Legacy Configuration Files ConfigSets API Rule-based Replica Placement Cross Data Center Replication (CDCR) |
Getting Started with SolrCloud
SolrCloud is designed to provide a highly available, fault tolerant environment for distributing your indexed
content and query requests across multiple servers. It's a system in which data is organized into multiple pieces, or shards, that can be hosted on multiple machines, with replicas providing redundancy for both scalability and fault tolerance, and a ZooKeeper server that helps manage the overall structure so that both indexing and search requests can be routed properly.
This section explains SolrCloud and its inner workings in detail, but before you dive in, it's best to have an idea of
what it is you're trying to accomplish. This page provides a simple tutorial to start Solr in SolrCloud mode, so you can begin to get a sense for how shards interact with each other during indexing and when serving queries. To that end, we'll use simple examples of configuring SolrCloud on a single machine, which is obviously not a real production environment, which would include several servers or virtual machines. In a real production environment, you'll also use the real machine names instead of "localhost" which we've used here. In this section you will learn how to start a SolrCloud cluster using startup scripts and a specific configset.
SolrCloud Example
Interactive Startup
The bin/solr script makes it easy to get started with SolrCloud as it walks you through the process of launching Solr nodes in cloud mode and adding a collection. To get started, simply do: $ bin/solr -e cloud This starts an interactive session to walk you through the steps of setting up a simple SolrCloud cluster with embedded ZooKeeper. The script starts by asking you how many Solr nodes you want to run in your local cluster, with the default being 2. Welcome to the SolrCloud example! This interactive session will help you launch a SolrCloud cluster on your local workstation. To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2] The script supports starting up to 4 nodes, but we recommend using the default of 2 when starting out. These nodes will each exist on a single machine, but will use different ports to mimic operation on different servers. Next, the script will prompt you for the port to bind each of the Solr nodes to, such as: Please enter the port for node1 [8983] Choose any available port for each node; the default for the first node is 8983 and 7574 for the second node. The script will start each node in order and shows you the command it uses to start the server, such as: solr start -cloud -s example/cloud/node1/solr -p 8983 The first node will also start an embedded ZooKeeper server bound to port 9983. The Solr home for the first node is in example/cloud/node1/solr as indicated by the -s option. After starting up all nodes in the cluster, the script prompts you for the name of the collection to create: Please provide a name for your new collection: [gettingstarted] The suggested default is "gettingstarted" but you might want to choose a name more appropriate for your specific search application. Next, the script prompts you for the number of shards to distribute the collection across. Sharding is covered in more detail later on, so if you're unsure, we suggest using the default of 2 so that you can see how a collection is distributed across multiple nodes in a SolrCloud cluster. Next, the script will prompt you for the number of replicas to create for each shard. Replication is covered in more detail later in the guide, so if you're unsure, then use the default of 2 so that you can see how replication is handled in SolrCloud. Lastly, the script will prompt you for the name of a configuration directory for your collection. You can choose bas ic_configs, data_driven_schema_configs, or sample_techproducts_configs. The configuration directories are pulled from server/solr/configsets/ so you can review them beforehand if you wish. The data_drive n_schema_configs configuration (the default) is useful when you're still designing a schema for your documents and need some flexiblity as you experiment with Solr. At this point, you should have a new collection created in your local SolrCloud cluster. To verify this, you can run the status command: $ bin/solr status If you encounter any errors during this process, check the Solr log files in example/cloud/node1/logs and e xample/cloud/node2/logs. You can see how your collection is deployed across the cluster by visiting the cloud panel in the Solr Admin UI: h ttp://localhost:8983/solr/#/~cloud. Solr also provides a way to perform basic diagnostics for a collection using the healthcheck command: $ bin/solr healthcheck -c gettingstarted The healthcheck command gathers basic information about each replica in a collection, such as number of docs, current status (active, down, etc), and address (where the replica lives in the cluster). Documents can now be added to SolrCloud using the Post Tool. To stop Solr in SolrCloud mode, you would use the bin/solr script and issue the stop command, as in: $ bin/solr stop -all
如何开始一个solrcloud模式的例子
Starting with -noprompt
开始solrcloud 使用默认值
You can also get SolrCloud started with all the defaults instead of the interactive session using the following
command: $ bin/solr -e cloud -noprompt
Restarting Nodes
重新开始相应的集群节点
You can restart your SolrCloud nodes using the bin/solr script. For instance, to restart node1 running on port
8983 (with an embedded ZooKeeper server), you would do: $ bin/solr restart -c -p 8983 -s example/cloud/node1/solr To restart node2 running on port 7574, you can do: $ bin/solr restart -c -p 7574 -z localhost:9983 -s example/cloud/node2/solr Notice that you need to specify the ZooKeeper address (-z localhost:9983) when starting node2 so that it can join the cluster with node1.
Adding a node to a cluster
加入一个新的节点到solrcloud集群中去
Adding a node to an existing cluster is a bit advanced and involves a little more understanding of Solr. Once you
startup a SolrCloud cluster using the startup scripts, you can add a new node to it by: $ mkdir $ cp $ bin/solr start -cloud -s solr.home/solr -p Notice that the above requires you to create a Solr home directory. You either need to copy solr.xml to the so lr_home directory, or keep in centrally in ZooKeeper /solr.xml. Example (with directory structure) that adds a node to an example started with "bin/solr -e cloud": $ mkdir -p example/cloud/node3/solr $ cp server/solr/solr.xml example/cloud/node3/solr $ bin/solr start -cloud -s example/cloud/node3/solr -p 8987 -z localhost:9983 The previous command will start another Solr node on port 8987 with Solr home set to example/cloud/node3 /solr. The new node will write its log files to example/cloud/node3/logs. Once you're comfortable with how the SolrCloud example works, we recommend using the process described in Taking Solr to Production for setting up SolrCloud nodes in production. |
How SolrCloud Works
The following sections cover provide general information about how various SolrCloud features work. To
understand these features, it's important to first understand a few key concepts that relate to SolrCloud. Shards and Indexing Data in SolrCloud Distributed Requests Read and Write Side Fault Tolerance If you are already familiar with SolrCloud concepts and basic functionality, you can skip to the section covering S olrCloud Configuration and Parameters.
Key SolrCloud Concepts
A SolrCloud cluster consists of some "logical" concepts layered on top of some "physical" concepts.
Logical
A Cluster can host multiple Collections of Solr Documents. A collection can be partitioned into multiple Shards, which contain a subset of the Documents in the Collection. The number of Shards that a Collection has determines: The theoretical limit to the number of Documents that Collection can reasonably contain. The amount of parallelization that is possible for an individual search request.
Physical
A Cluster is made up of one or more Solr Nodes, which are running instances of the Solr server process.
Each Node can host multiple Cores. Each Core in a Cluster is a physical Replica for a logical Shard. Every Replica uses the same configuration specified for the Collection that it is a part of. The number of Replicas that each Shard has determines: The level of redundancy built into the Collection and how fault tolerant the Cluster can be in the event that some Nodes become unavailable. The theoretical limit in the number concurrent search requests that can be processed under heavy load.
Shards and Indexing Data in SolrCloud
When your data is too large for one node, you can break it up and store it in sections by creating one or more sh
ards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index. A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined. Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results. So splitting the core across shards is not exclusively a SolrCloud concept. There were, however, several problems with the distributed approach that necessitated improvement with SolrCloud: Splitting of the core into shards was somewhat manual. There was no support for distributed indexing, which meant that you needed to explicitly send documents to a specific shard; Solr couldn't figure out on its own what shards to send documents to. There was no load balancing or failover, so if you got a high number of queries, you needed to figure out where to send them and if one shard died it was just gone. SolrCloud fixes all those problems. There is support for distributing both the index process and the queries automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have multiple replicas for additional robustness. In SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at http://z ookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection.. If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard ID. When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader. If the machine is a replica, the document is forwarded to the leader for processing. If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and forwards the index notation to itself and any replicas.
为何要用分片副本,及分片副本的工作原理
Document Routing
Solr offers the ability to specify the router implementation used by a collection by specifying the router.name p
arameter when creating your collection. If you use the "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which shard to direct the document to. Then at query time, you include the prefix(es) into your query with the _route_ parameter (i.e., q=solr&_rout e_=IBM!) to direct queries to specific shards. In some situations, this may improve query performance because it overcomes network latency when querying all the shards The compositeId router supports prefixes containing up to 2 levels of routing. For example: a prefix routing first by region, then by customer: "USA!IBM!12345" Another use case could be if the customer "IBM" has a lot of documents and you want to spread it across multiple shards. The syntax for such a use case would be : "shard_key/num!document_id" where the /num is the number of bits from the shard key to use in the composite hash. So "IBM/3!12345" will take 3 bits from the shard key and 29 bits from the unique doc id, spreading the tenant over 1/8th of the shards in the collection. Likewise if the num value was 2 it would spread the documents across 1/4th the number of shards. At query time, you include the prefix(es) along with the number of bits into your query with the _route_ parameter (i.e., q=solr&_route_=IBM/3!) to direct queries to specific shards. If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID. If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a router.field parameter to use a field from each document to identify a shard where the document belongs. If the field specified is missing in the document, however, the document will be rejected. You could also use the _r oute_ parameter to name a specific shard.
分片时的文档路由规则及路由参数的使用
Shard Splitting
When you create a collection in SolrCloud, you decide on the initial number shards to be used. But it can be
difficult to know in advance the number of shards that you need, particularly when organizational requirements can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data. The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you're ready. More details on how to use shard splitting is in the section on the Collections API.
使用分片切割可以对分片的数量进行扩充.解决分片一开始固定的问题
Ignoring Commits from Client Applications in SolrCloud
In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit
requests. Rather, you should configure auto commits with openSearcher=false and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster. To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code. To activate this request processor you'll need to add the following to your solrconfig.xml:
对于分布式solrcloud你应该关闭显示的commit提交.采取软提交的自动提交的方式.当然,这也不能是很确保,可以在solrconfig.xml中配置如下来忽略相关commit的最优化参数
As shown in the example above, the processor will return 200 to the client but will ignore the commit / optimize request. Notice that you need to wire-in the implicit processors needed by SolrCloud as well, since this custom chain is taking the place of the default chain.
In the following example, the processor will raise an exception with a 403 code with a customized error message:
可以定义一个异常,当发送commit或者最优化命令时
Lastly, you can also configure it to just ignore optimize and let commits pass thru by doing:
也能仅仅只忽略最优化操作让commit命令通过
OK!
Distributed Requests
When a Solr node receives a search request, that request is routed behinds the scenes to a replica of some
shard that is part of the collection being searched. The chosen replica will act as an aggregator: creating internal requests to randomly chosen replicas of every shard in the collection, coordinating the responses, issuing any subsequent internal requests as needed (For example, to refine facets values, or request additional stored fields) and constructing the final response for the client. Limiting Which Shards are Queried
指定分片进行查询
One of the advantages of using SolrCloud is the ability to very large collections distributed among various shards
– but in some cases you may know that you are only interested in results from a subset of your shards. You have the option of searching over all of your data or just parts of it. Querying all shards for a collection should look familiar; it's as though SolrCloud didn't even come into play: http://localhost:8983/solr/gettingstarted/select?q=*:* If, on the other hand, you wanted to search just one shard, you can specify that shard by it's logical ID, as in: http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1 If you want to search a group of shard Ids, you can specify them together: http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1,shard2 In both of the above examples, the shard Id(s) will be used to pick a random replica of that shard. Alternatively, you can specify the explict replicas you wish to use in place of a shard Ids: http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr/ge ttingstarted,localhost:8983/solr/gettingstarted Or you can specify a list of replicas to choose from for a single shard (for load balancing purposes) by using the pipe symbol (|): http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr/ge ttingstarted|localhost:7500/solr/gettingstarted And of course, you can specify a list of shards (seperated by commas) each defined by a list of replicas (seperated by pipes). In this example, 2 shards are queried, the first being a random replica from shard1, the second being a random replica from the explicit pipe delimited list: http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1,localhost:7574/ solr/gettingstarted|localhost:7500/solr/gettingstarted
Configuring the ShardHandlerFactory
You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr.
This allows for finer grained control and you can tune it to target your own specific requirements. The default configuration favors throughput over latency. To configure the standard handler, provide a configuration like this in the solrconfig.xml:
分片处理器的配置及相应的参数设置
Configuring statsCache (Distributed IDF)
Document and term statistics are needed in order to calculate relevancy. Solr provides four implementations out of the box when it comes to document stats calculation: LocalStatsCache: This only uses local term and document statistics to compute relevance. In cases with uniform term distribution across shards, this works reasonably well. This option is the default if no ExactStatsCache: This implementation uses global values (across the collection) for document frequency. ExactSharedStatsCache: This is exactly like the exact stats cache in it's functionality but the global stats are reused for subsequent requests with the same terms. LRUStatsCache: This implementation uses an LRU cache to hold global stats, which are shared between requests. The implementation can be selected by setting following line makes Solr use the ExactStatsCache implementation:
文档和词频的统计及相关实现
Avoiding Distributed Deadlock
Each shard serves top-level query requests and then makes sub-requests to all of the other shards. Care should
be taken to ensure that the max number of threads serving HTTP requests is greater than the possible number of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a distributed deadlock. For example, a deadlock might occur in the case of two shards, each with just a single thread to service HTTP requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other. Because there are no more remaining threads to service requests, the incoming requests will be blocked until the other pending requests are finished, but they will not finish since they are waiting for the sub-requests. By ensuring that Solr is configured to handle a sufficient number of threads, you can avoid deadlock situations like this.
配置更多的线程来避免死锁
Prefer Local Shards
Solr allows you to pass an optional boolean parameter named preferLocalShards to indicate that a
distributed query should prefer local replicas of a shard when available. In other words, if a query includes prefe rLocalShards=true, then the query controller will look for local replicas to service the query instead of selecting replicas at random from across the cluster. This is useful when a query requests many fields or large fields to be returned per document because it avoids moving large amounts of data over the network when it is available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas. Lastly, it follows that the value of this feature diminishes as the number of shards in a collection increases because the query controller will have to direct the query to non-local replicas for most of the shards. In other words, this feature is mostly useful for optimizing queries directed towards collections with a small number of shards and many replicas. Also, this option should only be used if you are load balancing requests across all nodes that host replicas for the collection you are querying, as Solr's CloudSolrClient will do. If not load-balancing, this feature can introduce a hotspot in the cluster since queries won't be evenly distributed across the cluster.
适用于本地的查询 有时候更加方便
Read and Write Side Fault Tolerance
读写和容错
SolrCloud supports elasticity, high availability, and fault tolerance in reads and writes. What this means, basically, is that when you have a large cluster, you can always make requests to the cluster: Reads will return results whenever possible, even if some nodes are down, and Writes will be acknowledged only if they are durable; i.e., you won't lose data.
Read Side Fault Tolerance
In a SolrCloud cluster each individual node load balances read requests across all the replicas in collection. You
still need a load balancer on the 'outside' that talks to the cluster, or you need a smart client which understands how to read and interact with Solr's metadata in ZooKeeper and only requests the ZooKeeper ensemble's address to start discovering to which nodes it should send requests. (Solr provides a smart Java SolrJ client called CloudSolrClient.) Even if some nodes in the cluster are offline or unreachable, a Solr node will be able to correctly respond to a search request as long as it can communicate with at least one replica of every shard, or one replica of every rel evant shard if the user limited the search via the 'shards' or '_route_' parameters. The more replicas there are of every shard, the more likely that the Solr cluster will be able to handle search results in the event of node failures.
zkConnected
zk连接的参数在节点处理数据时
A Solr node will return the results of a search request as long as it can communicate with at least one replica of
every shard that it knows about, even if it can not communicate with ZooKeeper at the time it receives the request. This is normally the preferred behavior from a fault tolerance standpoint, but may result in stale or incorrect results if there have been major changes to the collection structure that the node has not been informed of via ZooKeeper (ie: shards may have been added or removed, or split into sub-shards) A zkConnected header is included in every search response indicating if the node that processed the request was connected with ZooKeeper at the time: { "responseHeader": { "status": 0, "zkConnected": true, "QTime": 20, "params": { "q": "*:*" } }, "response": { "numFound": 107, "start": 0, "docs": [ ... ] } }
shards.tolerant
In the event that one or more shards queried are completely unavailable, then Solr's default behavior is to fail the
request. However, there are many use-cases where partial results are acceptable and so Solr provides a boolean shards.tolerant parameter (default 'false'). If shards.tolerant=true then partial results may be returned. If the returned response does not contain results from all the appropriate shards then the response header contains a special flag called 'partialResults'. The client can specify 'shards.info' along with the ' shards.tolerant' parameter to retrieve more fine-grained details. Example response with partialResults flag set to 'true': { "responseHeader": { "status": 0, "zkConnected": true, "partialResults": true, "QTime": 20, "params": { "q": "*:*" } }, "response": { "numFound": 77, "start": 0, "docs": [ ... ] } }
Write Side Fault Tolerance
写入容错
SolrCloud is designed to replicate documents to ensure redundancy for your data, and enable you to send
update requests to any node in the cluster. That node will determine if it hosts the leader for the appropriate shard, and if not it will forward the request to the the leader, which will then forwards it to all existing replicas, using versioning to make sure every replica has the most up-to-date version. If the leader goes down, and other replica can take it's place. This architecture enables you to be certain that your data can be recovered in the event of a disaster, even if you are using Near Real Time Searching.
描述数据更新流程及容灾
Recovery
A Transaction Log is created for each node so that every change to content or organization is noted. The log is
used to determine which content in the node should be included in a replica. When a new replica is created, it refers to the Leader and the Transaction Log to know which content to include. If it fails, it retries.
副本从lead的日志中获取信息那些数据应该被储存
Since the Transaction Log consists of a record of updates, it allows for more robust indexing because it includes
redoing the uncommitted updates if indexing is interrupted. If a leader goes down, it may have sent requests to some replicas and not others. So when a new potential leader is identified, it runs a synch process against the other replicas. If this is successful, everything should be consistent, the leader registers as active, and normal actions proceed. If a replica is too far out of sync, the system asks for a full replication/replay-based recovery. If an update fails because cores are reloading schemas and some have finished but others have not, the leader tells the nodes that the update failed and starts the recovery procedure.
Achieved Replication Factor
When using a replication factor greater than one, an update request may succeed on the shard leader but fail on
one or more of the replicas. For instance, consider a collection with one shard and replication factor of three. In this case, you have a shard leader and two additional replicas. If an update request succeeds on the leader but fails on both replicas, for whatever reason, the update request is still considered successful from the perspective of the client. The replicas that missed the update will sync with the leader when they recover. Behind the scenes, this means that Solr has accepted updates that are only on one of the nodes (the current leader). Solr supports the optional min_rf parameter on update requests that cause the server to return the achieved replication factor for an update request in the response. For the example scenario described above, if the client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the request only succeeded on the leader. The update request will still be accepted as the min_rf parameter only tells Solr that the client application wishes to know what the achieved replication factor was for the update request. In other words, min_rf does not mean Solr will enforce a minimum replication factor as Solr does not support rolling back updates that succeed on a subset of replicas. On the client side, if the achieved replication factor is less than the acceptable level, then the client application can take additional measures to handle the degraded state. For instance, a client application may want to keep a log of which update requests were sent while the state of the collection was degraded and then resend the updates once the problem has been resolved. In short, min_rf is an optional mechanism for a client application to be warned that an update request was accepted while the collection is in a degraded state.
当请求被lead更新成功但是副本却是失败时,solrcloud依然认为成功的更新,其余的副本以后会进行数据恢复冲lead的日志文件中.但是你可以设置一个参数来获取成功执行的副本的数目. 使用参数
min_rf .根据返回结果再做相应的处理
|
SolrCloud Configuration and Parameters
集群的配置和参数
In this section, we'll cover the various configuration options for SolrCloud.
The following sections cover these topics: Setting Up an External ZooKeeper Ensemble Using ZooKeeper to Manage Configuration Files ZooKeeper Access Control Collections API Parameter Reference Command Line Utilities SolrCloud with Legacy Configuration Files ConfigSets API
Setting Up an External ZooKeeper Ensemble
使用外部zookeeper集群
Although Solr comes bundled with Apache ZooKeeper, you should consider yourself discouraged from using this
internal ZooKeeper in production, because shutting down a redundant Solr instance will also shut down its ZooKeeper server, which might not be quite so redundant. Because a ZooKeeper ensemble must have a quorum of more than half its servers running at any given time, this can be a problem. The solution to this problem is to set up an external ZooKeeper ensemble. Fortunately, while this process can seem intimidating due to the number of powerful options, setting up a simple ensemble is actually quite straightforward, as described below.
为什么要使用外部的zookeeper集群,及告诉你构建是很简单的
How Many ZooKeepers?
ZooKeeper deployments are usually made up of an odd number of machines.
When planning how many ZooKeeper nodes to configure, keep in mind that the main principle for a ZooKeeper
ensemble is maintaining a majority of servers to serve requests. This majority is also called a quorum. It is generally recommended to have an odd number of ZooKeeper servers in your ensemble, so a majority is maintained. For example, if you only have two ZooKeeper nodes and one goes down, 50% of available servers is not a majority, so ZooKeeper will no longer serve requests. However, if you have three ZooKeeper nodes and one goes down, you have 66% of available servers available, and ZooKeeper will continue normally while you repair the one down node. If you have 5 nodes, you could continue operating with two down nodes if necessary. More information on ZooKeeper clusters is available from the ZooKeeper documentation at http://zookeeper.apa che.org/doc/r3.4.5/zookeeperAdmin.html#sc_zkMulitServerSetup.
Download Apache ZooKeeper
下载
The first step in setting up Apache ZooKeeper is, of course, to download the software. It's available from http://zookeeper.apache.org/releases.html.
Solr currently uses Apache ZooKeeper v3.4.6.
集群创建步骤
Setting Up a Single ZooKeeper
Create the instance
Creating the instance is a simple matter of extracting the files into a specific target directory. The actual directory
itself doesn't matter, as long as you know where it is, and where you'd like to have ZooKeeper store its internal data. Configure the instance The next step is to configure your ZooKeeper instance. To do that, create the following file: /conf/zoo.cfg. To this file, add the following information: tickTime=2000 dataDir=/var/lib/zookeeper clientPort=2181 tickTime: Part of what ZooKeeper does is to determine which servers are up and running at any given time, and the minimum session time out is defined as two "ticks". The tickTime parameter specifies, in miliseconds, how long each tick should be. dataDir: This is the directory in which ZooKeeper will store data about the cluster. This directory should start out empty. clientPort: This is the port on which Solr will access ZooKeeper. Once this file is in place, you're ready to start the ZooKeeper instance.
配置相关信息
Run the instance
To run the instance, you can simply use the ZOOKEEPER_HOME/bin/zkServer.sh script provided, as with this
command: zkServer.sh start Again, ZooKeeper provides a great deal of power through additional configurations, but delving into them is beyond the scope of this tutorial. For more information, see the ZooKeeper Getting Started page. For this example, however, the defaults are fine.
Point Solr at the instance
Pointing Solr at the ZooKeeper instance you've created is a simple matter of using the -z parameter when using
the bin/solr script. For example, in order to point the Solr instance to the ZooKeeper you've started on port 2181, this is what you'd need to do: Starting cloud example with Zookeeper already running at port 2181 (with all other defaults): bin/solr start -e cloud -z localhost:2181 -noprompt Add a node pointing to an existing ZooKeeper at port 2181: bin/solr start -cloud -s NOTE: When you are not using an example to start solr, make sure you upload the configuration set to zookeeper before creating the collection.
使用这个zookeeper实例开始集群
Shut down ZooKeeper
To shut down ZooKeeper, use the zkServer script with the "stop" command: zkServer.sh stop
Setting up a ZooKeeper Ensemble
设置zk集群
With an external ZooKeeper ensemble, you need to set things up just a little more carefully as compared to the
Getting Started example. The difference is that rather than simply starting up the servers, you need to configure them to know about and talk to each other first. So your original zoo.cfg file might look like this: dataDir=/var/lib/zookeeperdata/1 clientPort=2181 initLimit=5 syncLimit=2 server.1=localhost:2888:3888 server.2=localhost:2889:3889 server.3=localhost:2890:3890 Here you see three new parameters: initLimit: Amount of time, in ticks, to allow followers to connect and sync to a leader. In this case, you have 5 ticks, each of which is 2000 milliseconds long, so the server will wait as long as 10 seconds to connect and sync with the leader. syncLimit: Amount of time, in ticks, to allow followers to sync with ZooKeeper. If followers fall too far behind a leader, they will be dropped. server.X: These are the IDs and locations of all servers in the ensemble, the ports on which they communicate with each other. The server ID must additionally stored in the ir of each ZooKeeper instance. The ID identifies each server, so in the case of this first instance, you would create the file /var/lib/zookeeperdata/1/myid with the content "1". Now, whereas with Solr you need to create entirely new directories to run multiple instances, all you need for a new ZooKeeper instance, even if it's on the same machine for testing purposes, is a new configuration file. To complete the example you'll create two more configuration files. The tickTime=2000 dataDir=c:/sw/zookeeperdata/2 clientPort=2182 initLimit=5 syncLimit=2 server.1=localhost:2888:3888 server.2=localhost:2889:3889 server.3=localhost:2890:3890 You'll also need to create tickTime=2000 dataDir=c:/sw/zookeeperdata/3 clientPort=2183 initLimit=5 syncLimit=2 server.1=localhost:2888:3888 server.2=localhost:2889:3889 server.3=localhost:2890:3890 Finally, create your myid files in each of the dataDir directories so that each server knows which instance it is. The id in the myid file on each machine must match the "server.X" definition. So, the ZooKeeper instance (or machine) named "server.1" in the above example, must have a myid file containing the value "1". The myid file can be any integer between 1 and 255, and must match the server IDs assigned in the zoo.cfg file. To start the servers, you can simply explicitly reference the configuration files: cd bin/zkServer.sh start zoo.cfg bin/zkServer.sh start zoo2.cfg bin/zkServer.sh start zoo3.cfg Once these servers are running, you can reference them from Solr just as you did before: bin/solr start -e cloud -z localhost:2181,localhost:2182,localhost:2183 -noprompt For more information on getting the most power from your ZooKeeper installation, check out the ZooKeeper Administrator's Guide.
配置启动zk集群及使用zk启动solr
Securing the ZooKeeper connection
You may also want to secure the communication between ZooKeeper and Solr. To setup ACL protection of znodes, see ZooKeeper Access Control.
Using ZooKeeper to Manage Configuration Files
With SolrCloud your configuration files are kept in ZooKeeper. These files are uploaded in either of the following cases:
When you start a SolrCloud example using the bin/solr script. When you create a collection using the bin/solr script. Explicitly upload a configuration set to ZooKeeper.
Startup Bootstrap
When you try SolrCloud for the first time using the bin/solr -e cloud, the related configset gets uploaded to zookeeper automatically and is linked with the newly created collection. The below command would start SolrCloud with the default collection name (gettingstarted) and default configset (data_driven_schema_configs) uploaded and linked to it. $ bin/solr -e cloud -noprompt You can also explicitly upload a configuration directory when creating a collection using the bin/solr script with the -d option, such as: $ bin/solr create -c mycollection -d data_driven_schema_configs The create command will upload a copy of the data_driven_schema_configs configuration directory to ZooKeeper under /configs/mycollection. Refer to the Solr Start Script Reference page for more details about the create command for creating collections Once a configuration directory has been uploaded to ZooKeeper, you can update them using the ZooKeeper Command Line Interface (zkCLI).
默认上传配置文件在创建集合时
Uploading configs using zkcli or SolrJ
更新配置文件使用zkcli或者solrj客户端
In production situations, Config Sets can also be uploaded to ZooKeeper independent of collection creation using
either Solr's zkcli.sh script, or the CloudSolrClient.uploadConfig java method. The below command can be used to upload a new configset using the zkcli script. $ sh zkcli.sh -cmd upconfig -zkhost -solrhome More information about the ZooKeeper Command Line Utility to help manage changes to configuration files, can be found in the section on Command Line Utilities.
Managing Your SolrCloud Configuration Files
To update or change your SolrCloud configuration files:
Download the latest configuration files from ZooKeeper, using the source control checkout process. Make your changes. Commit your changed file to source control. Push the changes back to ZooKeeper. Reload the collection so that the changes will be in effect.
配置文件管理过程 这里的源代码控制指的是什么?
Preparing ZooKeeper before first cluster start
If you will share the same ZooKeeper instance with other applications you should use a chroot in ZooKeeper. Please see Taking Solr to Production#ZooKeeperchroot for instructions. There are certain configuration files containing cluster wide configuration. Since some of these are crucial for the cluster to function properly, you may need to upload such files to ZooKeeper before starting your Solr cluster for the first time. Examples of such configuration files (not exhaustive) are solr.xml, security.json and clus terprops.json. If you for example would like to keep your solr.xml in ZooKeeper to avoid having to copy it to every node's so lr_home directory, you can push it to ZooKeeper with the zkcli.sh utility (Unix example): zkcli.sh -zkhost localhost:2181 -cmd putfile /solr.xml /path/to/solr.xml
在solr集群启动之前将所需的配置文件上传到zookeeper中,例如solr.xml文件的推送. 安全控制文件,集群属性等文件
ZooKeeper Access Control
This section describes using ZooKeeper access control lists (ACLs) with Solr. For information about ZooKeeper ACLs, see the ZooKeeper documentation at http://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.htm l#sc_ZooKeeperAccessControl. About ZooKeeper ACLs How to Enable ACLs Changing ACL Schemes Example Usages
zk的访问控制部分
About ZooKeeper ACLs
SolrCloud uses ZooKeeper for shared information and for coordination. This section describes how to configure Solr to add more restrictive ACLs to the ZooKeeper content it creates, and how to tell Solr about the credentials required to access the content in ZooKeeper. If you want to use ACLs in your ZooKeeper nodes, you will have to activate this functionality; by default, Solr behavior is open-unsafe ACL everywhere and uses no credentials. Changing Solr-related content in ZooKeeper might damage a SolrCloud cluster. For example: Changing configuration might cause Solr to fail or behave in an unintended way. Changing cluster state information into something wrong or inconsistent might very well make a SolrCloud cluster behave strangely. Adding a delete-collection job to be carried out by the Overseer will cause data to be deleted from the cluster. You may want to enable ZooKeeper ACLs with Solr if you grant access to your ZooKeeper ensemble to entities you do not trust, or if you want to reduce risk of bad actions resulting from, e.g.: Malware that found its way into your system. Other systems using the same ZooKeeper ensemble (a "bad thing" might be done by accident). You might even want to limit read-access, if you think there is stuff in ZooKeeper that not everyone should know about. Or you might just in general work on need-to-know-basis. Protecting ZooKeeper itself could mean many different things. This section is about protecting Solr content in ZooKeeper. ZooKeeper content basically lives persisted on disk and (partly) in memory of the ZooKeeper processes. This section is not about protecting ZooKeeper data at storage or ZooKeeper process levels - that's for ZooKeeper to deal with. But this content is also available to "the outside" via the ZooKeeper API. Outside processes can connect to ZooKeeper and create/update/delete/read content; for example, a Solr node in a SolrCloud cluster wants to create/update/delete/read, and a SolrJ client wants to read from the cluster. It is the responsibility of the outside processes that create/update content to setup ACLs on the content. ACLs describe who is allowed to read, update, delete, create, etc. Each piece of information (znode/content) in ZooKeeper has its own set of ACLs, and inheritance or sharing is not possible. The default behavior in Solr is to add one ACL on all the content it creates - one ACL that gives anyone the permission to do anything (in ZooKeeper terms this is called "the open-unsafe ACL").
为什么要使用acls及 acls涉及的内容
How to Enable ACLs
We want to be able to:
Control the credentials Solr uses for its ZooKeeper connections. The credentials are used to get permission to perform operations in ZooKeeper. Control which ACLs Solr will add to znodes (ZooKeeper files/folders) it creates in ZooKeeper. Control it "from the outside", so that you do not have to modify and/or recompile Solr code to turn this on. Solr nodes, clients and tools (e.g. ZkCLI) always use a java class called SolrZkClient to deal with their ZooKeeper stuff. The implementation of the solution described here is all about changing SolrZkClient. If you use SolrZkClient in your application, the descriptions below will be true for your application too.
solr和zk交互的几种方式都是使用了solrzkclient 关键就在这里
Controlling Credentials
You control which credentials provider will be used by configuring the zkCredentialsProvider property in solr.xml's public interface ZkCredentialsProvider { public class ZkCredentials { String scheme; byte[] auth; public ZkCredentials(String scheme, byte[] auth) { super(); this.scheme = scheme; this.auth = auth; } String getScheme() { return scheme; } byte[] getAuth() { return auth; } } Collection } Solr determines which credentials to use by calling the getCredentials() method of the given credentialsprovider. If no provider has been configured, the default implementation, DefaultZkCredentialsProvider is used. solr获取令牌的途径
Out of the Box Implementations
You can always make you own implementation, but Solr comes with two implementations: org.apache.solr.common.cloud.DefaultZkCredentialsProvider: Its getCredentials() returns a list of length zero, or "no credentials used". This is the default and is used if you do not configure a provider in solr.xml. org.apache.solr.common.cloud.VMParamsSingleSetCredentialsDigestZkCredentialsPr ovider: This lets you define your credentials using system properties. It supports at most one set of credentials. The schema is "digest". The username and password are defined by system properties "zkDiges tUsername" and "zkDigestPassword", respectively. This set of credentials will be added to the list of credentials returned by getCredentials() if both username and password are provided. If the one set of credentials above is not added to the list, this implementation will fall back to default behavior and use the (empty) credentials list from DefaultZkCredentialsProvider
solr提供了两个实现类,第一个是默认的空令牌,第二个是可以通过系统属性去添加一个令牌的方法
Controlling ACLs
You control which ACLs will be added by configuring zkACLProvider property in solr.xml 's
package org.apache.solr.common.cloud;
public interface ZkACLProvider { List } When solr wants to create a new znode, it determines which ACLs to put on the znode by calling the getACLsT oAdd() method of the given acl provider. If no provider has been configured, the default implementation, Defau ltZkACLProvider is used.
当solr创建一个节点会调用获取令牌的方法来设置谁有改节点的权限
Out of the Box Implementations
You can always make you own implementation, but Solr comes with: org.apache.solr.common.cloud.DefaultZkACLProvider: It returns a list of length one for all z NodePath-s. The single ACL entry in the list is "open-unsafe". This is the default and is used if you do not configure a provider in solr.xml. org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLProvider: This lets you define your ACLs using system properties. Its getACLsToAdd() implementation does not use zNod ePath for anything, so all znodes will get the same set of ACLs. It supports adding one or both of these options: A user that is allowed to do everything. The permission is "ALL" (corresponding to all of CREATE, READ, WRITE, DELETE, and ADMI N), and the schema is "digest". The username and password are defined by system properties "zkDigestUsername" and " zkDigestPassword", respectively. This ACL will not be added to the list of ACLs unless both username and password are provided. A user that is only allowed to perform read operations. The permission is "READ" and the schema is "digest". The username and password are defined by system properties "zkDigestReadonlyUsern ame" and "zkDigestReadonlyPassword, respectively. This ACL will not be added to the list of ACLs unless both username and password are provided. If neither of the above ACLs is added to the list, the (empty) ACL list of DefaultZkACLProvider will be used by default. Notice the overlap in system property names with credentials provider VMParamsSingleSetCredentialsDig estZkCredentialsProvider (described above). This is to let the two providers collaborate in a nice and perhaps common way: we always protect access to content by limiting to two users - an admin-user and a readonly-user - AND we always connect with credentials corresponding to this same admin-user, basically so that we can do anything to the content/znodes we create ourselves. You can give the readonly credentials to "clients" of your SolrCloud cluster - e.g. to be used by SolrJ clients. They will be able to read whatever is necessary to run a functioning SolrJ client, but they will not be able to modify any content in ZooKeeper.
提供访问控制列表 --两个默认实现
Changing ACL Schemes
Over the lifetime of operating your Solr cluster, you may decide to move from an unsecured ZooKeeper to a secured instance. Changing the configured zkACLProvider in solr.xml will ensure that newly created nodes are secure, but will not protect the already existing data. To modify all existing ACLs, you can use: ZkCLI -cmd updateacls /zk-path. Changing ACLs in ZK should only be done while your SolrCloud cluster is stopped. Attempting to do so while Solr is running may result in inconsistent state and some nodes becoming inaccessible. To configure the new ACLs, run ZkCli with the following VM properties: -DzkACLProvider=... -DzkCredentialsProvider=.... The Credential Provider must be one that has current admin privileges on the nodes. When omitted, the process will use no credentials (suitable for an unsecure configuration). The ACL Provider will be used to compute the new ACLs. When omitted, the process will set allpermissions to all users, removing any security present. You may use the VMParamsSingleSetCredentialsDigestZkCredentialsProvider and VMParamsAll AndReadonlyDigestZkACLProvider implementations as described earlier in the page for these properties. After changing the ZK ACLs, make sure that the contents of your solr.xml match, as described for initial set up 已经存在的节点进行访问控制--重新设置后停止集群通过命令来完成
Example Usages
实现的例子:
Let's say that you want all Solr-related content in ZooKeeper protected. You want an "admin" user that is able to
do anything to the content in ZooKeeper - this user will be used for initializing Solr content in ZooKeeper and for server-side Solr nodes. You also want a "readonly" user that is only able to read content from ZooKeeper - this user will be handed over to "clients". In the examples below: The "admin" user's username/password is admin-user/admin-password. The "readonly" user's username/password is readonly-user/readonly-password. The provider class names must first be configured in solr.xml: ... ... alsDigestZkCredentialsProvider rovider To use ZkCLI: SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user -DzkDigestPassword=admin-password \ -DzkDigestReadonlyUsername=readonly-user -DzkDigestReadonlyPassword=readonly-password" java ... $SOLR_ZK_CREDS_AND_ACLS ... org.apache.solr.cloud.ZkCLI -cmd ... For operations using bin/solr, add the following at the bottom of bin/solr.in.sh: SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user -DzkDigestPassword=admin-password \ -DzkDigestReadonlyUsername=readonly-user -DzkDigestReadonlyPassword=readonly-password" SOLR_OPTS="$SOLR_OPTS $SOLR_ZK_CREDS_AND_ACLS" For operations using bin\solr.cmd, add the following at the bottom of bin\solr.in.cmd: set SOLR_ZK_CREDS_AND_ACLS=-DzkDigestUsername=admin-user -DzkDigestPassword=admin-password ^ -DzkDigestReadonlyUsername=readonly-user -DzkDigestReadonlyPassword=readonly-password set SOLR_OPTS=%SOLR_OPTS% %SOLR_ZK_CREDS_AND_ACLS% To start your own "clients" (using SolrJ): SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=readonly-user -DzkDigestPassword=readonly-password" java ... $SOLR_ZK_CREDS_AND_ACLS ... Or since you yourself are writing the code creating the SolrZkClient-s, you might want to override the provider implementations at the code level instead. Collections API The Collections API is used to enable you to create, remove, or reload collections, but in the context of SolrCloud you can also use it to create collections with a specific number of shards and replicas
作用
API Entry Points
The base URL for all API calls below is http:// /admin/collections?action=MODIFYCOLLECTION: Modify certain attributes of a collection /admin/collections?action=RELOAD: reload a collection /admin/collections?action=SPLITSHARD: split a shard into two new shards /admin/collections?action=CREATESHARD: create a new shard /admin/collections?action=DELETESHARD: delete an inactive shard /admin/collections?action=CREATEALIAS: create or modify an alias for a collection /admin/collections?action=DELETEALIAS: delete an alias for a collection /admin/collections?action=DELETE: delete a collection /admin/collections?action=DELETEREPLICA: delete a replica of a shard /admin/collections?action=ADDREPLICA: add a replica of a shard /admin/collections?action=CLUSTERPROP: Add/edit/delete a cluster-wide property /admin/collections?action=MIGRATE: Migrate documents to another collection /admin/collections?action=ADDROLE: Add a specific role to a node in the cluster /admin/collections?action=REMOVEROLE: Remove an assigned role /admin/collections?action=OVERSEERSTATUS: Get status and statistics of the overseer /admin/collections?action=CLUSTERSTATUS: Get cluster status /admin/collections?action=REQUESTSTATUS: Get the status of a previous asynchronous request /admin/collections?action=DELETESTATUS: Delete the stored response of a previous asynchronous request /admin/collections?action=LIST: List all collections /admin/collections?action=ADDREPLICAPROP: Add an arbitrary property to a replica specified by collection/shard/replica /admin/collections?action=DELETEREPLICAPROP: Delete an arbitrary property from a replica specified by collection/shard/replica /admin/collections?action=BALANCESHARDUNIQUE: Distribute an arbitrary property, one per shard, across the nodes in a collection /admin/collections?action=REBALANCELEADERS: Distribute leader role based on the "preferredLeader" assignments /admin/collections?action=FORCELEADER: Force a leader election in a shard if leader is lost /admin/collections?action=MIGRATESTATEFORMAT: Migrate a collection from shared clusterstate.json to per-collection state.json 所有的api 其余的详细的参数描述和例子
Parameter Reference
Cluster Parameters
numShards
SolrCloud Instance Parameters
These are set in solr.xml, but by default the host and hostContext parameters are set up to also work with system properties.
SolrCloud Instance ZooKeeper Parameters
Command Line Utilities
solr的zk 和zk的zk连接的区别及位置
Using Solr's ZooKeeper CLI
-cmd
g, makepath, get, getfile, put, putfile, list, clear or clusterprop. This parameter is mandatory
还是挺多的
ZooKeeper CLI Examples
Upload a configuration directory
上传一个目录到zk
./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 \
-cmd upconfig -confname my_new_config -confdir server/solr/configsets/basic_configs/conf
Bootstrap ZooKeeper from existing SOLR_HOME
不知道干啥的
./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181
-cmd bootstrap -solrhome /var/solr/data
Put arbitrary data into a new ZooKeeper file
在zk中创建节点加入数据
./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983
-cmd put /my_zk_file.txt 'some data'
Put a local file into a new ZooKeeper file
将本地文件设置为zk的文件
./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983
-cmd putfile /my_zk_file.txt /tmp/my_local_file.txt
Link a collection to a configuration set
将集合和配置文件相互映射起来
./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983
-cmd linkconfig -collection gettingstarted -confname my_new_config
Create a new ZooKeeper path
创建一个新的zk节点
Set a cluster property
./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181
-cmd clusterprop -name urlScheme -val https
---->>功能没有写全,还有许多命令
SolrCloud with Legacy Configuration Files
ConfigSets API
仅仅适用solrcloud模式
API Entry Points
The base URL for all API calls is http:// /admin/configs?action=DELETE: delete a ConfigSet /admin/configs?action=LIST: list all ConfigSets |
Rule-based Replica Placement
When Solr needs to assign nodes to collections, it can either automatically assign them randomly or the user can
specify a set nodes where it should create the replicas. With very large clusters, it is hard to specify exact node names and it still does not give you fine grained control over how nodes are chosen for a shard. The user should be in complete control of where the nodes are allocated for each collection, shard and replica. This helps to optimally allocate hardware resources across the cluster. Rule-based replica assignment allows the creation of rules to determine the placement of replicas in the cluster. In the future, this feature will help to automatically add or remove replicas when systems go down, or when higher throughput is required. This enables a more hands-off approach to administration of the cluster. This feature is used in the following instances: Collection creation Shard creation Replica creation Shard splitting
自动化的应对资源的分配 对于集合,分片,副本
Common Use Cases
There are several situations where this functionality may be used. A few of the rules that could be implemented are listed below: Don’t assign more than 1 replica of this collection to a host. Assign all replicas to nodes with more than 100GB of free disk space or, assign replicas where there is more disk space. Do not assign any replica on a given host because I want to run an overseer there. Assign only one replica of a shard in a rack. Assign replica in nodes hosting less than 5 cores. Assign replicas in nodes hosting the least number of cores
Rule Conditions
A rule is a set of conditions that a node must satisfy before a replica core can be created there.
必须满足的规则
Rule Conditions
There are three possible conditions. shard: this is the name of a shard or a wild card (* means for all shards). If shard is not specified, then the rule applies to the entire collection. replica: this can be a number or a wild-card (* means any number zero to infinity). tag: this is an attribute of a node in the cluster that can be used in a rule, e.g. “freedisk”, “cores”, “rack”, “dc”, etc. The tag name can be a custom string. If creating a custom tag, a snitch is responsible for providing tags and values. The section Snitches below describes how to add a custom tag, and defines six pre-defined tags (cores, freedisk, host, port, node, and sysprop).
六个预定义规则
Rule Operators
A condition can have one of the following operators to set the parameters for the rule.
equals (no operator required): tag:x means tag value must be equal to ‘x’ greater than (>): tag:>x means tag value greater than ‘x’. x must be a number less than (<): tag: String value
规则操作符号
Fuzzy Operator (~)
This can be used as a suffix to any condition. This would first try to satisfy the rule strictly. If Solr can’t find
enough nodes to match the criterion, it tries to find the next best match which may not satisfy the criterion. For example, if we have a rule such as, freedisk:>200~, Solr will try to assign replicas of this collection on nodes with more than 200GB of free disk space. If that is not possible, the node which has the most free disk space will be chosen instead.
Choosing Among Equals
The nodes are sorted first and the rules are used to sort them. This ensures that even if many nodes match the rules, the best nodes are picked up for node assignment. For example, if there is a rule such as freedisk:>20, nodes are sorted first on disk space descending and the node with the most disk space is picked up first. Or, if the rule is cores:<5, nodes are sorted with number of cores ascending and the node with the least number of cores is picked up first.
选择排序后的最优解
Rules for new shards
The rules are persisted along with collection state. So, when a new replica is created, the system will assign
replicas satisfying the rules. When a new shard is created as a result of create shard ensure that you have created rules specific for that shard name. Rules can be altered using the modify collection command. However, it is not required to do so if the rules do not specify explicit shard names. For example, a rule such as shard:sh ard1,replica:*,ip_3:168:, will not apply to any new shard created. But, if your rule is replica:*,ip_3: 168, then it will apply to any new shard created. The same is applicable to shard splitting. Shard splitting is treated exactly the same way as shard creation. Even though shard1_1 and shard1_2 may be created from shard1, the rules treat them as distinct, unrelated shards.
对于新的分片规则分配
Snitches
Tag values come from a plugin called Snitch. If there is a tag named ‘rack’ in a rule, there must be Snitch which
provides the value for ‘rack’ for each node in the cluster. A snitch implements the Snitch interface. Solr, by default, provides a default snitch which provides the following tags: cores: Number of cores in the node freedisk: Disk space available in the node host: host name of the node port: port of the node node: node name ip_1, ip_2, ip_3, ip_4: These are ip fragments for each node. For example, in a host with ip 192.168.1. 2, ip_1 = 2, ip_2 =1, ip_3 = 168 and ip_4 = 192 sysprop.{PROPERTY_NAME}: These are values available from system properties. sysprop.key mean s a value that is passed to the node as -Dkey=keyValue during the node startup. It is possible to use rules like sysprop.key:expectedVal,shard:*
可以被使用的标签
How Snitches are Configured
It is possible to use one or more snitches for a set of rules. If the rules only need tags from default snitch it need
not be explicitly configured. For example: snitch=class:fqn.ClassName,key1:val1,key2:val2,key3:val3 How Tag Values are Collected Identify the set of tags in the rules Create instances of Snitches specified. The default snitch is always created. Ask each Snitch if it can provide values for the any of the tags. If even one tag does not have a snitch, the assignment fails. After identifying the Snitches, they provide the tag values for each node in the cluster. If the value for a tag is not obtained for a given node, it cannot participate in the assignment.
如何配置及工作原理讲解
Examples
Keep less than 2 replicas (at most 1 replica) of this collection on any node
For this rule, we define the replica condition with operators for "less than 2", and use a pre-defined tag named node to define nodes with any name. replica:<2,node:*
保证集合中任意节点不能对于两个副本
For a given shard, keep less than 2 replicas on any node
For this rule, we use the shard condition to define any shard name, the replica condition with operators for "less than 2", and finally a pre-defined tag named node to define nodes with any name. shard:*,replica:<2,node:*
任何分片上的任何节点副本数目都不能对于两个
Assign all replicas in shard1 to rack 730
This rule limits the shard condition to 'shard1', but any number of replicas. We're also referencing a custom tag
named rack. Before defining this rule, we will need to configure a custom Snitch which provides values for the tag rack. shard:shard1,replica:*,rack:730 In this case, the default value of replica is * (or, all replicas). So, it can be omitted and the rule can be reduced to: shard:shard1,rack:730
自定义了一个snitch rack
Create replicas in nodes with less than 5 cores only
This rule uses the replica condition to define any number of replicas, but adds a pre-defined tag named core
and uses operators for "less than 5". replica:*,cores:<5 Again, we can simplify this to use the default value for replica, like so: cores:<5
当节点的core数目少于5个可以创建副本
Do not create any replicas in host 192.45.67.3
This rule uses only the pre-defined tag host to define an IP address where replicas should not be placed.
host:!192.45.67.3
不在指定主机上创建副本
Defining Rules
Rules are specified per collection during collection creation as request parameters. It is possible to specify
multiple ‘rule’ and ‘snitch’ params as in this example: snitch=class:EC2Snitch&rule=shard:*,replica:1,dc:dc1&rule=shard:*,replica:<2,dc:dc3 These rules are persisted in clusterstate.json in Zookeeper and are available throughout the lifetime of the collection. This enables the system to perform any future node allocation without direct user interaction. The rules added during collection creation can be modified later using the MODIFYCOLLECTION API.
自己定义一个 snitch
|
Cross Data Center Replication (CDCR)
The SolrCloud architecture is not particularly well suited for situations where a single SolrCloud cluster consists
of nodes in separated data clusters connected by an expensive pipe. The root problem is that SolrCloud is designed to support Near Real Time Searching by immediately forwarding updates between nodes in the clusteron a per-shard basis.
"CDCR" features exist to help mitigate the risk of an entire Data Center outage.
用来减轻完全数据中心的中断
What is CDCR?
Glossary Architecture Major Components CDCR Configuration CDCR Initialization Inter-Data Center Communication Updates Tracking & Pushing Synchronization of Update Checkpoints Maintenance of Updates Log Monitoring CDC Replicator Limitations Configuration Source Configuration Target Configuration Configuration Details The Replica Element The Replicator Element The updateLogSynchronizer Element The Buffer Element CDCR API API Entry Points (Control) API Entry Points (Monitoring) Control Commands Monitoring commands Initial Startup Monitoring ZooKeeper settings Upgrading and Patching Production
是什么,那些组件,如何配置,如何使用,等
|
What is CDCR?
The goal of the project is to replicate data to multiple Data Centers. The initial version of the solution will cover
the active-passive scenario where data updates are replicated from a Source Data Center to a Target Data Center. Data updates include adding/updating and deleting documents.
最初是为了解决将源数据中心的数据变更同步到目标数据中心
Data changes on the Source Data Center are replicated to the Target Data Center only after they are persisted
to disk. The data changes can be replicated in real-time (with a small delay) or could be scheduled to be sent in intervals to the Target Data Center. This solution pre-supposes that the Source and Target data centers begin with the same documents indexed. Of course the indexes may be empty to start.
数据的改变能被近时时的同步到目标数据中心,持久化到硬盘上, 开始时应该保证两个数据中的起点一致.
Each shard leader in the Source Data Center will be responsible for replicating its updates to the appropriate
collection in the Target Data Center. When receiving updates from the Source Data Center, shard leaders in the Target Data Center will replicate the changes to their own replicas.
原数据中心的分片lead负责将更新的内容发送给目标分片的leader,完成备份.
This replication model is designed to tolerate some degradation in connectivity, accommodate limited bandwidth, and support batch updates to optimize communication.
Replication supports both a new empty index and pre-built indexes. In the scenario where the replication is set
up on a pre-built index, CDCR will ensure consistency of the replication of the updates, but cannot ensure consistency on the full index. Therefore any index created before CDCR was set up will have to be replicated by other means (described in the section Starting CDCR the first time with an existing index) in order that Source and Target indexes be fully consistent.
未开始CDCR之前的索引需要通过另外的方式.来保证索引的一致
The active-passive nature of the initial implementation implies a "push" model from the Source collection to the
Target collection. Therefore, the Source configuration must be able to "see" the ZooKeeper ensemble in the Target cluster. The ZooKeeper ensemble is provided configured in the Source's solrconfig.xml file.
采取推送的方式,将源集合中的数据推送到目标集合中,所以必须在solrconfig.xml中配置相关的zookeeper配置
CDCR is configured to replicate from collections in the Source cluster to collections in the Target cluster on a
collection-by-collection basis. Since CDCR is configured in solrconfig.xml (on both Source and Target clusters), the settings can be tailored for the needs of each collection. CDCR can be configured to replicate from one collection to a second collection within the same cluster. That is a specialized scenario not covered in this document.
支持夸集群和本集群的复制
Glossary
Terms used in this document include:
Node: A JVM instance running Solr; a server. Cluster: A set of Solr nodes managed as a single unit by a ZooKeeper ensemble, hosting one or more Collections. Data Center: A group of networked servers hosting a Solr cluster. In this document, the terms Cluster and Data Center are interchangeable as we assume that each Solr cluster is hosted in a different group of networked servers. Shard: A sub-index of a single logical collection. This may be spread across multiple nodes of the cluster. Each shard can have as many replicas as needed. Leader: Each shard has one node identified as its leader. All the writes for documents belonging to a shard are routed through the leader. Replica: A copy of a shard for use in failover or load balancing. Replicas comprising a shard can either be leaders or non-leaders. Follower: A convenience term for a replica that is not the leader of a shard. Collection: Multiple documents that make up one logical index. A cluster can have multiple collections. Updates Log: An append-only log of write operations maintained by each node.
Architecture
Here is a picture of the data flow
Updates and deletes are first written to the Source cluster, then forwarded to the Target cluster. The data flow
sequence is: A shard leader receives a new data update that is processed by its Update Processor. The data update is first applied to the local index. Upon successful application of the data update on the local index, the data update is added to the Updates Log queue. After the data update is persisted to disk, the data update is sent to the replicas within the Data Center. After Step 4 is successful CDCR reads the data update from the Updates Log and pushes it to the corresponding collection in the Target Data Center. This is necessary in order to ensure consistency between the Source and Target Data Centers. The leader on the Target data center writes the data locally and forwards it to all its followers. Steps 1, 2, 3 and 4 are performed synchronously by SolrCloud; Step 5 is performed asynchronously by a background thread. Given that CDCR replication is performed asynchronously, it becomes possible to push batch updates in order to minimize network communication overhead. Also, if CDCR is unable to push the update at a given time -- for example, due to a degradation in connectivity -- it can retry later without any impact on the Source Data Center. One implication of the architecture is that the leaders in the Source cluster must be able to "see" the leaders in the Target cluster. Since leaders may change, this effectively means that all nodes in the Source cluster must be able to "see" all Solr nodes in the Target cluster so firewalls, ACL rules, etc. must be configured with care.
数据的更新过程步骤详解
Major Components
There are a number of key features and components in CDCR’s architecture:
CDCR Configuration
In order to configure CDCR, the Source Data Center requires the host address of the ZooKeeper cluster
associated with the Target Data Center. The ZooKeeper host address is the only information needed by CDCR to instantiate the communication with the Target Solr cluster. The CDCR configuration file on the Source cluster will therefore contain a list of ZooKeeper hosts. The CDCR configuration file might also contain secondary/optional configuration, such as the number of CDC Replicator threads, batch updates related settings, etc.
必须配置 zookeeper信息,及一些可选的配置
CDCR Initialization
CDCR supports incremental upddates to either new or existing collections. CDCR may not be able to keep up
with very high volume updates, especially if there are significant communications latencies due to a slow "pipe" between the data centers. Some scenarios: There is an initial bulk load of a corpus followed by lower volume incremental updates. In this case, one can do the initial bulk load, replicate the index and then keep then synchronized via CDCR. See the section Starting CDCR the first time with an existing index for more information. The index is being built up from scratch, without a significant initial bulk load. CDCR can be set up on empty collections and keep them synchronized from the start. The index is always being updated at a volume too high for CDCR to keep up. This is especially possible in situations where the connection between the Source and Target data centers is poor. This scenario is unsuitable for CDCR in its current form.
适用情况的说明
Inter-Data Center Communication
Communication between Data Centers will be achieved through HTTP and the Solr REST API using the SolrJ
client. The SolrJ client will be instantiated with the ZooKeeper host of the Target Data Center. SolrJ will manage the shard leader discovery process.
通过http和solrj进行跨集群通信
Updates Tracking & Pushing
CDCR replicates data updates from the Source to the Target Data Center by leveraging the Updates Log.
通过更新日志完成备份
A background thread regularly checks the Updates Log for new entries, and then forwards them to the Target
Data Center. The thread therefore needs to keep a checkpoint in the form of a pointer to the last update successfully processed in the Updates Log. Upon acknowledgement from the Target Data Center that updates have been successfully processed, the Updates Log pointer is updated to reflect the current checkpoint.
源集群后台线程定期检查更新的日志,从上次标记的点开始.更新成功后,目标集群应该返回一个反馈用于记录下次的检查点
This pointer must be synchronized across all the replicas. In the case where the leader goes down and a new
leader is elected, the new leader will be able to resume replication from the last update by using this synchronized pointer. The strategy to synchronize such a pointer across replicas will be explained next.
这个检查点应该被同步到源集群中的所有节点上
If for some reason, the Target Data Center is offline or fails to process the updates, the thread will periodically try
to contact the Target Data Center and push the updates. 如果目标集群挂了,那么源集群将定期去尝试重新推送
Synchronization of Update Checkpoints
A reliable synchronization of the update checkpoints between the shard leader and shard replicas is critical to
avoid introducing inconsistency between the Source and Target Data Centers. Another important requirement is that the synchronization must be performed with minimal network traffic to maximize scalability
in order to achieve this, the strategy is to:
Uniquely identify each update operation. This unique identifier will serve as pointer
Rely on two storages: an ephemeral storage on the Source shard leader, and a persistent storage on the Target cluster.
唯一标识和两地存储
The shard leader in the Source cluster will be in charge of generating a unique identifier for each update
operation, and will keep a copy of the identifier of the last processed updates in memory. The identifier will be sent to the Target cluster as part of the update request. On the Target Data Center side, the shard leader will receive the update request, store it along with the unique identifier in the Updates Log, and replicate it to the other shards.
SolrCloud is already providing a unique identifier for each update operation, i.e., a “version” number. This version
number is generated using a time-based lmport clock which is incremented for each update operation sent. This provides an “happened-before” ordering of the update operations that will be leveraged in (1) the initialization of the update checkpoint on the Source cluster, and in (2) the maintenance strategy of the Updates Log.
solrcloud通过version这个基于时间生成的字段来做唯一标识
The persistent storage on the Target cluster is used only during the election of a new shard leader on the Source
cluster. If a shard leader goes down on the Source cluster and a new leader is elected, the new leader will contact the Target cluster to retrieve the last update checkpoint and instantiate its ephemeral pointer. On such a request, the Target cluster will retrieve the latest identifier received across all the shards, and send it back to the Source cluster. To retrieve the latest identifier, every shard leader will look up the identifier of the first entry in its Update Logs and send it back to a coordinator. The coordinator will have to select the highest among them.
当源集群中的分片leader挂了,重新选举出来的leader会去目标集群中取出唯一标识,然后综合取最高版本使用
This strategy does not require any additional network traffic and ensures reliable pointer synchronization.
Consistency is principally achieved by leveraging SolrCloud. The update workflow of SolrCloud ensures that every update is applied to the leader but also to any of the replicas. If the leader goes down, a new leader is elected. During the leader election, a synchronization is performed between the new leader and the other replicas. As a result, this ensures that the new leader has a consistent Update Logs with the previous leader. Having a consistent Updates Log means that: On the Source cluster, the update checkpoint can be reused by the new leader. On the Target cluster, the update checkpoint will be consistent between the previous and new leader. This ensures the correctness of the update checkpoint sent by a newly elected leader from the Target cluster.
Maintenance of Updates Log
The CDCR replication logic requires modification to the maintenance logic of the Updates Log on the Source
Data Center. Initially, the Updates Log acts as a fixed size queue, limited to 100 update entries. In the CDCR scenario, the Update Logs must act as a queue of variable size as they need to keep track of all the updates up through the last processed update by the Target Data Center. Entries in the Update Logs are removed only when all pointers (one pointer per Target Data Center) are after them. If the communication with one of the Target Data Center is slow, the Updates Log on the Source Data Center can grow to a substantial size. In such a scenario, it is necessary for the Updates Log to be able to efficiently find a given update operation given its identifier. Given that its identifier is an incremental number, it is possible to implement an efficient search strategy. Each transaction log file contains as part of its filename the version number of the first element. This is used to quickly traverse all the transaction log files and find the transaction log file containing one specific version number.
日志文件的维护
Monitoring
CDCR provides the following monitoring capabilities over the replication operations:
Monitoring of the outgoing and incoming replications, with information such as the Source and Target nodes, their status, etc. Statistics about the replication, with information such as operations (add/delete) per second, number of documents in the queue, etc. Information about the lifecycle and statistics will be provided on a per-shard basis by the CDC Replicator thread. The CDCR API can then aggregate this information an a collection level.
提供的一些监控内容
CDC Replicator
The CDC Replicator is a background thread that is responsible for replicating updates from a Source Data
Center to one or more Target Data Centers. It will also be responsible in providing monitoring information on a per-shard basis. As there can be a large number of collections and shards in a cluster, we will use a fixed-size pool of CDC Replicator threads that will be shared across shards.
Limitations
The current design of CDCR has some limitations. CDCR will continue to evolve over time and many of these
limitations will be addressed. Among them are: CDCR is unlikely to be satisfactory for bulk-load situations where the update rate is high, especially if the bandwidth between the Source and Target clusters is restricted. In this scenario, the initial bulk load should be performed, the Source and Target data centers synchronized and CDCR be utilized for incremental updates. CDCR is currently only active-passive; data is pushed from the Source cluster to the Target cluster. There is active work being done in this area in the 6x code line to remove this limitation.
一些目前的缺陷
频繁的更新需要保证带宽足够
目前是主动推送的方式
Configuration
The Source and Target configurations differ in the case of the data centers being in separate clusters. "Cluster"
here means separate ZooKeeper ensembles controlling disjoint Solr instances. Whether these data centers are physically separated or not is immaterial for this discussion.
需要单独的集群
Source Configuration
一个源集群配置的例子
Here is a sample of a Source configuration file, a section in solrconfig.xml. The presence of the
section causes CDCR to use this cluster as the Source and should not be present in the Target collections in the cluster-to-cluster case. Details about each setting are after the two examples:
Target Configuration
目标集群配置
Here is a typical Target configuration.
Target instance must configure an update processor chain that is specific to CDCR. The update processor chain must include the CdcrUpdateProcessorFactory. The task of this processor is to ensure that the version numbers attached to update requests coming from a CDCR Source SolrCloud are reused and not overwritten by the Target. A properly configured Target configuration looks similar to this.
Configuration Details
The configuration details, defaults and options are as follows:
The Replica Element
CDCR can be configured to forward update requests to one or more replicas. A replica is defined with a “replica” list as follows:
The Replicator Element
复制因子元素
The CDC Replicator is the component in charge of forwarding updates to the replicas. The replicator will monitor the update logs of the Source collection and will forward any new updates to the Target collection. The replicator uses a fixed thread pool to forward updates to multiple replicas in parallel. If more than one replica is configured,
one thread will forward a batch of updates from one replica at a time in a round-robin fashion. The replicator can be configured with a “replicator” list as follows:
The updateLogSynchronizer Element
Expert: Non-leader nodes need to synchronize their update logs with their leader node from time to time in order
to clean deprecated transaction log files. By default, such a synchronization process is performed every minute. The schedule of the synchronization can be modified with a “updateLogSynchronizer” list as follows:
节点和leader节点的日志同步
The Buffer Element
CDCR is configured by default to buffer any new incoming updates. When buffering updates, the updates log will store all the updates indefinitely. Replicas do not need to buffer updates, and it is recommended to disable buffer on the Target SolrCloud. The buffer can be disabled at startup with a “buffer” list and the parameter “defaultState” as follows:
CDCR API
The CDCR API is used to control and monitor the replication process. Control actions are performed at a collection level, i.e., by using the following base URL for API calls: http://
Monitor actions are performed at a core level, i.e., by using the following base URL for API calls: http:// Currently, none of the CDCR API calls have parameters.
入口和功能
API Entry Points (Control)
collection/cdcr?action=STATUS: Returns the current state of CDCR.
collection/cdcr?action=START: Starts CDCR replication collection/cdcr?action=STOPPED: Stops CDCR replication. collection/cdcr?action=ENABLEBUFFER: Enables the buffering of updates. collection/cdcr?action=DISABLEBUFFER: Disables the buffering of updates
API Entry Points (Monitoring)
core/cdcr?action=QUEUES: Fetches statistics about the queue for each replica and about the update logs.
core/cdcr?action=OPS: Fetches statistics about the replication performance (operations per second) foreach
replicacore/cdcr?action=ERRORS: Fetches statistics and other information about replication errors for each replica.
Control Commands
Initial Startup
Upload the modified solrconfig.xml to ZooKeeper on both Source and Target
Sync the index directories from the Source collection to Target collection across to the corresponding shard nodes. Tip: rsync works well for this. For example: if there are 2 shards on collection1 with 2 replicas for each shard, copy the corresponding index directories from Start the ZooKeeper on the Target (DR) side Start the SolrCloud on the Target (DR) side Start the ZooKeeper on the Source side Start the SolrCloud on the Source side Tip: As a general rule, the Target (DR) side of the SolrCloud should be started before the Source side. Activate the CDCR on Source instance using the cdcr api http://host:port/solr/collection_name/cdcr?action=START http://host:port/solr/collection_name/cdcr?action=START
There is no need to run the /cdcr?action=START command on the Target
Disable the buffer on the Target http://host:port/solr/collection_name/cdcr?action=DISABLEBUFFER Renable indexing Monitoring Network and disk space monitoring are essential. Ensure that the system has plenty of available storage to queue up changes if there is a disconnect between the Source and Target. A network outage between the two data centers can cause your disk usage to grow. Tip: Set a monitor for your disks to send alerts when the disk gets over a certain percentage (eg. 70%) Tip: Run a test. With moderate indexing, how long can the system queue changes before you run out of disk space? Create a simple way to check the counts between the Source and the Target. Keep in mind that if indexing is running, the Source and Target may not match document for document. Set an alert to fire if the difference is greater than some percentage of the overall cloud size.
需要监控磁盘空间和两个集群之间索引的数量关系设置警报
ZooKeeper settings
With CDCR, the Target ZooKeepers will have connections from the Target clouds and the Source clouds.
You may need to increase the maxClientCnxns setting in the zoo.cfg.
## set numbers of connection to 200 from client
## is maxClientCnxns=0 that means no limit maxClientCnxns=800
Upgrading and Patching Production
When rolling in upgrades to your indexer or application, you should shutdown the Source (production) and the Target (DR). Depending on your setup, you may want to pause/stop indexing. Deploy the release or patch and renable indexing. Then start the Target (DR). Tip: There is no need to reissue the DISABLEBUFFERS or START commands. These are persisted. Tip: After starting the Target, run a simple test. Add a test document to each of the Source clouds. Then check for it on the Target.
#send to the Source
curl http:// |
Legacy Scaling and Distribution
What Problem Does Distribution Solve?
If searches are taking too long or the index is approaching the physical limitations of its machine, you should consider distributing the index across two or more Solr servers. To distribute an index, you divide the index into partitions called shards, each of which runs on a separate machine. Solr then partitions searches into sub-searches, which run on the individual shards, reporting results collectively. The architectural details underlying index sharding are invisible to end users, who simply experience faster performance on queries against very large indexes.
分布式能解决的问题,搜索太慢,索引太大
What Problem Does Replication Solve?
Replicating an index is useful when: You have a large search volume which one machine cannot handle, so you need to distribute searches across multiple read-only copies of the index. There is a high volume/high rate of indexing which consumes machine resources and reduces search performance on the indexing machine, so you need to separate indexing and searching. You want to make a backup of the index (see Making and Restoring Backups of SolrCores).
Distributed Search with Index Sharding
Distributing Documents across Shards
Configuring the ReplicationHandler
In addition to ReplicationHandler configuration options specific to the master/slave roles, there are a few special configuration options that are generally supported (even when using SolrCloud). maxNumberOfBackups an integer value dictating the maximum number of backups this node will keep on disk as it receives backup commands. Similar to most other request handlers in Solr you may configure a set of "defaults, invariants, and/or appends" parameters corresponding with any request parameters supported by the ReplicationHandl er when processing commands.
The example below shows a possible 'master' configuration for the ReplicationHandler, including a fixed
number of backups and an invariant setting for the maxWriteMBPerSec request parameter to prevent slaves from saturating it's network interface
Replicating solrconfig.xml
In the configuration file on the master server, include a line like the following:
This ensures that the local configuration solrconfig_slave.xml will be saved as solrconfig.xml on the
slave. All other files will be saved with their original names. On the master server, the file name of the slave configuration file can be anything, as long as the name is correctly identified in the confFiles string; then it will be saved as whatever file name appears after the colon ':'.
Configuring the Replication RequestHandler on a Slave Server
The code below shows how to configure a ReplicationHandler on a slave.
Setting Up a Repeater with the ReplicationHandler
A master may be able to serve only so many slaves without affecting performance. Some organizations have deployed slave servers across multiple data centers. If each slave downloads the index from a remote data center, the resulting download may consume too much network bandwidth. To avoid performance degradation in cases like this, you can configure one or more slaves as repeaters. A repeater is simply a node that acts as both a master and a slave.
To configure a server as a repeater, the definition of the Replication requestHandler in the solrconfi
g.xml file must include file lists of use for both masters and slaves. Be sure to set the replicateAfter parameter to commit, even if replicateAfter is set to optimize on the main master. This is because on a repeater (or any slave), a commit is called only after the index is downloaded. The optimize command is never called on slaves. Optionally, one can configure the repeater to fetch compressed files from the master through the compression parameter to reduce the index download time. Here is an example of a ReplicationHandler configuration for a repeater:
Commit and Optimize Operations
The replicateAfter parameter can accept multiple arguments. For example:
Slave Replication
副本的复制过程详解
The master is totally unaware of the slaves. The slave continuously keeps polling the master (depending on the
pollInterval parameter) to check the current index version of the master. If the slave finds out that the master has a newer version of the index it initiates a replication process. The steps are as follows: The slave issues a filelist command to get the list of the files. This command returns the names of the files as well as some metadata (for example, size, a lastmodified timestamp, an alias if any). The slave checks with its own index if it has any of those files in the local index. It then runs the filecontent command to download the missing files. This uses a custom format (akin to the HTTP chunked encoding) to download the full content or a part of each file. If the connection breaks in between, the download resumes from the point it failed. At any point, the slave tries 5 times before giving up a replication altogether. The files are downloaded into a temp directory, so that if either the slave or the master crashes during the download process, no files will be corrupted. Instead, the current replication will simply abort. After the download completes, all the new files are moved to the live index directory and the file's timestamp is same as its counterpart on the master. A commit command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded.
Replicating Configuration Files
To replicate configuration files, list them using using the confFiles parameter. Only files found in the conf dire
ctory of the master's Solr instance will be replicated. Solr replicates configuration files only when the index itself is replicated. That means even if a configuration file is changed on the master, that file will be replicated only after there is a new commit/optimize on master's index. Unlike the index files, where the timestamp is good enough to figure out if they are identical, configuration files are compared against their checksum. The schema.xml files (on master and slave) are judged to be identical if their checksums are identical. As a precaution when replicating configuration files, Solr copies configuration files to a temporary directory before moving them into their ultimate location in the conf directory. The old configuration files are then renamed and kept in the same conf/ directory. The ReplicationHandler does not automatically clean up these old files. If a replication involved downloading of at least one configuration file, the ReplicationHandler issues a core-reload command instead of a commit command.
副本配置文件的自动化更新
Resolving Corruption Issues on Slave Server
If documents are added to the slave, then the slave is no longer in sync with its master. However, the slave will
not undertake any action to put itself in sync, until the master has new index data. When a commit operation takes place on the master, the index version of the master becomes different from that of the slave. The slave then fetches the list of files and finds that some of the files present on the master are also present in the local index but with different sizes and timestamps. This means that the master and slave have incompatible indexes. To correct this problem, the slave then copies all the index files from master to a new index directory and asks the core to load the fresh index from the new directory.
当索引发生冲突时副本的解决办法
HTTP API Commands for the ReplicationHandler
You can use the HTTP commands below to control the ReplicationHandler's operations
Distribution and Optimization
Optimizing an index is not something most users should generally worry about - but in particular users should be aware of the impacts of optimizing an index when using the ReplicationHandler. The time required to optimize a master index can vary dramatically. A small index may be optimized in minutes. A very large index may take hours. The variables include the size of the index and the speed of the hardware. Distributing a newly optimized index may take only a few minutes or up to an hour or more, again depending on the size of the index and the performance capabilities of network connections and disks. During optimization the machine is under load and does not process queries very well. Given a schedule of updates being driven a few times an hour to the slaves, we cannot run an optimize with every committed snapshot. Copying an optimized index means that the entire index will need to be transferred during the next snappull. This is a large expense, but not nearly as huge as running the optimize everywhere. Consider this example: on a three-slave one-master configuration, distributing a newly-optimized index takes approximately 80 seconds total. Rolling the change across a tier would require approximately ten minutes per machine (or machine group). If this optimize were rolled across the query tier, and if each slave node being optimized were disabled and not receiving queries, a rollout would take at least twenty minutes and potentially as long as an hour and a half. Additionally, the files would need to be synchronized so that the following the optimize, snappull would not think that the independently optimized files were different in any way. This would also leave the door open to independent corruption of indexes instead of each being a perfect copy of the master. Optimizing on the master allows for a straight-forward optimization operation. No query slaves need to be taken out of service. The optimized index can be distributed in the background as queries are being normally serviced. The optimization can occur at any time convenient to the application providing index updates. While optimizing may have some benefits in some situations, a rapidly changing index will not retain those benefits for long, and since optimization is an intensive process, it may be better to consider other options, such as lowering the merge factor (discussed in the section on Index Configuration).
Combining Distribution and Replication
----> 直接使用solrcloud
Merging Indexes
If you need to combine indexes from two different projects or from multiple servers previously used in a
distributed configuration, you can use either the IndexMergeTool included in lucene-misc or the CoreAdminH andler. To merge indexes, they must meet these requirements: The two indexes must be compatible: their schemas should include the same fields and they should analyze fields the same way. The indexes must not include duplicate data. Optimally, the two indexes should be built using the same schema
Using IndexMergeTool
合并:
To merge the indexes, do the following:
Make sure that both indexes you want to merge are closed. Issue this command: java -cp $SOLR/server/solr-webapp/webapp/WEB-INF/lib/lucene-core-VERSION.jar:$ SOLR/server/solr-webapp/webapp/WEB-INF/lib/lucene-misc-VERSION.jar org/apache/lucene/misc/IndexMergeTool /path/to/newindex /path/to/old/index1 /path/to/old/index2
This will create a new index at /path/to/newindex that contains both index1 and index2.
Copy this new directory to the location of your application's solr index (move the old one aside first, of course) and start Solr
Using CoreAdmin
The MERGEINDEXES command of the CoreAdminHandler can be used to merge indexes into a new core – either
from one or more arbitrary indexDir directories or by merging from one or more existing srcCore core names. See the CoreAdminHandler section for details. |
Client APIs
This section discusses the available client APIs for Solr. It covers the following topics: Introduction to Client APIs: A conceptual overview of Solr client APIs. Choosing an Output Format: Information about choosing a response format in Solr. Using JavaScript: Explains why a client API is not needed for JavaScript responses. |