Managing collections via the Collections API (SolrCloud、solr4.3动态管理collection的api)

The collections API let's you manage collections. Under the hood, it generally uses the CoreAdmin API to asynchronously (though Overseer) manage SolrCores on each server - it's essentially sugar for actions that you could handle yourself if you made individual CoreAdmin API calls to each server you wanted an action to take place on.

Create http://localhost:8983/solr/admin/collections?action=CREATE&name=mycollection&numShards=3&replicationFactor=4

About the params:

  • name: The name of the collection to be created.

  • numShards: The number of logical shards (sometimes called slices) to be created as part of the collection.

  • replicationFactor: The number of copies of each document (or, the number of physical replicas to be created for each logical shard of the collection.) A replicationFactor of 3 means that there will be 3 replicas (one of which is normally designated to be the leader) for each logical shard. NOTE: in Solr 4.0, replicationFactor was the number of *additional* copies as opposed to the total number of copies.

  • maxShardsPerNode : A create operation will spread numShards*replicationFactor shard-replica across your live Solr nodes - fairly distributed, and never two replica of the same shard on the same Solr node. If a Solr is not live at the point in time where the create operation is carried out, it will not get any parts of the new collection. To prevent too many replica being created on a single Solr node, use maxShardsPerNode to set a limit for how many replicas the create operation is allowed to create on each node - default is 1. If it cannot fit the entire collection numShards*replicationFactor replicas on you live Solrs it will not create anything at all.

  • createNodeSet: If not provided the create operation will create shard-replica spread across all of your live Solr nodes. You can provide the "createNodeSet" parameter to change the set of nodes to spread the shard-replica across. The format of values for this param is "<node-name1>,<node-name2>,...,<node-nameN>" - e.g. "localhost:8983_solr,localhost:8984_solr,localhost:8985_solr"

  • collection.configName: The name of the config (must be already stored in zookeeper) to use for this new collection. If not provided the create operation will default to the collection name as the config name.

<!> Solr4.2

About the params:

  • name: The name of the collection alias to be created.

  • collections: A comma-separated list of one or more collections to alias to.

Delete http://localhost:8983/solr/admin/collections?action=DELETE&name=mycollection

About the params:

  • name: The name of the collection to be deleted.

Reload http://localhost:8983/solr/admin/collections?action=RELOAD&name=mycollection

About the params:

  • name: The name of the collection to be reloaded.

Split Shard http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=<collection_name>&shard=shardId

<!> Solr4.3

About the params:

  • collection: The name of the collection

  • shard: The shard to be split

This command cannot be used by clusters with custom hashing because such clusters do not rely on a hash range. It should only be used by clusters having "plain" or "compositeId" router.

The SPLITSHARD command will create two new shards by splitting the given shard's index into two pieces. The split is performed by dividing the shard's range into two equal partitions and dividing up the documents in the parent shard according to the new sub-ranges. This is a synchronous operation. The new shards will be named by appending _0 and _1 to the parent shard name e.g. if shard=shard1 is to be split, the new shards will be named as shard1_0 and shard1_1. Once the new shards are created, they are set active and the parent shard is set to inactive so that no new requests are routed to the parent shard.

This feature allows for seamless splitting and requires no down-time. The parent shard is not removed and therefore no data is removed. It is up to the user of the command to unload the shard using the new APIs in SOLR-4693 (under construction).

This feature was released with Solr 4.3 however due to bugs found after 4.3 release, it is recommended that you wait for release 4.3.1 before using this feature.

Collection Aliases

Aliasing allows you to create a single 'virtual' collection name that can point to one more real collections. You can update the alias on the fly.

CreateAlias http://localhost:8983/solr/admin/collections?action=CREATEALIAS&name=alias&collections=collection1,collection2,…

Creates or updates a given alias. Aliases that are used to send updates to should only map an alias to a single collection. Read aliases can map an alias to a single collection or multiple collections.

DeleteAlias http://localhost:8983/solr/admin/collections?action=DELETEALIAS&name=alias

Removes an existing alias.

Creating cores via CoreAdmin

New Solr cores may also be created and associated with a collection via CoreAdmin.

Additional cloud related parameters for the CREATE action:

  • collection - the name of the collection this core belongs to. Default is the name of the core.

  • shard - the shard id this core represents (Optional - normally you want to be auto assigned a shard id)

  • numShards - the number of shards you want the collection to have - this is only respected on the first core created for the collection

  • collection.<param>=<value> - causes a property of <param>=<value> to be set if a new collection is being created.

    • Use collection.configName=<configname> to point to the config for a new collection.

Example:

curl 'http://localhost:8983/solr/admin/cores?action=CREATE&name=mycore&collection=collection1&shard=shard2'

Distributed Requests

Query all shards of a collection (the collection is implicit in the URL):

http://localhost:8983/solr/collection1/select?

Query all shards of a compatible collection, explicitly specified:

http://localhost:8983/solr/collection1/select?collection=collection1_recent

Query all shards of multiple compatible collections, explicitly specified:

http://localhost:8983/solr/collection1/select?collection=collection1_NY,collection1_NJ,collection1_CT

Query specific shard ids of the (implicit) collection. In this example, the user has partitioned the index by date, creating a new shard every month:

http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001

Explicitly specify the addresses of shards you want to query:

http://localhost:8983/solr/collection1/select?shards=localhost:8983/solr,localhost:7574/solr

Explicitly specify the addresses of shards you want to query, giving alternatives (delimited by |) used for load balancing and fail-over:

http://localhost:8983/solr/collection1/select?shards=localhost:8983/solr|localhost:8900/solr,localhost:7574/solr|localhost:7500/solr

Required Config

All of the required config is already setup in the example configs shipped with Solr. The following is what you need to add if you are migrating old config files, or what you should not remove if you are starting with new config files.

schema.xml

You must have a _version_ field defined:

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>

solrconfig.xml

You must have an UpdateLog defined - this should be defined in the updateHandler section.

    <!-- Enables a transaction log, currently used for real-time get.

         "dir" - the target directory for transaction logs, defaults to the

         solr data directory.  -->

    <updateLog>

      <str name="dir">${solr.data.dir:}</str>

    </updateLog>

You must have a replication handler called /replication defined:

    <requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" />

You must have a realtime get handler called /get defined:

    <requestHandler name="/get" class="solr.RealTimeGetHandler">

      <lst name="defaults">

        <str name="omitHeader">true</str>

     </lst>

    </requestHandler>

You must have the admin handlers defined:

    <requestHandler name="/admin/" class="solr.admin.AdminHandlers" />

The DistributedUpdateProcessor is part of the default update chain and is automatically injected into any of your custom update chains. You can still explicitly add it yourself as follows:

   <updateRequestProcessorChain name="sample">

     <processor class="solr.LogUpdateProcessorFactory" />

     <processor class="solr.DistributedUpdateProcessorFactory"/>

     <processor class="my.package.UpdateFactory"/>

     <processor class="solr.RunUpdateProcessorFactory" />

   </updateRequestProcessorChain>

If you do not want the DistributedUpdateProcessFactory auto injected into your chain (say you want to use SolrCloud functionality, but you want to distribute updates yourself) then specify the following update processor factory in your chain: NoOpDistributingUpdateProcessorFactory

solr.xml

You must leave the admin path as the default:

    <cores adminPath="/admin/cores"

Re-sizing a Cluster

You can control cluster size by passing the numShards when you start up the first SolrCore in a collection. This parameter is used to auto assign which shard each instance should be part of. Any SolrCores that you start after starting numShards instances are evenly added to each shard as replicas (as long as they all belong to the same collection).

To add more SolrCores to your collection, simply keep starting new SolrCores up. You can do this at any time and the new SolrCore will sync up its data with the current replicas in the shard before becoming active.

If you want to start your cluster on fewer machines and then expand over time beyond just adding replicas, you can choose to start by hosting multiple shards per machine (using multiple SolrCores) and then later migrate shards onto new machines by starting up a new replica for a given shard and eventually removing the shard from the original machine.

<!> Solr4.3 The new "SPLITSHARD" collection API can be used to split an existing shard into two shards containing exactly half the range of the parent shard each. More details can be found under the "Managing collections via the Collections API" section.

If you want to use the Near Realtime search support, you will probably want to enable auto soft commits in your solrconfig.xml file before putting it into zookeeper. Otherwise you can send explicit soft commits to the cluster as you desire. See NearRealtimeSearch

Parameter Reference

Cluster Params

numShards

Defaults to 1

The number of shards to hash documents to. There will be one leader per shard and each leader can have N replicas.

SolrCloud Instance Params

These are set in solr.xml, but by default they are setup in solr.xml to also work with system properties. Important note: the hostPort value found here will be used (via zookeeper) to inform the rest of the cluster what port each Solr instance is using. The default port is 8983. The example solr.xml uses the jetty.port system property, so if you want to use a port other than 8983, either you have to set this property when starting Solr, or you have to change solr.xml to fit your particular installation. If you do not do this, the cluster will think all your Solr servers are using port 8983, which may not be what you want.

host

Defaults to the first local host address found

If the wrong host address is found automatically, you can over ride the host address with this param.

hostPort

Defaults to the jetty.port system property

The port that Solr is running on - by default this is found by looking at the jetty.port system property.

hostContext

Defaults to solr

The context path for the Solr webapp. (Note: in Solr 4.0, it was mandatory that the hostContext not contain "/" or "_" characters. Begining with Solr 4.1, this limitation was removed, and it is recomended that you specify the begining slash. When running in the example jetty configs, the "hostContext" system property can be used to control both the servlet context used by jetty, and the hostContext used by SolrCloud -- eg: -DhostContext=/solr)

SolrCloud Instance ZooKeeper Params

zkRun

Defaults to localhost:<solrPort+1001>

Causes Solr to run an embedded version of ZooKeeper. Set to the address of ZooKeeper on this node - this allows us to know who 'we are' in the list of addresses in the zkHost connect string. Simply using -DzkRun gets you the default value. Note this must be one of the exact strings from zkHost; in particular, the default localhost will not work for a multi-machine ensemble.

zkHost

No default

The host address for ZooKeeper - usually this should be a comma separated list of addresses to each node in your ZooKeeper ensemble.

zkClientTimeout

Defaults to 15000

The time a client is allowed to not talk to ZooKeeper before having it's session expired.

zkRun and zkHost are setup using system properties. zkClientTimeout is setup in solr.xml, but default, can also be set using a system property.

SolrCloud Core Params

shard

The shard id. Defaults to being automatically assigned based on numShards

Allows you to specify the id used to group SolrCores into shards.

shard can be configured in solr.xml for each core element as an attribute.

Getting your Configuration Files into ZooKeeper

Config Startup Bootstrap Params

There are two different ways you can use system properties to upload your initial configuration files to ZooKeeper the first time you start Solr. Remember that these are meant to be used only on first startup or when overwriting configuration files - everytime you start Solr with these system properties, any current configuration files in ZooKeeper may be overwritten when 'conf set' names match.

1. Look at solr.xml and upload the conf for each SolrCore found. The 'config set' name will be the collection name for that SolrCore, and collections will use the 'config set' that has a matching name.

bootstrap_conf

No default

If you pass -Dbootstrap_conf=true on startup, each SolrCore you have configured will have it's configuration files automatically uploaded and linked to the collection that SolrCore is part of

2. Upload the given directory as a 'conf set' with the given name. No linking of collection to 'config set' is done. However, if only one 'conf set' exists, a collection will auto link to it.

bootstrap_confdir

No default

If you pass -bootstrap_confdir=<directory> on startup, that specific directory of configuration files will be uploaded to ZooKeeper with a 'conf set' name defined by the below system property, collection.configName

collection.configName

Defaults to configuration1

Determines the name of the conf set pointed to by bootstrap_confdir

Command Line Util

The CLI tool also lets you upload config to ZooKeeper. It allows you to do it the same two ways that you can above. It also provides a few other commands that let you link collection sets to collections, make ZooKeeper paths or clear them, as well as download configs from ZooKeeper to the local filesystem.

usage: ZkCLI

 -c,--collection <arg>   for linkconfig: name of the collection

 -cmd <arg>              cmd to run: bootstrap, upconfig, downconfig,

                         linkconfig, makepath, clear

 -d,--confdir <arg>      for upconfig: a directory of configuration files

 -h,--help               bring up this help page

 -n,--confname <arg>     for upconfig, linkconfig: name of the config set

 -r,--runzk <arg>        run zk internally by passing the solr run port -

                         only for clusters on one machine (tests, dev)

 -s,--solrhome <arg>     for bootstrap, runzk: solrhome location

 -z,--zkhost <arg>       ZooKeeper host address

Examples

# try uploading a conf dir

java -classpath example/solr-webapp/WEB-INF/lib/* org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost 127.0.0.1:9983 -confdir example/solr/collection1/conf -confname conf1

# try linking a collection to a conf set

java -classpath example/solr-webapp/WEB-INF/lib/* org.apache.solr.cloud.ZkCLI -cmd linkconfig -zkhost 127.0.0.1:9983 -collection collection1 -confname conf1

# try bootstrapping all the conf dirs in solr.xml

java -classpath example/solr-webapp/WEB-INF/lib/* org.apache.solr.cloud.ZkCLI -cmd bootstrap -zkhost 127.0.0.1:9983 -solrhome example/solr

Scripts

There are scripts in example/cloud-scripts that handle the classpath and class name for you if you are using Solr out of the box with Jetty. Cmds then become:

sh zkcli.sh -cmd linkconfig -zkhost 127.0.0.1:9983 -collection collection1 -confname conf1

Zookeeper chroot

If you are already using Zookeeper for other applications and you want to keep the ZNodes organized by application, or if you want to have multiple separated SolrCloud clusters sharing one Zookeeper ensemble you can use Zookeeper's "chroot" option. From Zookeeper's documentation: http://zookeeper.apache.org/doc/r3.3.6/zookeeperProgrammers.html#ch_zkSessions

An optional "chroot" suffix may also be appended to the connection string. This will run the client commands while interpreting all paths relative to this root (similar to the unix chroot command). If used the example would look like: "127.0.0.1:4545/app/a" or "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002/app/a" where the client would be rooted at "/app/a" and all paths would be relative to this root - ie getting/setting/etc... "/foo/bar" would result in operations being run on "/app/a/foo/bar" (from the server perspective).

To use this Zookeeper feature, simply start Solr with the "chroot" suffix in the zkHost parameter. For example:

java -DzkHost=localhost:9983/foo/bar -jar start.jar

or

java -DzkHost=zoo1:9983,zoo2:9983,zoo3:9983/foo/bar -jar start.jar

NOTE: With Solr 4.0 you'll need to create the initial path in Zoookeeper before starting Solr. Since Solr 4.1, the initial path will automatically be created if you are using either bootstrap_conf or boostrap_confdir.

Known Limitations

A small number of Solr search components do not support DistributedSearch. In some cases, a component may never get distributed support, in other cases it may just be a matter of time and effort. All of the search components that do not yet support standard distributed search have the same limitation with SolrCloud. You can pass distrib=false to use these components on a single SolrCore.

The Grouping feature only works if groups are in the same shard. You must use the custom sharding feature to use the Grouping feature.

If upgrading an existing Solr instance instance running with SolrCloud from Solr 4.0 to 4.1, be aware that the way the name_node parameter is defined has changed. This may cause a situation where the name_node uses the IP address of the machine instead of the server name, and thus SolrCloud is not aware of the existing node. If this happens, you can manually edit the host parameter in solr.xml to refer to the server name, or set the host in your system environment variables (since by default solr.xml is configured to inherit the host name from the environment variables). See also the section Core Admin and Configuring solr.xml for more information about the host parameter.

Glossary

Collection:

A single search index.

Shard:

A logical section of a single collection (also called Slice). Sometimes people will talk about "Shard" in a physical sense (a manifestation of a logical shard)

Replica:

A physical manifestation of a logical Shard, implemented as a single Lucene index on a SolrCore

Leader:

One Replica of every Shard will be designated as a Leader to coordinate indexing for that Shard

SolrCore:

Encapsulates a single physical index. One or more make up logical shards (or slices) which make up a collection.

Node:

A single instance of Solr. A single Solr instance can have multiple SolrCores that can be part of any number of collections.

Cluster:

All of the nodes you are using to host SolrCores.

FAQ

  • Q: I'm seeing lot's of session timeout exceptions - what to do?

    • A: Try raising the ZooKeeper session timeout by editing solr.xml - see the zkClientTimeout attribute. The minimum session timeout is 2 times your ZooKeeper defined tickTime. The maximum is 20 times the tickTime. The default tickTime is 2 seconds. You should avoiding raising this for no good reason, but it should be high enough that you don't see a lot of false session timeouts due to load, network lag, or garbage collection pauses. The default timeout is 15 seconds, but some environments might need to go as high as 30-60 seconds.

  • Q: How do I use SolrCloud, but distribute updates myself?

    • A: Add the following UpdateProcessorFactory somewhere in your update chain: NoOpDistributingUpdateProcessorFactory

  • Q: What is the difference between a Collection and a SolrCore?

    • A: In classic single node Solr, a SolrCore is basically equivalent to a Collection. It presents one logical index. In SolrCloud, the SolrCore's on multiple nodes form a Collection. This is still just one logical index, but multiple SolrCores host different 'shards' of the full collection. So a SolrCore encapsulates a single physical index on an instance. A Collection is a combination of all of the SolrCores that together provide a logical index that is distributed across many nodes.  

你可能感兴趣的:(api,Collections,solrCloud,solr4.3,动态管理collection)