In a production situation, each shard will consist of multiple servers to ensure availability and automated failover.(Shardi中的3个均相同replica为了HA)
To partition a collection, we specify a shard key pattern. it names one or more fields to define the key upon which we distribute data. Some example shard key patterns include the following:
{ state : 1 }
{ name : 1 }
{ _id : 1 }
{ lastname : 1, firstname : 1 }
{ tag : 1, timestamp : -1 }
MongoDB's sharding is order-preserving; adjacent data by shard key tend to be on the same server. The config database stores all the metadata indicating the location of data by range:
Chunks grow to a maximum size, usually 64MB. Once a chunk has reached that approximate size, the chunk splits into two new chunks. When a particular shard has excess data, chunks will then migrate to other shards in the system. The addition of a new shard will also influence the migration of chunks.
When choosing a shard key, the values must be of high enough cardinality (granular enough) that data can be broken into many chunks, and thus distribute-able. (只是建议而已)
If it is possible that a single value within the shard key range might grow exceptionally large, it is best to use a compound shard key instead so that further discrimination of the values will be possible.
The config servers store the cluster's metadata,
Note that config server use their own replication model; they are not run in as a replica set.
If any of the config servers is down, the cluster's meta-data goes read only. However, even in such a failure state, the MongoDB cluster can still be read from and written to.???
The mongos process can be thought of as a routing and coordination process that makes the various components of the cluster look like a single system. When receiving client requests, the mongos process routes the request to the appropriate server(s) and merges any results to be sent back to the client.
mongos processes have no persistent state; rather, they pull their state from the config server on startup. Any changes that occur on the the config servers are propagated to each mongos process.
mongos processes can run on any server desired. They may be run on the shard servers themselves, but are lightweight enough to exist on each application server. There are no limits on the number of mongos processes that can be run simultaneously since these processes do not coordinate between one another.
For targeted operations, mongos communicates with a very small number of shards -- often a single shard. Such targeted operations are quite efficient.//有目标的寻找
Global operations involve the mongos process reaching out to all (or most) shards in the system.
The following table shows various operations and their type. For the examples below, assume a shard key of { x : 1 }.
Operation | Type |
Comments |
---|---|---|
db.foo.find( { x : 300 } ) |
Targeted |
Queries a single shard. |
db.foo.find( { x : 300, age : 40 } ) | Targeted | Queries a single shard. |
db.foo.find( { age : 40 } ) |
Global | Queries all shards. |
db.foo.find() |
Global | sequential |
db.foo.find(...).count() |
Variable | Same as the corresponding find() operation |
db.foo.find(...).sort( { age : 1 } ) |
Global | parallel |
db.foo.find(...).sort( { x : 1 } ) |
Global | sequential |
db.foo.count() |
Global | parallel |
db.foo.insert( | Targeted | |
db.foo.update( { x : 100 }, | Targeted | |
db.foo.update( { age : 40 }, | Global |
|
db.getLastError() |
||
db.foo.ensureIndex(...) |
Global |
the load is almost certainly low on the config servers. Here is an example where some sharing of physical machines is used to lay out a cluster. The outer boxes are machines (or VMs) and the inner boxes are processes.
In the picture about a given connection to the database simply connects to a random mongos. mongos is generally very fast so perfect balancing of those connections is not essential. Additionally the implementation of a driver could be intelligent about balancing these connections (but most are not at the time of this writing).
Yet more configurations are imaginable, especially when it comes to mongos. Alternatively, as suggested earlier, the mongosprocesses can exists on each application server. There is some potential benefit to this configuration, as the communications between app server and mongos then can occur over the localhost interface.
Exactly three config server processes are used in almost all sharded mongo clusters. This provides sufficient data safety; more instances would increase coordination cost among the config servers.
First, start the individual shards (mongod's), config servers, and mongos processes.
Shard Servers
To get started with a simple test, we recommend running a single mongod process per shard(简单的配置先配一个)
Run mongod on the config server(s) with the --configsvr command line parameter.
--configsvr declare this is a config db of a cluster; default port
27019; default dir /data/configdb
Note: Replicating data to each config server is managed by the router (mongos); they have a synchronous replication protocol optimized for three machines, if you were wondering why that number. 1-3个
mongos Router
Run mongos on the servers of your choice. Specify the --configdb parameter to indicate location of the config database(s). Note: use dns names, not ip addresses, for the --configdb parameter's value. Otherwise moving config servers later is difficult.
Start by connecting to one of the mongos processes, and then switch to the admin database before issuing any commands.
The mongos will route commands to the right machine(s) in the cluster and, if commands change metadata, the mongos will update that on the config servers. So, regardless of the number of mongos processes you've launched, you'll only need run these commands on one of those processes.
You can connect to the admin database via mongos like so:
./mongo :/admin
> db //db用于输出当前的数据库
admin
You must explicitly add each shard to the cluster's configuration using the addshard command:
> db.runCommand( { addshard : "[:]" } );
{"ok" : 1 , "added" : ...}
Run this command once for each shard in the cluster.
If the individual shards consist of replica sets, they can be added by specifying replicaSetName/
> db.runCommand( { addshard : "foo/[:]" } );
{"ok" : 1 , "added" : "foo"}
Any databases and collections that existed already in the mongod/replica set will be incorporated to the cluster. The databases will have as the "primary" host that mongod/replica set and the collections will not be sharded (but you can do so later by issuing ashardCollection command).//刚加进来的shard原来db中的内容并不变,需要执行shardCollection 命令才会进行切片
name
Each shard has a name, which can be specified using the name option. If no name is given, one will be assigned automatically.
maxSize
The addshard command accepts an optional maxSize parameter. This parameter lets you tell the system a maximum amount of disk space in megabytes to use on the specified shard.
As an example:
> db.runCommand( { addshard : "sf103", maxSize:100000/*MB*/ } );
To see current set of configured shards, run the listshards command:
> db.runCommand( { listshards : 1 } );
This way, you can verify that all the shard have been committed to the system.
See the removeshard command.
Once you've added one or more shards, you can enable sharding on a database. Unless enabled, all data in the database will be stored on the same shard. After enabling you then need to run shardCollection on the relevant collections (i.e., the big ones).
> db.runCommand( { enablesharding : "" } );
Once enabled, mongos will place new collections on the primary shard for that database. Existing collections within the database will stay on the original shard. To enable partitioning of data, we have to shard an individual collection.??什么是primary shard
When sharding a collection, "pre-splitting", that is, setting a seed set of key ranges, is recommended. Without a seed set of ranges, sharding works, however the system must learn the key distribution and this will take some time; during this time performance is not as high. The presplits do not have to be particularly accurate; the system will adapt to the actual key distribution of the data regardless. |
Use the shardcollection command to shard a collection. When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.
> db.runCommand( { shardcollection : "" ,
key : });
Running the "shardcollection" command will mark the collection as sharded with a specific key. Once called, there is currently no way to disable sharding or change the shard key, even if all the data is still contained within the same shard. It is assumed that the data may already be spread around the shards. If you need to "unshard" a collection, drop it (of course making a backup of data if needed), and recreate the collection (loading the backup data). |
For example, let's assume we want to shard a GridFS chunks collection stored in the test database. We'd want to shard on thefiles_id key, so we'd invoke the shardcollection command like so:
> db.runCommand( { shardcollection : "test.fs.chunks", key : { files_id : 1 } } )
{ "collectionsharded" : "mydb.fs.chunks", "ok" : 1 }
You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key. (note: prior to version 2.0 this worked only if the collection is empty).
db.runCommand( { shardcollection : "test.users" , key : { email : 1 } , unique : true } );
If the "unique: true" option is not used, the shard key does not have to be unique.
db.runCommand( { shardcollection : "test.products" , key : { category : 1, _id : 1 } } );
You can shard on multiple fields if you are using a compound index.
In the end, picking the right shard key for your needs is extremely important for successful sharding. Choosing a Shard Key.