原文地址:
http://blog.engineering.kiip.me/post/20988881092/a-year-with-mongodb
This week marks the one year anniversary of Kiip running MongoDB in production. As of this week, we’ve also moved over 95% of our data off of MongoDB onto systems such as Riak and PostgreSQL, depending which solution made sense for the way we use our data. This post highlights our experience with MongoDB over the past year. A future post will elaborate on the migration process: how we evaluated the proper solutions to migrate to and how we migrated the data from MongoDB.
First, some numbers about our data to give context to the scale being discussed. The figures below represent the peak usage when we were completely on MongoDB — the numbers are actually much higher now but are spread across different data stores.
Data size: 240 GB
Total documents: 85,000,000
Operations per second: 520 (Create, reads, updates, etc.)
The Good
We were initially attracted to MongoDB due to the features highlighted on the website as well as word of mouth from those who had used it successfully. MongoDB delivered on some of its promises, and our early experiences were positive.
Schemaless - Being a document data store, the schemaless-nature of MongoDB helps a lot. It is easy to add new fields, and even completely change the structure of a model. We changed the structure of our heaviest used models a couple times in the past year, and instead of going back and updating millions of old documents, we simply added a “version” field to the document and the application handled the logic of reading both the old and new version. This flexibility was useful for both application developers and operations engineers.
Simple replication - Replica Sets are easy to setup and work well enough. There are some issues that I’ll talk about later, but for the most part as an early stage startup, this feature was easy to incorporate and appeared to work as advertised.
Query Language - Querying into documents and being able to perform atomic operations on your data is pretty cool. Both of these features were used heavily. Unfortunately, these queries didn’t scale due to underlying architectural problems. Early on we were able to use advanced queries to build features quickly into our application.
Full-featured Drivers for Many Languages - 10gen curates official MongoDB drivers for many languages, and in our experience the driver for each language we’ve tried has been top-notch. Drivers were never an issue when working with MongoDB.
The Bad
Although MongoDB has a lot of nice features on the surface, most of them are marred by underlying architectural issues. These issues are certainly fixable, but currently limit the practical usage we were able to achieve with MongoDB. This list highlights some of the major issues we ran into.
Non-counting B-Trees - MongoDB uses non-counting B-trees as the underlying data structure to index data. This impacts a lot of what you’re able to do with MongoDB. It means that a simple count of a collection on an indexed field requires Mongo to traverse the entire matching subset of the B-tree. To support limit/offset queries, MongoDB needs to traverse the leaves of the B-tree to that point. This unnecessary traversal causes data you don’t need to be faulted into memory, potentially purging out warm or hot data, hurting your overall throughput. There has been an open ticket for this issue since September, 2010.
Poor Memory Management - MongoDB manages memory by memory mapping your entire data set, leaving page cache management and faulting up to the kernel. A more intelligent scheme would be able to do things like fault in your indexes before use as well as handle faulting in of cold/hot data more effectively. The result is that memory usage can’t be effectively reasoned about, and performance is non-optimal.
Uncompressed field names - If you store 1,000 documents with the key “foo”, then “foo” is stored 1,000 times in your data set. Although MongoDB supports any arbitrary document, in practice most of your field names are similar. It is considered good practice to shorten field names for space optimization. A ticket for this issue has been open since April 2010, yet this problem still exists today. At Kiip, we built field aliasing into our model layer, so a field with name “username” may actually map to “u” in the database. The database should handle this transparently by keeping a logical mapping between field names and a compressed form, instead of requiring clients to handle it explicitly.
Global write lock - MongoDB (as of the current version at the time of writing: 2.0), has a process-wide write lock. Conceptually this makes no sense. A write on collection X blocks a write on collection Y, despite MongoDB having no concept of transactions or join semantics. We reached practical limitations of MongoDB when pushing a mere 200 updates per second to a single server. At this point, all other operations including reads are blocked because of the write lock. When reaching out to 10gen for assistance, they recommended we look into sharding, since that is their general scaling solution. With other RDBMS solutions, we would at least be able to continue vertically scaling for some time before investigating sharding as a solution.
Safe off by default - This is a crazy default, although useful for benchmarks. As a general analogy: it’s like a car manufacturer shipping a car with air bags off, then shrugging and saying “you could’ve turned it on” when something goes wrong. We lost a sizable amount of data at Kiip for some time before realizing what was happening and using safe saves where they made sense (user accounts, billing, etc.).
Offline table compaction - The on-disk data size with MongoDB grows unbounded until you compact the database. Compaction is extremely time consuming and blocks all other DB operations, so it must be done offline or on a secondary/slave server. Traditional RDBMS systems such as PostgreSQL have handled this with auto-vacuums that clean up the database over time.
Secondaries do not keep hot data in RAM - The primary doesn’t relay queries to secondary servers, preventing secondaries from maintaining hot data in memory. This severely hinders the “hot-standby” feature of replica sets, since the moment the primary fails and switches to a secondary, all the hot data must be once again faulted into memory. Faulting in gigabytes of data can be painfully slow, especially when your data is backed by something like EBS. Distributing reads to secondaries helps with this, but if you’re only using secondaries as a means of backup or failover, the effect on throughput when a primary switch happens can be crippling until your hot data is faulted in.
What We’re Doing Now
Initially, we felt MongoDB gave us the flexibility and power we needed in a database. Unfortunately, underlying architectural issues forced us to investigate other solutions rather quickly. We never attempted to horizontally scale MongoDB since our confidence in the product was hurt by the time that was offered as a solution, and because we believe horizontally scaling shouldn’t be necessary for the relatively small amount of ops per second we were sending to MongoDB.
Over the past 6 months, we’ve “scaled” MongoDB by moving data off of it. This process is an entire blog post itself, but the gist of the matter is that we looked at our data access patterns and chose the right tool for the job. For key-value data, we switched to Riak, which provides predictable read/write latencies and is completely horizontally scalable. For smaller sets of relational data where we wanted a rich query layer, we moved to PostgreSQL. A small fraction of our data has been moved to non-durable purely in-memory solutions if it wasn’t important for us to persist or be able to query later.
In retrospect, MongoDB was not the right solution for Kiip. Although it may be a bit more upfront effort, we recommend using PostgreSQL (or some traditional RDBMS) first, then investigating other solutions if and when you find them necessary. In future blog posts, we’ll talk about how we chose our data stores and the steps we took to migrate data while minimizing downtime.