Along with the increasing of data in economic, social, geopolitical and prophetic events, big data is becoming greater important. The cloud providers are focus on the techniques to manage the cloud database. They introduce different ways to host databases in the public cloud. Releasing users from constrains of traditional hardware to database, while providing the scalability of cloud database to achieve larger capacity database. As a consequence, NoSQL(Not Only SQL) databases are coming into our sight. Traditional SQL databases concern about consistency and reliability which guaranteed by ACID(Atomicity, Consistency, Isolation and Durability). And constrains of ACID lead the SQL databases become difficult to scale. Compare with SQL databases, NoSQL cloud databases prefer Scalability, Availability and Network partition tolerance to Consistency. NoSQL also can reduce the level of availability to achieve higher consistency. However there are still many remaining problems which NoSQL databases cannot solve. As a result, providers produced distinct data model CDBMSs(Cloud Database Management System) to satisfy different requirements, whereas each of them are present an imperfection.
There four typical data models of NoSQL databases: Column family store, Document Store, Key-Value Store, Graph Store. In this essay, Column family and document will be discussed and each of them will list two corresponding CDBMSs as instances.
Column-oriented databases store data in the columns instead of rows. Columns are normally grouped into column families with a set of key-value pair rows, each row has its multiple columns and a row key, and read and write are done by columns rather than rows. Compare to RDBMS (Relational Database Management System), a column family can be considered as a table in RDBMS, and the difference is that different rows are not forced to have same columns. The most popular CDBMSs base on column-oriented data model are BigTable, Hbase and Cassandra.
Cassandra
Apache Cassandra was original developed by Facebook and open source released at 2008. Then, it became an Apache Incubator project in 2009. It adopted the data architecture of Google’s Bigtable and borrowed the distribution mechanisms from Dynamo.
HBase
HBase is also an open source CDBMS as a top-level Apache project. HBase is also known as Hadoop database, it integrate the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with key/value store and Map reduce.
Document-oriented databases store data in a collection of documents with key value pairs. It is quite similar to Key-value model, but the value part is stored differently which are data managed in structured documents, such as XML, JSON, BSON and so on. These documents are self-describing and hierarchical tree structure, the components in documents can be maps, collections and scalar
values. The most popular document-oriented CDBMS are MongoDB, CouchBase and CouchDB.
MongoDB
MongoDB is an open source project developed by 10gen and released in 2009. MongoDB is designed for high performance, high availability and easy scalability NoSQL database. It use Embedding to promote the speed of read and write requests, The Master-Slave replication model provide high availability, and sharding contributes to the scalability.
CouchDB
CouchDB is original created in 2005 by IBM and it became an Apache Incubator project in 2008. CouchDB stores data in JSON documents and query the documents with Javascript. CouchDB provides ACID properties, map reduce and Indexes. It uses replication to implement scalability, and provides availability and partition tolerance with eventual consistency.
This section is the comparison between two CDBMS having the same data model and discussion of their advantages and disadvantages. The comparison builds on several Features in NoSQL databases, such as Partitioning, Replication and so on.
Even though both Cassandra and HBase inherit the same data model from BigTable, there are still some differences between them. In Cassandra, the keyspaces are the container of data, inside the keyspaces are many column families which contains columns. A set of columns is identified by row-key, and each row in a column family can have different kinds of columns. In HBase, data is stored into HBase tables which made of rows and columns. Columns belong to column family and store together. In summary, because the differences of data model between Cassandra and HBase, the database designs may be various. Cassandra queries on row key always achieve a better performance, in contrast, HBase prefer to query on column family. And Cassandra more rely on writes than reads, however HBase achieves both convenient writes and reads.
Cassandra is an open source implementation of Dynamo, hence its architecture is p2p which every node are equality. Compare with Cassandra, HBase uses master-slave architecture, each HMaster need to manage many slaves distributed in the same region. As a consequence, Cassandra achieved a higher availability than HBase.
Cassandra has two partitioning strategies: Random partitioning and Ordered partitioning. Random partitioning is the default and recommended strategy, hence scanning rows is very complicated and partial row-keys is not permitted in Cassandra. In contrast, HBase only support the Ordered Partitioning, so HBase queries can be formulated with partial start and end row-keys. In addition, due to the different partitioning strategy, Hbase can support some simple aggregate, however Cassandra can’t.
Firstly, Cassandra replicates data in every transaction, a coordinate node captures changes and propagate it to other coordinators. In this case, Cassandra rely on a high speed and low latency network connection. On the other hand, HBase use a more practical replication architecture. HBase captures change log and put it into replication queue, then the replication message is propagated to other nodes. Secondly, in the case NoSQL databases, replication is provided to achieve high availability. Cassandra uses the no master-slave replication strategy or peer to peer replication strategy, which can provide a high level availability and durability. Whereas, HBase prefer Master-slave replication strategy, hence single point failure is acceptable in HBase databases.
Both Cassandra and HBase use timestamp to version the data, then opt the last-write-wins(LWW) approach. They also mark a tombstone when the deletion is requested. After the compactions, the LWW or tombstone will choose the newest data. Data Versioning is a significant concept to provide concurrency.
Cassandra satisfied the Availability and Partition Tolerance in the CAP(Consistency Availability Partition Tolerance) theorem, and HBase satisfied
the Consistency and Partition Tolerance properties. So, for achieving the availability, Cassandra need to trade-off between availability and consistency levels, it can achieve strong consistency and eventual consistency by configuration. However, Hbase only satisfy a strong consistency, hence it need to sacrifice the availability.
Support of traditional database constraints
Normally, traditional relation database used several constraints to limit the type of data in table. Includes: NOT NULL, DEFAULT, UNIQUE, PRIMARY Key, FOREIGN Key, CHECK, INDEX. In the case of Cassandra, it supports all the constraints except the FOREIGN Key constraint. Even though INDEX is provided in Cassandra, but INDEX is not recommended. As a consequence, Cassandra need to use many corresponding tables to achieve the read requests and ignore the data redundancy. In contrast, HBase support FOREIGN Key to support referencing, and INDEX is not offered in HBase.
ACID represents Atomicity, Consistency, Isolation and Durability. In Cassandra, Firstly, all individual writes are atomic at the row level. Secondly, even Though Cassandra could provide strict consistency, but it is a different scope with Consistency in ACID. Thirdly, Nothing is isolated in Cassandra. Finally, the commit log contribute to the updates durable. In Hbase, Firstly, the mutations also atomic within a row level. Secondly, Scan operations can provide a consistent view of data in HBase. Thirdly, writes in HBase is isolated by locking the rows, and reads is not isolated. Finally, all the visible or retrievable data are durable in HBase.
Existence of any Single Points of Failure
When talks about single points failure, that usually to do more with the master-slave replication architecture. Cassandra do not exist any single points failure, because it does not operate under the master-slave replication strategy. These bring lots of advantages in availability, however this will bring more extra transaction between nodes. On the other hand, due to the utilization of Mater-slave architecture, HBase do exist single points failure.
CouchDB and MongoDB are both document-oriented data model, while MongoDB is JSON based and CouchDB is BSON based. In MongoDB, data are stored in documents, and documents compose collections, whereas in CouchDB, there are no collections. MongoDB support Embedding and Referencing design, and CouchDB prefers to model data by the concept of Self-contained. MongoDB support dynamic queries, so Mongoose introduces the constraint that all the data in a collection must have the same schema, while CouchDB is a schema-less database.
MongoDB implements partitioning data by Sharding. CouchDB do not support partitioning itself, but CouchDB can partition data by using Lounge or Twitter’s Gizzard framework. MongoDB has two partitioning strategies: Range based partitioning and Hash based partitioning, these strategies design for unevenly and evenly distribute of data by sharding key.
The replication in CouchDB and MongoDB is very different. MongoDB uses a design called Replication Set which are more like a master-slave replication, but the slaves have the ability of failover and arbitration. On the other hand, CouchDB not only support master-slave replication, but also support master-master replication. One of the CouchDB strengths is to synchronize two copies of the same database. In addition, CouchDB uses replication for scalability, while MongoDB uses replication for availability and use sharding for scalability. Versioning
Versioning is an extra-feature that is not fundamental to document databases. Both MongoDB and CouchDB do not support it by default, but there are still different solutions available.
MongoDB satisfies the Availability and Partition Tolerant in CAP theorem, Whereas CouchDB satisfies Consistency and Partition Tolerant. MongoDB can achieve strict consistency by configuring the all the reads go to master and also
can achieve eventual consistency by letting the reads go to slaves. For MongoDB, even load distribution and multi-data center support will become easier in eventual consistency. CouchDB achieves eventual consistency by using incremental replication which is a process that document changes are periodically copied between servers.
NOT NULL, DEFAULT, UNIQUE, PRIMARY Key, FOREIGN Key, CHECK, INDEX MongoDB and CouchDB don’t support Not Null, Default, Foreign Key and Check due to the document data model. In MongDB, _id field in all the documents act as a primary key and Indexes can use to achieve Unique constraint. In contrast, CouchDB don’t support Primary Key.
CouchDB provides ACID semantics by MMVC which can deal with high volume of concurrent reads and writes without conflicts. On the other hand, MongDB is not traditional ACID compliant, but it compliant at a document level. For Atomic, all documents must be complete or not, for consistent and isolate, no reader will show a partially applied update, and write concern can provide durability.
Replica sets in MongoDB provide a high availability for shards which guaranteed that MongoDB has no single point of failure. For instance, in many mater-slave databases, if the master become unavailable then the data will become unavailable, whereas in the case of MongoDB, if the master is down, the slaves will elect a new mater to hold all data. Because CouchDB support both master-slave and master-master replication, it become tunable between availability and consistency which also influences the existence of SPOF(Single Points of Failure).
This section is a comparison of application appropriateness between column family based CDBMS and document based CDBMS. One typical application of Cassandra will be discussed later, then analyze the appropriateness of using MongoDB for this application.
Cassandra has proven to be one of the best solutions for time series data. Time series data normally use for statistics, econometrics, finance, seismology, meteorology and geophysics analysis. A typical application of time series data is Financial Transactions, functional requirements of this application is recoding the data of stocks trading and monitoring them.
Create TABLE stock_ticks(
symbol text,
data int,
trade timeuuid,
trade_details text,
PRIMARY KEY((symbol, date), trade)
)WITH CLUSTERING ORDER BY (trade DESC)
SELECT trade, trade_details
FROM stock_sticks
WHERE symbol = "" AND data="";
SELECT trade_details
FROM stock_sticks
WHERE symbol ="" AND trade = ""
SELECT trade, trade_details
FROM stock_sticks
WHERE symbol = "" AND data= ""
LIMIT 10;
There are many factors to consider before choosing Cassandra, such as the type of data you store, the type of queries you do and so on, several reasons that choosing Cassandra for this Financial Transactions application will be highlighted and discussed later.
Cassandra is designed to store large quantities of data, so the ability to scale is a primary goal of Cassandra. Cassandra achieved scalability by partitioning the data across machines in a cluster. Applications with time series data always require huge volume of incremental data, so the capacity to scale is a significant feature.
Cassandra made a trade-off between consistency and availability. If applications that require a high availability, Cassandra might be a great choice, because of its p2p(peer to peer) architecture provide high level of availability. Availability always is the primary requirement for Finance data.
Replication normally uses to provide availability for NoSQL databases. Cassandra can deploy replication in multiple data centers which might be very important for this application. Cassandra can provide a high availability for this application in a global environment.
In Cassandra, write operations works in a memory data structure known as Memtable, nothing is locked and nothing is written to disk. So writes operate very fast in Cassandra. Application like Financial Transaction needs a workload that involves a lot of writes and few reads, Cassandra is the best choice for it.
MongoDB is a schemaless database which meaning that tables and fields are flexible in MongoDB. MongoDB and Cassandra have a common goal that is to achieve high scalability, whereas they take different architecture to achieve it. Financial Transaction application is a typical time series application which can
also implement by MongoDB, however some limitations of MongoDB present a imperfect to time series data.
Financial Transaction application requires a lot of incremental time-series data in the same group. In the case of MongoDB, it is not a good choice of storing these data in a list. But, if storing these time series data in a single document, the entire application will slow down due to distribution of the documents on disk. On the other hand, Cassandra uses a column family data model which can organize these data in a row. This enables all the data in the same row will be stored together.
Due to the p2p architecture, Cassandra only has one type node which make it much easier to manage. Cassandra data are immutable, so no update provided in Cassandra. Financial Transaction application and many time series data also do not need updates. As a consequence, backups in Cassandra will be very simple due to the SSTables being immutable. On the other hand, Backups in MongoDB are more complicated because the documents are constantly updates.
Time-series data requires large numbers of inserts, but MongoDB must update every index associated with the collection during insert, update and delete. Therefore, every index on a collection adds some amount of overhead for the performance of write operations. Generally, MongoDB gains more performance for reading. So for most times-series data which need more writes and reads, MongDB may not be a wise choice.
This section is a comparison of application appropriateness between column family based CDBMS and document based CDBMS. One typical application of MongoDB will be discussed later, then analyze the appropriateness of using Cassandra for this application.
If you need dynamic queries and indexes, MongoDB may be one of the best choices for NoSQL databases. Along with a flexible data model, MongoDB provides a lot of options for querying. A typical demographic application will be discussion in this section.
{
"_id" : String,
"City" : String,
"state" : String,
"pop" : Integer,
"loc" : list
}
db.pop.createIndex(
{"city":1, "state":1}.
{unique:true}
)
db.pop.aggregate([
{$group:{_id:"state",total_pop:{$sum:"$pop"}}},
{$match:{total_pop:{"$gte":100000000}}}
])
db.pop.aggregate([
{$group:{_id:{state:"state", city:"$city"}, pop:{$sum:"$pop"}}},
{$sort:{pop:1}},
{$group:{
_id:"$_id.state",
smallest_city:{$first:"$_id.city"},
smallest_pop:{$first:"pop"},
biggest_city:{$last:"_id.city"},
biggest_city:{$last:"$pop"}
}
}
])
MongoDB’s flexibility data structure, ability to index and query data, auto-sharding makes it become a strong CDBMS. This example of application seems exactly fit MongoDB, because MongoDB can easily handle these complex queries. Factors involved of choosing MongoDB will be discussed in this section.
MongoDB provides high availability by Sharding and Replica Set and provides scalability by Sharding. MongoDB can store data on multi machines by sharding to overcome the increasing of data. Data also can be better managed and retrieved by using a appropriate sharding key. MongDB’s replica set can provides a higher availability than traditional Mater-Slave ones due to the slaves have the ability of failover and arbitration.
MongoDB documents and fields are flexible due to the schemaless data model. In MongoDB, one collection can hold different documents with distinct number of fields, content and size. In this application, the geographic structure of document may different due to the distinct rules of countries. MongoDB documents data model may be one of the best suitable RDBMS.
MongoDB supports dynamic queries on documents using a document-based query language which is nearly as powerful as SQL and provides flexible indexes on any attributes. These advantages perfectly solved the complexes queries in this demographic application.
One excellent feature with MongoDB is that updates can happen “in place”, meaning that databases do not need to allocate or write a new copy of the object. This ability can provide a high performance for frequently update use cases. The population in demographic application needs to be update frequently, then MongoDB can handle it shrewdly.
Cassandra provides scalability, consistency, availability and no SPOF, which seems a perfect NoSQL CDBMS. However, no databases can represent perfectly, many limitations restrict to use Cassandra. Reasons that limit the use of Cassandra in this application will be discussed.
Cassandra design for uses cases that has more writes than read, as a result, Cassandra can hardly handle with complex or gazillion queries. In this application, MongoDB can easily get the result by aggregations, however, if Cassandra needs to retrieve the same data, the design of data model will become very complicated.
Cassandra does not provide aggregation functions, such as sum, min and max. Though the later version of Cassandra provide counter to implement several functions, but dealing with complex aggregation functions are still a challenge to Cassandra.
For Cassandra database design, you must define you view in advance, that is, you need to design your table rely on reads requests. This factor limited the changeability of use cases, which cause applications are hard to update. Every time create a new table, you need migrate a large number of data in it.