Below are the key features of Hive that differ from RDBMS.
Hive resembles a traditional database by supporting SQL interface but it is not a full database. Hive can be better called as data warehouse instead of database.
Hive enforces schema on read time whereas RDBMS enforces schema on write time.
In RDBMS, a table‘s schema is enforced at data load time, If the data being
loaded doesn’t conform to the schema, then it is rejected. This design is called schema on write.
But Hive does‘t verify the data when it is loaded, but rather when a
it is retrieved. This is called schema on read.
Schema on read makes for a very fast initial load, since the data does not have to be read, parsed, and serialized to disk in the database’s internal format. The load operation is just a file copy or move.
Schema on write makes query time performance faster, since the database can index columns and perform compression on the data but it takes longer to load data into the database.
Hive is based on the notion of Write once, Read many times but RDBMS is designed for Read and Write many times.
In RDBMS, record level updates, insertions and deletes, transactions and indexes are possible. Whereas these are not allowed in Hive because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table.
In RDBMS, maximum data size allowed will be in 10鈥檚 of Terabytes but whereas Hive can 100鈥檚 Petabytes very easily.
As Hadoop is a batch-oriented system, Hive doesn鈥檛 support OLTP (Online Transaction Processing) but it is closer to OLAP (Online Analytical Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve.
RDBMS is best suited for dynamic data analysis and where fast responses are expected but Hive is suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when the data is not changing rapidly.
To overcome the limitations of Hive, HBase is being integrated with Hive to support record level operations and OLAP.
Hive is very easily scalable at low cost but RDBMS is not that much scalable that too it is very costly scale up.
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.
Hive’s Basic Architecture
As an overview, the major architectural components of Hive include the following:
Hadoop Distributed File System (HDFS): Hive uses the Hadoop HDFS system for storage.
MapReduce / YARN: As part of the Hadoop umbrella of projects, Hive uses MapReduce for parallel processing functions. It can also use Tez
Hive Driver / Compiler: The Hive Driver executes HiveQL queries which are parsed by the compiler to generate an execution plan by examining both the query blocks and expressions. The query is also compared against the metadata from the metastore to make sure it is valid. The resulting execution plan is executed on the Hadoop cluster via MapReduce or Tez.
Hive Thrift Server: Clients access Hive via the Hive Thrift Server, which allows any JDBC or ODBC-compliant application to access Hive. Along with programming language bindings for languages like PHP and Python a command line interface is also available.
Hive Metastore: The metastore contains information about the partitions and tables in the warehouse, data necessary to perform read and write functions, and HDFS file and data locations.
You can find a full explanation of the Hive architecture on the Apache Wiki.
Hive vs. MySQL
While each tool performs a similar general action, retrieving data, each does it in a very different way. Whereas Hive is intended as a convenience/interface for querying data stored in HDFS, MySQL is intended for online operations requiring many reads and writes.
One good example of this difference in action is in forming table schemas. Hive uses a method of querying data known as “schema on read,” which allows a user to redefine tables to match the data without touching the data. Hive has serialization and deserialization adapters to let the user do this, so it isn’t intended for online tasks requiring heavy read/write traffic. On the flip side, MySQL utilizes “schema on write”, which means you define table schemas before you can add data to the store. This allows MySQL to store it in an optimal way for fast reading and writing. These differing approaches are a good example of how these two technologies differ. You can read more about schema on read vs schema on write on the Marklogic blog.
WHEN TO USE HIVE
If you have large (think terabytes/petabytes) datasets to query: Hive is designed specifically for analytics on large datasets and works well for a range of complex queries. Hive is the most approachable way to quickly (relatively) query and inspect datasets already stored in Hadoop.
If extensibility is important: Hive has a range of user function APIs that can be used to build custom behavior in to the query engine. Check out my guide to Hive functions if you’d like to learn more.
WHEN TO USE MYSQL
If performance is key: If you need to pull data frequently and quickly, such as to support an application that uses online analytical processing (OLAP), MySQL performs much better. Hive isn’t designed to be an online transactional platform, and thus performs much more slowly than MySQL.
If your datasets are relatively small (gigabytes): Hive works very well in large datasets, but MySQL performs much better with smaller datasets and can be optimized in a range of ways.
If you need to update and modify a large number of records frequently: MySQL does this kind of activity all day long. Hive, on the other hand, doesn’t really do this well (or at all, depending). And if you need an interactive experience, use MySQL.
Through this summary of the differences between Hive and MySQL, I hope I’ve helped provide some direction on which platform to use in different applications and environments. For additional points of comparison, check out this post on the Hadoop Tutorial website.