##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR

HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR https://mapr.com/blog/hbase-and-mapr-db-designed-distribution-scale-and-speed/

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第1张图片
Paste_Image.png

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.
Relational Databases vs. HBase – Data Storage Model
Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:
Relational databases have provided a standard persistence model
SQL has become a de-facto standard model of data manipulation (SQL)
Relational databases manage concurrency for transactions
Relational database have lots of tools

[图片上传中。。。(1)]
Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.
[图片上传中。。。(2)]
Relational Databases vs. HBase - Scaling
What changed to bring on NoSQL?
An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.
[图片上传中。。。(3)]
Limitations of a Relational Model
Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:
[图片上传中。。。(4)]
Relational databases vs. HBase - data storage model
HBase Designed for Distribution, Scale, and Speed
HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.
[图片上传中。。。(5)]
HBase is a column family-oriented database
HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.
HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR https://mapr.com/blog/hbase-and-mapr-db-designed-distribution-scale-and-speed/

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第2张图片
Paste_Image.png

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.
Relational Databases vs. HBase – Data Storage Model
Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:
Relational databases have provided a standard persistence model
SQL has become a de-facto standard model of data manipulation (SQL)
Relational databases manage concurrency for transactions
Relational database have lots of tools

[图片上传中。。。(1)]
Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.
[图片上传中。。。(2)]
Relational Databases vs. HBase - Scaling
What changed to bring on NoSQL?
An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.
[图片上传中。。。(3)]
Limitations of a Relational Model
Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:
[图片上传中。。。(4)]
Relational databases vs. HBase - data storage model
HBase Designed for Distribution, Scale, and Speed
HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.
[图片上传中。。。(5)]
HBase is a column family-oriented database
HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第3张图片
HBase distributed database

HBase is a distributed database
HBase Data Model
Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.
[图片上传中。。。(7)]
HBase data model – row keys
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.
[图片上传中。。。(8)]
Tables are split into regions = contiguous keys
The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.
[图片上传中。。。(9)]
The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.
[图片上传中。。。(10)]
Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.
[图片上传中。。。(11)]
Logical data model vs. physical data storage
As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row:family:column, there can be multiple versions of the value.
[图片上传中。。。(12)]
Sparse data with cell versions
Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.
##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第4张图片
versioned data

Versioned data
In this blog post, you got an overview of HBase (and implicitly MapR-DB) and learned about the HBase/MapR-DB data model. Stay tuned for the next blog post, where I’ll take a deep dive into the details of the HBase architecture. In the third and final blog post in this series, we’ll take a look at schema design guidelines.
Want to learn more?
Installing HBase on MapR
Getting Started with HBase on MapR
Release notes for HBase on MapR
Apache HBase docsHBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR https://mapr.com/blog/hbase-and-mapr-db-designed-distribution-scale-and-speed/

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第5张图片
Paste_Image.png

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.
Relational Databases vs. HBase – Data Storage Model
Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:
Relational databases have provided a standard persistence model
SQL has become a de-facto standard model of data manipulation (SQL)
Relational databases manage concurrency for transactions
Relational database have lots of tools

[图片上传中。。。(1)]
Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.
[图片上传中。。。(2)]
Relational Databases vs. HBase - Scaling
What changed to bring on NoSQL?
An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.
[图片上传中。。。(3)]
Limitations of a Relational Model
Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:
[图片上传中。。。(4)]
Relational databases vs. HBase - data storage model
HBase Designed for Distribution, Scale, and Speed
HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.
[图片上传中。。。(5)]
HBase is a column family-oriented database
HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第6张图片
HBase distributed database

HBase is a distributed database
HBase Data Model
Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.
##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第7张图片
HBase Data Model

HBase data model – row keys
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.
[图片上传中。。。(8)]
Tables are split into regions = contiguous keys
The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.
[图片上传中。。。(9)]
The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.
[图片上传中。。。(10)]
Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.
[图片上传中。。。(11)]
Logical data model vs. physical data storage
As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row:family:column, there can be multiple versions of the value.
[图片上传中。。。(12)]
Sparse data with cell versions
Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.
[图片上传中。。。(13)]
Versioned data
In this blog post, you got an overview of HBase (and implicitly MapR-DB) and learned about the HBase/MapR-DB data model. Stay tuned for the next blog post, where I’ll take a deep dive into the details of the HBase architecture. In the third and final blog post in this series, we’ll take a look at schema design guidelines.
Want to learn more?
Installing HBase on MapR
Getting Started with HBase on MapR
Release notes for HBase on MapR
Apache HBase docs
HBase is a distributed database
HBase Data Model
Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.
[图片上传中。。。(7)]
HBase data model – row keys
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.
[图片上传中。。。(8)]
Tables are split into regions = contiguous keys
The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.
[图片上传中。。。(9)]
The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.
[图片上传中。。。(10)]
Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.
[图片上传中。。。(11)]
Logical data model vs. physical data storage
As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row:family:column, there can be multiple versions of the value.
[图片上传中。。。(12)]
Sparse data with cell versions
Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.
[图片上传中。。。(13)]
Versioned data
In this blog post, you got an overview of HBase (and implicitly MapR-DB) and learned about the HBase/MapR-DB data model. Stay tuned for the next blog post, where I’ll take a deep dive into the details of the HBase architecture. In the third and final blog post in this series, we’ll take a look at schema design guidelines.
Want to learn more?
Installing HBase on MapR
Getting Started with HBase on MapR
Release notes for HBase on MapR
Apache HBase docsHBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR https://mapr.com/blog/hbase-and-mapr-db-designed-distribution-scale-and-speed/

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第8张图片
Paste_Image.png

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.
Relational Databases vs. HBase – Data Storage Model
Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:
Relational databases have provided a standard persistence model
SQL has become a de-facto standard model of data manipulation (SQL)
Relational databases manage concurrency for transactions
Relational database have lots of tools

[图片上传中。。。(1)]
Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.
[图片上传中。。。(2)]
Relational Databases vs. HBase - Scaling
What changed to bring on NoSQL?
An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.
[图片上传中。。。(3)]
Limitations of a Relational Model
Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:
[图片上传中。。。(4)]
Relational databases vs. HBase - data storage model
HBase Designed for Distribution, Scale, and Speed
HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.
[图片上传中。。。(5)]
HBase is a column family-oriented database
HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第9张图片
HBase distributed database

HBase is a distributed database
HBase Data Model
Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.
[图片上传中。。。(7)]
HBase data model – row keys
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.
[图片上传中。。。(8)]
Tables are split into regions = contiguous keys
The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.
[图片上传中。。。(9)]
The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.
[图片上传中。。。(10)]
Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.
[图片上传中。。。(11)]
Logical data model vs. physical data storage
As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row:family:column, there can be multiple versions of the value.
##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第10张图片
Sparse data

Sparse data with cell versions
Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.
[图片上传中。。。(13)]
Versioned data
In this blog post, you got an overview of HBase (and implicitly MapR-DB) and learned about the HBase/MapR-DB data model. Stay tuned for the next blog post, where I’ll take a deep dive into the details of the HBase architecture. In the third and final blog post in this series, we’ll take a look at schema design guidelines.
Want to learn more?
Installing HBase on MapR
Getting Started with HBase on MapR
Release notes for HBase on MapR
Apache HBase docsHBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR https://mapr.com/blog/hbase-and-mapr-db-designed-distribution-scale-and-speed/

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第11张图片
Paste_Image.png

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.
Relational Databases vs. HBase – Data Storage Model
Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:
Relational databases have provided a standard persistence model
SQL has become a de-facto standard model of data manipulation (SQL)
Relational databases manage concurrency for transactions
Relational database have lots of tools

[图片上传中。。。(1)]
Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.
[图片上传中。。。(2)]
Relational Databases vs. HBase - Scaling
What changed to bring on NoSQL?
An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.
[图片上传中。。。(3)]
Limitations of a Relational Model
Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:
[图片上传中。。。(4)]
Relational databases vs. HBase - data storage model
HBase Designed for Distribution, Scale, and Speed
HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.
[图片上传中。。。(5)]
HBase is a column family-oriented database
HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第12张图片
HBase distributed database

HBase is a distributed database
HBase Data Model
Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.
[图片上传中。。。(7)]
HBase data model – row keys
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.
[图片上传中。。。(8)]
Tables are split into regions = contiguous keys
The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.
[图片上传中。。。(9)]
The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.
[图片上传中。。。(10)]
Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.
##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第13张图片
logical data model vs physical data storage

Logical data model vs. physical data storage
As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row:family:column, there can be multiple versions of the value.
[图片上传中。。。(12)]
Sparse data with cell versions
Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.
[图片上传中。。。(13)]
Versioned data
In this blog post, you got an overview of HBase (and implicitly MapR-DB) and learned about the HBase/MapR-DB data model. Stay tuned for the next blog post, where I’ll take a deep dive into the details of the HBase architecture. In the third and final blog post in this series, we’ll take a look at schema design guidelines.
Want to learn more?
Installing HBase on MapR
Getting Started with HBase on MapR
Release notes for HBase on MapR
Apache HBase docsHBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR https://mapr.com/blog/hbase-and-mapr-db-designed-distribution-scale-and-speed/

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第14张图片
Paste_Image.png

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.
Relational Databases vs. HBase – Data Storage Model
Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:
Relational databases have provided a standard persistence model
SQL has become a de-facto standard model of data manipulation (SQL)
Relational databases manage concurrency for transactions
Relational database have lots of tools

[图片上传中。。。(1)]
Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.
[图片上传中。。。(2)]
Relational Databases vs. HBase - Scaling
What changed to bring on NoSQL?
An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.
[图片上传中。。。(3)]
Limitations of a Relational Model
Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:
[图片上传中。。。(4)]
Relational databases vs. HBase - data storage model
HBase Designed for Distribution, Scale, and Speed
HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.
[图片上传中。。。(5)]
HBase is a column family-oriented database
HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第15张图片
HBase distributed database

HBase is a distributed database
HBase Data Model
Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.
[图片上传中。。。(7)]
HBase data model – row keys
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.
[图片上传中。。。(8)]
Tables are split into regions = contiguous keys
The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.
##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第16张图片
Hbase column families

The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.
[图片上传中。。。(10)]
Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.
[图片上传中。。。(11)]
Logical data model vs. physical data storage
As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row:family:column, there can be multiple versions of the value.
[图片上传中。。。(12)]
Sparse data with cell versions
Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.
[图片上传中。。。(13)]
Versioned data
In this blog post, you got an overview of HBase (and implicitly MapR-DB) and learned about the HBase/MapR-DB data model. Stay tuned for the next blog post, where I’ll take a deep dive into the details of the HBase architecture. In the third and final blog post in this series, we’ll take a look at schema design guidelines.
Want to learn more?
Installing HBase on MapR
Getting Started with HBase on MapR
Release notes for HBase on MapR
Apache HBase docsHBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR https://mapr.com/blog/hbase-and-mapr-db-designed-distribution-scale-and-speed/

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第17张图片
Paste_Image.png

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.
Relational Databases vs. HBase – Data Storage Model
Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:
Relational databases have provided a standard persistence model
SQL has become a de-facto standard model of data manipulation (SQL)
Relational databases manage concurrency for transactions
Relational database have lots of tools

[图片上传中。。。(1)]
Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.
[图片上传中。。。(2)]
Relational Databases vs. HBase - Scaling
What changed to bring on NoSQL?
An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.
[图片上传中。。。(3)]
Limitations of a Relational Model
Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:
[图片上传中。。。(4)]
Relational databases vs. HBase - data storage model
HBase Designed for Distribution, Scale, and Speed
HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.
[图片上传中。。。(5)]
HBase is a column family-oriented database
HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.

##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第18张图片
HBase distributed database

HBase is a distributed database
HBase Data Model
Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.
[图片上传中。。。(7)]
HBase data model – row keys
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.
[图片上传中。。。(8)]
Tables are split into regions = contiguous keys
The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.
[图片上传中。。。(9)]
The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.
##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR_第19张图片
Hbase table cells

Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.
[图片上传中。。。(11)]
Logical data model vs. physical data storage
As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row:family:column, there can be multiple versions of the value.
[图片上传中。。。(12)]
Sparse data with cell versions
Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.
[图片上传中。。。(13)]
Versioned data
In this blog post, you got an overview of HBase (and implicitly MapR-DB) and learned about the HBase/MapR-DB data model. Stay tuned for the next blog post, where I’ll take a deep dive into the details of the HBase architecture. In the third and final blog post in this series, we’ll take a look at schema design guidelines.
Want to learn more?
Installing HBase on MapR
Getting Started with HBase on MapR
Release notes for HBase on MapR
Apache HBase docs

你可能感兴趣的:(##[图]HBase and MapR-DB: Designed for Distribution, Scale, and Speed | MapR)