Summary
In case you missed something along the way, here is a quick overview of the material
covered in this chapter.
HBase is a database designed for semistructured data and horizontal scalability. It
stores data in tables. Within a table, data is organized over a four-dimensional coordinate
system: rowkey, column family, column qualifier, and version. HBase is schema-less,
requiring only that column families be defined ahead of time. It’s also type-less, storing
all data as uninterpreted arrays of bytes. There are five basic commands for interacting
with data in HBase: Get, Put, Delete, Scan, and Increment. The only way to query
HBase based on non-rowkey values is by a filtered scan.6
HBase is not an ACID-compliant database6
HBase isn’t an ACID-compliant database. But HBase provides some guarantees that
you can use to reason about the behavior of your application’s interaction with the
system. These guarantees are as follows:
1 Operations are row-level atomic. In other words, any Put() on a given row
either succeeds in its entirety or fails and leaves the row the way it was
before the operation started. There will never be a case where part of the row
is written and some part is left out. This property is regardless of the number
of column families across which the operation is being performed.
2 Interrow operations are not atomic. There are no guarantees that all operations
will complete or fail together in their entirety. All the individual operations
are atomic as listed in the previous point.
3 checkAnd* and increment* operations are atomic.
4 Multiple write operations to a given row are always independent of each other
in their entirety. This is an extension of the first point.
5 Any Get() operation on a given row returns the complete row as it exists at
that point in time in the system.
6 A scan across a table is not a scan over a snapshot of the table at any point.
If a row R is mutated after the scan has started but before R is read by the
scanner object, the updated version of R is read by the scanner. But the data
read by the scanner is consistent and contains the complete row at the time
it’s read.
From the context of building applications with HBase, these are the important points
you need to be aware of.
The data model is logically organized as either a key-value store or as a sorted map of maps.
The physical data model is column-oriented along column families and individual records
are stored in a key-value style. HBase persists data records into HFiles, an immutable file
format. Because records can’t be modified once written, new values are persisted to
new HFiles. Data view is reconciled on the fly at read time and during compactions.
The HBase Java client API exposes tables via the HTableInterface. Table connections
can be established by constructing an HTable instance directly. Instantiating an
HTable instance is expensive, so the preferred method is via the HTablePool because it
manages connection reuse. Tables are created and manipulated via instances of the
HBaseAdmin, HTableDescriptor, and HColumnDescriptor classes. All five commands
are exposed via their respective command objects: Get, Put, Delete, Scan, and Increment.
Commands are sent to the HTableInterface instance for execution. A variant of
Increment is also available using the HTableInterface.incrementColumnValue()
method. The results of executing Get, Scan, and Increment commands are returned in
instances of Result and ResultScanner objects. Each record returned is represented
by a KeyValue instance. All of these operations are also available on the command line
via the HBase shell.
Schema designs in HBase are heavily influenced by anticipated data-access patterns.
Ideally, the tables in your schema are organized according to these patterns.
The rowkey is the only fully indexed coordinate in HBase, so queries are often implemented
as rowkey scans. Compound rowkeys are a common practice in support of
these scans. An even distribution of rowkey values is often desirable. Hashing algorithms
such as MD5 or SHA1 are commonly used to achieve even distribution.