COMP9313 - NoSQL Technologies

What does RDBMS provide?

  • Relational model with schemas.
  • Powerful, flexible query language(SQL)
  • Transactional semantics: ACID(Atomicity, Consistency, Isolation, Durability)
  • Rich ecosystems, lots of tool support(MySQL, PostgreSQL, etc).

Key features of NoSQL:

  • non-relational
  • doesn't require strict schema
  • data are replicated to multiple nodes and can be patitioned
  • down nodes easily replaced
  • no single point of failure
  • horizontal scalability
  • cheap, easy to implement
  • massive write performance
  • fast key-value access

Why NoSQL?
Web apps have different needs(than the apps that RDBMS were designed for)

  • Low and predictable response time.
  • Scalability&Elasticity
  • High availability
  • Flexible schemas/semi-structured data
  • Geographic distributions(multiple datacenters)

Web apps can do without

  • Transactions/strong consistency/intergrity
  • Complex queries

CAP Theorem

Three properties of distribute system

  • Consistency
    all copies have the same value
  • Availability
    reads and writes always succeed
  • Partition-tolerance
    System properties always hold even when network failures prevent some machines from communicating with others.

For any system sharing data, it is impossible to guarantee simultaneously all of these three properties.
You can have at most two of these properties for any shared-data system.

Consistency

All clients have the same view of the data.
Once a writer has written data, all readers will see that data.

Two kinds of consistency:

  • Strong consistency: ACID
  • Weak consistency: BASE(Basically available soft-state eventual consistency)

A consistency model determines rules for visibility and apparent order of updates.
CAP theorem states: strong consistency can't be achieved at the same time as availability and partition-tolerance.

Eventual Consistency
When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent.

For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from the service.

The types of large systems based on CAP are BASE.
Basically Available: system seems to work at all time.
Soft state: it doesn't have to be consistent all the time.
Eventually consistent: becomes consistent at some later time.

Availability

System is available during software and hardware upgrades and node failures.

Partition-tolerance

A system can continue to operate in the presence of a network partitions.

NoSQL Taxonomy

  • Key-value stores
    Simple K/V lookups(Distributed Hashed Table(DHT))
  • Column stores
    Each key is associated with many attributes(columns)
    NoSQL column stores are actually hybrid row/column stores.
  • Document stores
    Store semi-structured documents(JSON)
  • Graph databases

Key-value

Focus on scaling on huge amounts of data
Designed to handle massive load
Data model: collection of key-value pairs
Ring partitioning and replication

Pros:
very fast
very scalable
simple data model
eventual consistency
fault-tolerant

Cons:
Can't model more complex data structure such as objects.

Key-value based: SimpleDB, Redis, Dynamo, Voldermort.

Document-based

Data model: collection of documents.
Document: JSON(JavaScript Object Notation), XML, other semi-structured formats.
Document-based: MongoDB, Couchbase, Elasticsearch.

Column-based

Based on Google's BigTable paper.
Like column oriented relational databases(store data in column order) but with a twist.
Tables: similar to RDBMS, but handle semi-structured data.
One column family can have variable numbers of columns.
Cells within a column family are stored "physically".
Very sparse, most cells have null values.
Comparison RDBMS vs column-based NoSQL:

  • RDBMS: must fetch data from several places on disk and glue together.
  • Column-based NoSQL: only fetch column families of those columns that are required by a query(All columns in a column family are stored together on the disk, so multiple rows can be retrieved in one read operation locality)

Column-based: BigTable, HBase, HyperTable, CASSANDRA, PNUTS.

Graph-based

  • Focus on modeling the structure of data(interconnectivity).
  • Scales to the complexity of data.
  • Inspired by mathematical graph theory(G=(E,V)).
  • Interfaces and query languages vary.
  • Single-step vs path expressions vs full recursion.
  • Example: Neo4j, FlockDB, InfoGRid...

Pros and Cons of NoSQL:

Advantages:

  1. Massive scalability.
  2. High availability.
  3. Lower cost(than competitive solutions at that scale)
  4. Predictable elasticity.
  5. Schema flexibility, sparse & semi-structured data.

Disadvantages:

  1. Doesn't fully support relational features.
  2. Eventual consistency is not intuitive to program for.
  3. Not always easy to integrate with other applications that support SQL.
  4. Relaxed ACID -> fewer guarantees.

NoSQL databases cover only a part of data intensive cloud applications(mainly web applications)

Problems with cloud computing:

  • Saas
  • Hybrid solutions

Introduction to HBase

HBase is an open-source, distributed, column-oriented database built on top of HDFS and based on BigTable.

  • Distributed - uses HDFS for storage.
  • Row/Column store.
  • Column-oriented and sparse - nulls don't occupy any storage space.
  • Multi-dimensional(versions).
  • Untyped - stores byte[].

HBase is part of Apache Hadoop.
HBase is the Hadoop application to use when you require real-time, fast read/write random access to very large datasets, aiming to support low-latency random access.

HBase is a sparse, distributed, persistent multi-dimensional sorted map.

Sparse:

  • Sparse data is supported with no waste of costly storage space.
  • HBase can handle the fact that we don't know the information we will store.
  • HBase as a schema-less data store; that is, it is fluid-we can add to, subtract from or modify the schema.

Distributed and persistent:

  • Persistent simply means that the data you store in HBase will persist or remain after our program or session ends.
  • HDFS is an open-source implementation of GFS; HBase is an open-source implementation of Big Table.
  • HBase leverages HDFS to persist its data to disk storage.
  • By storing data in HDFS, HBase offers reliability, availability, seamless scalability and high performance - all on cost effective distributed servers.

Multi-dimensional sorted map:

  • A map(also known as an associative array) is an abstract collection of key-value pairs, where the key is unique.
  • Keys are stored in HBase and sorted.

HBase vs HDFS

Both are distributed systems that scale to hundreds or thousands of nodes .

HDFS is good for batch processing(scans over big files)

  • Not good for record lookup.
  • Not good for incremental addition of small batches.
  • Not good for updates.

HBase is designed to efficiently address the above points.
HBase updates are done by creating new versions of values.


COMP9313 - NoSQL Technologies_第1张图片
HBase vs HDFS

Too big, or not too big

Two types of data: too big, or not too big.
If data is not too bid, a relational database should be used.
The data is too big:
The relational data model doesn't scale easily

You need to:

  • Add indexes.
  • Write really complex SQL queries.
  • De-normalize.
  • Cache

HBase data model

Table: Design-time namespace, has multiple sorted rows.

Row:

  • Atomic key/value container, with one row key.
  • Rows are sorted alphabetially by the row key as they are sorted.

Column: A column consists of a column family and a column qualifier, which are delimited by a :(colon) character.
Table schema only defines its Column Families.
Rows get flushed to disk periodically.
It can be broken up into different files with different properties, an reads can look at just a subset.

Column:

  • A column qualifier is added to a column family to provide the index for a given piece of data.
  • Given a column family content, a column qualifier can be various formats.
  • Column families are fixed at table creation, but column qualifiers are mutable and may differ greatly between rows.

Column Family divides columns into physical files.

  • Columns within the same family are stored together.

Why?

  • Table is sparse, many columns.
  • No need to scan the whole row when accessing a few columns.
  • Having one file per column will generate too many files.

Timestamp: long millseconds, sorted descending

  • A timestamp is written alongside each value, and is the identifier for a given version of a value.
  • By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell.

Cell:
A combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value's version.

(Row, Family:Column, Timestamp) -> Value

HBase schema consists of several Tables.
Each table consists of a set of Column Families(Columns are not parts of schema).

HBase has Dynamic Columns.

  • Because column names are encoded inside the cells.
  • Different cells can have different columns.

The version number can be user-supplied.

  • Version number are unique within each key.

Table can be very sparse.

  • Many cells are empty.

Keys are indexed as the primary key.
Each Column family is stored in a separate file(called HTables). Column families are stored separately on disk: access one without wasting I/O on the other.
Key & Version numbers are replicated with each column family.
Empty cells are not stored.

HBase regions:
Each HTable is partitioned horizontally into regions which are counterparted to HDFS block.

HBase Architecture

  • Major Components
  1. The MasterServer(HMaster)
    One master server and responsible for coordinating the slaves.
    Assigns regions and detect failures.
    Admin functions.
  2. The RegionServer(HRegionServer)
    Many region servers.
    Region(HRegion)
    Manages data regions.
    Serves data for readers and writers.
  3. The HBase Client

HBase benefits

No real indexes.
Automatic partitioning.
Scale linearly and automatically with new modes.
Commodity hardware.
Fault tolerance.
Batch processing.

When to use HBase?

  • You need to random write, random read, or both(otherwise, stick to HDFS).
  • You need to do many thousands of operations per second on multiple TB of data.
  • Your access patterns are well-known and simple.

你可能感兴趣的:(COMP9313 - NoSQL Technologies)