谷歌云数据工程师考试 - Bigtable复习笔记

Bigtable Summary

What is?
-> more expensive because you pay for the number of nodes that you are using
-> if 10 nodes, 100,000 queries per second with 6 millisecond latency
-> low latency
-> high throughput -> fast
-> structured data
-> NOT transactional
-> NOT SQL
-> global availability
-> durable, replicated, and you can get access to it

Screen Shot 2018-06-27 at 1.37.00 pm.png

[图片上传中...(Screen Shot 2018-06-26 at 11.04.42 am.png-5ada72-1532174291870-0)]

Serverless?
No

Benefits

  • Incredible scalability. Cloud Bigtable scales in direct proportion to the number of machines in your cluster. A self-managed HBase installation has a design bottleneck that limits the performance after a certain QPS is reached. Cloud Bigtable does not have this bottleneck, and so you can scale your cluster up to handle more queries.
  • Simple administration. Cloud Bigtable handles upgrades and restarts transparently, and it automatically maintains high data durability. To replicate your data, simply add a second cluster to your instance, and replication starts automatically. No more managing masters or regions; just design your table schemas, and Cloud Bigtable will handle the rest for you.
  • Cluster resizing without downtime. You can increase the size of a Cloud Bigtable cluster for a few hours to handle a large load, then reduce the cluster's size again—all without any downtime. After you change a cluster's size, it typically takes just a few minutes under load for Cloud Bigtable to balance performance across all of the nodes in your cluster.

What good for?
Storing time-series data in Cloud Bigtable is a natural fit

  • Time-series data, such as CPU and memory usage over time for multiple servers.
  • Marketing data, such as purchase histories and customer preferences.
  • Financial data, such as transaction histories, stock prices, and currency exchange rates.
  • Internet of Things data, such as usage reports from energy meters and home appliances.
  • Graph data, such as information about how users are connected to one another.

How to use?

cbt

  • a command-line interface for performing several different operations on Cloud Bigtable.

HBase shell

  • HBase shell to connect to a Cloud Bigtable instance, perform basic administrative tasks, and read and write data in a table

Indexing
-> can only be indexed by row key. none of other columns can be indexed

Design
As a summary:

Get a balance between:
Distribute the reading load between tablets (you don’t want reading to be to only one tablet)
AND
Distribute the writing load between tablets (you don’t want writing to be to only one tablet)
AND
Design a row key to allow common queries to return consecutive rows

先看要query的东西在不在key里

然后看key有没有以下东西,避免hotspotting

Avoid using a row key that’s a domain or starts with a domain (can be part of domain though)

-> because certain domains are extremely active than others

-> the tablets corresponding to those customers are going to cause hot spotting

Avoid using User ID as row key if user IDs are sequentially assigned

-> it is OK if your user ID is randomly assigned e.g. by a hash code

-> because in many applications, newer users are going to be more active than users that were created 6-7 years ago

-> so if the User IDs are assigned in sequential order, the tablets that correspond to new users will tend to be more active -> hots potting

Avoid using a static identifier as a key, especially if you have a static identifier that’s going to keep getting used

-> if you have row key that’s mem usage or CPU usage or disk usage and you keep updating them over and over again, those nodes that do processing for these constantly updated data will get overworked

Avoid using dates as most writes will have the latest dates, thus same tablets -> hot spotting

你可能感兴趣的:(谷歌云数据工程师考试 - Bigtable复习笔记)