这周一直在准备谷歌云数据工程师的考试 (Google Cloud Data Engineer Certification),就把一篇学习笔记贴上来。
时间仓促暂时没法翻译成中文,希望看官们谅解。
BigQuery Summary
What is?
Google's fully managed, petabyte scale, low cost analytics data warehouse.
Serverless?
Yes
Benefits
Allows near real-time analysis
-> not totally real-time
-> if need milliseconds delay in transactional database, BigQuery is not the answer; the answer will be Cloud SQL or Spanner
-> but it is a good choice for ad-hoc analysis of very large datasets
-> data warehousing and business intelligence
Completely no-ops
-> no need to create cluster before querying
-> pay for amount of data processed
Pricing
-> pay for use
OR
-> flat-rate pricing
Durable
-> if you put data in BigQuery, it is just like GCS
Immutable audit logs
-> audit logs cannot be tampered with
-> you know if someone actually uses the dataset
A way to share data beyond silos of your company’s structure
-> cluster less, so can take BigQuery and share it with anybody under your company’s domain
-> anybody that you allow to access your table is able to access your table without any ops that is typically required to dealt with
-> no need to worry that “how do I log in the person’s cluster so I can use their data” it is cluster free!
-> BigQuery can become the form by which you get data collaboration across your company -> way of sharing analysis, and that is live!
How to use?
- BigQuery web UI
- BigQuery command-line tool's bq query command
- BigQuery REST API to programmatically call the jobs.query or query-type jobs.insert methods
- BigQuery client libraries
How to load data?
You can load data:
- From Google Cloud Storage
- By inserting individual records using streaming inserts
- Using a Google Cloud Dataflow pipeline to write data to BigQuery
- From a readable data source (such as your local machine)
- From other Google services, such as DoubleClick and Google AdWords
- Using DML statements to perform bulk inserts
Not loaded but can query external data sources
BigQuery offers support for querying data directly from:
- Google Cloud Bigtable
- Google Cloud Storage
- Google Drive
BigQuery supports loading data from Cloud Storage and readable data sources in the following formats:
Google Cloud Storage:
CSV
JSON (newline delimited only)
Avro
Parquet
ORC (Beta)
Google Cloud Datastore backups
Readable data source (such as your local machine):
CSV
JSON (newline delimited only)
Avro
Parquet
ORC (Beta)
You can load data into a new table or partition, you can append data to an existing table or partition, or you can overwrite a table or partition. For more information on working with partitions, see Managing Partitioned Tables.
When you load data into BigQuery, you can supply the table or partition schema, or for supported data formats, you can use schema auto-detection.
How to load schema?
- Manually specify the schema:
- Using the BigQuery web UI.
- Inline using the CLI.
- Create a schema file in JSON format.
- Call the jobs.insert method and configure the configuration.load.schema property.
- Call the tables.insert method and configure the schema in the table resource using theschema property.
Indexing
BigQuery does not use or support indexes.
Cost Reduction best practice
- Avoid SELECT *. Query only the columns that you need.
- Don't run queries to explore or preview table data. Use preview options.
- Before running queries, estimate costs.
- Use the maximum bytes billed setting to limit query costs.
- Do not use a LIMIT clause as a method of cost control.
- Create a dashboard to view your billing data so you can make adjustments to your BigQuery usage. Also consider streaming your audit logs to BigQuery so you can analyze usage patterns.
- Partition your tables by date.
- If possible, materialize your query results in stages.
- If you are writing large query results to a destination table, use the default table expiration time to remove the data when it's no longer needed.
- Use streaming inserts only if your data must be immediately available.
Performance Improve best practice
- Avoid self-joins. Use a window function instead.
- If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.
- Avoid unbalanced joins
- Avoid joins that generate more outputs than inputs. When a CROSS JOIN is required, pre-aggregate your data.
- Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch your updates and inserts.
Billing
Free
-> loading
-> exporting
-> queries on metadata
-> cached queries: if you run a query and you run exactly the same query in that project, it’s free
-> per user cache
-> if two users on the same project, they don’t share the cache
-> query with errors
Storage
-> You get charged based on the amount of data in the table
-> Streaming data
-> get charged based on the ingest rate of that streaming data
-> automatically get a discount for older data
Processing
-> on-demand: charge for queries based on the amount of data that’s being processed, 1TB / month free
-> flat rate plan: talk with sales team
-> opt-in to run high-compute queries: JS UDF
总结就是:
data input: streaming insert收钱,其他不收
data storage: 收钱
data processing: 按量收
IAM / Permission
BigQuery uses Identity and Access Management (IAM) to manage access to resources. The three types of resources available in BigQuery are organizations, projects, and datasets. In the IAM policy hierarchy, datasets are child resources of projects. Tables and views are child resources of datasets — they inherit permissions from their parent dataset.
To grant access to a resource, assign one or more roles to a user, group, or service account. Organization and project roles affect the ability to run jobs or manage the project's resources, whereas dataset roles affect the ability to access or modify the data inside of a particular dataset.
IAM provides two types of roles: predefined and primitive roles. When you assign both predefined and primitive roles to a user, the permissions granted are a union of each role's permissions.
View Access Management
You cannot assign access controls directly to views. You can control view access by configuring access controls at the dataset level or at the project level.
Dataset-level access controls specify the operations users, groups, and service accounts are allowed to perform on views in that specific dataset. If you assign only dataset-level permissions, you must also assign a primitive or predefined, project-level role that provides access to the project, for example, bigquery.user.
Instead of granting access to individual datasets, you can assign predefined, project-level IAM roles that grant permissions to all views in all datasets in a project.
You can also create IAM custom roles. If you create a custom role, the permissions you grant depend on the view operations you want the user, group, or service account to be able to perform.
For more information on roles and permissions, see:
- Understanding Roles in the IAM documentation
- BigQuery Access Controls
- Assigning access controls to datasets.
Encryption
By default, BigQuery encrypts customer content stored at rest. BigQuery handles and manages this default encryption for you without any additional actions on your part. First, data in a BigQuery table is encrypted using a data encryption key. Then, those data encryption keys are encrypted with key encryption keys, which is known as envelope encryption. Key encryption keys do not directly encrypt your data but are used to encrypt the data encryption keys that Google uses to encrypt your data.
If you want to control encryption yourself, you can use customer-managed encryption keys (CMEK) for BigQuery. Instead of Google managing the key encryption keys that protect your data, you control and manage key encryption keys in Cloud KMS.
BigQuery API
To use the Google BigQuery API, you must first authenticate to verify your client's identity. BigQuery authorizes access to resources based on the verified identity.
A service account is a Google account that is associated with your GCP project. Use a service account to access the BigQuery API if your application can run jobs associated with service credentials rather than an end-user's credentials, such as a batch processing pipeline.
Use user credentials to ensure that your application has access only to BigQuery tables that are available to the end user.