感谢公司赞助了Google Cloud Platform(GCP) Coursera课程:https://www.coursera.org/,包括云基础设施,应用开发,数据湖和数据仓库相关知识。
其中谷歌云的实验操作平台是:https://www.qwiklabs.com/,获得的谷歌云Coursera认证(该认证包括Qwiklabs平台的实验)如下:
2020/3/26-2020/4/1 | Essential Google Cloud Infrastructure: Core Services | Certificated |
2020/4/2-2020/4/5 | Essential Google Cloud Infrastructure: Foundation | Certificated |
2020/4/6-2020/4/11 | Essential Google Cloud Infrastructure: Core Services | Certificated |
2020/4/12-2020/4/16 | Elastic Google Cloud Infrastructure: Scaling and Automation | Certificated |
2020/4/17-2020/4/21 | Reliable Google Cloud Infrastructure: Design and Process | Certificated |
2020/4/22-2020/4/26 | Getting Started With Application Development | Certificated |
2020/4/24-2020/5/10 | Modernizing Data Lakes and Data Warehouses with GCP | Certificated |
Building Batch Data Pipelines on GCP |
也推荐 John J. Geewax 写的《Google Cloud Platform in Action》这本书作为参考阅读
目录
什么是云计算
云计算特点
云计算分类
Region and Zone
IAM
VPC Network
GCP服务
Computing
Google Kubernetes Engine (GKE)
Cloud Storage
Data & Analytics
Cloud SQL
Cloud Spanner
DataStore
Bigtable
BigQuery
Dataproc
Cloud Pub/Sub
Datalab
Comparing
Cloud Composer & Apache Airflow
Data Catalog
Google Data Studio
Monitoring
Logging
Machine Learning
Cloud Build
Cloud Run & Cloud Functions & App Engine
Management Tool
Pricing
谷歌云首页:https://cloud.google.com/
首先,GCP是Google Cloud Platform,谷歌云平台的缩写,GCP主要包括 Compute,Storage,Big Data ,Machine Learning (AI) 四大类服务,其他还有Networking,Pricing,SDK,Management Tool,IoT,Mobile 等分类。
按照云计算的服务模式,大体可以分为:IaaS、PaaS、SaaS三层
基础设施即服务,通过网络向用户提供IT基础设施能力的服务(计算,存储,网络等)。
平台即服务,指的是在云计算基础设施之上,为用户提供应用软件部署和运行环境的服务。
软件即服务,是指基于网络提供软件服务的软件应用模式。
用盖房子打个比方:IaaS就好比只提供一片土地,用户买下之后,所有的工作还得用户自己去做,PaaS就好比在这片土地上给用户建好了楼,用户入住之前只需要自己装修一下,而SaaS不仅帮用户把楼建好,还装修好,用户买下即可拎包入住。
按照云计算的目标用户,分为公有云、私有云、混合云和行业云(专有云)
地域与分区。每个地域下有不同的分区,同一地域内的网络延迟通常在5毫秒以下。为了容灾,可以把我们的应用分布在多个地域。
Identity and Access Management,即身份识别和访问管理。
它包括三个部分:
Who:
可以通过google account, google group, service account定义。
Can do what: 可以通过 IAM role 定义,它是一个 permissions 的集合。
有三种类型的角色:
Primitive role
Predefined role
Custom role: can only be defined in organization or project, but not in folders
On which resource
GCP资源架构:
polices can define in organization, folder, project, they are inherited in the hierarchy.
Projects are the main way you organise your gcp resources.
每个 Project 有:
Project ID: 不可变的 (assigned by you)
Project Name: 可变的 (assigned by you)
Project number: 不可变的 (assigned by GCP)
Policies defined in organisation level can be inherited to all children.
GCP use least privilege in managing any kind of compute infrastructure.
The policies implemented at a higher level in this hierarchy can’t take away access that’s granted at a lower level
Eg: if you grant Editor role to Organisation and Viewer role to the folder, then the folder is granted the Editor role.
Projects can have different owners and users - they are built separately and managed separately.
When using GCP, it handles most of the lower security layer, the upper layers remain the customer’s responsibility
Virtual Private Cloud: it connects your GCP resources to each other and to the internet.
In the example below, us-east1-b and us-east1-c are on the same subnet but in different zones
VPCs have routing tables, you can define firewall rules in terms of tags on compute engine.
VPC Peering: establish a peering relationship between projects
Shared VPC: you can use IAM to control
1. GCP四大类服务如下:
2. 有四种方式与 GCP 交互:
GCP console
https://cloud.google.com/console
Cloud Shell and Cloud SDK
包括: gcloud, gsutil (Cloud Storage), bq (BigQuery) 等。
如上图所示,点击用户头像旁的激活 Cloud Shell 图标, 会在 web 控制台下方出现 shell 命令行。
可以点击“打开编辑器”:
点击“打开终端”按钮即可回到命令行界面。
本地的话,在https://cloud.google.com/sdk/docs/install下载官方Google Cloud SDK程序,Windows需要配置bin路径到PATH,其他系统也需要配置环境变量。
初始化SDK:gloud init
gcloud config list
gcloud info
gcloud compute instances list
gcloud components list
gcloud components update
gcloud auth list
export GOOGLE_APPLICATION_CREDENTIALS等。
API
APIs Explorer is an interface tool that let you easily try GCP APIs using a browser
https://developers.google.com/apis-explorer
Use libraries within your code
Cloud Client Libraries: https://cloud.google.com/apis/docs/cloud-client-libraries
Google API Client Libraries: https://developers.google.com/api-client-library
Cloud Console Mobile App
3. Cloud MarketPlace (Cloud Launcher)
可以在 GCP 上很快部署软件包,比如LAMP (Linux+Apache+MySQL+PHP) 应用。
搭建了 LAMP (Linux + Apache + MySQL + PHP) 的博客案例,最终效果图如下:
谷歌提供的云计算服务中,归类如下:
Compute Engine属于IaaS,Kubernetes Engine属于Hybrid,App Engine属于PaaS,Cloud Functions属于Serverless。
容器编排,可以管理和扩展应用等。Pod 是 Kubernetes 中最小的可部署单元。
In GCP, node is VM running in Compute Engine. The smallest deployable unit in Kubernetes. It has 1 container often, but it could have multiple containers, where the containers
will share the networking and have the same disk storage volume.
Demo及常用命令可查看官方文档 Deploying a containerized web application: https://cloud.google.com/kubernetes-engine/docs/tutorials/hello-app
docker build -t gcr.io/${PROJECT_ID}/hello-app:v1 .
运行 docker images
命令以验证构建是否成功:
docker images
使用本地 Docker 引擎测试容器映像:
docker run --rm -p 8080:8080 gcr.io/${PROJECT_ID}/hello-app:v1
必须将容器映像上传到 Registry,以便 GKE 集群可以下载并运行该容器映像。在 Google Cloud 中,Container Registry 默认处于启用状态。
为您正在使用的 Google Cloud 项目启用 Container Registry API:
gcloud services enable containerregistry.googleapis.com
配置 Docker 命令行工具以向 Container Registry 进行身份验证:
gcloud auth configure-docker
将刚刚构建的 Docker 映像推送到 Container Registry:
docker push gcr.io/${PROJECT_ID}/hello-app:v1
创建名为 hello-cluster
的集群:
标准集群:
gcloud container clusters create hello-cluster
Autopilot 集群:
gcloud container clusters create-auto hello-cluster
创建 GKE 集群并进行运行状况检查需要几分钟的时间。
该命令运行完后,请运行以下命令以查看集群的三个工作器虚拟机实例:
gcloud compute instances list
可以将构建的 Docker 映像部署到 GKE 集群。
为 hello-app
Docker 映像创建 Kubernetes 部署。
kubectl create deployment hello-app --image=gcr.io/${PROJECT_ID}/hello-app:v1
以前老版本是 kubectl run
将部署副本的基准数量设置为 3。
kubectl scale deployment hello-app --replicas=3
为您的部署创建一个 HorizontalPodAutoscaler 资源。
kubectl autoscale deployment hello-app --cpu-percent=80 --min=1 --max=5
如需查看已创建的 Pod,请运行以下命令:
kubectl get pods
输出: NAME READY STATUS RESTARTS AGE
hello-app-784d7569bc-hgmpx 1/1 Running 0 10s
hello-app-784d7569bc-jfkz5 1/1 Running 0 10s
hello-app-784d7569bc-mnrrl 1/1 Running 0 15s
使用 kubectl expose
命令为 hello-app
部署生成 Kubernetes 服务。
kubectl expose deployment hello-app --name=hello-app-service --type=LoadBalancer --port 80 --target-port 8080
此处,--port 标志指定在负载平衡器上配置的端口号,--target-port 标志指定hello-app容器正在侦听的端口号。
运行以下命令以获取 hello-app-service
的服务详情。
kubectl get service
将 EXTERNAL_IP
地址复制到剪贴板(例如:203.0.113.0
)。
注意:预配负载平衡器可能需要几分钟的时间。 在预配负载平衡器之前,您可能会看到
IP 地址。
现在,hello-app
Pod 已通过 Kubernetes 服务公开发布到互联网,您可以打开新的浏览器标签页,然后导航到先前复制到剪贴板中的服务 IP 地址。您会看到一条 Hello, World!
消息以及一个 Hostname
字段。Hostname
对应于向浏览器传送 HTTP 请求的三个 hello-app
Pod 中的一个。
在本部分中,您将通过构建新的 Docker 映像并将其部署到 GKE 集群,来将 hello-app
升级到新版本。
GKE 的滚动更新功能让您可以在不停机的情况下更新部署。在滚动更新期间,GKE 集群将逐步将现有 hello-app
Pod 替换为包含新版本的 Docker 映像的 Pod。在更新期间,负载平衡器服务仅将流量路由到可用的 Pod。
返回到 Cloud Shell,现在您已在其中克隆了 hello 应用源代码和 Dockerfile。 更新项目里的文件为新版本 2.0.0
。
构建并标记新的 hello-app
Docker 映像。
docker build -t gcr.io/${PROJECT_ID}/hello-app:v2 .
将映像推送到 Container Registry。
docker push gcr.io/${PROJECT_ID}/hello-app:v2
现在,您可以更新 hello-app
Kubernetes 部署来使用新的 Docker 映像。
通过更新映像,对现有部署进行滚动更新:
kubectl set image deployment/hello-app hello-app=gcr.io/${PROJECT_ID}/hello-app:v2
运行 v1
映像的 Pod 停止运行后,系统会启动运行 v2
映像的新 Pod
watch kubectl get pods
输出: NAME READY STATUS RESTARTS AGE
hello-app-89dc45f48-5bzqp 1/1 Running 0 2m42s
hello-app-89dc45f48-scm66 1/1 Running 0 2m40s
在单独的标签页中,再次导航到 hello-app-service
外部 IP。您现在应该看到 Version
被设置为 2.0.0.
。
清理
为避免因本教程中使用的资源导致您的 Google Cloud 帐号产生费用,请删除包含这些资源的项目,或者保留项目但删除各个资源。
删除 Service:此步骤将取消并释放为 Service 创建的 Cloud Load Balancer:
kubectl delete service hello-app-service
删除集群:此步骤将删除构成集群的资源,如计算实例、磁盘和网络资源:
gcloud container clusters delete hello-cluster
删除容器映像:此操作会删除推送到 Container Registry 的 Docker 映像。
gcloud container images delete gcr.io/${PROJECT_ID}/hello-app:v1 --force-delete-tags --quiet
gcloud container images delete gcr.io/${PROJECT_ID}/hello-app:v2 --force-delete-tags --quiet
以下是我的一个小试验:
在 VM instances里可以看到:
结果如下:
对象存储,有 unique key 可以访问对应对象。在 Cloud Storage 中,每个对象都有一个 URL,并且该 URL 不可变。
Cloud Storage 保留修改历史,存储对象历史,我们可以查看版本列表,还原或者删除。
Cloud Storage 提供生命周期管理,比如你可以删除 5 天以前的对象。
用途:
serving website content
storing data for archival and disaster recovery
distributing large data objects to your end users via direct download
For most case, IAM is sufficient, but if you need finer control, you can create ACLs (access control lists).
每个访问控制列表包括:
a user or group
a permission
Cloud Storage 有不同的存储类型: Multi-Regional, Regional, Nearline, Coldline
3 Ways to bring data into Cloud Storage:
Online Transfer
Storage Transfer Service
Transfer Appliance
RDBMS,目前支持 MySQL,PostgreSQL 和 SQL Server 关系型数据库。数据大小最大是 10 TB,如果数据量大于10 TB,建议选择 Cloud Spanner
horizontally scalable RDBMS
什么时候使用?
Spanner vs Cloud SQL
Spanner 对 MySQL/PostgreSQL/SQL Server 不兼容
Spanner architecture
每个项目只能有 1 个 Datastore
什么时候使用 Datastore
应用需要扩展
ACID 事务,eg: transferring funds
用例:产品目录 - 实时库存;User profiles - 手机应用;游戏存储状态。
什么时候不使用 Datastore
Relational Database vs Datastore
查询和索引
查询
索引
queries get results from indexes
注意事项:避免过度使用index
solutions:
数据一致性
Performance vs Accuracy
以下是 Entity 详情示例:
可以看到程序返回的JSON结构是:
结果如下:
NoSQL. 读写都支持高吞吐性. 低延迟。Google Analytics, Gmail 等主要产品都使用了Bigtable。
Bigtable的层次结构,涉及实例,集群和节点,而每个实例的数据模型涉及表,行,列族和列限定符。
表的设计如图所示:
Row key is only indexed item.
It offers similar API as HBase,我们都知道 HBase 是在 Google Bigtable 2006年发表的论文里的设计后开源出来的
区别:
Bigtable can scale and manage fast and easily (Bigtable 能够更轻松地扩展到更大数量的节点,从而可以处理给定实例的更多整体吞吐量。HBase 的设计需要一个主节点来处理故障转移和其他管理操作,这意味着随着您添加越来越多的节点(成千上万个)来处理越来越多的请求,主节点将成为性能瓶颈)
Bigtable encrypts data in-flight and at rest
Bigtable can be controlled access with IAM
Bigtable infrastructure
首次开始写入数据时,Bigtable集群可能会将大多数数据放在单个节点上。
启动时,Bigtable可能会将数据放在单个节点上。
随着更多 Tablet 在单个节点上积累,集群可能会将其中一些 Tablet 重新放置到另一个节点上,以更平衡的方式重新分配数据:
随着时间的推移写入的数据越来越多,某些 Tablet 的访问频率可能会比其他平板电脑更高。如下图所示,三个 Tablet 负责整个系统中所有读取查询的35%。
在这样的场景中,几个 hot Tablet 位于一个节点上,Bigtable 通过将一些访问频率较低的 Tablet 转移到其他容量更大的节点来重新平衡集群,以确保三个节点中的每个节点都能看到三分之一的总流量:
它也可能是一个单一的 Tablet 变得 too hot(它被写入或过于频繁地读取)。将 Tablet 原样移动到另一个节点并不能解决问题。相反,Bigtable的可 split 分裂这个 Tablet ,然后重新平衡:
最重要的事情是谨慎选择行键 rowkey,这样它们就不会将流量集中在一个地方。
上手练习:
界面操作:
Cloud Console 控制台左侧导航栏导航到Bigtable,创建实例
填写 Instance ID 等相关信息后:
以 Node.js 方式时,在编写一些代码以与Cloud Bigtable进行交互之前,您需要通过运行 npm install @google-cloud/[email protected] 来安装客户端。
客户端安装后,您可以通过列出实例和集群来对其进行测试,如下所示:
const bigtable = require('@google-cloud/bigtable')({
projectId: 'your-project-id'
});
const instance = bigtable.instance('test-instance');
instance.createTable('todo', {
families: ['completed']
}).then((data) =>
const table = data[0];
console.log('Created table', table.id);
});
命令行操作:
install cbt in Google Cloud SDK
gcloud components update
gcloud components install cbt
set env variable
|
|
create table
|
|
list table
|
|
add column family
|
|
list column family
|
|
add value to row1, column family cf1, column qualifier c1
|
|
read table
|
|
delete table
|
|
数据仓库,接近实时的 PB 级数据库的分析
How BigQuery works
Structure
IAM
命令行模式:
BigQuery案例
Find correlation between rain and bicycle rentals
How about joining the bicycle rentals data against weather data to learn whether there are fewer bicycle rentals on rainy days?
采用GCP提供的数据集:
数据导入成功后,在SQL输入框中写以下SQL:
WITH bicycle_rentals AS (
SELECT
COUNT(starttime) as num_trips,
EXTRACT(DATE from starttime) as trip_date
FROM `bigquery-public-data.new_york_citibike.citibike_trips`
GROUP BY trip_date
),
rainy_days AS
(
SELECT
date,
(MAX(prcp) > 5) AS rainy
FROM (
SELECT
wx.date AS date,
IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp
FROM
`bigquery-public-data.ghcn_d.ghcnd_2015` AS wx
WHERE
wx.id = 'USW00094728'
)
GROUP BY
date
)
SELECT
ROUND(AVG(bk.num_trips)) AS num_trips,
wx.rainy
FROM bicycle_rentals AS bk
JOIN rainy_days AS wx
ON wx.date = bk.trip_date
GROUP BY wx.rainy
执行结果是:
This together with StackDriver Logging and StackDriver Monitoring provides a complete data platform,
Cloud Dataproc has two ways to customize clusters, optional components and initialization actions. Pre-configured optional components can be selected when deployed via the console or the command line and include Anaconda Jupyter notebook, Zeppelin notebook, Presto and Zookeeper.
Setup(Create a cluster):
Configure:
For configuration the cluster can be set up as a single VM, which is usually to keep costs down for development and experimentation. Standard is with a single master node and high availability has three master nodes. You can choose between a region and a zone or select the global region and allow the service to choose the zone for you. The cluster defaults to a global endpoint but defining a regional endpoint may offer increased isolation and in certain cases lower latency. The master node is where the HDFS name node runs as well as the yarn node and job drivers. HDFS replication defaults to to in Cloud Dataproc. Optional components from the Hadoop ecosystem include Anaconda, which is your Python distribution in package manager, Web H CAD, Jupyter Notebook and Zeppelin Notebook as well. Cluster properties are runtime values that can be used by configuration files for more dynamic startup options. And user labels can be used to tag your cluster for your own solutions or your reporting purposes. The master node, worker nodes and preemptible worker nodes if enabled have separate VM options such as vCPU, memory and storage. Preemptible nodes include yarn node manager, but they don't run HDFS. There are a minimum number of worker nodes. The default is two, the maximum number of worker knows is determined by a quota and the number of SS Divs attached to each worker. You can also specify initialization actions such as an initialization script that we saw earlier. It can further customize your worker nodes on startup. And metadata can be defined, so the VM share state information between each other. This may be the first time you saw a preemptible nodes as an option for your cluster.
Optimize:
the main reason to use preemptible VMs or PVMs is to lower costs for fault-tolerant workloads. PVMs can be pulled from service at any time within 24 hours. But if your workload in your cluster architecture is a healthy mix of VMs and PVMs, you may be able to withstand the interruption and get a great discount in the cost of running your job. Custom machine types allow you to specify the balance of memory and CPU to tune the VM to the load, so you're not wasting resources. A custom image can be used to pre-install software. So it takes less time for the customized node become operational, then if you install the software boot time using an initialization script. You can get a persistent SSD boot disk for faster cluster startup.
Dataproc performance optimization
Utilize: (how do you submit a job to Cloud Dataproc for processing? )
Monitoring:
Using StackDriver. Or you can also build a custom dashboard with graphs and set up monitoring of alert policies to send emails for example, where you can notify if incidents happen.
Any details from HDFS, YARN, metrics about a particular job or overall metrics for the cluster like CPU utilization, disk and network usage can all be monitored and alerted on with StackDriver.
Cloud Dataproc Initialization Actions
可参照:https://github.com/GoogleCloudDataproc/initialization-actions
There are a lot of pre-built startup scripts that you can leverage for common Hadoop cluster set of tasks like Flink, Jupyter and more.
use initializeion actions to add other software to cluster at startup
gcloud dataproc clusters create --initialication-actions gs://$MY_BUCKET/hbase/hbase.sh --num-masters 3 --num-workers 2
It's pretty easy to adapt existing Hadoop code to use GCS instead of HDFS. It's just a matter of changing the prefix for this storage from hdfs// to gs//.
Converting from HDFS to Google Cloud Storage
创建Dataproc集群:
Cluster Name输入sparktodp,选择Image Type and Version,勾上Enable Gateway,Optional Components勾上Jupyter Notebook:
点击Notebook:
点击 "OPEN JUPYTERLAB" 打开Jupyter,运行01_spark.ipynb(Run All,或者一步步一个个Cell来),先把数据读到HDFS里,可以看到:
读数据:
Spark 分析:
一种就是调用DataFrame:
另一种就是使用Spark SQL:
执行结果:
最后可以通过matplotlib画图,把上面的attack_stats结果展示出来:
Replace HDFS by Google Cloud Storage
Load csv to BigQuery
bq mk sparktobq
BUCKET='cloud-training-demos-ml' # CHANGE
bq --location=US load --autodetect --source_format=CSV sparktobq.kdd_cup_raw gs://$BUCKET/kddcup.data_10_percent.gz
Using Cloud Functions, launch analysis every time there is a new file in the bucket. (serverless)
%%bash
wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz
gunzip kddcup.data_10_percent.gz
BUCKET='cloud-training-demos-ml' # CHANGE
gsutil cp kdd* gs://$BUCKET/
bq mk sparktobq
%%writefile main.py
from google.cloud import bigquery
import google.cloud.storage as gcs
import tempfile
import os
def create_report(BUCKET, gcsfilename, tmpdir):
"""
Creates report in gs://BUCKET/ based on contents in gcsfilename (gs://bucket/some/dir/filename)
"""
# connect to BigQuery
client = bigquery.Client()
destination_table = 'sparktobq.kdd_cup'
# Specify table schema. Autodetect is not a good idea for production code
job_config = bigquery.LoadJobConfig()
schema = [
bigquery.SchemaField("duration", "INT64"),
]
for name in ['protocol_type', 'service', 'flag']:
schema.append(bigquery.SchemaField(name, "STRING"))
for name in 'src_bytes,dst_bytes,wrong_fragment,urgent,hot,num_failed_logins'.split(','):
schema.append(bigquery.SchemaField(name, "INT64"))
schema.append(bigquery.SchemaField("unused_10", "STRING"))
schema.append(bigquery.SchemaField("num_compromised", "INT64"))
schema.append(bigquery.SchemaField("unused_12", "STRING"))
for name in 'su_attempted,num_root,num_file_creations'.split(','):
schema.append(bigquery.SchemaField(name, "INT64"))
for fieldno in range(16, 41):
schema.append(bigquery.SchemaField("unused_{}".format(fieldno), "STRING"))
schema.append(bigquery.SchemaField("label", "STRING"))
job_config.schema = schema
# Load CSV data into BigQuery, replacing any rows that were there before
job_config.create_disposition = bigquery.CreateDisposition.CREATE_IF_NEEDED
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.skip_leading_rows = 0
job_config.source_format = bigquery.SourceFormat.CSV
load_job = client.load_table_from_uri(gcsfilename, destination_table, job_config=job_config)
print("Starting LOAD job {} for {}".format(load_job.job_id, gcsfilename))
load_job.result() # Waits for table load to complete.
print("Finished LOAD job {}".format(load_job.job_id))
# connections by protocol
sql = """
SELECT COUNT(*) AS count
FROM sparktobq.kdd_cup
GROUP BY protocol_type
ORDER by count ASC
"""
connections_by_protocol = client.query(sql).to_dataframe()
connections_by_protocol.to_csv(os.path.join(tmpdir,"connections_by_protocol.csv"))
print("Finished analyzing connections")
# attacks plot
sql = """
SELECT
protocol_type,
CASE label
WHEN 'normal.' THEN 'no attack'
ELSE 'attack'
END AS state,
COUNT(*) as total_freq,
ROUND(AVG(src_bytes), 2) as mean_src_bytes,
ROUND(AVG(dst_bytes), 2) as mean_dst_bytes,
ROUND(AVG(duration), 2) as mean_duration,
SUM(num_failed_logins) as total_failed_logins,
SUM(num_compromised) as total_compromised,
SUM(num_file_creations) as total_file_creations,
SUM(su_attempted) as total_root_attempts,
SUM(num_root) as total_root_acceses
FROM sparktobq.kdd_cup
GROUP BY protocol_type, state
ORDER BY 3 DESC
"""
attack_stats = client.query(sql).to_dataframe()
ax = attack_stats.plot.bar(x='protocol_type', subplots=True, figsize=(10,25))
ax[0].get_figure().savefig(os.path.join(tmpdir,'report.png'));
print("Finished analyzing attacks")
bucket = gcs.Client().get_bucket(BUCKET)
for blob in bucket.list_blobs(prefix='sparktobq/'):
blob.delete()
for fname in ['report.png', 'connections_by_protocol.csv']:
bucket.blob('sparktobq/{}'.format(fname)).upload_from_filename(os.path.join(tmpdir,fname))
print("Uploaded report based on {} to {}".format(gcsfilename, BUCKET))
def bigquery_analysis_cf(data, context):
# check that trigger is for a file of interest
bucket = data['bucket']
name = data['name']
if ('kddcup' in name) and not ('gz' in name):
filename = 'gs://{}/{}'.format(bucket, data['name'])
print(bucket, filename)
with tempfile.TemporaryDirectory() as tmpdir:
create_report(bucket, filename, tmpdir)
# test that the function works
import main as bq
BUCKET='cloud-training-demos-ml' # CHANGE
try:
bq.create_report(BUCKET, 'gs://{}/kddcup.data_10_percent'.format(BUCKET), "/tmp")
except Exception as e:
print(e.errors)
gcloud functions deploy bigquery_analysis_cf --runtime python37 --trigger-resource $BUCKET --trigger-event google.storage.object.finalize
Verify that the Cloud Function is being run. You can do this from the Cloud Functions part of the GCP Console.
Once the function is complete (in about 30 seconds), see if the output folder contains the report:
gsutil ls gs://$BUCKET/sparktobq
Dataflow
is managed data pipelines
Processes data using Compute Engine
Clusters are sized for you
Automated scaling
Write code for batch and streaming
Auto scaling, No-Ops, Stream and Batch Processing
Built on Apache Beam
Pipelines are regional-based
Why use Cloud Dataflow?
ETL
Data analytics: batch or streaming
Orchestration: create pipelines that coordinate services, including external services
Integrates with GCP services
Data Processing
Solution:
Apache Beam + Cloud Dataflow
Data Transformation
Cloud Dataproc vs Cloud Dataflow
Key Terms
Element : single entry of data (eg. table row)
PCollection: Distributed data set, input and output
Transform: Data processing in pipeline
ParDo: Type of Transform
is scalable, reliable messaging
Supports many-to-many asynchronous messaging
Push/pull to topics
Support for offline consumers
At least once delivery policy
Global scale messaging buffer/coupler
No-ops
Decouples senders and receivers
Equivalent to Kafka
At-least-once delivery
Pub/Sub overview
Topic: publisher sends messages to topic
Messages are stored in message store until they are delivered and acknowledged by subscribers
Pub/Sub forwards messages from a topic to subscribers. messages can be pushed by Pub/Sub to subscriber or pulled by subscribers from Pub/Sub
Subscriber receives pending messages from subscription and acknowledge to Pub/Sub
After message is acknowledged by the subscriber, it is removed from the subscription’s queue of messages.
Push and Pull
Push = lower latency, more real-time
Push subscribers must be Webhook endpoints that accept POST over HTTPS
Pull ideal for large volume of messages - batch delivery
Demo: how to publish and receive messages in PubSub with Java
create topic
|
|
create subscription to this topic
|
|
git clone project into cloud shell
|
|
go into the sample
|
|
modify PublisherExample.java and SubscribeAsyncExample.java to put the right project id, topic id and subscription id
compile project
|
|
run subscriber
|
|
run publisher in another screen and observe subscriber
|
|
interactive data exploration (Notebook)
Built on Jupyter (formerly IPython)
Easily deploy models to BigQuery. You can visualize data with Google Charts or map plot line
Relational database: “Consistency and Reliability over Performance”
Non-Relational Database: “Performance over Consistency”
How to choose the right storage
Orchestrating work between GCP services with Cloud Composer
使用谷歌云上的Cloud Composer,就可以不用自己装Airflow,只需要关注workflow。
Cloud Composer用GCS(Google Cloud Storage)存储Apache Airflow DAGs,可以在我们的环境里新增,更新,删除DAGs。
The DAGs folder is simply a GCS bucket where you will load your pipeline code. a GCS bucket that is automatically created for when you launch your Cloud Composer Instance.
通过Cloud Functions去event trigger,或者通过schedule去周期性执行
Monitoring and Logging等都可以点击对应的Job详情查看Job的运行情况和细节。
Airflow官网:https://airflow.incubator.apache.org/
Airflow是开源的:https://github.com/apache/airflow
Airflow官方文档:https://airflow.incubator.apache.org/docs/apache-airflow/stable/index.html
Cloud Composer 是基于 Apache Airflow 构建的全代管式工作流编排服务。
端到端地集成多种 Google Cloud 产品,包括 BigQuery、Dataflow、Dataproc、Datastore、Cloud Storage、Pub/Sub 和 AI Platform,让用户可以灵活自由地全方位编排流水线(data pipeline),编写、安排(schedule)和监控(monitor)工作流(workflow)。
What is a Workflow?
安装及使用 Airflow:
pip3 install apache-airflow
airflow db init
airflow webserver -p 8080
airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
访问 http://localhost:8080/,输入username和password均为admin即可登录成功:
Graph View:
example_bash_operator:
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""Example DAG demonstrating the usage of the BashOperator."""
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago
args = {
'owner': 'airflow',
}
dag = DAG(
dag_id='example_bash_operator',
default_args=args,
schedule_interval='0 0 * * *',
start_date=days_ago(2),
dagrun_timeout=timedelta(minutes=60),
tags=['example', 'example2'],
params={"example_key": "example_value"},
)
run_this_last = DummyOperator(
task_id='run_this_last',
dag=dag,
)
# [START howto_operator_bash]
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
dag=dag,
)
# [END howto_operator_bash]
run_this >> run_this_last
for i in range(3):
task = BashOperator(
task_id='runme_' + str(i),
bash_command='echo "{{ task_instance_key_str }}" && sleep 1',
dag=dag,
)
task >> run_this
# [START howto_operator_bash_template]
also_run_this = BashOperator(
task_id='also_run_this',
bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
dag=dag,
)
# [END howto_operator_bash_template]
also_run_this >> run_this_last
if __name__ == "__main__":
dag.cli()
Trigger DAG 后可以 View Logs。
或者通过 docker 装 airflow:
docker-compose.yml
version: '3'
services:
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
ports:
- "5432:5432"
webserver:
image: puckel/docker-airflow:1.10.1
build:
context: https://github.com/puckel/docker-airflow.git#1.10.1
dockerfile: Dockerfile
args:
AIRFLOW_DEPS: gcp_api,s3
PYTHON_DEPS: sqlalchemy==1.2.0
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=n
- EXECUTOR=Local
- FERNET_KEY=jsDPRErfv8Z_eVTnGfF8ywd19j4pyqE3NpdUBA_oRTo=
volumes:
- ./examples/intro-example/dags:/usr/local/airflow/dags
# Uncomment to include custom plugins
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8080:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
docker-compose up
即可在http://localhost:8080/看到 airflow web ui
docker-compose logs
docker-compose down
或者通过下面这种Dockerfile:
# Base Image
FROM python:3.7-slim-buster
# Arguments that can be set with docker build
ARG AIRFLOW_VERSION=1.10.1
ARG AIRFLOW_HOME=/usr/local/airflow
# Export the environment variable AIRFLOW_HOME where airflow will be installed
ENV AIRFLOW_HOME=${AIRFLOW_HOME}
ENV AIRFLOW_GPL_UNIDECODE=1
# Install dependencies and tools
RUN apt-get update -yqq && \
apt-get upgrade -yqq && \
apt-get install -yqq --no-install-recommends \
wget \
libczmq-dev \
curl \
libssl-dev \
git \
inetutils-telnet \
bind9utils freetds-dev \
libkrb5-dev \
libsasl2-dev \
libffi-dev libpq-dev \
freetds-bin build-essential \
default-libmysqlclient-dev \
apt-utils \
rsync \
zip \
unzip \
gcc \
locales \
procps \
&& apt-get clean
# Load custom configuration
COPY ./airflow.cfg ${AIRFLOW_HOME}/airflow.cfg
# Upgrade pip
# Create airflow user
# Install apache airflow with subpackages
RUN pip install --upgrade pip && \
useradd -ms /bin/bash -d ${AIRFLOW_HOME} airflow && \
pip install apache-airflow==${AIRFLOW_VERSION} --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10.1/constraints-3.7.txt"
# Copy the entrypoint.sh from host to container (at path AIRFLOW_HOME)
COPY ./entrypoint.sh /entrypoint.sh
# Set the entrypoint.sh file to be executable
RUN chmod +x ./entrypoint.sh
# Set the owner of the files in AIRFLOW_HOME to the user airflow
RUN chown -R airflow: ${AIRFLOW_HOME}
# Set the username to use
USER airflow
# Set workdir (it's like a cd inside the container)
WORKDIR ${AIRFLOW_HOME}
# Create the dags folder which will contain the DAGs
RUN mkdir dags
# Expose the webserver port
EXPOSE 8080
# Execute the entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]
entrypoint.sh:
#!/usr/bin/env bash
# Initiliaze the metadata database
airflow initdb
# Run the scheduler in background
airflow scheduler &> /dev/null &
# Run the web server in foreground (for docker logs)
exec airflow webserver
然后 Build the Airflow image
docker build --tag airflow .
Run the Airflow container
docker run --name my_airflow -it -d -p 8080:8080 airflow
Verify that your Airflow container is running and healthy:
docker ps
Check out the logs:
docker logs my_airflow
mount /xxx目录下python文件写成的DAG到AIRFLOW_HOME目录下的dags目录:
docker run --name my_airflow -it -d -p 8080:8080 --mount type=bind,source=/xxx/my_dag.py,target=/usr/local/airflow/dags/my_dag.py airflow
进入验证my_dag在dags目录下:
docker exec -it my_airflow ls /usr/local/airflow/dags
exec into the container to access the shell.
docker exec -it my_airflow bash
Next, make sure the DAG was parsed correctly:
python dags/my_dag.py
选择Airflow和Python版本,点击创建,即可成功创建env。
还可以安装Python依赖:
接下来,我们就可以参照上面的 example_bash_operator 写 DAG:
跟 BigQuery 集成可以用 bigquery_operator,并且在 Web UI 上设置 Connection,从而操作 BigQuery 里的 Dataset,在 task 里可以写 sql 或者指明 sql 文件。
Airflow 还有另一个比较常用的是 Variables,它就是 key-value 键值对。
推荐以下 Airflow 中文文档:
https://www.kancloud.cn/luponu/airflow-doc-zh/889656
及以下 Youtube视频:
Airflow tutorial 1: Introduction to Apache Airflow
Airflow tutorial 2: Set up airflow environment with docker
Airflow tutorial 3: Set up airflow environment using Google Cloud Composer
Airflow tutorial 4: Writing your first pipeline
Airflow tutorial 5: Airflow concept
Airflow tutorial 6: Build a data pipeline using Google Bigquery
Airflow tutorial 7: Airflow variables
元数据管理
(1) System: BIGQUERY
Type: Dataset, Table
Resource URL: link to BigQuery URL
Tags
Schema and column tags: Name, Type (NUMBERIC, STRING, etc) , Mode (eg: NULLABLE), Column tags, Policy tags, Description list
(2) System: CLOUD_PUBSUB
Resource URL: link to Cloud Pub/Sub URL
Tags
Cloud Pub/Sub里的详情有Topics,Subscriptions(Delivery type: Pull等),View Message,Publish message request count/sec图表,Publish message operation count/sec图表等。
(3)GCS
Entry group, Entries, Bucket, Type: FILESET, etc
连接数据源,BI report可视化报表,可以share report,也可以查看shared with me/owned by me的report
Incident, Dashboards, Alerting等
Logs explorer, Logs Dashboard, Logs Storage retention period等
TensorFlow
Cloud ML
Machine Learning APIs
Why use CLoud Machine Learning Platform?
For structured data
Classification and regression
Recommendation
Anomaly detection
For unstructured data
Image and video analytics
Text analytics
Gain insight from images
Detect inappropriate content
Analyze sentiment
Extract text
Cloud Natural Language API
can return text in real time
Highly accurate, even in noisy environments
Access from any device
Cloud Translation API
Translate strings
Programmatically detect a document’s language
Support for dozen’s languages
Cloud Video Intelligence API
Annotate the contents of video
Detect scene changes
Flag inappropriate content
Support for a variety of video formats
Run Infrastructure as a code. Let you orchestrate build steps that run as container images and automate Terraform workflow.
可参照https://github.com/agmsb/googlecloudbuild-terraform
Cloud Build 可以从各种代码库或云存储空间导入源代码,根据您的规范执行构建,并生成诸如 Docker 容器或 Java 归档的软件工件。
可以通过 Google Cloud Console、gcloud
命令行工具或 Cloud Build 的 REST API 使用 Cloud Build。
在 Cloud Console 中,您可以通过构建记录页面查看 Cloud Build 构建结果,并通过构建触发器进行自动构建。
您可以使用 gcloud
工具创建和管理构建,并可以运行命令来执行提交构建、列出构建和取消构建等任务。
您可以使用 Cloud Build REST API 请求构建。
与其他 Cloud Platform API 一样,您必须使用 OAuth2 授予访问权限。获得访问授权后,您可以使用 API 启动新构建、查看构建状态和详情、列出每个项目的构建并取消当前正在进行的构建。
构建配置和构建步骤
可以编写构建配置,向 Cloud Build 提供有关执行什么任务的说明。可以将构建配置为提取依赖项,运行单元测试、静态分析和集成测试,并使用 docker、gradle、maven、bazel 和 gulp 等构建工具创建软件工件。
Cloud Build 将构建作为一系列构建步骤执行,其中的每个构建步骤都在 Docker 容器中运行。执行构建步骤类似于在脚本中执行命令。
您可以使用 Cloud Build 和 Cloud Build 社区提供的构建步骤,也可以编写自己的自定义构建步骤:
Cloud Build 提供的构建步骤:Cloud Build 发布了一组适用于常用语言和任务的受支持开源构建步骤。
社区提供的构建步骤:Cloud Build 用户社区提供了开源构建步骤。
自定义构建步骤:您可以自行创建要在自己的构建中使用的构建步骤。
每个构建步骤都通过其连接到本地 Docker 网络(名为 cloudbuild
)的容器运行。这使构建步骤可以相互通信并共享数据。
您可以在 Cloud Build 中使用标准 Docker Hub 映像,例如 Ubuntu 和 Gradle。
构建的工作原理
以下步骤描述了一般而言的 Cloud Build 构建生命周期:
程序运行结果:
以下是Java code部署的场景例子,更多场景可以查看其他官方文档:
使用 App Engine:https://cloud.google.com/appengine/docs/flexible/java/quickstart
使用 Compute Engine:https://cloud.google.com/java/getting-started/getting-started-on-compute-engine
使用 Jib 构建 Java 容器:https://cloud.google.com/java/getting-started/jib
https://cloud.tencent.com/developer/news/612944
Deployment Manager
输入以下命令即可看到创建my-vm成功:
my-vm详情如下:
Budget and Alerts
基于 GCP project上的 billing 账户,可以定义在 50%,90% 和 100% 时触发 alerts,可导出账单详情,在 Report 上可看出支出详情。Quotas 可用来预防过度消费资源,有速率分额限制和分配数量限制,比如 Kubernetes services 可设定分额为每 100 秒最多 1000 个调用,每个 project 最多 5 个 VPN。