前言
- dbt 是 Data Build Tool 的简称,目前由 dbt labs 公司负责开发和维护,公司的前身是Fishtown Analytics。
- dbt 主要处理
ETL
中的T
(transform) dbt 主要分为两部分
dbt Core: cli 命令行工具 dbt Cloud: 云服务
- dbt Core 的 github 地址是:https://github.com/dbt-labs/d...
- qbit 个人觉得 dbt Core 的官方入门教程不友好,一开始就要求谷歌的 BigQuery 服务、GitHub 账号等,其实
dbt Core
可以不联网完全离线使用。 - 本文主要参照:【大数据架构之旅】2 从零起步学 dbt
技术栈
Windows 10
Python 3.8.10
poetry 1.3.1
git 2.35.1.windows.2
PostgreSQL 15.1
poetry 项目配置
.../test_dbt/pyproject.toml
[tool.poetry] name = "test-dbt" version = "0.1.0" description = "" authors = ["qbit
"] readme = "README.md" packages = [{include = "test_dbt"}] [[tool.poetry.source]] name = "aliyun" url = "https://mirrors.aliyun.com/pypi/simple/" default = true [tool.poetry.dependencies] python = "^3.8" dbt-core = "~1.3.1" dbt-postgres = "~1.3.1" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"
- 本文中的
poetry 项目文件夹
指.../test_dbt/
- 本文中的
dbt 项目文件夹
指.../test_dbt/dbt_demo/
创建数据库与初始化项目
- 在 PostgreSQL 里面新建数据库
dbt-demo
初始化 dbt 项目,在
.../test_dbt/
目录下执行以下命令poetry run dbt init dbt_demo
按提示执行完后,
.../test_dbt/
的目录结构如下$ tree . ├── dbt_demo │ ├── analyses │ ├── dbt_project.yml │ ├── macros │ ├── models │ │ └── example │ │ ├── my_first_dbt_model.sql │ │ ├── my_second_dbt_model.sql │ │ └── schema.yml │ ├── README.md │ ├── seeds │ ├── snapshots │ └── tests ├── logs │ └── dbt.log ├── poetry.lock └── pyproject.toml
在个人用户目录下找到文件
~/.dbt/profiles.yml
,qbit
的绝对路径是C:\Users\qbit\.dbt\profiles.yml
,将以下内容复制粘贴到文件,中括号里面的内容按自己的实际情况填写dbt_demo: outputs: dev: type: postgres threads: 1 host: [host] port: [port] user: [dev_username] pass: [dev_password] dbname: dbt_demo schema: dev_schema prod: type: postgres threads: 1 host: [host] port: [port] user: [prod_username] pass: [prod_password] dbname: dbt_demo schema: prod_schema target: dev
- 下文命令若无特殊说明,都在
.../test_dbt/dbt_demo/
目录下运行
自带模型
进入
.../test_dbt/dbt_demo/
目录后运行以下命令检查环境信息$ poetry run dbt debug 06:41:09 Running with dbt=1.3.1 dbt version: 1.3.1 python version: 3.8.10 python path: D:\Python3Project\test_dbt\.venv\Scripts\python.exe os info: Windows-10-10.0.19045-SP0 Using profiles.yml file at C:\Users\qbit\.dbt\profiles.yml Using dbt_project.yml file at D:\Python3Project\test_dbt\dbt_demo\dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: 192.168.1.52 port: 5432 user: postgres database: dbt_demo schema: dev_schema search_path: None keepalives_idle: 0 sslmode: None Connection test: [OK connection ok] All checks passed!
生成自带模型
$ poetry run dbt run 06:47:04 Running with dbt=1.3.1 06:47:04 Partial parse save file not found. Starting full parse. 06:47:04 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 06:47:04 06:47:05 Concurrency: 1 threads (target='dev') 06:47:05 06:47:05 1 of 2 START sql table model dev_schema.my_first_dbt_model ..................... [RUN] 06:47:05 1 of 2 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.15s] 06:47:05 2 of 2 START sql view model dev_schema.my_second_dbt_model ..................... [RUN] 06:47:05 2 of 2 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.09s] 06:47:05 06:47:05 Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.53 seconds (0.53s). 06:47:05 06:47:05 Completed successfully 06:47:05 06:47:05 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 检查 PostgreSQL 数据库,应该可以看到一张名为
my_first_dbt_model
的表,和一个名为my_second_dbt_model
的视图
外部数据
创建底表
- 从加州大学欧文分校下载共享单车的数据集,将
hour.csv
和day.csv
放到.../test_dbt/dbt_demo/seeds
文件夹下。 在
.../test_dbt/dbt_demo/seeds
目录下创建文件bike_share.yml
,内容如下:version: 2 seeds: - name: hour config: column_types: dteday: date - name: day config: column_types: dteday: date
- 这里指定
date
类型是为了避免dbt
自动推断类型出错 - 关于数据列的类型,可参考官方文档 column_types
执行以下命令导入基础数据
$ poetry run dbt seed 07:19:26 Running with dbt=1.3.1 07:19:26 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:19:26 07:19:26 Concurrency: 1 threads (target='dev') 07:19:26 07:19:26 1 of 2 START seed file dev_schema.day .......................................... [RUN] 07:19:31 1 of 2 OK loaded seed file dev_schema.day ...................................... [INSERT 731 in 4.54s] 07:19:31 2 of 2 START seed file dev_schema.hour ......................................... [RUN] 07:21:24 2 of 2 OK loaded seed file dev_schema.hour ..................................... [INSERT 17379 in 112.83s] 07:21:24 07:21:24 Finished running 2 seeds in 0 hours 1 minutes and 57.65 seconds (117.65s). 07:21:24 07:21:24 Completed successfully 07:21:24 07:21:24 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 检查 PostgreSQL 数据库,应该可以看到新建了 2 张表:
hour
和day
自定义模型
- 创建模型目录
.../test_dbt/dbt_demo/models/bike_share
,下面 sql 文件中ref('day')
是引用的上面导入的day
数据库表 在
.../models/bike_share
目录内创建weather_trend.sql
,内容如下/* 天气趋势表 */ {{ config(materialized='table') }} with weather_trend as ( select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by weathersit )
在
.../models/bike_share
目录内创建season_trend.sql
,内容如下/* 季节趋势表 */ {{ config(materialized='table') }} with season_trend as ( select season, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by season )
在
.../models/bike_share
目录内创建month_trend.sql
,内容如下/* 月度趋势表 */ {{ config(materialized='table') }} with month_trend as ( select mnth, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by mnth )
在
.../models/bike_share
目录内创建weekday_trend.sql
,内容如下/* 星期几趋势表 */ {{ config(materialized='table') }} with weekday_trend as ( select weekday, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by weekday )
执行以下命令生成模型
$ poetry run dbt run 07:43:59 Running with dbt=1.3.1 07:43:59 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:43:59 07:43:59 Concurrency: 1 threads (target='dev') 07:43:59 07:43:59 1 of 6 START sql table model dev_schema.month_trend ............................ [RUN] 07:44:00 1 of 6 OK created sql table model dev_schema.month_trend ....................... [SELECT 12 in 0.15s] 07:44:00 2 of 6 START sql table model dev_schema.my_first_dbt_model ..................... [RUN] 07:44:00 2 of 6 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.09s] 07:44:00 3 of 6 START sql table model dev_schema.season_trend ........................... [RUN] 07:44:00 3 of 6 OK created sql table model dev_schema.season_trend ...................... [SELECT 4 in 0.07s] 07:44:00 4 of 6 START sql table model dev_schema.weather_trend .......................... [RUN] 07:44:00 4 of 6 OK created sql table model dev_schema.weather_trend ..................... [SELECT 3 in 0.08s] 07:44:00 5 of 6 START sql table model dev_schema.weekday_trend .......................... [RUN] 07:44:00 5 of 6 OK created sql table model dev_schema.weekday_trend ..................... [SELECT 7 in 0.07s] 07:44:00 6 of 6 START sql view model dev_schema.my_second_dbt_model ..................... [RUN] 07:44:00 6 of 6 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.07s] 07:44:00 07:44:00 Finished running 5 table models, 1 view model in 0 hours 0 minutes and 0.78 seconds (0.78s). 07:44:00 07:44:00 Completed successfully 07:44:00 07:44:00 Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
文档生成和查看
运行以下命令生成文档
$ poetry run dbt docs generate 07:49:09 Running with dbt=1.3.1 07:49:09 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:49:09 07:49:09 Concurrency: 1 threads (target='dev') 07:49:09 07:49:09 Done. 07:49:10 Building catalog 07:49:10 Catalog written to D:\Python3Project\test_dbt\dbt_demo\target\catalog.json
运行以下命令,会启动 http 服务,并打开默认浏览器查看文档
$ poetry run dbt docs serve 07:49:53 Running with dbt=1.3.1 07:49:53 Serving docs at 0.0.0.0:8080 07:49:53 To access from your browser, navigate to: http://localhost:8080 07:49:53 07:49:53 07:49:53 Press Ctrl+C to exit. 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET / HTTP/1.1" 200 - 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /manifest.json?cb=1672213794801 HTTP/1.1" 200 - 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /catalog.json?cb=1672213794801 HTTP/1.1" 200 -
- 网页界面
- 点击右下角按钮查看血缘关系图(Lineage Graph)
目录结构
最后再看一眼目录结构
$ tree ./dbt_demo/ -L 3 ./dbt_demo/ ├── analyses ├── dbt_packages ├── dbt_project.yml ├── logs │ └── dbt.log ├── macros ├── models │ ├── bike_share │ │ ├── month_trend.sql │ │ ├── season_trend.sql │ │ ├── weather_trend.sql │ │ └── weekday_trend.sql │ └── example │ ├── my_first_dbt_model.sql │ ├── my_second_dbt_model.sql │ └── schema.yml ├── README.md ├── seeds │ ├── bike_share.yml │ ├── day.csv │ └── hour.csv ├── snapshots ├── target │ ├── catalog.json │ ├── compiled │ │ └── dbt_demo │ ├── graph.gpickle │ ├── index.html │ ├── manifest.json │ ├── partial_parse.msgpack │ ├── run │ │ └── dbt_demo │ └── run_results.json └── tests
本文出自 qbit snap