poetry 下运行 dbt(qbit)

前言

  • dbt 是 Data Build Tool 的简称,目前由 dbt labs 公司负责开发和维护,公司的前身是Fishtown Analytics。
  • dbt 主要处理 ETL 中的 T(transform)
  • dbt 主要分为两部分

    dbt Core: cli 命令行工具
    dbt Cloud: 云服务
  • dbt Core 的 github 地址是:https://github.com/dbt-labs/d...
  • qbit 个人觉得 dbt Core 的官方入门教程不友好,一开始就要求谷歌的 BigQuery 服务、GitHub 账号等,其实 dbt Core 可以不联网完全离线使用。
  • 本文主要参照:【大数据架构之旅】2 从零起步学 dbt

技术栈

Windows    10
Python     3.8.10
poetry     1.3.1
git        2.35.1.windows.2
PostgreSQL 15.1

poetry 项目配置

  • .../test_dbt/pyproject.toml

    [tool.poetry]
    name = "test-dbt"
    version = "0.1.0"
    description = ""
    authors = ["qbit "]
    readme = "README.md"
    packages = [{include = "test_dbt"}]
    
    [[tool.poetry.source]]
    name = "aliyun"
    url = "https://mirrors.aliyun.com/pypi/simple/"
    default = true
    
    [tool.poetry.dependencies]
    python = "^3.8"
    dbt-core = "~1.3.1"
    dbt-postgres = "~1.3.1"
    
    [build-system]
    requires = ["poetry-core"]
    build-backend = "poetry.core.masonry.api"
  • 本文中的 poetry 项目文件夹.../test_dbt/
  • 本文中的 dbt 项目文件夹.../test_dbt/dbt_demo/

创建数据库与初始化项目

  • 在 PostgreSQL 里面新建数据库 dbt-demo
  • 初始化 dbt 项目,在 .../test_dbt/ 目录下执行以下命令

    poetry run dbt init dbt_demo
  • 按提示执行完后,.../test_dbt/ 的目录结构如下

    $ tree
    .
    ├── dbt_demo
    │   ├── analyses
    │   ├── dbt_project.yml
    │   ├── macros
    │   ├── models
    │   │   └── example
    │   │       ├── my_first_dbt_model.sql
    │   │       ├── my_second_dbt_model.sql
    │   │       └── schema.yml
    │   ├── README.md
    │   ├── seeds
    │   ├── snapshots
    │   └── tests
    ├── logs
    │   └── dbt.log
    ├── poetry.lock
    └── pyproject.toml
  • 在个人用户目录下找到文件 ~/.dbt/profiles.ymlqbit 的绝对路径是 C:\Users\qbit\.dbt\profiles.yml,将以下内容复制粘贴到文件,中括号里面的内容按自己的实际情况填写

    dbt_demo:
    outputs:
    
      dev:
        type: postgres
        threads: 1
        host: [host]
        port: [port]
        user: [dev_username]
        pass: [dev_password]
        dbname: dbt_demo
        schema: dev_schema
    
      prod:
        type: postgres
        threads: 1
        host: [host]
        port: [port]
        user: [prod_username]
        pass: [prod_password]
        dbname: dbt_demo
        schema: prod_schema
    
    target: dev
  • 下文命令若无特殊说明,都在 .../test_dbt/dbt_demo/ 目录下运行

自带模型

  • 进入 .../test_dbt/dbt_demo/ 目录后运行以下命令检查环境信息

    $ poetry run dbt debug
    06:41:09  Running with dbt=1.3.1
    dbt version: 1.3.1
    python version: 3.8.10
    python path: D:\Python3Project\test_dbt\.venv\Scripts\python.exe
    os info: Windows-10-10.0.19045-SP0
    Using profiles.yml file at C:\Users\qbit\.dbt\profiles.yml
    Using dbt_project.yml file at D:\Python3Project\test_dbt\dbt_demo\dbt_project.yml
    
    Configuration:
    profiles.yml file [OK found and valid]
    dbt_project.yml file [OK found and valid]
    
    Required dependencies:
     - git [OK found]
    
    Connection:
    host: 192.168.1.52
    port: 5432
    user: postgres
    database: dbt_demo
    schema: dev_schema
    search_path: None
    keepalives_idle: 0
    sslmode: None
    Connection test: [OK connection ok]
    
    All checks passed!
  • 生成自带模型

    $ poetry run dbt run
    06:47:04  Running with dbt=1.3.1
    06:47:04  Partial parse save file not found. Starting full parse.
    06:47:04  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
    06:47:04
    06:47:05  Concurrency: 1 threads (target='dev')
    06:47:05
    06:47:05  1 of 2 START sql table model dev_schema.my_first_dbt_model ..................... [RUN]
    06:47:05  1 of 2 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.15s]
    06:47:05  2 of 2 START sql view model dev_schema.my_second_dbt_model ..................... [RUN]
    06:47:05  2 of 2 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.09s]
    06:47:05
    06:47:05  Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.53 seconds (0.53s).
    06:47:05
    06:47:05  Completed successfully
    06:47:05
    06:47:05  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
  • 检查 PostgreSQL 数据库,应该可以看到一张名为 my_first_dbt_model 的表,和一个名为 my_second_dbt_model 的视图

外部数据

创建底表

  • 加州大学欧文分校下载共享单车的数据集,将 hour.csvday.csv 放到 .../test_dbt/dbt_demo/seeds 文件夹下。
  • .../test_dbt/dbt_demo/seeds 目录下创建文件 bike_share.yml,内容如下:

    version: 2
    seeds:
    - name: hour
      config:
        column_types: 
          dteday: date
    - name: day
      config:
        column_types:
          dteday: date
  • 这里指定 date 类型是为了避免 dbt 自动推断类型出错
  • 关于数据列的类型,可参考官方文档 column_types
  • 执行以下命令导入基础数据

    $ poetry run dbt seed
    07:19:26  Running with dbt=1.3.1
    07:19:26  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
    07:19:26
    07:19:26  Concurrency: 1 threads (target='dev')
    07:19:26
    07:19:26  1 of 2 START seed file dev_schema.day .......................................... [RUN]
    07:19:31  1 of 2 OK loaded seed file dev_schema.day ...................................... [INSERT 731 in 4.54s]
    07:19:31  2 of 2 START seed file dev_schema.hour ......................................... [RUN]
    07:21:24  2 of 2 OK loaded seed file dev_schema.hour ..................................... [INSERT 17379 in 112.83s]
    07:21:24
    07:21:24  Finished running 2 seeds in 0 hours 1 minutes and 57.65 seconds (117.65s).
    07:21:24
    07:21:24  Completed successfully
    07:21:24
    07:21:24  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
  • 检查 PostgreSQL 数据库,应该可以看到新建了 2 张表:hourday

自定义模型

  • 创建模型目录 .../test_dbt/dbt_demo/models/bike_share,下面 sql 文件中 ref('day') 是引用的上面导入的 day数据库表
  • .../models/bike_share 目录内创建 weather_trend.sql,内容如下

    /* 天气趋势表 */
    
    {{ config(materialized='table') }}
    
    with weather_trend as (
      select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
      from {{ ref('day') }}
      group by weathersit
    )
  • .../models/bike_share 目录内创建 season_trend.sql,内容如下

    /* 季节趋势表 */
    
    {{ config(materialized='table') }}
    
    with season_trend as (
      select season, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
      from {{ ref('day') }}
      group by season
    )
  • .../models/bike_share 目录内创建 month_trend.sql,内容如下

    /* 月度趋势表 */
    
    {{ config(materialized='table') }}
    
    with month_trend as (
      select mnth, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
      from {{ ref('day') }}
      group by mnth
    )
  • .../models/bike_share 目录内创建 weekday_trend.sql,内容如下

    /* 星期几趋势表 */
    
    {{ config(materialized='table') }}
    
    with weekday_trend as (
      select weekday, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt
      from {{ ref('day') }}
      group by weekday
    )
  • 执行以下命令生成模型

    $ poetry run dbt run
    07:43:59  Running with dbt=1.3.1
    07:43:59  Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
    07:43:59
    07:43:59  Concurrency: 1 threads (target='dev')
    07:43:59
    07:43:59  1 of 6 START sql table model dev_schema.month_trend ............................ [RUN]
    07:44:00  1 of 6 OK created sql table model dev_schema.month_trend ....................... [SELECT 12 in 0.15s]
    07:44:00  2 of 6 START sql table model dev_schema.my_first_dbt_model ..................... [RUN]
    07:44:00  2 of 6 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.09s]
    07:44:00  3 of 6 START sql table model dev_schema.season_trend ........................... [RUN]
    07:44:00  3 of 6 OK created sql table model dev_schema.season_trend ...................... [SELECT 4 in 0.07s]
    07:44:00  4 of 6 START sql table model dev_schema.weather_trend .......................... [RUN]
    07:44:00  4 of 6 OK created sql table model dev_schema.weather_trend ..................... [SELECT 3 in 0.08s]
    07:44:00  5 of 6 START sql table model dev_schema.weekday_trend .......................... [RUN]
    07:44:00  5 of 6 OK created sql table model dev_schema.weekday_trend ..................... [SELECT 7 in 0.07s]
    07:44:00  6 of 6 START sql view model dev_schema.my_second_dbt_model ..................... [RUN]
    07:44:00  6 of 6 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.07s]
    07:44:00
    07:44:00  Finished running 5 table models, 1 view model in 0 hours 0 minutes and 0.78 seconds (0.78s).
    07:44:00
    07:44:00  Completed successfully
    07:44:00
    07:44:00  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

文档生成和查看

  • 运行以下命令生成文档

    $ poetry run dbt docs generate
    07:49:09  Running with dbt=1.3.1
    07:49:09  Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics
    07:49:09
    07:49:09  Concurrency: 1 threads (target='dev')
    07:49:09
    07:49:09  Done.
    07:49:10  Building catalog
    07:49:10  Catalog written to D:\Python3Project\test_dbt\dbt_demo\target\catalog.json
  • 运行以下命令,会启动 http 服务,并打开默认浏览器查看文档

    $ poetry run dbt docs serve
    07:49:53  Running with dbt=1.3.1
    07:49:53  Serving docs at 0.0.0.0:8080
    07:49:53  To access from your browser, navigate to:  http://localhost:8080
    07:49:53
    07:49:53
    07:49:53  Press Ctrl+C to exit.
    127.0.0.1 - - [28/Dec/2022 15:49:54] "GET / HTTP/1.1" 200 -
    127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /manifest.json?cb=1672213794801 HTTP/1.1" 200 -
    127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /catalog.json?cb=1672213794801 HTTP/1.1" 200 -
  • 网页界面
  • 点击右下角按钮查看血缘关系图(Lineage Graph)

目录结构

  • 最后再看一眼目录结构

    $ tree ./dbt_demo/ -L 3
    ./dbt_demo/
    ├── analyses
    ├── dbt_packages
    ├── dbt_project.yml
    ├── logs
    │   └── dbt.log
    ├── macros
    ├── models
    │   ├── bike_share
    │   │   ├── month_trend.sql
    │   │   ├── season_trend.sql
    │   │   ├── weather_trend.sql
    │   │   └── weekday_trend.sql
    │   └── example
    │       ├── my_first_dbt_model.sql
    │       ├── my_second_dbt_model.sql
    │       └── schema.yml
    ├── README.md
    ├── seeds
    │   ├── bike_share.yml
    │   ├── day.csv
    │   └── hour.csv
    ├── snapshots
    ├── target
    │   ├── catalog.json
    │   ├── compiled
    │   │   └── dbt_demo
    │   ├── graph.gpickle
    │   ├── index.html
    │   ├── manifest.json
    │   ├── partial_parse.msgpack
    │   ├── run
    │   │   └── dbt_demo
    │   └── run_results.json
    └── tests
本文出自 qbit snap

你可能感兴趣的:(python教程数据sql)