LinkedIn DataHub --- 经验分享

在这里插入图片描述

LinkedIn DataHub --- 经验分享

  • ⚽⚽Passion begets persistence⚽⚽
    • 1. Docker command
      • 1.1 docker quickstart
      • 1.2 python3 -m datahub docker nuke --keep-data
      • 1.3 docker data volumes
    • 2. Error
      • 2.1 DPI-147:Cannot locate a 64-bit Oracle Client library
      • 2.2 UI界面无法cancle
    • 3. Delete metadata
    • 4. Oracle permission
    • 5. Neo4j or elastisearch
    • 6. Ingest metadata by json
      • 6.1 Json template
      • 6.2 Json yaml
    • 7. Create Lineage
      • 7.1 Yml template
      • 7.2 Run
    • 8. Ingest CSV
      • 8.1 Csv Template
      • 8.2 Run
    • 9. Transformers
      • 9.1 Simple Demo
    • 10. Actions
      • 10.1 Install plugin
      • 10.2 Config Action
      • 10.3 Run
      • 10.4 Kafka topic
    • 11. Data Quality
      • 11.1 initial
      • 11.2 connect DB
      • 11.3 create expectation
    • 12. Openapi
      • 12.1 Swagger
      • 12.2 api test
    • 13. Pending

⚽⚽Passion begets persistence⚽⚽

LinkedIn DataHub --- 经验分享_第1张图片


LinkedIn DataHub --- 经验分享_第2张图片

datahub官网地址: https://datahubproject.io/docs/.
github地址: https://github.com/datahub-project/datahub.
在线Demo: https://demo.datahubproject.io.
Recipe Demo: https://github.com/datahub-project/datahub/tree/master/metadata-ingestion/examples/recipes.


1. Docker command

1.1 docker quickstart

  • github地址: https://github.com/datahub-project/datahub.
  • docker没有装neo4j,所以看这篇yml.在这里插入图片描述
  • 将数据写入mysqlLinkedIn DataHub --- 经验分享_第3张图片
    LinkedIn DataHub --- 经验分享_第4张图片
  • 进入mysql,docker exec -it containerId /bin/bashLinkedIn DataHub --- 经验分享_第5张图片
    LinkedIn DataHub --- 经验分享_第6张图片

1.2 python3 -m datahub docker nuke --keep-data

在这里插入图片描述

1.3 docker data volumes

LinkedIn DataHub --- 经验分享_第7张图片

2. Error

2.1 DPI-147:Cannot locate a 64-bit Oracle Client library

  • 首先确保oracle client安装完成,以及cx_oracle
    官网安装: https://blog.csdn.net/weixin_43916074/article/details/124827554.

  • 看报警记录,确保当前databub和python与服务器的版本相同,我当时就是执行了下面的命令导致版本不一致。所以找不到cx_oracle.

    python3 -m datahub docker nuke --keep-data
    python3 -m datahub docker quickstart --version v0.8.38
    LinkedIn DataHub --- 经验分享_第8张图片

  • 要升版,保存数据,然后就直接升
    python3 -m datahub docker nuke --keep-data
    官网参考: https://datahubproject.io/docs/cli.
    LinkedIn DataHub --- 经验分享_第9张图片

2.2 UI界面无法cancle

  • ingest已经执行完,UI还在转圈圈
    暂时无解。

3. Delete metadata

  • 删除qa中的oracle
    python3 -m datahub delete --env QA --entity_type dataset --platform oracle
    LinkedIn DataHub --- 经验分享_第10张图片

4. Oracle permission

  • 设定oracle 账密权限,只能看到一个schema下的table/view

  • 在datahub config中配置正则
    LinkedIn DataHub --- 经验分享_第11张图片
  • 多翻官方文档,熟悉配置
    官网参考: https://datahubproject.io/docs/generated/ingestion/sources/oracle.
    LinkedIn DataHub --- 经验分享_第12张图片
    LinkedIn DataHub --- 经验分享_第13张图片

5. Neo4j or elastisearch

  • Neo4j和Elasticsesarch是并列关系
    LinkedIn DataHub --- 经验分享_第14张图片
  • 如果docker下载了Neo4j
    LinkedIn DataHub --- 经验分享_第15张图片
  • 没有下载Neo4j,则会下载Elasticsesarch
    LinkedIn DataHub --- 经验分享_第16张图片

6. Ingest metadata by json

官网参考: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/demo_data/demo_data.json.

6.1 Json template

[
    {
        "auditHeader": null,
        "proposedSnapshot": {
            "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
                "urn": "urn:li:dataset:(urn:li:dataPlatform:bigquery,bigquery-schema-data.covid19,QA)",
                "aspects": [
                    {
                        "com.linkedin.pegasus2avro.schema.SchemaMetadata": {
                            "schemaName": "bigquery-schema-data.covid19",
                            "platform": "urn:li:dataPlatform:bigquery",
                            "version": 0,
                            "created": {
                                "time": 1621882982738,
                                "actor": "urn:li:corpuser:etl",
                                "impersonator": null
                            },
                            "lastModified": {
                                "time": 1621882982738,
                                "actor": "urn:li:corpuser:etl",
                                "impersonator": null
                            },
                            "deleted": null,
                            "dataset": null,
                            "cluster": null,
                            "hash": "",
                            "platformSchema": {
                                "com.linkedin.pegasus2avro.schema.MySqlDDL": {
                                    "tableSchema": ""
                                }
                            },
                            "fields": [
                                {
                                    "fieldPath": "county_code",
                                    "jsonPath": null,
                                    "nullable": true,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.StringType": {}
                                        }
                                    },
                                    "nativeDataType": "String()",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                },
                                {
                                    "fieldPath": "county_name",
                                    "jsonPath": null,
                                    "nullable": true,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.StringType": {}
                                        }
                                    },
                                    "nativeDataType": "String()",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                },
                                
                                {
                                    "fieldPath": "county_number",
                                    "jsonPath": null,
                                    "nullable": true,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.NumberType": {}
                                        }
                                    },
                                    "nativeDataType": "Integer()",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                },
                                {
                                    "fieldPath": "hospital_bed_number",
                                    "jsonPath": null,
                                    "nullable": true,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.NumberType": {}
                                        }
                                    },
                                    "nativeDataType": "Integer()",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                }
                            ],
                            "primaryKeys": null,
                            "foreignKeysSpecs": null
                        }
                    }
                ]
            }
        },
        "proposedDelta": null
    },
    {
        "auditHeader": null,
        "proposedSnapshot": {
            "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
                "urn": "urn:li:dataset:(urn:li:dataPlatform:bigquery,bigquery-sehcma-nan.covid19,QA)",
                "aspects": [
                    {
                        "com.linkedin.pegasus2avro.schema.SchemaMetadata": {
                            "schemaName": "bigquery-schema-nan.covid19",
                            "platform": "urn:li:dataPlatform:bigquery",
                            "version": 0,
                            "created": {
                                "time": 1621882983026,
                                "actor": "urn:li:corpuser:etl",
                                "impersonator": null
                            },
                            "lastModified": {
                                "time": 1621882983026,
                                "actor": "urn:li:corpuser:etl",
                                "impersonator": null
                            },
                            "deleted": null,
                            "dataset": null,
                            "cluster": null,
                            "hash": "",
                            "platformSchema": {
                                "com.linkedin.pegasus2avro.schema.MySqlDDL": {
                                    "tableSchema": ""
                                }
                            },
                            "fields": [
                                {
                                    "fieldPath": "county_code",
                                    "jsonPath": null,
                                    "nullable": true,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.StringType": {}
                                        }
                                    },
                                    "nativeDataType": "String()",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                },
                                {
                                    "fieldPath": "county_name",
                                    "jsonPath": null,
                                    "nullable": true,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.StringType": {}
                                        }
                                    },
                                    "nativeDataType": "String()",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                },
                                {
                                    "fieldPath": "total_personnel_number",
                                    "jsonPath": null,
                                    "nullable": true,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.NumberType": {}
                                        }
                                    },
                                    "nativeDataType": "Integer()",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                },
                                {
                                    "fieldPath": "total_hospital_number",
                                    "jsonPath": null,
                                    "nullable": true,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.NumberType": {}
                                        }
                                    },
                                    "nativeDataType": "Integer()",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                }
                            ],
                            "primaryKeys": null,
                            "foreignKeysSpecs": null
                        }
                    }
                ]
            }
        },
        "proposedDelta": null
    }
]

6.2 Json yaml

sudo python3 -m datahub ingest -c xxx.yaml

source:
  type: file
  config:
    # Coordinates
    filename: ./xxxx/file.json

# sink configs
sink:
  type: 'datahub-rest'
  config: 
    server: 'http://localhost:8080'

7. Create Lineage

官网参考: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/file_lineage.yml.

7.1 Yml template

---
version: 1
lineage:
  - entity:
      name: topic3
      type: dataset
      env: DEV
      platform: kafka
    upstream:
      - entity:
          name: topic2
          type: dataset
          env: DEV
          platform: kafka
      - entity:
          name: topic1
          type: dataset
          env: DEV
          platform: kafka
  - entity:
      name: topic2
      type: dataset
      env: DEV
      platform: kafka
    upstream:
      - entity:
          name: kafka.topic2
          env: PROD
          platform: snowflake
          platform_instance: test
          type: dataset

7.2 Run

sudo python3 -m datahub ingest -c xxx.yaml

source:
  type: datahub-lineage-file
  config:
    file: /path/to/file_lineage.yml
    preserve_upstream: False

# sink configs
sink:
  type: 'datahub-rest'
  config: 
    server: 'http://localhost:8080'

8. Ingest CSV

官网参考: https://datahubproject.io/docs/generated/ingestion/sources/csv.

8.1 Csv Template

官网参考: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/demo_data/csv_enricher_demo_data.csv.

注意事项:

  • 新增属性domain
  • 当前版本会新建glossaryTerm,与目标冲突,解决方法就是用id

8.2 Run

sudo python3 -m datahub ingest -c xxx.yaml

source:
  type: csv-enricher
  config:
    filename: /xxxx/csv_xxxx.csv
    delimiter: ','
    array_delimiter: '|'

# sink configs
sink:
  type: 'datahub-rest'
  config: 
    server: 'http://localhost:8080'

9. Transformers

官网参考: https://datahubproject.io/docs/metadata-ingestion/transformers.

LinkedIn DataHub --- 经验分享_第17张图片

9.1 Simple Demo

LinkedIn DataHub --- 经验分享_第18张图片

sudo python3 -m datahub ingest -c xxx.yaml

//结合第六项的json一起使用
source:
  type: file
  config:
    # Coordinates
    filename: ./xxxx/file.json

transformers:
  - type: "simple_add_dataset_properties"
    config:
      properties:
        prop1: value1
# sink configs
sink:
  type: 'datahub-rest'
  config: 
    server: 'http://localhost:8080'

10. Actions

官网参考: https://datahubproject.io/docs/actions.

10.1 Install plugin

  • Install Cli
    sudo python3 -m pip install --upgrade pip wheel setuptools
    sudo python3 -m pip install --upgrade acryl-datahub
    sudo datahub --version
  • Install Action
    sudo python3 -m pip install --upgrade pip wheel setuptools
    sudo python3 -m pip install --upgrade acryl-datahub-actions
    sudo datahub actions version

10.2 Config Action

官网参考: https://datahubproject.io/docs/actions.

  • Action Pipeline Name (Should be unique and static)
  • Source Configurations
  • Transform + Filter Configurations
  • Action Configuration
  • Pipeline Options (Optional)
  • DataHub API configs (Optional - required for select actions)
# 1. Required: Action Pipeline Name
name: <action-pipeline-name>

# 2. Required: Event Source - Where to source event from.
source:
  type: <source-type>
  config:
    # Event Source specific configs (map)

# 3a. Optional: Filter to run on events (map)
filter: 
  event_type: <filtered-event-type>
  event:
    # Filter event fields by exact-match
    <filtered-event-fields>

# 3b. Optional: Custom Transformers to run on events (array)
transform:
  - type: <transformer-type>
    config: 
      # Transformer-specific configs (map)

# 4. Required: Action - What action to take on events. 
action:
  type: <action-type>
  config:
    # Action-specific configs (map)

# 5. Optional: Additional pipeline options (error handling, etc)
options: 
  retry_count: 0 # The number of times to retry an Action with the same event. (If an exception is thrown). 0 by default. 
  failure_mode: "CONTINUE" # What to do when an event fails to be processed. Either 'CONTINUE' to make progress or 'THROW' to stop the pipeline. Either way, the failed event will be logged to a failed_events.log file. 
  failed_events_dir: "/tmp/datahub/actions"  # The directory in which to write a failed_events.log file that tracks events which fail to be processed. Defaults to "/tmp/logs/datahub/actions". 

# 6. Optional: DataHub API configuration
datahub:
  server: "http://localhost:8080" # Location of DataHub API
  # token: <your-access-token> # Required if Metadata Service Auth enabled

官网提供的demo

# 1. Action Pipeline Name
name: "hello_world"
# 2. Event Source: Where to source event from.
source:
  type: "kafka"
  config:
    connection:
      bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
      schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
# 3. Action: What action to take on events. 
action:
  type: "hello_world"

10.3 Run

  • sudo python3 -m datahub actions -c
    LinkedIn DataHub --- 经验分享_第19张图片
  • 将官网提供的demo写入topic就会触发
    百度链接: https://datahubproject.io/docs/actions/events/entity-change-event.

10.4 Kafka topic

  • 丢人丢大了
    LinkedIn DataHub --- 经验分享_第20张图片
  • 找到kafka容器
    docker ps
    LinkedIn DataHub --- 经验分享_第21张图片
  • 进入kafka容器
    docker exec -it 58 /bin/bash
    cd /etc/kafka
    cat kafka.propertiesLinkedIn DataHub --- 经验分享_第22张图片
  • log就是data存放的地方
    cd /var/lib/kafka/data
  • 里面就是topic和offset
    LinkedIn DataHub --- 经验分享_第23张图片
  • 查看消费者组
    ./kafka-consumer-groups --bootstrap-server localhost:9092 --list
    在这里插入图片描述
  • 查看topic消息
    cd /usr/bin
    ./kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic PlatformEvent_v1
    LinkedIn DataHub --- 经验分享_第24张图片

11. Data Quality

官网参考: https://datahubproject.io/docs/metadata-ingestion/integration_docs/great-expectations/.

11.1 initial

  • install GE
    sudo pip3 install ‘acryl-datahub[great-expectations]’
  • 初始化init
    /usr/local/python3/bin/great_expectations init
  • 查询版号
    /usr/local/python3/bin/great_expectations --version

11.2 connect DB

  • 按照官网,创建checkpoint并执行
    /usr/local/python3/bin/great_expectations -v checkpoint run demo_checkpoint.yaml

  • 不出以外,报警啦
    Could not find Checkpoint ‘demo_checkpoint.yaml’ (or its configuration is invalid)
    /usr/local/python3/bin/great_expectations datasource new --no-jupyter
    enter option 2 =>我用的mysql
    enter option 2
    LinkedIn DataHub --- 经验分享_第25张图片

  • 我是用python3,所以要手动执行
    sudo pip3 install psycopg2-binary
    LinkedIn DataHub --- 经验分享_第26张图片

  • 重新执行上一步
    /usr/local/python3/bin/great_expectations datasource new --no-jupyter
    LinkedIn DataHub --- 经验分享_第27张图片

  • 按照提示继续
    jupyter notebook /home/os-nan.zhao/great_expectations/uncommitted/datasource_new.ipynb --allow-root --ip 0.0.0.0
    LinkedIn DataHub --- 经验分享_第28张图片

  • 浏览器访问红框中的地址
    LinkedIn DataHub --- 经验分享_第29张图片

  • 将token输入,enter new password
    datahub@123LinkedIn DataHub --- 经验分享_第30张图片

  • 点进datasource_new.ipynd
    LinkedIn DataHub --- 经验分享_第31张图片

  • 在git中找到mysql的config
    docker mysql config: https://github.com/datahub-project/datahub/blob/master/docker/quickstart/docker-compose-without-neo4j.quickstart.yml.
    LinkedIn DataHub --- 经验分享_第32张图片
    LinkedIn DataHub --- 经验分享_第33张图片
    LinkedIn DataHub --- 经验分享_第34张图片
    LinkedIn DataHub --- 经验分享_第35张图片

11.3 create expectation

  • 另开窗口,继续执行
    /usr/local/python3/bin/great_expectations suite new
  • select 2
    enter option 2
    LinkedIn DataHub --- 经验分享_第36张图片
  • Index of the table of which you want to create the suite
    enter option 10
  • Enter the file name
    demo01
    LinkedIn DataHub --- 经验分享_第37张图片
  • 晕死,没有开8889的port
    这个datahub,真难提前开好所有port
  • 编辑
    /usr/local/python3/bin/great_expectations suite edit --no-jupyter
    jupyter notebook /great_expectations/uncommitted/edit_.ipynb --allow-root --ip 0.0.0.0
  • 执行
    /usr/local/python3/bin/great_expectations checkpoint new --no-jupyter
  • next
    jupyter notebook /great_expectations/uncommitted/edit_checkpoint_.ipynb --allow-root --ip 0.0.0.0

Great Expectation website: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/database/mysql.

12. Openapi

LinkedIn DataHub --- 经验分享_第38张图片

12.1 Swagger

  • 官方提供了swagger UI (gms port 8080)
    http://datahub-server-ip:8080/openapi/swagger-ui/index.html#/Timeline
    LinkedIn DataHub --- 经验分享_第39张图片

12.2 api test

  • datahub api官网地址: https://datahubproject.io/docs/api/openapi/openapi-usage-guide.
  • open api 和 json file 对比看。如何配置LinkedIn DataHub --- 经验分享_第40张图片

13. Pending

    学习datahub的时光即快乐又痛苦,快乐是捡起了docker,从陌生到熟悉到熟练。第一次从零开始,安装学习使用一个网上资源基本没有的软件,所有问题都要去slack上面去提问,再次感谢热心的社区人员。他们真的很怒力,基本每个月都会release 三个版本,可怜了我提交的feature request,还没有实现。哈哈。

    对于一个开源的软件来说,真的很厉害了,界面时尚,效能也很牛,但是很多地方颗粒度都不够细,虽然支持Hana,却不支持SAP,最后又回归到了SAP的information steward,


SAP IS website: https://www.sap.com/products/technology-platform/data-profiling-steward.html.
SAP IS document website: https://help.sap.com/docs/SAP_INFORMATION_STEWARD.

你可能感兴趣的:(DataHub,docker,大数据)