DataX与DataX web入门

1.DataX3.0简介
DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
DataX与DataX web入门_第1张图片

  • 设计理念
    为了解决异构数据源同步问题,DataX将复杂的网状的同步链路变成了星型数据链路,DataX作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,只需要将此数据源对接到DataX,便能跟已有的数据源做到无缝数据同步。
  • 当前使用现状
    DataX在阿里巴巴集团内被广泛使用,承担了所有大数据的离线同步业务,并已持续稳定运行了6年之久。目前每天完成同步8w多道作业,每日传输数据量超过300TB。

2.DataX3.0框架设计
DataX本身作为离线数据同步框架,采用Framework + plugin架构构建。将数据源读取和写入抽象成为Reader/Writer插件,纳入到整个同步框架中。

  • Reader:Reader为数据采集模块,负责采集数据源的数据,将数据发送给Framework。
  • Writer: Writer为数据写入模块,负责不断向Framework取数据,并将数据写入到目的端。
  • Framework:Framework用于连接reader和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。

详情可参照官方文档说明:
https://github.com/alibaba/DataX/blob/master/introduction.md

3.dataX安装部署文档

  • 推荐环境:
    Linux
    JDK(1.8以上,推荐1.8)
    Python(推荐Python2.6.X)
    Apache Maven 3.x (Compile DataX)

  • 部署
    方式1:直接下载DataX工具包
    http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

    方式2:下载DataX源码,自己编译
    https://github.com/alibaba/DataX
    下载DataX源码:

$ git clone [email protected]:alibaba/DataX.git
通过maven打包:
$ cd  {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true
打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/ 

4.dataX示例
reader–MySQL
writer–HIVE

  • 准备MySQL数据源表信息
  • hive中创建表结构
  • 构建json文件,用于数据抽取
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": [
                            "*"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": [
                                    "jdbc:mysql://100.73.13.37:3306/test"
                                ],
                                "table": [
                                    "datax"
                                ]
                            }
                        ],
                        "password": "datax",
                        "username": "datax"
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [
                            {
                                "name": "id",
                                "type": "BIGINT"
                            },
                            {
                                "name": "test1",
                                "type": "VARCHAR"
                            },
                            {
                                "name": "test2",
                                "type": "INT"
                            },
                            {
                                "name": "test3",
                                "type": "INT"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://jxq-100-73-13-31:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "dataxtest",
                        "fileType": "text",
                        "path": "/user/hive/warehouse/datax",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "2"
            }
        }
    }
}
$python /datax/bin/datax.py hdfs.json 

执行脚本输出:
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2020-07-21 10:57:07.856 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2020-07-21 10:57:07.864 [main] INFO  Engine - the machine info  => 

        osInfo: Oracle Corporation 1.8 25.141-b15
        jvmInfo:        Linux amd64 3.10.0-693.5.2.el7.x86_64
        cpu num:        4

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [PS MarkSweep, PS Scavenge]

        MEMORY_NAME                    | allocation_size                | init_size                      
        PS Eden Space                  | 256.00MB                       | 256.00MB                       
        Code Cache                     | 240.00MB                       | 2.44MB                         
        Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
        PS Survivor Space              | 42.50MB                        | 42.50MB                        
        PS Old Gen                     | 683.00MB                       | 683.00MB                       
        Metaspace                      | -0.00MB                        | 0.00MB                         


2020-07-21 10:57:07.880 [main] INFO  Engine - 
{...
}

2020-07-21 10:57:07.901 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-07-21 10:57:07.903 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-07-21 10:57:07.903 [main] INFO  JobContainer - DataX jobContainer starts job.
2020-07-21 10:57:07.904 [main] INFO  JobContainer - Set jobId = 0
2020-07-21 10:57:08.237 [job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://100.73.13.37:3306/test?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2020-07-21 10:57:08.239 [job-0] WARN  OriginalConfPretreatmentUtil - 您的配置文件中的列配置存在一定的风险. 因为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
Jul 21, 2020 10:57:08 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-07-21 10:57:09.225 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2020-07-21 10:57:09.226 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2020-07-21 10:57:09.226 [job-0] INFO  JobContainer - DataX Writer.Job [hdfswriter] do prepare work .
2020-07-21 10:57:09.321 [job-0] INFO  HdfsWriter$Job - 由于您配置了writeMode append, 写入前不做清理工作, [/user/hive/warehouse/datax] 目录下写入相应文件名前缀  [dataxtest] 的文件
2020-07-21 10:57:09.321 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2020-07-21 10:57:09.321 [job-0] INFO  JobContainer - Job set Channel-Number to 2 channels.
2020-07-21 10:57:09.326 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2020-07-21 10:57:09.328 [job-0] INFO  HdfsWriter$Job - begin do split...
2020-07-21 10:57:09.331 [job-0] INFO  HdfsWriter$Job - splited write file name:[hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77]
2020-07-21 10:57:09.331 [job-0] INFO  HdfsWriter$Job - end do split.
2020-07-21 10:57:09.331 [job-0] INFO  JobContainer - DataX Writer.Job [hdfswriter] splits to [1] tasks.
2020-07-21 10:57:09.345 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2020-07-21 10:57:09.347 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2020-07-21 10:57:09.349 [job-0] INFO  JobContainer - Running by standalone Mode.
2020-07-21 10:57:09.355 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2020-07-21 10:57:09.368 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-07-21 10:57:09.368 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2020-07-21 10:57:09.375 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-07-21 10:57:09.379 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Begin to read record by Sql: [select * from datax 
] jdbcUrl:[jdbc:mysql://100.73.13.37:3306/test?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2020-07-21 10:57:09.403 [0-0-0-writer] INFO  HdfsWriter$Task - begin do write...
2020-07-21 10:57:09.403 [0-0-0-writer] INFO  HdfsWriter$Task - write to file : [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77]
2020-07-21 10:57:09.413 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Finished read record by Sql: [select * from datax 
] jdbcUrl:[jdbc:mysql://100.73.13.37:3306/test?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2020-07-21 10:57:09.745 [0-0-0-writer] INFO  HdfsWriter$Task - end do write
2020-07-21 10:57:09.776 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[402]ms
2020-07-21 10:57:09.777 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2020-07-21 10:57:19.364 [job-0] INFO  StandAloneJobContainerCommunicator - Total 6 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-07-21 10:57:19.365 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2020-07-21 10:57:19.365 [job-0] INFO  JobContainer - DataX Writer.Job [hdfswriter] do post work.
2020-07-21 10:57:19.365 [job-0] INFO  HdfsWriter$Job - start rename file [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77.gz] to file [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77.gz].
2020-07-21 10:57:19.372 [job-0] INFO  HdfsWriter$Job - finish rename file [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77.gz] to file [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77.gz].
2020-07-21 10:57:19.373 [job-0] INFO  HdfsWriter$Job - start delete tmp dir [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865] .
2020-07-21 10:57:19.389 [job-0] INFO  HdfsWriter$Job - finish delete tmp dir [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865] .
2020-07-21 10:57:19.390 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do post work.
2020-07-21 10:57:19.390 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2020-07-21 10:57:19.391 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /data/lilin/datax/hook
2020-07-21 10:57:19.393 [job-0] INFO  JobContainer - 
         [total cpu info] => 
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
                -1.00%                         | -1.00%                         | -1.00%
                        

         [total gc info] => 
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
                 PS MarkSweep         | 1                  | 1                  | 1                  | 0.029s             | 0.029s             | 0.029s             
                 PS Scavenge          | 1                  | 1                  | 1                  | 0.018s             | 0.018s             | 0.018s             

2020-07-21 10:57:19.393 [job-0] INFO  JobContainer - PerfTrace not enable!
2020-07-21 10:57:19.394 [job-0] INFO  StandAloneJobContainerCommunicator - Total 6 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-07-21 10:57:19.394 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2020-07-21 10:57:07
任务结束时刻                    : 2020-07-21 10:57:19
任务总计耗时                    :                 11s
任务平均流量                    :                5B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   6
读写失败总数                    :                   0

hive> select * from datax;
OK
1       你好    11      111
2       he      22      222
3       li      33      333
4       xu      44      444
5       xiao    55      555
6       xu      66      666

5.dataX web 单机安装部署

  • 安装包准备:
    方式1:下载官方提供的tar安装包
    https://pan.baidu.com/s/13yoqhGpD00I82K4lOYtQhg
    提取码:cpsk
    方式2:编译打包,具体要求可参考官方文档:https://github.com/WeiYe-Jing/datax-web/blob/master/doc/datax-web/datax-web-deploy.md

  • 部署:
    解压后,执行一建安装脚本

tar -zxvf datax-web-2.1.2.tar.gz
cd datax-web-2.1.2/bin
sh install.sh --force
  • 数据库安装,此步可参考MySQL安装方案
mysql> create database datax;
mysql> use datax_web
mysql> source $PATH/datax-web-2.1.2/bin/db/datax_web.sql
mysql> use datax_web
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+---------------------+
| Tables_in_datax_web |
+---------------------+
| job_group           |
| job_info            |
| job_jdbc_datasource |
| job_lock            |
| job_log             |
| job_log_report      |
| job_logglue         |
| job_permission      |
| job_project         |
| job_registry        |
| job_template        |
| job_user            |
+---------------------+
12 rows in set (0.00 sec)
  • 修改配置文件
    ./modules/datax-admin/conf/application.yml
server:
  port: 8080
  #port: ${server.port}
spring:
  #数据源
  datasource:
    username: datax
    password: datax
    url: jdbc:mysql://localhost:3306/datax_web?serverTimezone=Asia/Shanghai&useLegacyDatetimeCode=false&useSSL=false&nullNamePatternMatchesAll=true&useUnicode=true&characterEncoding=UTF-8
    driver-class-name: com.mysql.jdbc.Driver

或者修改:
bootstrap.properties 配置文件
  • 修改./modules/datax-executor/conf/application.yml配置文件
# web port
server:
  port: ${server.port}
  #port: 8081

# log config
logging:
  config: classpath:logback.xml
  path: ${data.path}/applogs/executor/jobhandler
  #path: ./data/applogs/executor/jobhandler

datax:
  job:
    admin:
      ### datax admin address list, such as "http://address" or "http://address01,http://address02"
      #addresses: http://127.0.0.1:8080
      addresses: http://127.0.0.1:${datax.admin.port}
    executor:
      appname: datax-executor
      ip:
      port: 9999
      #port: ${executor.port:9999}
      ### job log path
      logpath: ./data/applogs/executor/jobhandler
      #logpath: ${data.path}/applogs/executor/jobhandler
      ### job log retention days
      logretentiondays: 30
    ### job, access token
    accessToken:

  executor:
    #jsonpath: D:\\temp\\executor\\json\\
    jsonpath: /data/lilin/datax/bin

  #pypath: F:\tools\datax\bin\datax.py
  pypath: /data/lilin/datax/bin/datax.py
  • 执行启动脚本:
cd ./datax-web-2.1.2/bin
sh start-all.sh
[root@jxq-100-73-13-3 bin]# jps
7428 DataXAdminApplication
7704 DataXExecutorApplication

启动成功后,出现DataXAdminApplication和DataXExecutorApplication进程;如启动失败,请检查日志:modules/datax-admin/bin/console.out或者modules/datax-executor/bin/console.out

6.运行

http://$IP:9527/index.html
输入用户名 admin 密码 123456 就可以直接访问系统

参考链接:
dataX官方安装文档:
https://github.com/alibaba/DataX/blob/master/userGuid.md
dataX web官方安装文档:
https://github.com/WeiYe-Jing/datax-web/blob/master/doc/datax-web/datax-web-deploy.md
dataX 3.0官方介绍文档:
https://github.com/alibaba/DataX/blob/master/introduction.md
其他童鞋贡献的文档:
https://segmentfault.com/a/1190000022182167?utm_source=tag-newest
https://www.oschina.net/search?scope=blog&q=datax

你可能感兴趣的:(DataX与DataX web入门)