1.DataX3.0简介
DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
2.DataX3.0框架设计
DataX本身作为离线数据同步框架,采用Framework + plugin架构构建。将数据源读取和写入抽象成为Reader/Writer插件,纳入到整个同步框架中。
详情可参照官方文档说明:
https://github.com/alibaba/DataX/blob/master/introduction.md
3.dataX安装部署文档
推荐环境:
Linux
JDK(1.8以上,推荐1.8)
Python(推荐Python2.6.X)
Apache Maven 3.x (Compile DataX)
部署
方式1:直接下载DataX工具包
http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
方式2:下载DataX源码,自己编译
https://github.com/alibaba/DataX
下载DataX源码:
$ git clone [email protected]:alibaba/DataX.git
通过maven打包:
$ cd {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true
打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/
4.dataX示例
reader–MySQL
writer–HIVE
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"*"
],
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://100.73.13.37:3306/test"
],
"table": [
"datax"
]
}
],
"password": "datax",
"username": "datax"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "BIGINT"
},
{
"name": "test1",
"type": "VARCHAR"
},
{
"name": "test2",
"type": "INT"
},
{
"name": "test3",
"type": "INT"
}
],
"compress": "gzip",
"defaultFS": "hdfs://jxq-100-73-13-31:8020",
"fieldDelimiter": "\t",
"fileName": "dataxtest",
"fileType": "text",
"path": "/user/hive/warehouse/datax",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "2"
}
}
}
}
$python /datax/bin/datax.py hdfs.json
执行脚本输出:
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2020-07-21 10:57:07.856 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2020-07-21 10:57:07.864 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.141-b15
jvmInfo: Linux amd64 3.10.0-693.5.2.el7.x86_64
cpu num: 4
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2020-07-21 10:57:07.880 [main] INFO Engine -
{...
}
2020-07-21 10:57:07.901 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-07-21 10:57:07.903 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-07-21 10:57:07.903 [main] INFO JobContainer - DataX jobContainer starts job.
2020-07-21 10:57:07.904 [main] INFO JobContainer - Set jobId = 0
2020-07-21 10:57:08.237 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://100.73.13.37:3306/test?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2020-07-21 10:57:08.239 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置存在一定的风险. 因为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
Jul 21, 2020 10:57:08 AM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-07-21 10:57:09.225 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2020-07-21 10:57:09.226 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2020-07-21 10:57:09.226 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do prepare work .
2020-07-21 10:57:09.321 [job-0] INFO HdfsWriter$Job - 由于您配置了writeMode append, 写入前不做清理工作, [/user/hive/warehouse/datax] 目录下写入相应文件名前缀 [dataxtest] 的文件
2020-07-21 10:57:09.321 [job-0] INFO JobContainer - jobContainer starts to do split ...
2020-07-21 10:57:09.321 [job-0] INFO JobContainer - Job set Channel-Number to 2 channels.
2020-07-21 10:57:09.326 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2020-07-21 10:57:09.328 [job-0] INFO HdfsWriter$Job - begin do split...
2020-07-21 10:57:09.331 [job-0] INFO HdfsWriter$Job - splited write file name:[hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77]
2020-07-21 10:57:09.331 [job-0] INFO HdfsWriter$Job - end do split.
2020-07-21 10:57:09.331 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] splits to [1] tasks.
2020-07-21 10:57:09.345 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2020-07-21 10:57:09.347 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2020-07-21 10:57:09.349 [job-0] INFO JobContainer - Running by standalone Mode.
2020-07-21 10:57:09.355 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2020-07-21 10:57:09.368 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-07-21 10:57:09.368 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2020-07-21 10:57:09.375 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-07-21 10:57:09.379 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select * from datax
] jdbcUrl:[jdbc:mysql://100.73.13.37:3306/test?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2020-07-21 10:57:09.403 [0-0-0-writer] INFO HdfsWriter$Task - begin do write...
2020-07-21 10:57:09.403 [0-0-0-writer] INFO HdfsWriter$Task - write to file : [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77]
2020-07-21 10:57:09.413 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select * from datax
] jdbcUrl:[jdbc:mysql://100.73.13.37:3306/test?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2020-07-21 10:57:09.745 [0-0-0-writer] INFO HdfsWriter$Task - end do write
2020-07-21 10:57:09.776 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[402]ms
2020-07-21 10:57:09.777 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2020-07-21 10:57:19.364 [job-0] INFO StandAloneJobContainerCommunicator - Total 6 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-07-21 10:57:19.365 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2020-07-21 10:57:19.365 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do post work.
2020-07-21 10:57:19.365 [job-0] INFO HdfsWriter$Job - start rename file [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77.gz] to file [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77.gz].
2020-07-21 10:57:19.372 [job-0] INFO HdfsWriter$Job - finish rename file [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77.gz] to file [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax/dataxtest__8813b2e1_2cd2_45fa_a81d_a3f762cb2b77.gz].
2020-07-21 10:57:19.373 [job-0] INFO HdfsWriter$Job - start delete tmp dir [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865] .
2020-07-21 10:57:19.389 [job-0] INFO HdfsWriter$Job - finish delete tmp dir [hdfs://jxq-100-73-13-31:8020/user/hive/warehouse/datax__3ee8db03_9653_4f5f_bdba_82e1459a7865] .
2020-07-21 10:57:19.390 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do post work.
2020-07-21 10:57:19.390 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2020-07-21 10:57:19.391 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /data/lilin/datax/hook
2020-07-21 10:57:19.393 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 1 | 1 | 1 | 0.029s | 0.029s | 0.029s
PS Scavenge | 1 | 1 | 1 | 0.018s | 0.018s | 0.018s
2020-07-21 10:57:19.393 [job-0] INFO JobContainer - PerfTrace not enable!
2020-07-21 10:57:19.394 [job-0] INFO StandAloneJobContainerCommunicator - Total 6 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-07-21 10:57:19.394 [job-0] INFO JobContainer -
任务启动时刻 : 2020-07-21 10:57:07
任务结束时刻 : 2020-07-21 10:57:19
任务总计耗时 : 11s
任务平均流量 : 5B/s
记录写入速度 : 0rec/s
读出记录总数 : 6
读写失败总数 : 0
hive> select * from datax;
OK
1 你好 11 111
2 he 22 222
3 li 33 333
4 xu 44 444
5 xiao 55 555
6 xu 66 666
5.dataX web 单机安装部署
安装包准备:
方式1:下载官方提供的tar安装包
https://pan.baidu.com/s/13yoqhGpD00I82K4lOYtQhg
提取码:cpsk
方式2:编译打包,具体要求可参考官方文档:https://github.com/WeiYe-Jing/datax-web/blob/master/doc/datax-web/datax-web-deploy.md
部署:
解压后,执行一建安装脚本
tar -zxvf datax-web-2.1.2.tar.gz
cd datax-web-2.1.2/bin
sh install.sh --force
mysql> create database datax;
mysql> use datax_web
mysql> source $PATH/datax-web-2.1.2/bin/db/datax_web.sql
mysql> use datax_web
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> show tables;
+---------------------+
| Tables_in_datax_web |
+---------------------+
| job_group |
| job_info |
| job_jdbc_datasource |
| job_lock |
| job_log |
| job_log_report |
| job_logglue |
| job_permission |
| job_project |
| job_registry |
| job_template |
| job_user |
+---------------------+
12 rows in set (0.00 sec)
server:
port: 8080
#port: ${server.port}
spring:
#数据源
datasource:
username: datax
password: datax
url: jdbc:mysql://localhost:3306/datax_web?serverTimezone=Asia/Shanghai&useLegacyDatetimeCode=false&useSSL=false&nullNamePatternMatchesAll=true&useUnicode=true&characterEncoding=UTF-8
driver-class-name: com.mysql.jdbc.Driver
或者修改:
bootstrap.properties 配置文件
# web port
server:
port: ${server.port}
#port: 8081
# log config
logging:
config: classpath:logback.xml
path: ${data.path}/applogs/executor/jobhandler
#path: ./data/applogs/executor/jobhandler
datax:
job:
admin:
### datax admin address list, such as "http://address" or "http://address01,http://address02"
#addresses: http://127.0.0.1:8080
addresses: http://127.0.0.1:${datax.admin.port}
executor:
appname: datax-executor
ip:
port: 9999
#port: ${executor.port:9999}
### job log path
logpath: ./data/applogs/executor/jobhandler
#logpath: ${data.path}/applogs/executor/jobhandler
### job log retention days
logretentiondays: 30
### job, access token
accessToken:
executor:
#jsonpath: D:\\temp\\executor\\json\\
jsonpath: /data/lilin/datax/bin
#pypath: F:\tools\datax\bin\datax.py
pypath: /data/lilin/datax/bin/datax.py
cd ./datax-web-2.1.2/bin
sh start-all.sh
[root@jxq-100-73-13-3 bin]# jps
7428 DataXAdminApplication
7704 DataXExecutorApplication
启动成功后,出现DataXAdminApplication和DataXExecutorApplication进程;如启动失败,请检查日志:modules/datax-admin/bin/console.out或者modules/datax-executor/bin/console.out
6.运行
http://$IP:9527/index.html
输入用户名 admin 密码 123456 就可以直接访问系统
参考链接:
dataX官方安装文档:
https://github.com/alibaba/DataX/blob/master/userGuid.md
dataX web官方安装文档:
https://github.com/WeiYe-Jing/datax-web/blob/master/doc/datax-web/datax-web-deploy.md
dataX 3.0官方介绍文档:
https://github.com/alibaba/DataX/blob/master/introduction.md
其他童鞋贡献的文档:
https://segmentfault.com/a/1190000022182167?utm_source=tag-newest
https://www.oschina.net/search?scope=blog&q=datax