DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。
DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。
请参考:DataX-Introduction
Download DataX下载地址
请点击:Quick Start
DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX数据源参考指南
类型 | 数据源 | Reader(读) | Writer(写) | 文档 |
---|---|---|---|---|
RDBMS 关系型数据库 | MySQL | √ | √ | 读 、写 |
Oracle | √ | √ | 读 、写 | |
SQLServer | √ | √ | 读 、写 | |
PostgreSQL | √ | √ | 读 、写 | |
DRDS | √ | √ | 读 、写 | |
通用RDBMS(支持所有关系型数据库) | √ | √ | 读 、写 | |
阿里云数仓数据存储 | ODPS | √ | √ | 读 、写 |
ADS | √ | 写 | ||
OSS | √ | √ | 读 、写 | |
OCS | √ | √ | 读 、写 | |
NoSQL数据存储 | OTS | √ | √ | 读 、写 |
Hbase0.94 | √ | √ | 读 、写 | |
Hbase1.1 | √ | √ | 读 、写 | |
MongoDB | √ | √ | 读 、写 | |
Hive | √ | √ | 读 、写 | |
无结构化数据存储 | TxtFile | √ | √ | 读 、写 |
FTP | √ | √ | 读 、写 | |
HDFS | √ | √ | 读 、写 | |
Elasticsearch | √ | 写 |
1.将DataX下载后上传到服务器
2.解压后使用自检脚本:
python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
3.运行没有问题后根据模板和自己的需求创建your_job.json配置文件,这里以hive2mysql为例:
(需要注意的是读hive表其实就是读hive存放在hdfs上的文件)
{
"job": {
"setting": {
"speed": {
"channel": 3
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/apps/hive/warehouse/test_table/*",
"defaultFS": "hdfs://192.168.10.101:8020",
"column": [ { "index": 0, "type": "string" }, { "index": 1, "type": "long" } ],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": "\u0001" }
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "insert",
"username": "username",
"password": "password",
"column": [ "word", "cnt" ],
"session": [ "set session sql_mode='ANSI'" ],
"preSql": [ "delete from test_table" ],
"connection": [ { "jdbcUrl": "jdbc:mysql://192.168.10.116:3306/test_datax?useUnicode=true&characterEncoding=gbk", "table": [ "test_table" ] } ] }
}
}
]
}
}
4.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/hive2mysql.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2018-08-29 23:19:46.674 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-29 23:19:46.691 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.161-b12
jvmInfo: Linux amd64 2.6.32-696.23.1.el6.x86_64
cpu num: 1
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [Copy, MarkSweepCompact]
MEMORY_NAME | allocation_size | init_size
Eden Space | 273.06MB | 273.06MB
Code Cache | 240.00MB | 2.44MB
Survivor Space | 34.13MB | 34.13MB
Compressed Class Space | 1,024.00MB | 0.00MB
Metaspace | -0.00MB | 0.00MB
Tenured Gen | 682.69MB | 682.69MB
2018-08-29 23:19:46.732 [main] INFO Engine -
...
2018-08-29 23:20:00.489 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
Copy | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
MarkSweepCompact | 1 | 1 | 1 | 0.071s | 0.071s | 0.071s
2018-08-29 23:20:00.490 [job-0] INFO JobContainer - PerfTrace not enable!
2018-08-29 23:20:00.499 [job-0] INFO StandAloneJobContainerCommunicator - Total 939 records, 31267 bytes | Speed 3.05KB/s, 93 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.001s | All Task WaitReaderTime 0.285s | Percentage 100.00%
2018-08-29 23:20:00.500 [job-0] INFO JobContainer -
任务启动时刻 : 2018-08-29 23:19:46
任务结束时刻 : 2018-08-29 23:20:00
任务总计耗时 : 13s
任务平均流量 : 3.05KB/s
记录写入速度 : 93rec/s
读出记录总数 : 939
读写失败总数 : 0
**特别注意:hive存在hdfs上的数据默认是以'\001'分隔的,如果用vim打开文件会看到是以^A分隔,但分隔符要用"fieldDelimiter": "\u0001"!!!**
1.在Hive中建表(储存为文本文件类型)
hive> create table mysql_table(word string, cnt int) row format delimited fields terminated by ',' STORED AS TEXTFILE;
OK
Time taken: 0.194 seconds
hive> select * from mysql_table limit 10;
OK
Time taken: 0.162 seconds
2.编写mysql2hive.json配置文件
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"writeMode": "insert",
"username": "root",
"password": "123456",
"column": [ "word", "cnt" ],
"splitPk": "cnt",
"connection": [ { "table": [ "test_table" ], "jdbcUrl": [ "jdbc:mysql://192.168.10.116:3306/test_datax" ] } ] }
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://192.168.10.101:8020",
"fileType": "text",
"path": "/apps/hive/warehouse/mysql_table",
"fileName": "mysql_table",
"column": [ { "name": "word", "type": "string" }, { "name": "cnt", "type": "int" } ],
"writeMode": "append",
"fieldDelimiter": ",",
"compress":"gzip" }
}
}
]
}
}
3.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/mysql2hive.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2018-08-30 01:03:36.399 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-30 01:03:36.415 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.161-b12
jvmInfo: Linux amd64 2.6.32-696.23.1.el6.x86_64
cpu num: 1
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [Copy, MarkSweepCompact]
MEMORY_NAME | allocation_size | init_size
Eden Space | 273.06MB | 273.06MB
Code Cache | 240.00MB | 2.44MB
Survivor Space | 34.13MB | 34.13MB
Compressed Class Space | 1,024.00MB | 0.00MB
Metaspace | -0.00MB | 0.00MB
Tenured Gen | 682.69MB | 682.69MB
2018-08-30 01:03:36.469 [main] INFO Engine -
...
2018-08-30 01:03:52.094 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
Copy | 1 | 1 | 1 | 0.346s | 0.346s | 0.346s
MarkSweepCompact | 1 | 1 | 1 | 0.206s | 0.206s | 0.206s
2018-08-30 01:03:52.094 [job-0] INFO JobContainer - PerfTrace not enable!
2018-08-30 01:03:52.096 [job-0] INFO StandAloneJobContainerCommunicator - Total 939 records, 19462 bytes | Speed 1.90KB/s, 93 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.975s | All Task WaitReaderTime 0.022s | Percentage 100.00%
2018-08-30 01:03:52.101 [job-0] INFO JobContainer -
任务启动时刻 : 2018-08-30 01:03:36
任务结束时刻 : 2018-08-30 01:03:52
任务总计耗时 : 15s
任务平均流量 : 1.90KB/s
记录写入速度 : 93rec/s
读出记录总数 : 939
读写失败总数 : 0
4.查看hive表中是否有数据
hive> select * from mysql_table limit 10;
OK
at-spi-python-1.28.1 1
mesa-dri-drivers-9.2 1
m4-1.4.13-5.el6.x86_ 1
libXxf86vm-1.1.2-2.e 1
libuuid-2.17.2-12.14 1
httpd-2.2.15-29.el6. 1
libgweather-2.28.0-5 1
lvm2-2.02.100-8.el6. 1
xmlrpc-c-1.16.24-121 1
parted-2.1-21.el6.x8 1
Time taken: 0.148 seconds, Fetched: 10 row(s)
hive> select * from mysql_table order by cnt desc limit 10;
Query ID = hdfs_20180830010430_68021b2b-9bd8-44a0-92df-dbe0c717053a
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1535363634130_0005)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 17.63 s
--------------------------------------------------------------------------------
OK
Installing 918
warning: 2
*** 2
tcp_wrappers-libs-7. 1
m4-1.4.13-5.el6.x86_ 1
httpd-2.2.15-29.el6. 1
libgweather-2.28.0-5 1
libXxf86vm-1.1.2-2.e 1
xmlrpc-c-1.16.24-121 1
parted-2.1-21.el6.x8 1
Time taken: 33.713 seconds, Fetched: 10 row(s)
2018-08-29 22:37:54.327 [job-0] ERROR JobContainer - Exception when job run com.alibaba.datax.common.
exception.DataXException: Code:[DBUtilErrorCode-01], Description:[获取表字段相关信息失败.]. -
获取表:test_table 的字段的元信息时失败. 请联系 DBA 核查该库、表信息. -
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'id' in 'field list'
原因:”column”: [ ]中定义的列名与mysql不一致
经DataX智能分析,该任务最可能的错误原因是:com.alibaba.datax.common.exception.DataXException:
Code:[UnstructuredStorageReader-03], Description:[您填写的参数值不合法.]. - 仅仅支持单字符切分,
您配置的切分为 : ['01']
原因:分隔符问题,详见上面的特别注意
DataX 内部类型 | Mysql 数据类型 |
---|---|
Long | int, tinyint, smallint, mediumint, int, bigint, year |
Double | float, double, decimal |
String | varchar, char, tinytext, text, mediumtext, longtext |
Date | date, datetime, timestamp, time |
Boolean | bit, bool |
Bytes | tinyblob, mediumblob, blob, longblob, varbinary |
* bit类型目前是未定义类型转换
DataX 内部类型 | Mysql 数据类型 |
---|---|
Long | int, tinyint, smallint, mediumint, int, bigint |
Double | float, double, decimal |
String | varchar, char, tinytext, text, mediumtext, longtext, year |
Date | date, datetime, timestamp, time |
Boolean | bit, bool |
Bytes | tinyblob, mediumblob, blob, longblob, varbinary |
请注意:
除上述罗列字段类型外,其他类型均不支持
。tinyint(1) DataX视作为整形
。year DataX视作为字符串类型
bit DataX属于未定义行为
。