简介: 阿里云开源离线同步工具DataX3.0介绍 一. DataX3.0概览 DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
如果不熟悉的话可以先进行了解:https://developer.aliyun.com/article/59373
源码开源地址:https://github.com/alibaba/DataX?spm=a2c6h.12873639.0.0.21084f64hM6IE9
DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图
类型 | 数据源 | Reader(读) | Writer(写) | 文档 |
---|---|---|---|---|
RDBMS 关系型数据库 | MySQL | √ | √ | 读 、写 |
Oracle | √ | √ | 读 、写 | |
SQLServer | √ | √ | 读 、写 | |
PostgreSQL | √ | √ | 读 、写 | |
DRDS | √ | √ | 读 、写 | |
通用RDBMS(支持所有关系型数据库) | √ | √ | 读 、写 | |
阿里云数仓数据存储 | ODPS | √ | √ | 读 、写 |
ADS | √ | 写 | ||
OSS | √ | √ | 读 、写 | |
OCS | √ | √ | 读 、写 | |
NoSQL数据存储 | OTS | √ | √ | 读 、写 |
Hbase0.94 | √ | √ | 读 、写 | |
Hbase1.1 | √ | √ | 读 、写 | |
Phoenix4.x | √ | √ | 读 、写 | |
Phoenix5.x | √ | √ | 读 、写 | |
MongoDB | √ | √ | 读 、写 | |
Hive | √ | √ | 读 、写 | |
Cassandra | √ | √ | 读 、写 | |
无结构化数据存储 | TxtFile | √ | √ | 读 、写 |
FTP | √ | √ | 读 、写 | |
HDFS | √ | √ | 读 、写 | |
Elasticsearch | √ | 写 | ||
时间序列数据库 | OpenTSDB | √ | 读 | |
TSDB | √ | √ | 读 、写 |
test.json
{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "datax",
"password": "123456",
"where":"updated_at>='${start_time} 00:00:00' and updated_at<='${end_time} 23:59:59'",
"column": [
"id",
"app_id",
"collection_phone",
"transaction_number",
"pay_amount",
"if(auto_tags is null,'',replace(replace(replace(auto_tags,'[',''),']',''),'\"','')) as auto_tags",
"if(manual_tags is null,'',replace(replace(replace(manual_tags,'[',''),']',''),'\"','')) as manual_tags",
"if(latest_days_ordered_at is null,'',replace(replace(latest_days_ordered_at,'[',''),']','')) as latest_days_ordered_at",
"if(latest_days_paid_at is null,'',replace(replace(latest_days_paid_at,'[',''),']','')) as latest_days_paid_at",
"if(latest_days_visited_at is null,'',replace(replace(latest_days_visited_at,'[',''),']','')) as latest_days_visited_at",
"latest_ordered_at",
"visited_products",
"ordered_products"
],
"connection": [
{
"jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/db_user?com.mysql.jdbc.faultInjection.serverCharsetIndex=45"],
"table": [
"user"
]
}
]
}
},
"writer": {
"name": "elasticsearchwriter",
"parameter": {
"endpoint": "http://127.0.0.1:9200",
"accessId": "elastic",
"accessKey": "123456",
"index":"user",
"type":"traces",
"settings": {"index" :{"number_of_shards": 5, "number_of_replicas": 1}},
"batchSize": 5000,
"splitter": ",",
"column": [
{"name":"pk","type":"id"},
{"name":"app_id","type":"keyword"},
{"name":"collection_phone","type":"keyword"},
{"name":"transaction_number","type":"integer"},
{"name":"pay_amount","type":"integer"},
{"name":"auto_tags","type":"keyword","array":true},
{"name":"manual_tags","type":"keyword","array":true},
{"name":"latest_days_ordered_at","type":"long","array":true},
{"name":"latest_days_paid_at","type":"long","array":true},
{"name":"latest_days_visited_at","type":"long","array":true},
{"name":"latest_ordered_at","type":"long"},
{"name":"visited_products","type":"nested"},
{"name":"ordered_products","type":"nested"}
]
}
}
}
]
}
}
python /usr/local/datax/bin/datax.py ./test.json -p "-Dstart_time=2020-09-02 -Dend_time=2020-09-02"
运行完直接报错了,报错如下:
2020-09-02 15:49:33.747 [main] WARN ConfigParser - 插件[mysqlreader,elasticsearchwriter]加载失败,1s后重试... Exception:Code:[Framework-12], Description:[DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 插件加载失败,未完成指定插件加载:[elasticsearchwriter, mysqlreader]
2020-09-02 15:49:34.765 [main] ERROR Engine -
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Framework-12], Description:[DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 插件加载失败,未完成指定插件加载:[elasticsearchwriter, mysqlreader]
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
at com.alibaba.datax.core.util.ConfigParser.parsePluginConfig(ConfigParser.java:142)
at com.alibaba.datax.core.util.ConfigParser.parse(ConfigParser.java:63)
at com.alibaba.datax.core.Engine.entry(Engine.java:137)
at com.alibaba.datax.core.Engine.main(Engine.java:204)
那既然说加载不成功,那我们就去看吗,拿数据说话
mysqlreder已存在!!
哦豁,好像真的没有 elasticsearchwriter,小点声马上去安装。。。
//原始的里面是所有很全的,不过一般都是按需install
common
core
transformer
mysqlreader
drdsreader
sqlserverreader
postgresqlreader
oraclereader
odpsreader
otsreader
otsstreamreader
txtfilereader
hdfsreader
streamreader
ossreader
ftpreader
mongodbreader
rdbmsreader
hbase11xreader
hbase094xreader
tsdbreader
opentsdbreader
cassandrareader
gdbreader
mysqlwriter
drdswriter
odpswriter
txtfilewriter
ftpwriter
hdfswriter
streamwriter
otswriter
oraclewriter
sqlserverwriter
postgresqlwriter
osswriter
mongodbwriter
adswriter
ocswriter
rdbmswriter
hbase11xwriter
hbase094xwriter
hbase11xsqlwriter
hbase11xsqlreader
elasticsearchwriter
tsdbwriter
adbpgwriter
gdbwriter
cassandrawriter
clickhousewriter
plugin-rdbms-util
plugin-unstructured-storage-util
hbase20xsqlreader
hbase20xsqlwriter
修改后:
//原始的里面是所有很全的,不过一般都是按需install
common
core
transformer
mysqlreader
elasticsearchwriter
plugin-rdbms-util
plugin-unstructured-storage-util
hbase20xsqlreader
hbase20xsqlwriter
mvn clean install -Dmaven.test.skip=true
cp -r /usr/local/DataX-master/elasticsearchwriter/target/datax/plugin/writer/elasticsearchwriter /usr/local/data/datax/datax/plugin/writer
python /usr/local/datax/bin/datax.py ./test.json -p "-Dstart_time=2020-09-02 -Dend_time=2020-09-02"