datax简单入门
概述
什么是datax
DataX 是阿里巴巴开源的一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
DataX的设计
为了解决异构数据源同步问题,DataX将复杂的网状的同步链路变成了星型数据链路,DataX作为中间传输载体负责连接各种数据源。
当需要接入一个新的数据源的时候,只需要将此数据源对接到DataX,便能跟已有的数据源做到无缝数据同步。
框架设计
运行原理
快速入门
官方地址
下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
源码地址:https://github.com/alibaba/DataX
前置要求
- Linux
- JDK(1.8以上,推荐1.8)
- Python(推荐Python2.6.X)
安装
1)将下载好的datax.tar.gz上传到other的/opt/softwarez
[root@other software]$ ls datax.tar.gz
2)解压datax.tar.gz到/opt/module
[root@other software]$ tar -zxvf datax.tar.gz -C /opt/module/
3)运行自检脚本
[root@other ~]# cd /opt/module/datax/bin/
[root@other bin]# ll
total 40
-rwxr-xr-x 1 62265 users 8993 Nov 24 2017 datax.py
-rwxr-xr-x 1 62265 users 6906 Nov 24 2017 dxprof.py
-rwxr-xr-x 1 62265 users 16897 Nov 24 2017 perftrace.py
[root@other bin]# python datax.py /opt/module/datax/job/job.json
使用案例
从stream流读取数据并打印到控制台
1)查看配置模板
[root@other bin]# python datax.py -r streamreader -w streamwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please refer to the streamwriter document:
https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [],
"sliceRecordCount": ""
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
[root@other bin]#
2)根据模板编写配置文件
[root@other job]# cat stream2stream.json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
[root@other job]#
3)运行
[root@other job]$ /opt/module/datax/bin/datax.py /opt/module/datax/job/stream2stream.json
Oracle数据库
我这里是直接用docker安装的,需要的话可以查看我之前的博客:
新建用户
建议插入数据:
SQL>create TABLE student(id INTEGER,name VARCHAR2(20));
SQL>insert into student values (1,'zhangsan');
SQL> select * from student;
ID NAME
---------- ----------------------------------------
1 zhangsan
Oracle与MySQL的SQL区别
类型 | Oracle | MySQL |
---|---|---|
整型 | number(N)/integer | int/integer |
浮点型 | float | float/double |
字符串类型 | varchar2(N) | varchar(N) |
NULL | '' | null和''不一样 |
分页 | rownum | limit |
"" | 限制很多,一般不让用 | 与单引号一样 |
价格 | 闭源,收费 | 开源,免费 |
主键自动增长 | × | √ |
if not exists | × | √ |
auto_increment | × | √ |
create database | × | √ |
select * from table as t | × | √ |
DataX案例
从Oracle中读取数据存到MySQL
1)MySQL中创建表
mysql> create database oracle;
mysql> use oracle;
mysql> create table student(id int,name varchar(20));
2)编写datax配置文件
[root@other job]# cat oralce2mysql.json
{
"job": {
"content": [
{
"reader": {
"name": "oraclereader",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": ["jdbc:oracle:thin:@192.168.1.121:1521:helowin"],
"table": ["student"]
}
],
"password": "123456",
"username": "dalianpai"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://192.168.1.121:3306/datax",
"table": ["student"]
}
],
"password": "root",
"username": "root",
"writeMode": "insert"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
[root@other job]#
3)执行命令
/opt/module/datax/bin/datax.py /opt/module/datax/job/oracle2mysql.json
显示:
结果:
注:简单的演示一下,由于我的HDFS安装在CDH中,懒的开那么多虚拟机,后面有时间在继续研究一下,datax-web好像更加友好,还提供了相关的界面。