mysql+hive数据导入,Datax配置及脚本

1、在hive中建表
create table test0604(
CD_ID INT,
COMMEN STRING,
COLUMN_NAME string,
TYPE_NAME string,
INTEGER_IDX int
)
PARTITIONED BY(date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS ORC;
注:公司要求指定在hive中存储格式为ORC
分区导入增量数据需要手动或者定时在hive中先建好分区而后在datax导入mysql数据到指定分区中
2、在Oozie上定时执行sql ALTER TABLE test0604 ADD PARTITION (dt = ${partdt});
https://blog.csdn.net/nieson2012/article/details/70156012
例如,昨天的日期就可以写为昨天日期 ${coord:formatTime(coord:dateOffset(coord:nominalTime(), -1, 'DAY'), 'yyyyMMdd')}
3、执行脚本
#!/bin/bash
yesterday=`date -d -1day +%Y%m%d`
echo "$yesterday"
python /opt/module/datax/bin/datax.py -p"-Dtable=$1 -Dyesterday='date='$yesterday" /opt/module/datax/job/test0604.json
(注:做的小Demo传入了两个参数一个mysql中table的名字,运行脚本时传入,一个时间,shell命令获得)

4、json配置(从mysql读取,写入hdfs,其中mysql可以写sql代替指定的整体表)
{
"setting": {},
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"CD_ID",
"COMMEN",
"COLUMN_NAME ",
"TYPE_NAME",
"INTEGER_IDX"
],
"connection": [{
"jdbcUrl": ["jdbc:mysql://localhost:3306/test"],
"table": ["$table"]
}],
"password": "123456",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://devtest.node2.com:8020",
"fileType": "orc",
"path": "/data/user/hive/warehouse/aries.db/test0604/$yesterday",
"fileName": "Mytest0604",
"column": [
{
"name": "CD_ID",
"type": "int"

},{
"name": "COMMEN",
"type": "string"
},{
"name": "COLUMN_NAME ",
"type": "string"
},{
"name": "TYPE_NAME ",
"type": "string"
},{
"name": "INTEGER_IDX ",
"type": "int"
},{
"name": "DATE ",
"type": "string"
}
],
"writeMode": "append",
"fieldDelimiter": "\t",
"compress":"NONE"
}
}
}
]
}
}

5、写好json配置后百度下json在线验证,查看配置是否格式正确

而后运行shell脚本,注意把脚本中路径替换成你服务器的路径。

你可能感兴趣的:(Datax使用)