阿里 离线数据同步工具 DataX 初试

DataX : 一个异构数据源离线同步框架,通过插件体系完成数据同步过程。reader插件用于读入,writer插件用于写出,中间的framework可以定义transform插件完成数据转化的需要。

Sqoop 只支持关系型数据库与HDFS/Hive 之间的数据同步, DataX 则更为丰富。

目前支持的数据源有:https://github.com/alibaba/DataX/wiki/DataX-all-data-channels

使用:

$ tar zxvf datax.tar.gz
$ sudo chmod -R 755 {YOUR_DATAX_HOME}
$ cd  {YOUR_DATAX_HOME}/bin
$ python datax.py ../job/job.json

json配置例子(Mongo > HDFS/Hive):

mongotest.json

{
    "job": {
        "setting": {
            "speed": {
                "channel": "2"
            }
        },
        "content": [{
                "reader": {
                    "name": "mongodbreader",
                    "parameter": {
                        "address": [""],
                        "userName": "",
                        "userPassword": "",
                        "dbName": "",
                        "collectionName": "",
                        "column": [{ "name": "cityid", "type": "string" }, { "name": "searchstr", "type": "string" }, { "name": "pv", "type": "string" } ] }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [{ "name": "cityid", "type": "string" }, { "name": "searchstr", "type": "int" }, { "name": "pv", "type": "int" } ],
                        "defaultFS": "hdfs://*",
                        "fieldDelimiter": "\t",
                        "fileName": "mongotest",
                        "fileType": "text",
                        "path": "/user/hive/warehouse/temp.db/mongotest",
                        "writeMode": "append" }
                }
            }
        ]
    }
}

同步过程:

  1. create Hive table temp.mongotest
  2. python {DATAX_HOME}/bin/datax.py ../mongotest.json

你可能感兴趣的:(Sqoop,&,Datax)