ElasticSearch数据迁移工具Transporter

什么是Transporter?

       transporter 是一款简单而又强大的数据迁移工具。它通过一种的agnostic message format数据形式轻松的将不同数据来源不同格式的数据进行转换。

transporter 可以在不同数据库之间进行数据转换迁移,同时也可以将text文件迁移至其他数据库。transporter连接不同数据源的媒介称为Adaptor. Adaptor可以配置为读数据的Source端也可以配置为作为写目标的Sink端。典型的Transporter包含一个Source和一个Sink,通过数据管道pipeline进行转换传输。transporter包含一系列本地或者JavaScript函数形式的转换器(Transformers),通过转换器可以将源数据格式进行过滤、转换以便正确的写入Sink目标数据源。

Transporter 命令

transporter 包含以下命令:

  • init - Configure Transporter 生产配置文件pipeline.js
  • about - List available adaptors 
  • run - Run Transporter 运行transporter
  • test - Test Transporter configuration
  • xlog - Manage the commit log
  • offset - Manage the offsets for sinks (see xlog)

init

transporter init [source adaptor name] [sink adaptor name]

运行init命令,在当前目录下生产基本的pipeline.js配置文件,如:

$ transporter init mongodb elasticsearch
Writing pipeline.js...
$ cat pipeline.js
var source = mongodb({
  "uri": "${MONGODB_URI}"
  // "timeout": "30s",
  // "tail": false,
  // "ssl": false,
  // "cacerts": ["/path/to/cert.pem"],
  // "wc": 1,
  // "fsync": false,
  // "bulk": false,
  // "collection_filters": "{}"
})

var sink = elasticsearch({
  "uri": "${ELASTICSEARCH_URI}"
  // "timeout": "10s", // defaults to 30s
  // "aws_access_key": "ABCDEF", // used for signing requests to AWS Elasticsearch service
  // "aws_access_secret": "ABCDEF" // used for signing requests to AWS Elasticsearch service
})

t.Source(source).Save(sink)
// t.Source("source", source).Save("sink", sink)
// t.Source("source", source, "/.*/").Save("sink", sink, "/.*/")

编辑pipeline.js文件 配置相应的source数据源和sink目标数据源。

about

transporter about [adaptor name]

列出所有可用的adaptor, 通过adaptor name指定adaptor可以获得配置详情, 如:

$ transporter about
elasticsearch - an elasticsearch sink adaptor
file - an adaptor that reads / writes files
mongodb - a mongodb adaptor that functions as both a source and a sink
postgres - a postgres adaptor that functions as both a source and a sink
rabbitmq - an adaptor that handles publish/subscribe messaging with RabbitMQ
rethinkdb - a rethinkdb adaptor that functions as both a source and a sink


$ transporter about rethinkdb
rethinkdb - a rethinkdb adaptor that functions as both a source and a sink

 Sample configuration:
{
   "uri": "${RETHINKDB_URI}"
  // "timeout": "30s",
  // "tail": false,
  // "ssl": false,
  // "cacerts": ["/path/to/cert.pem"]
}
$

run

transporter run [-log.level "info"] []

运行pipeline脚本,若未指定,则默认参数名为 pipelin.js 。

test

transporter test [-log.level "info"] []

test 一般用于调试,执行后只建立和source及sink数据源的连接,并不执行真正的数据迁移。

xlog

transporter xlog --log_dir=/path/to/log oldest|current|show [OFFSET]

offset

transporter offset --log_dir=/path/to/log list|show|mark|delete [SINK] [OFFSET]

switches

-log.level "info" - 设置日志级别,默认为info,可设置为debug或者error。

Adaptor

        adaptor的作用就是从source读取数据或者写入数据至sink。Transporter 使用adaptor作为输入和输出的媒介。 可以通过 transporter about列出可用的adaptor列表。

elasticsearch

elasticsearch adaptor 仅作为sink,写入数据至elasticsearch。以json格式写入数据至elasticsearch索引。

MongoDB

MongoDB adaptor 既可以作为source又可以作为sink。写入时, MongoDB adaptor可通过bulk模式批量执行更新或者创建请求;

file

        file adaptor 既可以作为source 读取硬盘文件又可以作为sink写入硬盘文件。写入时, 将transptorter adaptor 内部数据结构Message转为json string写入文件;读取时,adaptor默认磁盘文件每一行为一个json string, 解析并转为transporter 内部数据结构(Message)。

除此之外, transporter 的adaptor还支持postgres, rethinkdb以及rabbitmq。

Messages 数据结构

Messages 作为source到sink的中间过渡的数据结构。完成source到sink的数据转换(因为不同数据源之间无法直接转换)。 messages 是一个JavaScript的对象, 由4个字段组成。

{
    "ns":"message.namespace",
    "ts":12345, // time represented in milliseconds since epoch
    "op":"insert",
    "data": {
        "id": "abcdef",
        "name": "hello world"
    }
}

data

包含source读取的文档数据,以key/value的json格式数据。

ns

ns(namespace), 为pipeline.js中配置的namespace参数。

This field contains the namespace string which is matched with the namespace parameters in the Transporter pipeline.

op

对sink执行的操作。

The operation that this data should be used to reflect when being written by a Sink. The op field is determined by the Source when being read and can be insert, update, delete, command or noop.

 ts

ts(timestamp)时间戳。

The timestamp for the message. This is a Unix epoch time which reflects when the message was created.

Transformers--转换器

有两种形式的transformer:native 及 JavaScript function。 native transformer是go语言编写,内置于transporter中。

JavaScript transformer 由用户通过编写JavaScript function组成。

基本的无transformer的pipeline:

t.Source(source).Save(sink);

通过.Transform()函数添加一个transformer:

t.Source(source).Transform(transformer({"param":"value"})).Save(sink);

如,添加一个omit transformer,用于忽略“internalid”字段:

t.Source(source).Transform(omit({"fields":["internalid"]})).Save(sink);

Native transformer

omit

omit() - 移除messages顶层字段, 把不需要迁移的字段过滤掉。

omit({"fields": ["name"]})

例如:

输入:

{
    "_id": 0,
    "name": "transporter",
    "type": "function"
}

omit过滤 “type” 字段:

omit({"fields":["type"]})

输出:

{
    "_id": 0,
    "name": "transporter"
}

pick

选择需要迁移的字段,和omit相反。

pick({"fields": ["name"]})

例如:

输入:

{
    "_id": 0,
    "name": "transporter",
    "type": "function"
}

pick 选择 _id 和name字段:

pick({"fields":["_id", "name"]})

输出:

{
    "_id": 0,
    "name": "transporter"
}

rename

rename() - 对输入字段重命名。

rename({"field_map": {"test":"renamed"}})

例如,输入:

{
    "_id": 0,
    "name": "transporter",
    "type": "function",
    "count": 10
}

rename重命名count字段为total字段:

rename({"field_map": {"count":"total"}})

输出:

{
    "_id": 0,
    "name": "transporter",
    "type": "function",
    "total": 10
}

skip

skip() - 跳过不满足条件的source记录,不进行迁移。

skip() will evalute the data based on the criteria configured and determine whether the message should continue down the pipeline or be skipped. When evaluating the data, true will result in the message being sent down the pipeline and false will result in the message being skipped.

skip({"field": "test", "operator": "==", "match": 10})

例如,输入:

{
    "_id": 0,
    "name": "transporter",
    "type": "function",
    "count": 10
}

skip选择count==10的source输入记录,置入pipeline中:

skip({"field": "count", "operator": "==", "match": 10})

输出:

{
    "_id": 0,
    "name": "transporter",
    "type": "function",
    "count": 10
}

skip选择count>20的source输入记录,置入pipeline中:

skip({"field": "count", "operator": ">", "match": 20})

上面ciount=10的记录将被跳过。

pretty

pretty 美化输出的json格式数据。

JavaScript transformer

JavaScript transformer 调用JavaScript 脚本处理message。transformer存在两种格式JavaScript 引擎:otto和goja。

There are two JavaScript engines currently in Transformer; the older otto and newer goja. The js() transformer function is actually an alias to the Goja JavaScript engine. The only external difference between the two is how functions should be created.

goja style

function transform(msg) {
    return msg
}

例如,存在如下数据:

{
    "ns":"message.namespace",
    "ts":12345, // time represented in milliseconds since epoch
    "op":"insert",
    "data": {
        "id": "abcdef",
        "name": "hello world"
    }
}

NOTE when working with data from MongoDB, the _id field will be represented in the following fashion:

{
    "ns":"message.namespace",
    "ts":12345, // time represented in milliseconds since epoch
    "op":"insert",
    "data": {
        "_id": {
            "$oid": "54a4420502a14b9641000001"
        },
        "name": "hello world"
    }
}

goja调用js脚本:

goja({"filename": "/path/to/transform.js"})

如输入:

{
    "_id": 0,
    "name": "transporter",
    "type": "function"
}

goja调用自定义的transform.js:

goja({"filename":"transform.js"})

transform.js 如下:

function transform(doc) {
    doc["data"]["name_type"] = doc["data"]["name"] + " " + doc["data"]["type"];
    return doc
}

输出:

{
    "_id": 0,
    "name": "transporter",
    "type": "function",
    "name_type": "transporter function"
}

可参考:https://github.com/compose/transporter/blob/master/function/gojajs/README.md

Otto style

module.exports = function(msg) {
    return msg;
};

可参考:https://github.com/compose/transporter/blob/master/function/ottojs/README.md

Transporter 配置

执行transporter init mongodb elasticsearch 命令初始化生成基本的pipeline.js配置文件(从MongoDB迁移至elasticsearch),或者直接编辑已存在的pipeline.js(名字随意).

pipeline.js:

var source = mongodb({
  "uri": "${MONGODB_URI}"
  // "timeout": "30s",
  // "tail": false,
  // "ssl": false,
  // "cacerts": ["/path/to/cert.pem"],
  // "wc": 1,
  // "fsync": false,
  // "bulk": false,
  // "collection_filters": "{}"
})

var sink = elasticsearch({
  "uri": "${ELASTICSEARCH_URI}"
  // "timeout": "10s", // defaults to 30s
  // "aws_access_key": "ABCDEF", // used for signing requests to AWS Elasticsearch service
  // "aws_access_secret": "ABCDEF" // used for signing requests to AWS Elasticsearch service
})

t.Source(source).Save(sink)
// t.Source("source", source).Save("sink", sink)
// t.Source("source", source, "/.*/").Save("sink", sink, "/.*/")

pipelines

var source = mongodb({
  "uri": "${MONGODB_URI}"
})

var sink = elasticsearch({
  "uri": "${ELASTICSEARCH_URI}"
})

t.Source("source", source, "/.*/").Save("sink", sink, "/.*/") //默认的pipeline

t 代表 transporter对象,通过t调用三个JavaScript函数Source()、Transform()、Save()创建pipeline。

每个函数携带三个参数:name,adaptor,namespace。 如上,transporter配置了名为“source”的source数据源,指定变量source作为adaptor, 其命名空间为“/.*/”, ".*"为正则表达式,表示匹配所有MongoDB表。默认的namespace为“/.*/”。

namespace

namespace配置为指定的正则表达式。以MongoDB为例,transporter将迁移任何匹配namespace正则表达式的集合数据。若要精确指定某一MongoDB集合进行迁移,如指定迁移table集合,则namespace可以指定为“/^table$/”,即以table开头(^)以table结尾($)。若要将robo开头的MongoDB结合迁移,则source()的namespace参数可以设为“/^robo.*/”, 如果MongoDB存在robocop, robosaurusrobbietherobot集合,则robocop和robosaurus将会被迁移。

对于Save()和Transform(),namespace作为过滤器,只有满足过滤条件的message进入sink或者transform处理。

For Save() sinks and Transform() transformers, the namespace setting then acts as a filter on which messages those sink or transformers are applied to. The namespace of an incoming message has to match the sink's namespace setting to be processed by the sink or the transformer before being handed to them. If not specified, then the sinks and transformers will accept any message.

参考:https://github.com/compose/transporter/wiki/Pipelines

Running with Docker

参考:https://github.com/compose/transporter/wiki/Running-with-Docker

参考文档

https://github.com/compose/transporter

你可能感兴趣的:(Elasticsearch,MongoDB)