future_xiaowu

ES reindex介绍

参考资料：

https://gaoming.blog.csdn.net/article/details/82734667

https://www.cnblogs.com/Ace-suiyuan008/p/9985249.html

https://www.cnblogs.com/siye1989/p/11561972.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.5/docs-reindex.html

https://www.elastic.co/guide/en/elasticsearch/client/java-rest/6.5/java-rest-high-document-reindex.html

官方生肉传送门:

reindex的API介绍

restHighCLient相关介绍

这里ES版本为6.5.0

应用背景：

1、当你的数据量过大，而你的索引最初创建的分片数量不足，导致数据入库较慢的情况，此时需要扩大分片的数量，此时可以尝试使用Reindex。

2、当数据的mapping需要修改，但是大量的数据已经导入到索引中了，重新导入数据到新的索引太耗时；但是在ES中，一个字段的mapping在定义并且导入数据之后是不能再修改的，

所以这种情况下也可以考虑尝试使用Reindex。

Reindex会将一个索引的数据复制到另一个已存在的索引，但是并不会复制原索引的mapping（映射）、shard（分片）、replicas（副本）等配置信息。在引入索引模板后，符合命名规范的索引都会套用模板配置信息，这在reindex时也会十分便利，常见的按时间索引就是其中一种方式

A 使用：

最常见形式

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

该操作会将twitter中的文档复制到new_twitter中

对应java客户端写法

ReindexRequest request = new ReindexRequest(); 
request.setSourceIndices("source1", "source2"); 
request.setDestIndex("dest");

源索引可以设置多个，从索引source1和索引source2中复制文档至dest

返回示例：

{
  "took" : 147,
  "timed_out": false,
  "created": 120,
  "updated": 0,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1.0,
  "throttled_until_millis": 0,
  "total": 120,
  "failures" : [ ]
}

_reindex会将一个索引的快照数据copy到另一个索引，默认情况下存在相同的_id会进行覆盖（一般不会发生，除非是将两个索引的数据copy到一个索引中)，通过设置version_type参数可以控制这种情况，不设置该变量值或者设置为internal会让ES直接将数据“倒”进目标索引，id type相同的文档若存在多个则会直接被新值覆盖

version_type：

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "internal"
  }
}

request.setDestVersionType(VersionType.EXTERNAL);

op_type：

将op_type设置为create时，会直接执行_reindex时在目标索引创建缺失的文档，若此时文档不存在缺失的情况则会导致版本冲突

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

request.setDestOpType("create");

conflicts：

默认情况下，发生冲突会禁止_reindex进程，将该值设置为"proceed"可以在发生冲突时跳过当前继续执行，类似于continue

POST _reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

request.setConflicts("proceed");

query与type：

对满足query条件的数据/type属性符合要求的数据进行reindex操作，执行方式如下：

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "_doc",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

request.setSourceDocTypes("doc"); 
request.setSourceQuery(new TermQueryBuilder("user", "kimchy"));

source与type：

同上，索引twitter与blog中的type为_doc与post的文档数据将会被复制

POST _reindex
{
  "source": {
    "index": ["twitter", "blog"],
    "type": ["_doc", "post"]
  },
  "dest": {
    "index": "all_together",
    "type": "_doc"
  }
}

size：

这个参数可控制执行的文档数量，如下例子将只会复制一个文档

POST _reindex
{
  "size": 1,
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

request.setSize(1);

size与sort：

通过sort与size参数的结合，我们可以只复制部分我们想要的数据，如下例子将把twitter中date字段降序排列的前10000条数据，_reindex是基于scroll操作的，sort会降低执行效率，如果可以的话，最好使用query代替sort与size的组合使用

POST _reindex
{
  "size": 10000,
  "source": {
    "index": "twitter",
    "sort": { "date": "desc" }
  },
  "dest": {
    "index": "new_twitter"
  }
}

request.addSortField("field1", SortOrder.DESC); 
request.addSortField("field2", SortOrder.ASC);

sourceBatchSize：

默认情况下，每次执行处理的文档数量为1000，修改这个参数我们可以自定义批处理的数量

POST _reindex
{
  "source": {
    "index": "source",
    "size": 100
  },
  "dest": {
    "index": "dest",
    "routing": "=cat"
  }
}

request.setSourceBatchSize(100);

script：

_reindex有些类似于_update_by_query，也是可以使用脚本的

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  },
  "script": {
    "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
    "lang": "painless"
  }
}

request.setScript(
    new Script(
        ScriptType.INLINE, "painless",
        "if (ctx._source.user == 'kimchy') {ctx._source.likes++;}",
        Collections.emptyMap()));

脚本中的参数ctx.op 只能设置为noop或者delete，noop则什么都不做，若设置为delete，在目标索引中符合调价文档会被删除，且在响应体中会有具体的数量返回

_id
_type
_index
_version
_routing

这些参数都是可以被自定义的，具体使用参考官方文档

pipeline：

管道的具体用法也请查阅更多资料

POST _reindex
{
  "source": {
    "index": "source"
  },
  "dest": {
    "index": "dest",
    "pipeline": "some_ingest_pipeline"
  }
}

B. 从远程集群执行reindex

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

request.setRemoteInfo(
    new RemoteInfo(
        "https", "localhost", 9002, null, new BytesArray(new MatchAllQueryBuilder().toString()),
        "user", "pass", Collections.emptyMap(), new TimeValue(100, TimeUnit.MILLISECONDS),
        new TimeValue(100, TimeUnit.SECONDS)
    )
);

java写法里请求在remoteInfo中，这里不建议使用Builder构建请求体，因为请求的集群版本可能比较旧，无法识别版本较新的请求，此时手写JSON请求体是最安全可靠的

"host"值为协议 + 主机 + 端口号,用户名和密码是可选项，如果使用了这两个参数，ES会使用基础权限链接ES节点，要使用这两个参数请确保请求方式为https，否则发送的请求中密码将会是明文的

使用这个功能还需要配置elasticsearch.yaml中的白名单，reindex.remote.whitelist，如

reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"

URL Parameters(reindex参数设置)

Url可选参数有pretty,refresh, wait_for_completion, wait_for_active_shards, timeout, requests_per_second.

1、refresh
Index API的refresh只会让接收新数据的碎片被刷新，而reindex的refresh则会刷新所有索引。

2、wait_for_completion
将参数设置为false则会执行一些预执行检查，启动请求，然后返回一个任务，该任务可以用于任务api来取消或获得任务的状态。Es会在.tasks/task/${taskId}中创建记录ID。

3、wait_for_active_shards
在Bulk API的情况下，requests_per_second可以设置在继续索引之前，控制多少个碎片的拷贝数必须是活跃的。而timeout 超时控制每个写请求等待不可用的碎片等待的时间。

4、requests_per_second
每秒的请求数据，显然是节流控制参数，运行设置一个正整数，设置为-1表示不进行控制。

响应体如下

{
"took" : 639, // 执行全过程使用的毫秒数
"updated": 0, // 成功修改的条数
"created": 123, // 成功创建的条数
"batches": 1, // 批处理的个数
"version_conflicts": 2, // 版本冲突个数
"retries": { // 重试机制
"bulk": 0, // 重试的批个数
"search": 0 // 重试的查询个数
}
"throttled_millis": 0, // 由于设置requests_per_second参数而sleep的毫秒数
"failures" : [ ] // 失败的数据
}

C. Task API 操作：

取消Reindex操作

POST _tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel

requests_per_second(节流限制)

每秒请求数，设置为-1则会禁止当前操作，也可以设置为其它具体数字，若提升该值则立刻生效，若减小则会在当前批次结束后生效，这样是为了防止请求超时

POST _reindex/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1

在Reindex时给文档字段重命名

源索引

POST test/_doc/1?refresh
{
  "text": "words words",
  "flag": "foo"
}

执行reindex并重命名flag字段

POST _reindex
{
  "source": {
    "index": "test"
  },
  "dest": {
    "index": "test2"
  },
  "script": {
    "source": "ctx._source.tag = ctx._source.remove(\"flag\")"
  }
}

查询目标索引

GET test2/_doc/1

{
  "found": true,
  "_id": "1",
  "_index": "test2",
  "_type": "_doc",
  "_version": 1,
  "_source": {
    "text": "words words",
    "tag": "foo"
  }
}

可见，flag字段在目标索引中已被重命名为tag

Slicing：

该方法会将请求拆分且并行化执行，可以提高效率，将大的请求拆分为多个小请求

1 手动拆分：

POST _reindex
{
"source": {
"index": "my_index_name",
"slice": { // 第一slice执行操作
"id": 0,
"max": 2
}
},
"dest": {
"index": "my_index_name_new"
}
}
POST _reindex
{
"source": {
"index": "my_index_name",
"slice": { // 第二slice执行操作
"id": 1,
"max": 2
}
},
"dest": {
"index": "my_index_name_new"
}
}

可以通过以下命令查看执行的结果：

GET _refresh
POST my_index_name/_search?size=0&filter_path=hits.total

结果如下

{
"hits": {
"total": 120
}
}

2 自动并行化

如下是自动划分的5个slices,只是将需要手动划分的过程自动化处理，将一个操作拆分为多个子操作并行化处理，其他查询方式等都一样，如下：

POST _reindex?slices=5&refresh
{
"source": {
"index": "my_index_name"
},
"dest": {
"index": "my_index_name_new"
}
}

3、并行化处理的特性
同样可以使用Task API查看每个slices的子请求（child）的task状态；

获取每个slices请求的任务状态，只返回已完成的状态；

这些子请求单独可寻址，比如取消操作和重新配置节流操作；

对每个slices进行重新配置节流时，会将所有未完成的操作进行比例分配；

对每个slices进行取消操作其他所有slices都会生效；

每个请求只拥有全部数据的部分，并且每个文档的大小会不同，大文件基本分配均匀；

并行化处理时使用requests_per_second 或size等，可能或导致分布不均匀；

每个子请求可能获取到不同版本或快照的源索引数据。

4、slices数量设置要求
数量不能过大，比如500可能出现CPU问题；

查询性能角度看，设置slices为源索引的分片的倍数是比较合适的，一倍是最有效的；

索引性能角度看，应该随着可用资源的数量线性地扩展；

然而索引或查询性能是否在此过程中占据主导，取决于许多因素，比如重新索引的文档和重新索引的集群。

D.按时间索引的Reindex操作

现有索引

PUT metricbeat-2016.05.30/_doc/1?refresh
{"system.cpu.idle.pct": 0.908}
PUT metricbeat-2016.05.31/_doc/1?refresh
{"system.cpu.idle.pct": 0.105}

使用脚本

POST _reindex
{
  "source": {
    "index": "metricbeat-*"
  },
  "dest": {
    "index": "metricbeat"
  },
  "script": {
    "lang": "painless",
    "source": "ctx._index = 'metricbeat-' + (ctx._index.substring('metricbeat-'.length(), ctx._index.length())) + '-1'"
  }
}