ElasticSearch中的批量导入Bulk

版本 7.7, 官方文档 https://www.elastic.co/guide/en/elasticsearch/reference/7.7/docs-bulk.html

Bulk API

在单个API调用中执行多个索引或删除操作。这样可以减少开销，并大大提高索引速度。

比如:

POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

请求

POST /_bulk

POST //_bulk

说明

Bulk接口提供了在一个请求中执行多种索引/创建/删除/更新操作的方法.

要求内容(body)部分必须是"newline delimited JSON" (NDJSON, 每行以换行符\n结尾)格式.

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

换行符\n前面可以有回车符\r

每行数据对应两个json, 占两行, 第一行是用来指明操作命令和元数据, 第二行是自定义的数据.

删除命令(delete)只占一行, 后面不需要再跟数据

每条数据之间不需要多余的换行

如果head中指明了"", 则内容中不需要再指定"_index". 比如我们只往"test"中插入数据

POST /test/_bulk
{"index":{"_id":1}}
{"id":1,"name":"aben","age":18}
{"index":{"_id":2}}
{"id":2,"name":"sky","age":19}
{"index":{"_id":3}}
{"id":3,"name":"tom","age":20}

curl 命令行下提交数据

如果我们想通过文件方式导入大量数据, 则必须在命令行中使用curl了.

$ cat requests
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@requests";

1. 必须在header中指定 "Content-Type" 为 "application/x-ndjson" 或者 "application/json"

比如:

curl -XPOST localhost:9200/info/_bulk --data-binary "info.json"

报错:

{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

2. 必须使用参数"--data-binary", 不能使用参数"-d", 否则会忽略换行符

3. 文件名前必须加"@"符号, 比如: "@data_file_name.json"

这个官方文档中没有特别说明, 真的很坑

比如下面这个命令, 参数都加完整了, 但是就是缺少一个 @符号:

curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/info2/_bulk --data-binary "info2.json"

报错:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"The bulk request must be terminated by a newline [\\n]"}],"type":"illegal_argument_exception","reason":"The bulk request must be terminated by a newline [\\n]"},"status":400}

bulk导入时的优化

1. refresh时间间隔

配置名称: index.refresh_interval

优化点：减少刷新频率，降低潜在的写磁盘性能损耗，默认的刷新时间间隔是1s，对于写入量很大的场景，这样的配置会导致写入吞吐量很低，适当提高刷新间隔，可以提升写入量，代价就是让新写入的数据在60s之后可以被搜索，新数据可见的及时性有所下降。

在bulk大量数据到ES集群的时候可以关闭刷新频率，把其值设置为-1就是关闭了刷新频率，在导入完之后设置成合理的值即可。

# bulk优化1: 刷新间隔
# 查看原设置
GET /info/_settings/index.refresh_interval?include_defaults=true
# 关闭刷新
PUT /info/_settings
{
  "index": {
    "refresh_interval": -1
  }
}
# 恢复刷新的默认设置
PUT /info/_settings
{
  "index": {
    "refresh_interval": null
  }
}

2. replica数目设置

在bulk大量数据到ES集群的可以把副本数设置为0，在数据导入完成之后再设置为1或者你集群的适合的数目。

# bulk优化2: 副本数量
# 查看原设置
GET /info/_settings/index.number_of_replicas?include_defaults=true
# 副本先设置为0
PUT /info/_settings
{
  "index.number_of_replicas": 0
}
# 重新设置副本数量
PUT /info/_settings
{
  "index.number_of_replicas": 1
}

附: 从mysql中导出大量数据到json文件(这里把每10w笔数据写入一个文件)

//每次读取10w笔会超出内存, 所以改成每次1w, 然后把这10次的查询结果写入到同一个文件里面去.
$idFrom = 0;
$ix = 1;
$batchWriteSize = 1000; //批次写入的行数量, 避免每行写入一次
while (true) {
    $sql = 'SELECT * FROM `info` WHERE id > :id ORDER BY id ASC LIMIT 0,10000';//每次只取1w行
    $rs = Db::select($sql, ['id' => $idFrom]);
    if (empty($rs)) {
        echo 'no more data ...';
        break;
    }
    echo '取到数据: ' . count($rs) . ' 行 ======' . PHP_EOL;

    $fileName = 'info_' . (ceil($ix / 10) - 1) . '.json';//不能取余, 否则id会很分散, 用除法再向上进位`ceil`
    echo '从' . $idFrom . ' 开始, 写入到文件 ' . $fileName . ':' . PHP_EOL;

    $i = 1;
    $str = '';
    $writed = false;//是否已写入. 解决数量不是 1000 的倍数时最后数据没写入的问题
    foreach ($rs AS $row) {
        $idFrom = (int)$row['id'];
        $row_data = '{"index":{"_index":"info","_id":' . $row['id'] . '}}' . "\n" . json_encode($row) . "\n";//注意: 数据里面必须使用\n
        if ($i <= $batchWriteSize) {// 1 ~ 1000 都写入
            $str .= $row_data;
        }
        if ($i == $batchWriteSize) {
            echo '写入一次, 当前id: ' . $idFrom . ', 大小: ' . strlen($str) . PHP_EOL;
            file_put_contents($fileName, $str, FILE_APPEND);
            $writed = true;
            $str = '';
            $i = 1;
        }
        else {
            $i++;
        }

    }
    //把最后不到1000行的数据写入
    if ($i > 1) {
        echo '把最后不到' . $batchWriteSize . '行的数据(' . ($i - 1) . '个)写入' . PHP_EOL;
        file_put_contents($fileName, $str, FILE_APPEND);
    }

    $ix++;
}