elasticsearch备份恢复snapshot&restore

本文基于es7.1版本。部分内容在较旧或较新版本可能不十分准确。

es提供对于运行中的集群的备份恢复功能,利用此功能可以备份整个集群(open或者started状态的分片,备份不了closed的index),或者备份指定的index。备份目标存储可以是网络文件系统NFS、Amazon S3、HDFS、Azure(将来可能不支持deprecated)、以及谷歌云存储。

对于不同es版本备份数据与恢复之间的版本兼容性,官方文档中有这样的描述:

A snapshot contains a copy of the on-disk data structures that make up an index. This means that snapshots can only be restored to versions of Elasticsearch that can read the indices:

  • A snapshot of an index created in 6.x can be restored to 7.x.
  • A snapshot of an index created in 5.x can be restored to 6.x.
  • A snapshot of an index created in 2.x can be restored to 5.x.
  • A snapshot of an index created in 1.x can be restored to 2.x.

意思就是说前一个major版本的备份,可以在后一个major的版本上被恢复。但是要注意,1.x的备份在2.x上恢复后,接着在2.x上对恢复的数据再备份,是不可以恢复到5.x上的。因为原始索引是在1.x下创建的,保留了1.x的特性,在从1.x恢复到2.x的过程中,特性仍然被保留,5.x的版本不能识别1.x创建的索引结构。When backing up your data prior to an upgrade, keep in mind that you won’t be able to restore snapshots after you upgrade if they contain indices created in a version that’s incompatible with the upgrade version.解决办法就是在2.x上restore后,reindex索引,然后delete掉旧索引。或者利用reindex-from-remote功能,直接从1.x建索引重建到2.x上。

要在es上进行备份,需要先创建repository,这里以备份到HDFS为例进行说明。这里所描述的内容在官方文档中也有讲述:https://www.elastic.co/guide/en/elasticsearch/plugins/7.3/repository-hdfs.html

1.首先安装hdfs插件。可以通过联机安装和离线安装。联机安装执行如下命令:
 

sudo bin/elasticsearch-plugin install repository-hdfs

离线安装,需要先下载对应版本的插件压缩包。比如我这里是es7.0.1版本,需要下载7.0.1的包,否则安装会报错。https://artifacts.elastic.co/downloads/elasticsearch-plugins/repository-hdfs/repository-hdfs-7.0.1.zip  对应版本的zip包只需要修改版本号即可。下载下来后,root用户cd到es bin目录下,执行下面命令安装,file后面的路径按照实际修改:
 

./elasticsearch-plugin  install file:///bigdata/cluster1/repository-hdfs-7.0.1.zip

安装过程会弹出警告,忽略,选Y继续安装完成。结束后需要重启节点。集群中每个节点都照此步骤进行。因为repository会被持久化到cluster state,当节点online时,会尝试初始化所有注册的repository,如果验证失败,会将此repository设置为unusable。之后,只能删除repository重新注册,或者重启节点。

2.注册repository。这里我有一个三个namenode的ha部署的hdfs集群。
 

PUT _snapshot/my_hdfs_repository
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://hadoopcluster",
    "path": "/elasticsearch/backup",
    "conf.dfs.nameservices": "hadoopcluster",
	"conf.dfs.ha.namenodes.hadoopcluster": "nn1,nn2,nn3",
	"conf.dfs.namenode.rpc-address.hadoopcluster.nn1": "MYSQL1:8020",
	"conf.dfs.namenode.rpc-address.hadoopcluster.nn2": "MYSQL2:8020",
	"conf.dfs.namenode.rpc-address.hadoopcluster.nn3": "MYSQL3:8020",
	"conf.dfs.client.failover.proxy.provider.hadoopcluster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
  }
}

上面的uri指定hdfs集群service name。这个是配置在hadoop core-site.xml中的fs.defaultFS参数值,path是hdfs中的文件夹路径,conf.*是对应的hdfs-site.xml中的配置。conf可以配置core-site.xml、hdfs-site.xml中所有可被repository-hdfs插件识别的客户端参数。如下是settings支持的全部的可配置参数:
 

uri

The uri address for hdfs. ex: "hdfs://:/". (Required)

path

The file path within the filesystem where data is stored/loaded. ex: "path/to/file". (Required)

load_defaults

Whether to load the default Hadoop configuration or not. (Enabled by default)

conf.

Inlined configuration parameter to be added to Hadoop configuration. (Optional) Only client oriented properties from the hadoop core and hdfs configuration files will be recognized by the plugin.

compress

Whether to compress the metadata or not. (Disabled by default)

max_restore_bytes_per_sec

Throttles per node restore rate. Defaults to 40mb per second.

max_snapshot_bytes_per_sec

Throttles per node snapshot rate. Defaults to 40mb per second.

readonly

Makes repository read-only. Defaults to false.

chunk_size

Override the chunk size. (Disabled by default)

security.principal

Kerberos principal to use when connecting to a secured HDFS cluster. If you are using a service principal for your elasticsearch node, you may use the _HOST pattern in the principal name and the plugin will replace the pattern with the hostname of the node at runtime (see Creating the Secure Repository).

 

readonly参数在多个集群指向同一个repository起作用。通常用来将A集群的备份,恢复到B集群时,将A、B集群注册相同地址的repository,并将B集群repository 的readonly设置为true。

验证一下repository是否可用:
 

POST /_snapshot/my_hdfs_repository/_verify

接下来就可以开始做备份。

3.备份。执行命令:
 

PUT _snapshot/my_hdfs_repository/snapshot_1?wait_for_completion=true

snapshot操作是后台操作进行的,通常在snapshot process initial完成后就会返回,snapshot process initial需要读取当前repository中所有历史的snapshot信息,因此如果历史snapshot较大,initial过程会耗时较长,initial之后,此命令返回,接着才会在后台thread pool中异步进行备份。这里加上wait_for_completion=true,等备份完成后命令返回。

snapshot还可以指定其他参数:
 

PUT /_snapshot/my_backup/snapshot_2?wait_for_completion=true
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,
  "include_global_state": false,
  "partial":true,
  "metadata": {
    "taken_by": "kimchy",
    "taken_because": "backup before upgrading"
  }
}

indices:指定了需要备份的索引,而不用整个集群备份。
ignore_unavailable:设置是否忽略不存在的索引,为true的话,即使指定的索引不存在也不报错。不包括closed的索引,如果备份closed的index会报错。
include_global_state:是否备份cluster global state。
partial:默认情况下,如果索引的部分primary分片不可用,会导致整个备份失败。设置为true,可只备份可用的primary 分片。
metadata:用户随意指定。可以设置注释等。

snapshot名称可以指定字符串,也可以指定日期表达式:

# PUT /_snapshot/my_backup/
PUT /_snapshot/my_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E

备份过程可以通过如下命令监控备份进行状态:

GET _snapshot/my_hdfs_repository/_all
GET _snapshot/my_hdfs_repository/*
GET _snapshot/my_hdfs_repository/snapshot_1
GET /_snapshot/my_hdfs_repository/snapshot_1/_status
GET _cluster/pending_tasks
GET _cat/pending_tasks?v

其中通过前三者获取snapshot info的方式,获取snapshot状态。第四种通过snapshot status的方式获取。两者之间的区别为前者与snapshot process共用thread pool,因此与snapshot process会有资源争用。后者不会。第五、六中是直接检查thread pool中的任务,并不能获取到snapshot的详细信息。有如下几种备份状态:
 

IN_PROGRESS

The snapshot is currently running.

SUCCESS

The snapshot finished and all shards were stored successfully.

FAILED

The snapshot finished with an error and failed to store any data.

PARTIAL

The global cluster state was stored, but data of at least one shard wasn’t stored successfully. The failure section in this case should contain more detailed information about shards that were not processed correctly.

INCOMPATIBLE

The snapshot was created with an old version of Elasticsearch and therefore is incompatible with the current version of the cluster.

es的备份是增量备份,后一次的备份只会保存自前一次备份依赖变更的记录。备份操作不阻碍数据修改操作,但是备份内容仅仅包括在snapshot process 启动时刻的记录。备份过程中变更的数据记录不在备份内容中。集群中任何时刻,仅能进行一个snapshot operation。当有snapshot operation进行中时,再提交备份会报错。备份过程中不会进行分片的reallocation操作。如果因为备份内容较多,导致备份时间过长时,可通过删除备份的方式取消备份操作,操作过程中产生的备份文件会自动清理掉。删除备份会删除备份文件,但是若此删除的备份有被其他备份依赖的文件,则依赖的文件会保留。
 

DELETE _snapshot/my_hdfs_repository/snapshot_1047

恢复操作可通过如下命令进行:
 

POST /_snapshot/my_hdfs_repository/snapshot_1/_restore

默认情况下,备份中的所有index都会恢复,但是cluster state不恢复。如果集群中已经存在了任何将要resoter的index,则恢复过程会失败。可以使用如下参数指定恢复部分索引,是否恢复cluster state,以及部分恢复,修改索引设置等等。
 

POST _snapshot/restore_readonly_repository/snapshot_1056/_restore
{
  "indices": "human*,ac_blog",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored_index_$1",
  "ignore_unavailable": true,
  "include_global_state": true,
  "partial":true,
  "index_settings":{
    "index.number_of_replicas":0
  },
  "ignore_index_settings":[
    "index.refresh_interval"
  ]
}

indices:仅恢复指定的索引。可以使用通配符,逗号分隔的多个索引。
rename_pattern:与rename_replacement结合使用,进行索引的rename。
rename_replacement:与rename_pattern结合使用,进行索引的rename。
ignore_unavailable:是否忽略不存在的索引。
include_global_state:是否恢复cluster global state,默认为false。设置为true,对于存在的index template会进行替换,不存在的会创建。持久化的settings也会添加到当前集群中。
partial:在创建备份的时候,可能会指定partial=true,则部分索引可能仅包含了部分primary shard。这样的备份在进行恢复时,如果不设置partial=true,会恢复失败。
index_settings:在restore过程中将备份的index settings修改为指定值。
ignore_index_settings:忽略备份中的index的设置,并设置为默认值。

restore过程是标准的es recovery机制,就跟集群启动过程中索引恢复,节点丢失后,shard在其他节点上的恢复一样。只不过数据源变在了repository中,并且shard恢复过程可能会使用到translog,如果最近的修改都包括在translog中的话。restore过程可通过如下命令查看:
 

GET _recovery
GET restored_index_human/_recovery
GET _cat/recovery?v&format=yaml

 

你可能感兴趣的:(elasticsearch)