[TOC]
在日常工作中,当我们需要去维护一个elasitcsearch集群以期能稳定工作。通常需要有计划的做很多事情。比如定期的清除数据,合并 segment,备份恢复等。如果我们具备编程能力,这些工作一般都是可以通过各种编程语言根据我们的需求,调用elasticsearch的API可以完成的。但是,重复造轮子之前,我们应该确定,别人没有遇到过类似的事情,没有通用的工具可以完成我们的需求,我们才自己动手去做。elasticsearch整个生态圈已经很成熟。elastic.co提供的curator这个工具(用python开发的)已经为各种运维场景提供了完善的解决方案,大部分情况下,我们只需要使用curator就可以完成我们的日常需求。
关于它的安装,可以查看官网。如果我们的服务器已经安装了pip,则可以很方便的通过pip install来完成:
pip install elasticsearch-curator
但很多生产环境是没有安装pip的。因为防火墙的关系,也不能直接访问https://packages.elastic.co。所以,官网上介绍的大部分安装方式,其实都是很适用。
因此,解决方案是直接下完整个RPM安装包,直接在服务器上安装。
地址:
Elasticsearch Curator 5.2.0 Binary Package (DEB)
Elasticsearch Curator 5.2.0 Binary Package for newer Debian 9 based systems (DEB)
Elasticsearch Curator 5.2.0 RHEL/CentOS 6 Binary Package (RPM)
Elasticsearch Curator 5.2.0 RHEL/CentOS 7 Binary Package (RPM)
curator提供了两个interface。一个是curator,一个是curator_cli。
先说这个接口,是因为它适合用于调试,但真正但运维场景我还是推荐curator。
$ curator_cli --help
Usage: curator_cli [OPTIONS] COMMAND [ARGS]...
Options:
--config PATH Path to configuration file. Default:
~/.curator/curator.yml
--host TEXT Elasticsearch host.
--url_prefix TEXT Elasticsearch http url prefix.
--port TEXT Elasticsearch port.
--use_ssl Connect to Elasticsearch through SSL.
--certificate TEXT Path to certificate to use for SSL validation.
--client-cert TEXT Path to file containing SSL certificate for client auth.
--client-key TEXT Path to file containing SSL key for client auth.
--ssl-no-validate Do not validate SSL certificate
--http_auth TEXT Use Basic Authentication ex: user:pass
--timeout INTEGER Connection timeout in seconds.
--master-only Only operate on elected master node.
--dry-run Do not perform any changes.
--loglevel TEXT Log level
--logfile TEXT log file
--logformat TEXT Log output format [default|logstash|json].
--version Show the version and exit.
--help Show this message and exit.
Commands:
allocation Shard Routing Allocation
close Close indices
delete_indices Delete indices
delete_snapshots Delete snapshots
forcemerge forceMerge index/shard segments
open Open indices
replicas Change replica count
show_indices Show indices
show_snapshots Show snapshots
snapshot Snapshot indices
上面是基本的命令参数。但为什么说不推荐在运维期间使用curator_cli。是因为这个接口只支持一次运行一个action。并且通过命令行写入复杂的filter是很反人类的。所以,一般是使用curator_cli来配合写curator的action.yml,或者做写简单的测试。
curator_cli --host 10.33.4.160 --port 9200 show_indices --verbos
输出:
.kibana open 54.9KB 6 1 1 2017-09-06T02:13:00Z
.monitoring-alerts-6 open 6.5KB 1 1 1 2017-09-06T02:14:01Z
.monitoring-es-6-2017.10.12 open 376.1MB 556576 1 1 2017-10-12T00:00:06Z
.monitoring-es-6-2017.10.13 open 76.8MB 96220 1 1 2017-10-13T00:00:08Z
.monitoring-kibana-6-2017.10.12 open 3.3MB 8638 1 1 2017-10-12T00:00:08Z
.monitoring-kibana-6-2017.10.13 open 1.3MB 3390 1 1 2017-10-13T00:00:09Z
.monitoring-logstash-6-2017.10.12 open 2.4MB 8211 1 1 2017-10-12T01:09:48Z
.monitoring-logstash-6-2017.10.13 open 1.1MB 3390 1 1 2017-10-13T00:00:08Z
.reporting-2017.09.17 open 376.9KB 2 5 1 2017-09-21T09:58:01Z
.triggered_watches open 9.2MB 19 1 1 2017-09-06T02:14:01Z
.watcher-history-3-2017.10.12 open 6.0MB 7200 1 1 2017-10-12T00:00:03Z
.watcher-history-3-2017.10.13 open 2.4MB 2830 1 1 2017-10-13T00:00:03Z
.watches open 23.6KB 4 1 1 2017-09-06T02:13:00Z
syslog-network-2017.10.11 open 26.1MB 109195 5 1 2017-10-13T02:20:58Z
syslog-network-2017.10.12 open 11.5KB 1 5 1 2017-10-12T20:11:28Z
syslog-platform-2017.10.11 open 1019.5MB 4004662 5 1 2017-10-13T02:36:11Z
syslog-platform-2017.10.12 open 16.0MB 61915 5 1 2017-10-12T03:17:38Z
syslog-platform-2017.10.13 open 20.8MB 90628 5 1 2017-10-12T23:52:10Z
watcher open 69.0KB 5 5 1 2017-09-21T02:23:10Z
watcher_alarms-2017.10.11 open 365.5KB 1 5 1 2017-10-11T08:00:06Z
curator_cli --host 10.33.4.160 --port 9200 close --filter_list '[{"filtertype":"age","source":"creation_date","direction":"older","unit":"days","unit_count":1},{"filtertype":"pattern","kind":"prefix","value":"syslog-"}]'
2017-10-13 17:30:21,573 INFO Closing selected indices: ['syslog-platform-2017.10.12']
2017-10-13 17:30:21,713 INFO Singleton "close" action completed.
上面的操作就是通过--fliter_list
过滤出所有1天前创建的,以syslog-开头的index,然后关闭它们。可以从例子上看到,curator_cli很难阅读。
这个接口从调用上就很简单:
curator [--config CONFIG.YML] [--dry-run] ACTION_FILE.YML
--config
之后跟上配置文件,再跟action文件。action文件中可以包含一连串的action(我们所有的操作都可以放在一起)。相比于curator_cli接口,curator接口集中式的config和action管理,可以方便我们重用变量,更利于维护和阅读。
一般来说,配置文件命名为curator.yml,当然,什么名字都无所谓,通过--config
引用即可。
---
# Remember, leave a key empty if there is no value. None will be a string,
# not a Python "NoneType"
client:
hosts:
- 10.33.4.160
port: 9200
url_prefix:
use_ssl: False
certificate:
client_cert:
client_key:
ssl_no_validate: False
http_auth:
timeout: 30
master_only: False
logging:
loglevel: INFO
logfile: /var/log/curator.log
logformat: default
blacklist: ['elasticsearch', 'urllib3']
很直观的配置,每个参数的含义都很清楚。这里需要指出的是,如果不配置参数的话,留空,即可,不要画蛇添足的写None。
另外,logfile如果不填的话,默认是输出到stdout。推荐是存储到文件中。如上例。
每个action由三部分组成:
- action,具体执行什么操作
- option, 配置哪些可选项
- filter, 过滤条件,哪些index需要执行action
对比curator_cli,多出来了alias, store, shrink等操作:
- alias
- allocation
- close
- cluster_routing
- create_index
- delete_indices
- delete_snapshots
- forcemerge
- index_settings
- open
- reindex
- replicas
- restore
- rollover
- shrink
- snapshot
很多,这里不一一介绍,看后面的例子,理解最关键的几个,剩下自己到官网查资料:
- allocation_type
- continue_if_exception
- count
- delay
- delete_after
- delete_aliases
- disable_action
- extra_settings
- ignore_empty_list
- ignore_unavailable
- include_aliases
- include_global_state
- indices
- key
- max_age
- max_docs
- max_num_segments
- max_wait
- migration_prefix
- migration_suffix
- name
- node_filters
- number_of_replicas
- number_of_shards
- partial
- post_allocation
- preserve_existing-
- refresh
- remote_aws_key
- remote_aws_region
- remote_aws_secret_key
- remote_certificate
- remote_client_cert
- remote_client_key
remote_filters
- remote_ssl_no_validate
- remote_url_prefix
- rename_pattern
- rename_replacement
- repository
- requests_per_second
- request_body
- retry_count
- retry_interval
- routing_type
- setting
- shrink_node
- shrink_prefix
- shrink_suffix
- slices
- skip_repo_fs_check
- timeout
- timeout_override
- value
- wait_for_active_shards
- wait_for_completion
- wait_interval
- warn_if_no_indices
最常用的filtertype是pattern和age:
- age
- alias
- allocated
- closed
- count
- forcemerged
- kibana
- none
- opened
- pattern
- period
- space
- state
---
# Remember, leave a key empty if there is no value. None will be a string,
# not a Python "NoneType"
#
# Also remember that all examples have 'disable_action' set to True. If you
# want to use this action as a template, be sure to set this to False after
# copying it.
actions:
1:
action: delete_indices
description: >-
Delete metric indices older than 3 days (based on index name), for
.monitoring-es-6-
.monitoring-kibana-6-
.monitoring-logstash-6-
.watcher-history-3-
prefixed indices. Ignore the error if the filter does not result in an
actionable list of indices (ignore_empty_list) and exit cleanly.
options:
ignore_empty_list: True
# disable_action: True
filters:
- filtertype: pattern
kind: regex
value: '^(\.monitoring-(es|kibana|logstash)-6-|\.watcher-history-3-).*$'
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 3
2:
action: close
description: >-
Close indices older than 30 days (based on index name), for syslog-
prefixed indices.
options:
ignore_empty_list: True
delete_aliases: False
# disable_action: True
filters:
- filtertype: pattern
kind: prefix
value: syslog-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 30
3:
action: forcemerge
description: >-
forceMerge syslog- prefixed indices older than 2 days (based on index
creation_date) to 2 segments per shard. Delay 120 seconds between each
forceMerge operation to allow the cluster to quiesce. Skip indices that
have already been forcemerged to the minimum number of segments to avoid
reprocessing.
options:
ignore_empty_list: True
max_num_segments: 2
delay: 120
timeout_override:
continue_if_exception: False
filters:
- filtertype: pattern
kind: prefix
value: syslog-
exclude:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 2
- filtertype: forcemerged
max_num_segments: 2
exclude:
actions定义在一个yml文件中,通过缩进定义变量。例子中定义了3个action。它们会被顺序执行。当然,这三个任务(1,2,3)在这里没有先后依赖,如果有依赖关系,要保证被依赖的action写在前面。
三个任务分别是,删除索引,关闭过期索引,合并索引的segment。
这里特别要注意的是option选项,在多action,并且没有互相依赖的情况下,一定要设置ignore_empty_list: True
。这里代表的是,如果filter没有找到符合查询条件的index,略过。如果设置成false。则第一个action,没有找到匹配的index,整个curator会被abort。
官网上有各种action的例子,大家可以查看。
当然,curator是一个命令行工具,而我们的需要是需要自动化的定期维护,因此需要crontab等工具。一般的linux操作系统都自带crontab。修改/etc/crontab
文件:
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# For details see man 4 crontabs
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
0 0 * * * root curator --config /opt/curator/curator.yml /opt/curator/action.yml
每天都会执行一次,delete index,close index,merge segment