版本:aodh-4.0.3
aodh是有ceilometer分离出来的组件,主要功能是提供资源告警功能,支持log,webhook等方式告警。
下面的分析建议先运行aodh alarm create -h 了解创建时有哪几种类型,哪几种字段。这里会有另一篇文章翻译分析。
aodh有四大块组成
evaluator:告警条件触发计算模块
notifier : 负责告警通知
listener:监听模块
api:aodh启动模块
代码目录:/aodh/evaluator/
可以看到有几个文件,分别对应一或多种告警类型
composite.py – compsoite 组合类型
event.py 时间类型
gnocchi.py 由gnocchi提供数据的告警类型
threshold.py 由ceilometer提供数据的告警类型
_ init _.py文件
class Evaluator(object):
"""Base class for alarm rule evaluator plugins."""
def __init__(self, conf):
self.conf = conf
self.notifier = queue.AlarmNotifier(self.conf)
self.storage_conn = None
self._ks_client = None
self._alarm_change_notifier = None
加载通知notifier,后续条件触发时会调用到
def __init__(self, worker_id, conf):
super(AlarmEvaluationService, self).__init__(worker_id)
self.conf = conf
ef = lambda: futures.ThreadPoolExecutor(max_workers=10)
self.periodic = periodics.PeriodicWorker.create(
[], executor_factory=ef)
self.evaluators = extension.ExtensionManager(
namespace=self.EVALUATOR_EXTENSIONS_NAMESPACE,
invoke_on_load=True,
invoke_args=(self.conf,)
)
self.storage_conn = storage.get_connection_from_config(self.conf)
self.partition_coordinator = coordination.PartitionCoordinator(
self.conf)
self.partition_coordinator.start()
self.partition_coordinator.join_group(self.PARTITIONING_GROUP_NAME)
# allow time for coordination if necessary
delay_start = self.partition_coordinator.is_active()
if self.evaluators:
@periodics.periodic(spacing=self.conf.evaluation_interval,
run_immediately=not delay_start)
def evaluate_alarms():
self._evaluate_assigned_alarms()
self.periodic.add(evaluate_alarms)
if self.partition_coordinator.is_active():
heartbeat_interval = min(self.conf.coordination.heartbeat,
self.conf.evaluation_interval / 4)
@periodics.periodic(spacing=heartbeat_interval,
run_immediately=True)
def heartbeat():
self.partition_coordinator.heartbeat()
self.periodic.add(heartbeat)
t = threading.Thread(target=self.periodic.start)
t.daemon = True
t.start()
上面的代码是evaluator的程序入口
先看这个self.evaluators的值,这个值保存了evaluators的类型有哪几种,目前是有以下的6种,
combination = aodh.evaluator.combination:CombinationEvaluator
composite = aodh.evaluator.composite:CompositeEvaluator
gnocchi_aggregation_by_metrics_threshold = aodh.evaluator.gnocchi:GnocchiAggregationMetricsThresholdEvaluator
gnocchi_aggregation_by_resources_threshold = aodh.evaluator.gnocchi:GnocchiAggregationResourcesThresholdEvaluator
gnocchi_resources_threshold = aodh.evaluator.gnocchi:GnocchiResourceThresholdEvaluator
threshold = aodh.evaluator.threshold:ThresholdEvaluator
上面的内容来自 entry_point.txt中的[aodh.evaluator]
关键是以下代码
def _evaluate_alarm(self, alarm):
"""Evaluate the alarms assigned to this evaluator."""
if alarm.type not in self.evaluators:
LOG.debug('skipping alarm %s: type unsupported', alarm.alarm_id)
return
LOG.debug('evaluating alarm %s', alarm.alarm_id)
try:
self.evaluators[alarm.type].obj.evaluate(alarm)
except Exception:
LOG.exception(_LE('Failed to evaluate alarm %s'), alarm.alarm_id)
上面的代码就是根据具体的告警类型调用对应的代码来处理,代码的路径可以根据上entry_ point 的对应类型找到,下面以
gnocchi_ resources_ threshold为例讲解,其它类型大同小异。
举例说明
首先可以根据上面所说的定位到aodh.evaluator.gnocchi:GnocchiResourceThresholdEvaluator
代码更简单
class GnocchiAggregationResourcesThresholdEvaluator(GnocchiBase):
def _statistics(self, rule, start, end):
# FIXME(sileht): In case of a heat autoscaling stack decide to
# delete an instance, the gnocchi metrics associated to this
# instance will be no more updated and when the alarm will ask
# for the aggregation, gnocchi will raise a 'No overlap'
# exception.
# So temporary set 'needed_overlap' to 0 to disable the
# gnocchi checks about missing points. For more detail see:
# https://bugs.launchpad.net/gnocchi/+bug/1479429
try:
return self._gnocchi_client.metric.aggregation(
metrics=rule['metric'],
granularity=rule['granularity'],
query=jsonutils.loads(rule['query']),
resource_type=rule["resource_type"],
start=start, stop=end,
aggregation=rule['aggregation_method'],
needed_overlap=0,
)
except Exception as e:
LOG.warning(_LW('alarm stats retrieval failed: %s'), e)
return []
只有gnocchi_client.metric.aggregation这个主要的方法调用,由于gnocchi resources_ threshold这种类型从字面意思就能得到是gnocchi的一种,所以它的数据来源就是gnocchi,上面的方法正是通过gnocchiclient端的调用来获取数据。
而gnocchi的数据究竟是如何,可以通过以下命令获得,需要按装gnocchi,如何安装使用。可以查看前几篇文章。
gnocchi metric list #查看所有的metric
gnocchi measures show metric_id # 根据前面获取的metric id 来查看数据
上面是一个很简单的例子,想要查看具体,更精确定位的数据,可以通过查看gnocchi的命令手册来了解命令,
回到代码
上面的代码就是通过传入的告警规则数据来达到获取具体的某个资源,某个时间段的数据。
提示
gnocchi_ resources_ threshold和gnocchi_ aggregation_ by_ resources_threshold的告警规则都是需要提供resource id和resource type和meter-name的,
而gnocchi_ aggregation_ by_ metrics_ threshold只需要metric id 。
但是,它们在代码中都是最终体现都是metric id的形式,前两种就需要通过resource id和meter-name来获得metric id
下面就是如何计算比较数据来触发告警了
这里的代码在threshold.py的文件中,这里的代码是关键,
def evaluate(self, alarm):
if not self.within_time_constraint(alarm):
LOG.debug('Attempted to evaluate alarm %s, but it is not '
'within its time constraint.', alarm.alarm_id)
return
state, trending_state, statistics, outside_count = self.evaluate_rule(
alarm.rule)
self._transition_alarm(alarm, state, trending_state, statistics,
outside_count)
上面分两步:
步骤一:将获取来的数据比较evaluate_ rule
def evaluate_rule(self, alarm_rule):
"""Evaluate alarm rule.
:returns: state, trending state and statistics.
"""
start, end = self._bound_duration(alarm_rule)
statistics = self._statistics(alarm_rule, start, end)
statistics = self._sanitize(alarm_rule, statistics)
sufficient = len(statistics) >= alarm_rule['evaluation_periods']
if not sufficient:
return evaluator.UNKNOWN, None, statistics, len(statistics)
def _compare(value):
op = COMPARATORS[alarm_rule['comparison_operator']]
limit = alarm_rule['threshold']
LOG.debug('comparing value %(value)s against threshold'
' %(limit)s', {'value': value, 'limit': limit})
return op(value, limit)
compared = list(six.moves.map(_compare, statistics))
distilled = all(compared)
unequivocal = distilled or not any(compared)
number_outside = len([c for c in compared if c])
if unequivocal:
state = evaluator.ALARM if distilled else evaluator.OK
return state, None, statistics, number_outside
else:
trending_state = evaluator.ALARM if compared[-1] else evaluator.OK
return None, trending_state, statistics, number_outside
这里就有我们熟悉的字段了:
包括threshold(阈值),evaluation_ periods(周期间隔),comparison_ operator(>,<,>=等)
所以,这里就不细讲了。下面都比较清晰了。
步骤二:触发告警_ transition_ alarm
def _transition_alarm(self, alarm, state, trending_state, statistics,
outside_count):
unknown = alarm.state == evaluator.UNKNOWN
continuous = alarm.repeat_actions
if trending_state:
if unknown or continuous:
state = trending_state if unknown else alarm.state
reason, reason_data = self._reason(alarm, statistics, state,
outside_count)
self._refresh(alarm, state, reason, reason_data)
return
可以看到alarm,state,reason, reason_ data这里都是告警方式需要的数据,获取形成的也比较清晰,这不多啰嗦了。
上面的内容就是evaluator的主要逻辑了
还是先看entry_ point
[aodh.notifier]
http = aodh.notifier.rest:RestAlarmNotifier
https = aodh.notifier.rest:RestAlarmNotifier
log = aodh.notifier.log:LogAlarmNotifier
test = aodh.notifier.test:TestAlarmNotifier
trust+http = aodh.notifier.trust:TrustRestAlarmNotifier
trust+https = aodh.notifier.trust:TrustRestAlarmNotifier
trust+zaqar = aodh.notifier.zaqar:TrustZaqarAlarmNotifier
zaqar = aodh.notifier.zaqar:ZaqarAlarmNotifier
上面罗列了所有的告警通知方式,具体的解析可以看官网。
这里就已log例子分析代码
步骤一
还是先看_ init_.py文件
主要看AlarmEndpoint这个类
@staticmethod
def _process_alarm(notifiers, data):
"""Notify that alarm has been triggered.
:param notifiers: list of possible notifiers
:param data: (dict): alarm data
"""
actions = data.get('actions')
if not actions:
LOG.error(_LE("Unable to notify for an alarm with no action"))
return
for action in actions:
AlarmEndpoint._handle_action(notifiers, action,
data.get('alarm_id'),
data.get('alarm_name'),
data.get('severity'),
data.get('previous'),
data.get('current'),
data.get('reason'),
data.get('reason_data'))
这里就是alarm通知进程了,调用了_ handle_ action下面来分析这个方法的主要代码:
notifier = notifiers[action.scheme].obj #获取notifier方式
调用具体的notifier方法
try:
LOG.debug("Notifying alarm %(id)s with action %(act)s",
{'id': alarm_id, 'act': action})
notifier.notify(action, alarm_id, alarm_name, severity,
previous, current, reason, reason_data)
假如notifier是log则是notifier目录下的log.py文件
class LogAlarmNotifier(notifier.AlarmNotifier):
"Log alarm notifier."""
@staticmethod
def notify(action, alarm_id, alarm_name, severity, previous, current,
reason, reason_data):
LOG.info(_(
"Notifying alarm %(alarm_name)s %(alarm_id)s of %(severity)s "
"priority from %(previous)s to %(current)s with action %(action)s"
" because %(reason)s.") % ({'alarm_name': alarm_name,
'alarm_id': alarm_id,
'severity': severity,
'previous': previous,
'current': current,
'action': action.geturl(),
'reason': reason}))
这里只是简单的记录了日志,上面的参数都是可以在evaluator代码中找到的 这里可以实现自己的代码逻辑
api这里的代码主要是响应外部接口的调用
可以看到 api/controllers/v2/alarm.py文件的代码就有alarm的post,get,put,delete等方法,就是响应alarm的CURD操作。
而数据库的CURD则是在storage/impl_sqlalchemy.py下。
这里代码的逻辑都是比较清晰,所以这里就留给大家自行去分析了。