本文旨在介绍通过阿里云函数计算(FC)结合日志服务 (Log Service)简单方便地搭建一套Serverless监控系统。日志服务的一个典型使用场景是将监控指标数据通过日志(json/csv 格式)的方式上传到日志服务(例如每个请求一条日志),借助日志服务强大易用的功能做索引,查询分析,制作面板功能和设置报警规则,可以花费很小的代价就能建立起监控大盘和报警系统。然而随着业务增长,当日调日志条数超过几亿甚至更多,实时聚合超过一个月的原始数据(如大盘显示过去30天的P99延迟变化)显然不再现实。一个可能的解法是在服务端做本地聚合,减少日志聚合的数量,然而这样的做法会丢失掉原始日志中详细的信息,不便于日后单请求问题的调查,并不完美。既然问题的根源在于长时间query聚合数据量过大,那么自然可以基于日志服务做定时的pre-aggregation。我们抽象出如下图所示的指标聚合系统,本文将介绍如何使用FC实现Aggretor借助Log Service的查询分析能力实现Serverless的海量指标聚合系统。
下面展示了一个非常简单的Serverless指标聚合系统的架构,仅需要实现以下模块:
定时触发器会将triggerTime 通过函数event传入, 函数将这个时间相对的前1-2分钟作为聚合开始时间,1分钟为粒度,向日志服务发起类似下面的SQL聚合query。日志服务将 O(N)的原始数据在聚合后变为O(1)的数据返回给函数,函数再将聚合数据存回Logstore(Agg).
为了避免函数逻辑出现异常,导致某段时间聚合失败,也可以采用下图的架构,不依赖triggerTime, 将完成过的聚合时间利用表格存储持久化,作为下一次聚合的开始时间:
创建函数, 这里用python2.7 runtime 编写函数,Log Service Python SDK内置于FC python2.7 runtime, 无需额外打包。函数会向Log Service发起下面的query,将原始数据聚合出请求成功数,错误数,平均, P99, P99.9 延迟。
select (__time__ - __time__ %60) as t, avg(latency) as latencyAvg, approx_percentile(latency, 0.99) as latencyP99, approx_percentile(latency, 0.999) as latencyP99dot9, count_if(status >= 200 and status < 300) as successes, count_if(status >= 400 and status < 500) as clientErrors, count_if(status >= 500) as serverErrors group by t order by t limit 3000
import logging
import time
from datetime import datetime
import os
from aliyun.log import *
import json
def handler(event, context):
evt = json.loads(event)
trigger_time = evt['triggerTime']
dt=datetime.strptime(trigger_time, "%Y-%m-%dT%H:%M:%SZ")
starttime_unix = int(time.mktime(dt.timetuple()))
logger = logging.getLogger()
logger.info(evt)
endpoint = 'https://cn-shanghai.log.aliyuncs.com'
creds = context.credentials
access_key_id = creds.access_key_id
secret_key = creds.access_key_secret
security_token = creds.security_token
# Replace with your own log project and logstores
project = 'metrics-project'
logstore_raw = 'metrics-raw'
logstore_agg = 'metrics-agg'
client = LogClient(endpoint, access_key_id, secret_key, securityToken=security_token)
topic = ""
source = ""
topic = ""
query = "*|select (__time__ - __time__ %60) as t, avg(latency) as latencyAvg, approx_percentile(latency, 0.99) as latencyP99, approx_percentile(latency, 0.999) as latencyP99dot9, count_if(status >= 200 and status < 300) as successes, count_if(status >= 400 and status < 500) as clientErrors, count_if(status >= 500) as serverErrors group by t order by t limit 3000"
# Query time range between trigger_timer - 120s and trigger_timer - 60s
from_time = starttime_unix - 120
to_time = starttime_unix - 60
logger.info("From " + str(from_time) + ", to " + str(to_time))
# Retry if get logs response is not complete
res = None
for retry_time in range(0, 3):
# Make query to Log Service
req4 = GetLogsRequest(project=project, logstore=logstore_raw, fromTime=from_time, toTime=to_time, topic=topic, query=query)
resp = client.get_logs(req4)
logitems = []
if resp is not None and resp.is_completed():
for log in resp.get_logs():
logitem = LogItem()
logitem.set_time(int(time.time()))
logcontents = log.get_contents()
contents = []
for key in logcontents:
print(key)
print(logcontents[key])
contents.append((key, logcontents[key]))
logitem.set_contents(contents)
logitems.append(logitem)
if len(logitems) == 0:
print("No more logitems to put, breaking")
break
# Put aggregated logs into the "agg" logstore
req2 = PutLogsRequest(project, logstore_agg, topic, source, logitems)
res2 = client.put_logs(req2)
break
return str(len(logitems)) + " log items were put into " + logstore_agg
注:service role需要有Log Service相应logstore的权限
为Aggregator函数配置定时触发器,可根据需求选择触发频率或规则:
每分钟函数触发都会借助Log Service 做1分钟数据量的聚合,即使每天有1000亿条(百万TPS)数据,每分钟也只需要聚合7千万条原始数据,Log Service 对于亿条日志都可以在秒级别完成。
在聚合Logstore中数据很少,可以轻松的查询几个月的聚合数据,使对业务长期发展的监控和分析变成可能。FC的函数有
这篇文章用不到100行python代码,两个Log Service logstore,不用一台server, 实现了一套简单轻量却可以覆盖大多数监控,BI需求的指标的预聚合系统,解决了ad-hoc query 基于海量原始数量无法完成或快速返回的痛点。这套系统也享受Serverless天生带来的优势:
希望借此文投石引路,由开发者发现更多Serverless在监控领域的新玩法。