diagnosing root cause of isq-yinzheng-2020-vldb

iSQ(intermittent slolw query)

  • 定义: 某个sqlQ ,第t次发生,T是Q最近发生的总次数,Qt是iSQ的条件是 这次Xt的执行时间>z(z是概率分布
  • 意义:这类slow sql对其他有很大影响,一般常规slow的是通过sql优化和index等解决

KPIs(key performance indicators)

  • 定义:physical machines,docker instances,mysql configurations,59个,8类

diagnosing root cause of isq-yinzheng-2020-vldb_第1张图片

KPIS symptoms=>anomaly types

spike up/dowm (robust threshold ,median && mad && canchy distribution [tcprt])
level shift-up/shift-down(2 windows,检查win是否一致,T-Test [A. Pettitt. A non-parametric approach to the change-point
problem. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 28(2):126–135, 1979.] )
void
diagnosing root cause of isq-yinzheng-2020-vldb_第2张图片

其他方法

实验环境=》成本和机器/事件/负载等
dbsherlock predicate-based illustrations anomalies with a decision-tree-like implementation。excessive information对树的影响很大
1.只关注给定异常事件的特征,结果。
2.构造independent kpis(比如已经A=>B去除b)
3.TOPIC(type-oriented pattern integration clustering)分类
4.每类用贝叶斯Case model
5.打标
6.线上isql=>分类

detail

1.异常检测
2.dependency cleansing
不想sherlock计算互信息(没懂为啥不用),计算关联规则的置信度。DBA提取10个dependency
3.TOPIC
某个isql的pattern包含:KPI states(是否异常)
计算两个isql之间的相似度:每类计算距离求平均。
diagnosing root cause of isq-yinzheng-2020-vldb_第3张图片

4.贝叶斯Case Model
5.DBA给3和4都打标,提取10个root cause

实现过程:
一天319个异常sql。打标。55%offline,45%online(与10个从BCM得出的representative iSQ比较).
评估:F1-Score,precision,recall.

 clustering accuracy()
 normalized mutual information(NMI)

TOPIC与hierarchical clustering,k-means,dbscan做比较

你可能感兴趣的:(论文)