Apache Kylin CPU 使用率异常
Kylin 版本 2.6.1
- Apache Kylin CPU 使用率异常
- 创建测试数据
- 创建测试 Cube
- 创建包含大量与或逻辑的 SQL
- 大量与或操作导致的结果
- 解决方案
创建测试数据
首先造一点测试数据:
rm data.csv
for i in {0..1000}
do
echo "$i,$i,2333
$i,$i,2333
$i,$i,2333
$i,$i,2333
$i,$i,2333
$i,$i,2333
$i,$i,2333
$i,$i,2333
$i,$i,2333
$i,$i,2333" >> data.csv
done
use temp;
drop table test_filter;
create table test_filter(a int, b int, value int)
row format delimited
FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH "/home/kylin/data.csv" into table test_filter;
创建测试 Cube
略
创建包含大量与或逻辑的 SQL
生成 sql 的脚本
#!/bin/bash
echo "select * from temp.test_filter
where ((a >= 0 and a <= 1)"
for i in {1..3}
do
echo "or (a >= $i and a <= $((i+1)))"
done
echo ") and ((b >= 0 and b <= 1)"
for i in {1..3}
do
echo "or (b >= $i and b <= $((i+1)))"
done
echo ")"
先来个简单点的 sql, 看一下执行步骤:
select * from temp.test_filter
where ((a >= 0 and a <= 1)
or (a >= 1 and a <= 2)
or (a >= 2 and a <= 3)
or (a >= 3 and a <= 4)
) and ((b >= 0 and b <= 1)
or (b >= 1 and b <= 2)
or (b >= 2 and b <= 3)
or (b >= 3 and b <= 4)
)
大量与或操作导致的结果
org.apache.kylin.metadata.filter.TupleFilter 240行左右:
// boolean algebra flatten
if (op == FilterOperatorEnum.AND) {
flatFilter = new LogicalTupleFilter(FilterOperatorEnum.AND);
for (TupleFilter andChild : andChildren) {
flatFilter.addChildren(andChild.getChildren());
}
if (!orChildren.isEmpty()) {
List fullAndFilters = cartesianProduct(orChildren, flatFilter);
flatFilter = new LogicalTupleFilter(FilterOperatorEnum.OR);
flatFilter.addChildren(fullAndFilters);
}
}
这段代码就解释了 CPU 跑满的原因,
这里的 op 就是
((a >= 0 and a <= 1)
or (a >= 1 and a <= 2)
or (a >= 2 and a <= 3)
or (a >= 3 and a <= 4)
)
与
((b >= 0 and b <= 1)
or (b >= 1 and b <= 2)
or (b >= 2 and b <= 3)
or (b >= 3 and b <= 4)
)
之间的 AND.
orChildren 就是
((a >= 0 and a <= 1)
or (a >= 1 and a <= 2)
or (a >= 2 and a <= 3)
or (a >= 3 and a <= 4)
)
和
((b >= 0 and b <= 1)
or (b >= 1 and b <= 2)
or (b >= 2 and b <= 3)
or (b >= 3 and b <= 4)
)
所以最终会走到这里:
List fullAndFilters = cartesianProduct(orChildren, flatFilter);
做一个笛卡尔积, 目的是把这个 where 字句后面的条件打平成如下的样子:
((a >= 0 and a <= 1 and b >= 0 and b <= 1)
or (a >= 0 and a <= 1 and b >= 1 and b <= 2)
or (a >= 0 and a <= 1 and b >= 2 and b <= 3)
or (a >= 0 and a <= 1 and b >= 3 and b <= 4)
or (a >= 1 and a <= 2 and b >= 0 and b <= 1)
or (a >= 1 and a <= 2 and b >= 1 and b <= 2)
or (a >= 1 and a <= 2 and b >= 2 and b <= 3)
or (a >= 1 and a <= 2 and b >= 3 and b <= 4)
or (a >= 2 and a <= 3 and b >= 0 and b <= 1)
or (a >= 2 and a <= 3 and b >= 1 and b <= 2)
or (a >= 2 and a <= 3 and b >= 2 and b <= 3)
or (a >= 2 and a <= 3 and b >= 3 and b <= 4)
or (a >= 3 and a <= 4 and b >= 0 and b <= 1)
or (a >= 3 and a <= 4 and b >= 1 and b <= 2)
or (a >= 3 and a <= 4 and b >= 2 and b <= 3)
or (a >= 3 and a <= 4 and b >= 3 and b <= 4)
)
所以, 问题来了, 为什么要把两个 and 连接的条件打平成这么多个条件?
通过这段代码, 我推测这么做的原因和构建 HTable 维度组合时的默认排列方式有关, 这么一打平然后通过上面的算法就可以达到一次性扫出所有相关的行.
这样以来可以很好的优化执行计划, 但是这么打平的坏处是什么呢?
上面的例子可以看到, and([4], [4]) 最终的到了一个 or([16]) 的 List.
那么如果把这 4 个条件改成 200 个条件组合起来呢?
最终 List 的长度会变成 200 * 200 = 40000.
所以, 可以使用上述脚本生成个 200 个条件的 SQL 跑一把:
效果立竿见影, 仅一条 SQL 几乎就可以让整个系统卡到无法正常使用, 更不用说 Tableau 可能会一次性发送多条类似 SQL (Tableau 报表中的过滤器使用仅相关值即可触发这种 SQL ). 这无疑是非常致命的.
因为如果过滤字段在多那么几个, 这个最终的 List 长度则会成指数型增长, 这么一来就会发现 CPU 全部被用来往 List 里追加元素了, 永无止尽的往 List 里插入几万几十万个元素.
此类 SQL 明显是不合理的, 而且一旦出现一条这样的 SQL 可能就会导致整个实例的 CPU 使用率发生异常.
解决方案
调低 kylin.query.flat-filter-max-children 的值, 在迪卡尔积数量可能过大的时候, 直接抛出异常拒绝执行.
相关 Issue: KYLIN-3797, KYLIN-4180