Functions: the most widely ignored performance tweak
It happens quite frequently that people complain about stored procedure performance in PostgreSQL. In many cases the reason for bad performance can be explained quite nicely when looking at the definition of a function.在PostgreSQL中,人们时常抱怨存储过程的性能。在很多情况下,通过查看函数的定义,可以很好的解释性能差的原因。
In general, a PostgreSQL function can be marked as follows: VOLATILE, STABLE, IMMUTABLE or [NOT] LEAKPROOF. What does this actually mean?一般来说,一个PostgreSQL函数可以使用VOLATILE, STABLE, IMMUTABLE 或者 [NOT] LEAKPROOF标识。这个到底是意味着什么呢?
To show which impact this can have, we can create a small table consisting of just integer values:
我们创建一个只包含整数值的小表,来展示这个能有什么影响。
test=# CREATE TABLE t_test AS SELECT * FROM generate_series(1, 1000000) AS id; SELECT 1000000
In our example we have added 1 mio rows to the table. In the next step we can define a simple index:
在这个例子中,我们向这个表中添加了一百万行记录。下一步我们将定义一个简单的索引:
test=# CREATE INDEX idx_test ON t_test (id); CREATE INDEX #注: 这行只是执行之后的输出
The point now is: Many complaints about poor performance arise when people use the output of a function and compare it to a column as shown in the following listing:现在关键是当人们使用一个函数的输出以及用它 与一列如下面所列出的那样做对比时,对性能差所导致的怨言有多大。
test=# explain SELECT * FROM t_test WHERE id = round(17.5, 0)::int4; QUERY PLAN —————————————————————————- Index Only Scan using idx_test on t_test (cost=0.00..8.38 rows=1 width=4) Index Cond: (id = 18) (2 rows)
We want to check, if any of the rows is identical to round(17.5, 0)::int4. As you can see PostgreSQL can use an index nicely. The reason for that is that the round function will ALWAYS return 18 if you pass 17.5 as parameter. It is simply a mathematical fact. Technically this means that PostgreSQL can calculate the function ONCE and use it to search the index. Inside PostgreSQL the function is marked a IMMUTABLE. Its output will never change:
我们想核查一下,是否每次行都是和round(17.5, 0)::int4一样。正如你所看到的那样,PostgreSQL能够恰当的使用索引。这个原因就是如果传递参数17.5,那么round函数每次都会返回18。这是个简单的数学事实。从技术上说就是PostgreSQL计算这个函数一次,然后使用它在索引上搜索。在PostgreSQL内部这个函数标识为IMMUTABLE,它的输出不会再改变了:
test=# SELECT proname, provolatile FROM pg_proc WHERE proname = ’round’ LIMIT 1; proname | provolatile ———+————- round | i (1 row)
What happens if we try the very same thing using the random() function? The execution plan will be very different:
如果我们使用random()函数做同样的事会发生什么呢?执行计划将会不一样:
test=# explain SELECT * FROM t_test WHERE id = random()::int4; QUERY PLAN ———————————————————- Seq Scan on t_test (cost=0.00..21925.00 rows=1 width=4) Filter: (id = (random())::integer) (2 rows)
In this case we have to read the entire table to find the right answer. Reading 1 mio rows is expensive and therefore it is not a good strategy. The problem is that there is no way for the optimizer to use an index. The random() function will change its result everytime it is called. So, which value would you look up in the index? The answer is: There is no way to use the index because everytime you are inspecting a row the output of random() will already be something else. This behavior is called VOLATILE:在这种情况下我们不得不遍历整个表来找出正确的答案。读取一百万行是费时的,所以这不是一个好策略。问题是优化器没办法使用索引。random()函数在它被调用时会随时改变它的结果。所以在你的索引里哪个值会是你所找的?答案就是没有任何办法使用索引,因为每次你在检查random()输出的行时,它可能是其它的结果[不确定]。这种性质叫做VOLATILE:
test=# SELECT proname, provolatile FROM pg_proc WHERE proname = ‘random’ LIMIT 1; proname | provolatile ———+————- random | v (1 row)
The performance difference between a VOLATILE and an IMMUTABLE function will be substantial because the VOLATILE function has to be called 1 mio times:
VOLATILE与IMMUTABLE函数的性能不同的本质在于VOLATILE函数强制调用了一百万次:
test=# explain analyze SELECT * FROM t_test WHERE id = random()::int4; QUERY PLAN ——————————————————————————————————– Seq Scan on t_test (cost=0.00..21925.00 rows=1 width=4) (actual time=244.884..244.884 rows=0 loops=1) Filter: (id = (random())::integer) Rows Removed by Filter: 1000000 Total runtime: 244.947 ms (4 rows)
test=# explain analyze SELECT * FROM t_test WHERE id = round(17.5, 0)::int4; QUERY PLAN ———————————————————————————————————————- Index Only Scan using idx_test on t_test (cost=0.00..8.38 rows=1 width=4) (actual time=1.064..1.065 rows=1 loops=1) Index Cond: (id = 18) Heap Fetches: 1 Total runtime: 1.084 ms (4 rows)
In our example we can see an execution time, which is 240 times higher than in the index optimized case.
在我们这个例子中可以看到一个执行时间,和索引优化过的情形相比高了240多倍。
Tweaking your own functions
So, if you are writing your own procedures – please never forget to mark your function appropriately. Otherwise the function might be called ways too often, which is bad for performance. Defining your function properly is a very simple tweak offering great potential.所以,如果你打算写自己的存储过程 -- 请不要忘记恰当的标识你的函数。否则这个函数可能会被经常调用,这对性能是不好的。合适的定义你的函数只是一个非常简单的调整,却具有巨大的潜能。
Here is how this can be done:
这里就是如何做到的:
test=# CREATE OR REPLACE FUNCTION ld(int) RETURNS numeric AS ‘ SELECT log(2, $1) ‘ LANGUAGE sql IMMUTABLE; CREATE FUNCTION
via functions-the-most-widely-ignored-performance-tweak
9.2官方文档:35.6. Function Volatility Categories
德哥@diag: Thinking PostgreSQL Function's Volatility Categories