查询处理 - 代价估计和计划树

代价估计和计划树

  • 了解代价估计和计划树

简单查询的成本估算

Postgres的查询优化基于成本。成本是无量纲值,这些不是绝对的绩效指标,而是指比较操作相对绩效的指标。
主要有3种成本: start-up, run, total
start-up:获取到第一个目标tuple之前的成本
run:获取到所有tuple的成本
total:start-up和run的总和

postgres=# explain select * from a;
                     QUERY PLAN
-----------------------------------------------------
 Seq Scan on a  (cost=0.00..32.60 rows=2260 width=8)
(1 row)

如上面的例子,cost=0.00..32.60,其中start-up为0.00,total为32.60。

  1. 顺序扫描
    顺序扫描的成功由函数 cost_seqscan()来估算。
    顺序扫描,由于是扫描所有数据页,不需要准备工作,start-up的成本为0,run的成本计算公式为:
    'run cost' = 'cpu run cost' + 'disk run cost'
    =(cpu_tuple_cost + cpu_operator_cost) × Ntuple + seq_page_cost × Npage
    其中seq_page_cost,cpu_tuple_cost,cpu_operator_cost的值在配置文件postgresql.conf中设置,默认值为 1.0, 0.01, and 0.0025。NtupleNpage可在pg_class中查询出来。
postgres=# create table test(id int,col int);
CREATE TABLE
postgres=# insert into test select gene

postgres=# insert into test select generate_series(1,100000),generate_series(1,100000);
INSERT 0 100000
postgres=# analyze;
ANALYZE
postgres=# select relpages, reltuples FROM pg_class WHERE relname='test';
 relpages | reltuples
----------+-----------
      443 |    100000
(1 row)

这里以查询语句 select * from test where id<10000; 为例子来估算顺序扫描的成本,可估算出run cost=(0.01+0.0025) * 100000+1.0 * 443=1693。在命令行查看语句的执行计划:

postgres=# explain select * from test where id<10000;
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on test  (cost=0.00..1693.00 rows=9885 width=8)
   Filter: (id < 10000)
(2 rows)
  1. 索引扫描
    索引扫描的成本由函数cost_index()来估算。
// 创建如下表带索引
postgres=# \d+ test
                                   Table "public.test"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 id     | integer |           | not null |         | plain   |              |
 col    | integer |           |          |         | plain   |              |
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)
    "test_idx" btree (col)

以查询语句 select * from test where col < 240; 来估算索引扫描成本。
由查询条件col<240,可走test_idx索引,查询pg_class可得Nindex,page=276,Nindex,tuple=100000。

postgres=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'test_idx';
 relpages | reltuples
----------+-----------
      276 |    100000
(1 row)
  • start-up成本
    索引扫描start-up成本指读取索引页拿到目标表第一个tuple的成本,其估算公式为:
    'start-up cost'={ceil(log2(Nindex,tuple)) + (Hindex + 1) x 50} x cpu_operator_cost
    其中Hindex为索引树的高度。
// 用pageinspect查询索引高度,level为1,高度为1
postgres=# select * from bt_metap('test_idx');
 magic  | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples
--------+---------+------+-------+----------+-----------+-------------+-------------------------
 340322 |       3 |    3 |     1 |        3 |         1 |           0 |                      -1
(1 row)

cpu_operator_cost默认值为0.0025,则
start-up cost={ceil(log2(100000)) + (1+1) * 50} * 0.0025=0.2925。

  • run成本
    索引扫描的run成本指表和索引cpu成本、IO成本之和:
    'run cost'=('index cpu cost' + 'table cpu cost') + ('index IO cost' + 'table IO cost')
    :仅索引扫描不需估算 'table cpu cost' 和 'table IO cost'。
    其中三个成本估算公式如下:
    'index cpu cost'=Selectivity x Nindex,tuple x (cpu_index_tuple_cost+qual_op_cost),
    'table cpu cost'=Selectivity x Ntuple x cpu_tuple_cost,
    'index IO cost'=ceil(Selectivity x Nindex,page) x random_page_cost
    其中 cpu_index_tuple_cost, cpu_tuple_cost, random_page_cost 在配置文件postgresql.conf的默认值分别为 0.005, 0.01, 4.0。qual_op_cost,粗略的来讲,是评估索引的成本,这里为0.0025。Selectivity 指where子句的索引的搜索范围的比例,它是从0到1的浮点数,如 (Selectivity x Ntuple) 指读取表tuple的数量, (Selectivity x Nindex,page) 指读取索引页的数量。

Selectivity
查询谓词的选择性使用 histogram_bounds 或 MCV(Most Common Value) 来估算,这两者可在pg_stats中查询出来。
表每个字段的MCV存储在pg_stats的most_common_vals 和 most_common_freqs字段,其中:
most_common_vals:指字段 MCV 列表
most_common_freqs:指MCV的频率列表
如查询语句是:select * from test where col < 240;
SELECT most_common_vals, most_common_freqs FROM pg_stats where tablename='test' and attname='col';
可查出表 test 字段 col 值 240 对应的频率,将该频率作为Selectivity值。
如果 MCV 没有查询出结果,则使用 histogram_bounds 来估算。
histogram_bounds:将字段的值分成近似相等的级别的值列表
查看表test字段col的histogram_bounds:

postgres=# SELECT histogram_bounds  FROM pg_stats where tablename='test' and attname='col';

                                                                                                           histogram_bounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {2,1030,2030,3016,3967,4950,5839,6887,7880,8930,10053,11095,12058,13050,14052,15031,16055,17013,18042,19101,20216,21246,22275,23219,24254,25285,26293,27321,28441,29448,30569,31608,32
659,33671,34663,35620,36559,37519,38460,39413,40411,41393,42358,43368,44296,45366,46374,47321,48384,49354,50361,51391,52450,53510,54503,55498,56503,57470,58418,59409,60492,61416,62346
,63312,64390,65286,66356,67311,68350,69363,70277,71322,72283,73306,74362,75403,76297,77265,78300,79236,80238,81261,82321,83317,84335,85287,86225,87188,88206,89199,90201,91152,92060,92
999,94019,94978,96002,97009,97935,99004,99999}
(1 row)

histogram_bounds默认分为100个桶,桶的编号从0开始,histogram_bounds的值是桶的边界值,如第0个桶的histogram_bounds为2,则存储在0号桶的tuple最小值为2,第1个桶的histogram_bounds为1030,则存储在1号桶的tuple最小值为1030,则桶0存储的数据为 2 <= value < 1030。
查询条件 col < 240,240 存储在第0个桶,本例中通过线性插值,可计算:

所以,
index cpu cost = 0.002315175* 100000 * (0.005 + 0.0025) = 1.73638125
table cpu cost = 0.002315175* 100000 * 0.01 = 2.315175
index IO cost = ceil(0.002315175* 276) * 4.0 = 4.0

'table IO cost'估算公式为:
'table IO cost'=max_IO_cost+indexCorrelation2x(min_IO_cost-max_IO_cost)

max_IO_cost是IO成本的最坏情况,即随机扫描所有表页的成本,公式如下 :
max_IO_cost=Npage x random_page_cost
其中Npage=443,
max_IO_cost=443 * 4.0 = 1772

min_IO_cost是IO成本的最佳情况,即顺序扫描所选表页的成本,公式如下:
min_IO_cost=1 x random_page_cost + (ceil(Selectivity x Npage) - 1) x seq_page_cost
则:
min_IO_cost=1 * 4.0 + (ceil(0.002315175* 443) - 1) * 1.0 = 5.0
在本例中,indexCorrelation=1。

postgres=# SELECT tablename,attname, correlation FROM pg_stats WHERE tablename = 'test' and attname='col';
 tablename | attname | correlation
-----------+---------+-------------
 test      | col     |           1
(1 row)

所以,
table IO cost = 1772 + 12 * (5.0 - 1772) = 5.0
最后,
run cost = (1.73638125 + 2.315175) + (4.0 + 5.0) = 13.05155625

  • total成本
    total cost = 0.2925 + 13.05155625 = 13.34405625
    使用explain查询执行计划为:
postgres=# explain select * from test where col < 240;
                               QUERY PLAN
------------------------------------------------------------------------
 Index Scan using test_idx on test  (cost=0.29..13.35 rows=232 width=8)
   Index Cond: (col < 240)
(2 rows)

计划树

PostgreSQL Planner执行三个步骤,如下所示:

  1. 预处理
  2. 通过估算所有可能的访问路径的成本,获得最廉价的访问路径
  3. 从最廉价的路径创建计划树
    访问路径是用于估计成本的处理单元,例如顺序扫描,索引扫描,排序和各种加入操作具有它们对应的路径。 访问路径仅在Planner使用以创建计划树。 访问路径最基本的数据结构是Path,它对应于顺序扫描,所有其他访问路径都基于它。
    要处理上述步骤,Planner内部创建了一个PlannerInfo结构,并保存查询树,查询中包含的relations,访问路径等信息。
// Path 结构
typedef struct Path {
    NodeTag type;

    NodeTag pathtype; /* tag identifying scan/join method */

    RelOptInfo* parent;        /* the relation this path can build */
    ParamPathInfo* param_info; /* parameterization info, or NULL if none */

    /* estimated size/costs for path (see costsize.c for more info) */
    double rows; /* estimated number of global result tuples */
    double multiple;
    Cost startup_cost; /* cost expended before fetching any tuples */
    Cost total_cost;   /* total cost (assuming all tuples fetched) */
    Cost stream_cost;  /* cost of actions invoked by stream but can't be parallelled in this path */

    List* pathkeys;        /* sort ordering of path's output */
    List* distribute_keys; /* distribute key, Var list */
    char locator_type;
    Oid rangelistOid;
    int dop; /* degree of parallelism */
    /* pathkeys is a List of PathKey nodes; see above */
    Distribution distribution;
    int hint_value;       /* Mark this path if be hinted, and hint kind. */
    double innerdistinct; /* join inner rel distinct estimation value */
    double outerdistinct; /* join outer rel distinct estimation value */
} Path;

// PlannerInfo 结构
typedef struct PlannerInfo {
    NodeTag type;

    Query* parse; /* the Query being planned */

    PlannerGlobal* glob; /* global info for current planner run */

    Index query_level; /* 1 at the outermost Query */

    struct PlannerInfo* parent_root; /* NULL at outermost Query */

    /*
     * simple_rel_array holds pointers to "base rels" and "other rels" (see
     * comments for RelOptInfo for more info).  It is indexed by rangetable
     * index (so entry 0 is always wasted).  Entries can be NULL when an RTE
     * does not correspond to a base relation, such as a join RTE or an
     * unreferenced view RTE; or if the RelOptInfo hasn't been made yet.
     */
    struct RelOptInfo** simple_rel_array; /* All 1-rel RelOptInfos */
    int simple_rel_array_size;            /* allocated size of array */

    /*
     * List of changed var that mutated during cost-based rewrite optimization, the
     * element in the list is "struct RewriteVarMapping", for example:
     * - inlist2join
     * - pushjoin2union (will implemented)
     * _ ...
     *
     */
    List* var_mappings;
    Relids var_mapping_rels; /* all the relations that related to inlist2join */

    /*
     * simple_rte_array is the same length as simple_rel_array and holds
     * pointers to the associated rangetable entries.  This lets us avoid
     * rt_fetch(), which can be a bit slow once large inheritance sets have
     * been expanded.
     */
    RangeTblEntry** simple_rte_array; /* rangetable as an array */

    /*
     * all_baserels is a Relids set of all base relids (but not "other"
     * relids) in the query; that is, the Relids identifier of the final join
     * we need to form.
     */
    Relids all_baserels;

    /*
     * join_rel_list is a list of all join-relation RelOptInfos we have
     * considered in this planning run.  For small problems we just scan the
     * list to do lookups, but when there are many join relations we build a
     * hash table for faster lookups.  The hash table is present and valid
     * when join_rel_hash is not NULL.  Note that we still maintain the list
     * even when using the hash table for lookups; this simplifies life for
     * GEQO.
     */
    List* join_rel_list;        /* list of join-relation RelOptInfos */
    struct HTAB* join_rel_hash; /* optional hashtable for join relations */

    /*
     * When doing a dynamic-programming-style join search, join_rel_level[k]
     * is a list of all join-relation RelOptInfos of level k, and
     * join_cur_level is the current level.  New join-relation RelOptInfos are
     * automatically added to the join_rel_level[join_cur_level] list.
     * join_rel_level is NULL if not in use.
     */
    List** join_rel_level; /* lists of join-relation RelOptInfos */
    int join_cur_level;    /* index of list being extended */

    List* init_plans; /* init SubPlans for query */

    List* cte_plan_ids; /* per-CTE-item list of subplan IDs */

    List* eq_classes; /* list of active EquivalenceClasses */

    List* canon_pathkeys; /* list of "canonical" PathKeys */

    List* left_join_clauses; /* list of RestrictInfos for
                              * mergejoinable outer join clauses
                              * w/nonnullable var on left */

    List* right_join_clauses; /* list of RestrictInfos for
                               * mergejoinable outer join clauses
                               * w/nonnullable var on right */

    List* full_join_clauses; /* list of RestrictInfos for
                              * mergejoinable full join clauses */

    List* join_info_list; /* list of SpecialJoinInfos */

    List* lateral_info_list;  /* list of LateralJoinInfos */

    List* append_rel_list; /* list of AppendRelInfos */

    List* rowMarks; /* list of PlanRowMarks */

    List* placeholder_list; /* list of PlaceHolderInfos */

    List* query_pathkeys; /* desired pathkeys for query_planner(), and
                           * actual pathkeys afterwards */

    List* group_pathkeys;    /* groupClause pathkeys, if any */
    List* window_pathkeys;   /* pathkeys of bottom window, if any */
    List* distinct_pathkeys; /* distinctClause pathkeys, if any */
    List* sort_pathkeys;     /* sortClause pathkeys, if any */

    List* minmax_aggs; /* List of MinMaxAggInfos */

    List* initial_rels; /* RelOptInfos we are now trying to join */

    MemoryContext planner_cxt; /* context holding PlannerInfo */

    double total_table_pages; /* # of pages in all tables of query */

    double tuple_fraction; /* tuple_fraction passed to query_planner */
    double limit_tuples;   /* limit_tuples passed to query_planner */

    bool hasInheritedTarget;     /* true if parse->resultRelation is an
                                  * inheritance child rel */
    bool hasJoinRTEs;            /* true if any RTEs are RTE_JOIN kind */
    bool hasLateralRTEs;         /* true if any RTEs are marked LATERAL */
    bool hasHavingQual;          /* true if havingQual was non-null */
    bool hasPseudoConstantQuals; /* true if any RestrictInfo has
                                  * pseudoconstant = true */
    bool hasRecursion;           /* true if planning a recursive WITH item */

    /* Note: qualSecurityLevel is zero if there are no securityQuals */
    Index qualSecurityLevel; /* minimum security_level for quals */

#ifdef PGXC
    /* This field is used only when RemoteScan nodes are involved */
    int rs_alias_index; /* used to build the alias reference */

    /*
     * In Postgres-XC Coordinators are supposed to skip the handling of
     * row marks of type ROW_MARK_EXCLUSIVE & ROW_MARK_SHARE.
     * In order to do that we simply remove such type
     * of row marks from the list rowMarks. Instead they are saved
     * in xc_rowMarks list that is then handeled to add
     * FOR UPDATE/SHARE in the remote query
     */
    List* xc_rowMarks; /* list of PlanRowMarks of type ROW_MARK_EXCLUSIVE & ROW_MARK_SHARE */
#endif

    /* These fields are used only when hasRecursion is true: */
    int wt_param_id;                 /* PARAM_EXEC ID for the work table */
    struct Plan* non_recursive_plan; /* plan for non-recursive term */

    /* These fields are workspace for createplan.c */
    Relids curOuterRels;  /* outer rels above current node */
    List* curOuterParams; /* not-yet-assigned NestLoopParams */

    Index curIteratorParamIndex;
    bool isPartIteratorPlanning;
    int curItrs;
    List* subqueryRestrictInfo; /* Subquery RestrictInfo, which only be used in wondows agg. */

    /* optional private data for join_search_hook, e.g., GEQO */
    void* join_search_private;

    /* Added post-release, will be in a saner place in 9.3: */
    List* plan_params; /* list of PlannerParamItems, see below */

    /* For count_distinct, save null info for group by clause */
    List* join_null_info;
    /* for GroupingFunc fixup in setrefs */
    AttrNumber* grouping_map;

    /* If current query level is correlated with upper level */
    bool is_correlated;

    /* data redistribution for DFS table.
     * dataDestRelIndex is index into the range table. This variable
     * will take effect on data redistribution state.
     * The effective value must be greater than 0.
     */
    Index dataDestRelIndex;

    /* interesting keys of current query level */
    ItstDisKey dis_keys;

    /*
     * indicate if the subquery planning root (PlannerInfo) is under or rooted from
     * recursive-cte planning.
     */
    bool is_under_recursive_cte;

    /*
     * indicate if the subquery planning root (PlannerInfo) is under recursive-cte's
     * recursive branch
     */
    bool is_under_recursive_tree;
    bool has_recursive_correlated_rte; /* true if any RTE correlated with recursive cte */

    int subquery_type;
    Bitmapset *param_upper;
    
    bool hasRownumQual;
} PlannerInfo;
  • 预处理
    创建计划树前,Planner对存储在PlannerInfo structure的查询树执行预处理。
    这里介绍单表查询的主要预处理:
  1. 简化target lists,limit子句等
    例:将'2+2'重写为'4'
  2. 正常化Boolean表达式
    例:'NOT (NOT a)' 重写为 'a'
  3. 平整 AND/OR 表达式
    AND/OR表达式在标准SQL里是二进制运算符,但在PG,它们是n-ary运行符,Planner始终假设所有嵌套的AND,OR表达式平整化。
    如下例子展示将 '(id=1) OR (id=2) OR (id=3)' 平整化。


    flattened OR expression.png
  • 获取最廉价的访问路径
    为了拿到最廉价的访问路径,Planner对所有可能的访问路径进行成本估算,并选择最廉价的访问路径。Planner执行了以下操作:
  1. 创建 RelOptInfo 结构来存储访问路径和对应的成本,RelOptInfo中的baserestrictinfo存储where条件相关信息,indexlist存储目标表存在的索引信息
typedef enum RelOptKind { 
RELOPT_BASEREL, 
RELOPT_JOINREL, 
RELOPT_OTHER_MEMBER_REL, 
RELOPT_DEADREL 
} RelOptKind;

typedef struct RelOptInfo {
    NodeTag type;

    RelOptKind reloptkind;

    /* all relations included in this RelOptInfo */
    Relids relids; /* set of base relids (rangetable indexes) */

    bool isPartitionedTable; /* is it a partitioned table? it is meaningless unless
                         it is a 'baserel' (reloptkind = RELOPT_BASEREL) */
    PartitionFlag partflag;

    /* size estimates generated by planner */
    double rows;           /* estimated number of global result tuples */
    int width;             /* estimated avg width of result tuples */
    int encodedwidth;      /* estimated avg width of encoded columns in result tuples */
    AttrNumber encodednum; /* number of encoded column */

    /* materialization information */
    List* reltargetlist;   /* Vars to be output by scan of relation */
    List* distribute_keys; /* distribute key */
    List* pathlist;        /* Path structures */
    List* ppilist;         /* ParamPathInfos used in pathlist */
    struct Path* cheapest_startup_path;
    List* cheapest_total_path; /* contain all cheapest total paths from different distribute key */
    struct Path* cheapest_unique_path;
    List* cheapest_parameterized_paths;

    /* information about a base rel (not set for join rels!) */
    Index relid;
    Oid reltablespace;   /* containing tablespace */
    RTEKind rtekind;     /* RELATION, SUBQUERY, or FUNCTION */
    AttrNumber min_attr; /* smallest attrno of rel (often <0) */
    AttrNumber max_attr; /* largest attrno of rel */
    Relids* attr_needed; /* array indexed [min_attr .. max_attr] */
    int32* attr_widths;  /* array indexed [min_attr .. max_attr] */
    List* lateral_vars;  /* LATERAL Vars and PHVs referenced by rel */
    Relids lateral_relids; /* minimum parameterization of rel */
    List* indexlist;     /* list of IndexOptInfo */
    RelPageType pages;   /* local size estimates derived from pg_class */
    double tuples;       /* global size estimates derived from pg_class */
    double multiple;     /* how many dn skewed and biased be influenced by distinct. */
    double allvisfrac;

    struct PruningResult* pruning_result; /* pruning result for partitioned table with
                                    baserestrictinfo,it is meaningless unless it
                                    is a 'baserel' (reloptkind = RELOPT_BASEREL) */
    int partItrs;                         /* the number of the partitions in pruning_result */
    struct PruningResult* pruning_result_for_index_usable;
    int partItrs_for_index_usable; /* the number of the partitions in pruning_result_for_seqscan */
    struct PruningResult* pruning_result_for_index_unusable;
    int partItrs_for_index_unusable; /* the number of the partitions in pruning_result_for_seqscan */
    /* information about a partitioned table */
    BucketInfo *bucketInfo;

    /* use "struct Plan" to avoid including plannodes.h here */
    struct Plan* subplan; /* if subquery */
    PlannerInfo* subroot; /* if subquery */
    List *subplan_params; /* if subquery */
    /* use "struct FdwRoutine" to avoid including fdwapi.h here */
    struct FdwRoutine* fdwroutine; /* if foreign table */
    void* fdw_private;             /* if foreign table */

    /* used by various scans and joins: */
    List* baserestrictinfo;          /* RestrictInfo structures (if base
                                      * rel) */
    QualCost baserestrictcost;       /* cost of evaluating the above */
    Index baserestrict_min_security; /* min security_level found in
                                      * baserestrictinfo */
    List* joininfo;                  /* RestrictInfo structures for join clauses
                                      * involving this rel */
    bool has_eclass_joins;           /* T means joininfo is incomplete */
    RelOrientation orientation;      /* the store type of base rel */
    RelstoreType relStoreLocation;   /* the relation store location. */
    char locator_type;               /* the location type of base rel */
    Oid rangelistOid;                /* oid of list/range distributed table, InvalidOid if not list/range table */
    List* subplanrestrictinfo;       /* table filter with correlated column involved */
    ItstDisKey rel_dis_keys;         /* interesting key info for current relation */
    List* varratio;                  /* rel tuples ratio after join to different relation */
    List* varEqRatio;

    /*
     * The alternative rel for cost-based query rewrite
     *
     * Note: Only base rel have valid pointer of this fields, set to NULL for alternative rel
     */
    List* alternatives;

    /*
     * Rel opinter to base rel that in plannerinfo->simple_rel_array[x].
     *
     * Note: Only alternative rels has valid pointer of this field, set to NULL for the
     * origin rel.
     */
    RelOptInfo* base_rel;

    unsigned int num_data_nodes = 0; //number of distributing data nodes
} RelOptInfo;
  1. 估算所有可能的访问路径的成本,将其加入到RelOptInfo 结构
    i. 创建path,估算顺序扫描的成本,并将结果写到path,然后将path增加到RelOptInfo的pathlist里。
    ii. 如果目标表存在索引,则逐个创建索引访问路径path,并估算索引扫描的成本保存到path,再将所有path增加到RelOptInfo的pathlist里。
    iii. 如果支持位图扫描,创建位图扫描访问路径path,并估算位图扫描的成本保存到path,再将所有path增加到RelOptInfo的pathlist里。
  2. 从RelOptInfo的pathlist里获取最廉价的访问路径
  3. 如果需要,估算 LIMIT, ORDER BY, ARREGISFDD 的成本

以下以两个示例来说明Planner的执行。

  • 示例1
    首先创建一张表,不带索引,执行的查询语句带where条件和order by子句。
postgres=# \d+ tbl_1
                                   Table "public.tbl_1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 id     | integer |           |          |         | plain   |              |
 data    | integer |           |          |         | plain   |              |

postgres=# select * from tbl_1 where id < 300 order by data;
示例1 - 1.png

(1) 创建RelOptInfo,加入到PlannerInfo的simple_rel_array
(2) 将where条件存储到RelOptInfo的baserestrictinfo中,另外RelOptInfo的indexlist为空,因为目标表没有索引
(3) 由于语句中有order by子句,增加排序的pathkey到PlannerInfo的sort_pathkeys
(4) 创建顺序扫描的path,并已估算好成本,再增加到RelOptInfo的pathlist
由于该例中目标表没有索引,所以Planner只估算顺序扫描的成本,顺序扫描即为该查询最廉价的访问路径。

示例1 - 2.png

(5) 创建新的RelOptInfo来处理ORDER BY
该RelOptInfo不包含where条件,即baserestrictinfo为空
(6) 创建sort path,估算成本后,加入到新的RelOptInfo的pathlist里。然后,将上面的顺序扫描的path连接到sort path的subpath中。subpath连接最廉价的访问路径。

  • 示例2
    创建表带两个索引,执行的查询语句带where条件
postgres=# \d+ tbl_2
                                   Table "public.tbl_2"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 id     | integer |           | not null |         | plain   |              |
 data   | integer |           |          |         | plain   |              |
Indexes:
    "tbl_2_pkey" PRIMARY KEY, btree (id)
    "tbl_2_data_idx" btree (data)

postgres=# select * from tbl_2 where id < 240;
示例2 - 1.png

(1) 创建RelOptInfo,加入到PlannerInfo的simple_rel_array
(2) 将where条件存储到RelOptInfo的baserestrictinfo中,将目标表的索引逐个加入到RelOptInfo的indexlist中
(3) 创建顺序扫描的path,并已估算好成本,再增加到RelOptInfo的pathlist

示例2 - 2.png

(4) 创建索引扫描的访问路径 IndexPath,估算成本后,加入到RelOptInfo的pathlist中
该例中有两个索引,tbl_2_pkey,tbl_2_data_idx,需按顺序处理。首先处理tbl_2_pkey。
这里先创建tbl_2_pkey的indexpath,估算成本后,加入到RelOptInfo的pathlist中。由于tbl_2_pkey关联字段 'id',所以将where条件存储到indexclauses中。
另外,在将访问路径加入到RelOptInfo的pathlist时,会按total成本排序。在本例中,由于索引扫描的成本小于顺序扫描,所以索引扫描的访问路径在顺序扫描前面。
(5) 创建另一个索引的访问路径 IndexPath,估算成本后,加入到RelOptInfo的pathlist中

示例2 - 3.png

(6) 创建新的RelOptInfo
(7) 将最廉价的访问路径加入到新的RelOptInfo的pathlist中
在该例中,最廉价的访问路径是使用tbl_2_pkey索引扫描的访问路径。

  1. 创建计划树
    Planner使用最廉价的访问路径生成计划树。
    计划树的根存储为PlannedStmt结构。
/* ----------------
 *      PlannedStmt node
 *
 * The output of the planner is a Plan tree headed by a PlannedStmt node.
 * PlannedStmt holds the "one time" information needed by the executor.
 * ----------------
 */
typedef struct PlannedStmt {
    NodeTag type;

    CmdType commandType; /* select|insert|update|delete */

    uint64 queryId; /* query identifier,  uniquely indicate this plan in Runtime (copied from Query) */

    bool hasReturning; /* is it insert|update|delete RETURNING? */

    bool hasModifyingCTE; /* has insert|update|delete in WITH? */

    bool canSetTag; /* do I set the command result tag? */

    bool transientPlan; /* redo plan when TransactionXmin changes? */

    bool dependsOnRole; /* is plan specific to current role? */

    Plan* planTree; /* tree of Plan nodes */

    List* rtable; /* list of RangeTblEntry nodes */

    /* rtable indexes of target relations for INSERT/UPDATE/DELETE */
    List* resultRelations; /* integer list of RT indexes, or NIL */

    Node* utilityStmt; /* non-null if this is DECLARE CURSOR */

    List* subplans; /* Plan trees for SubPlan expressions */

    Bitmapset* rewindPlanIDs; /* indices of subplans that require REWIND */

    List* rowMarks; /* a list of PlanRowMark's */

    /*
     * Notice: be careful to use relationOids as it may contain non-table OID
     * in some scenarios, e.g. assignment of relationOids in fix_expr_common.
     */
    List* relationOids; /* contain OIDs of relations the plan depends on */

    List* invalItems; /* other dependencies, as PlanInvalItems */

    int nParamExec; /* number of PARAM_EXEC Params used */

    int num_streams; /* number of stream exist in plan tree */

    int max_push_sql_num; /* number of split sql want push DN execute */

    int gather_count; /* gather_count in query */

    int num_nodes; /* number of data nodes */

    NodeDefinition* nodesDefinition; /* all data nodes' defination */

    int instrument_option; /* used for collect instrument data */

    int num_plannodes; /* how many plan node in this planstmt */

    int query_mem[2]; /* how many memory the query can use ,  memory in kb  */

    int assigned_query_mem[2]; /* how many memory the query is assigned   */

    bool is_dynmaic_smp;

    int dynsmp_max_cpu; /* max avaliable cpu for this dn */

    int dynsmp_avail_cpu; /* max avaliable cpu for this dn */

    int dynsmp_cpu_util;

    int dynsmp_active_statement;

    double dynsmp_query_estimate_cpu_usge;

    int dynsmp_plan_optimal_dop; /* the final optimized dop for the plan */

    int dynsmp_plan_original_dop;

    int dynsmp_dop_mem_limit; /* memory will put a limit on dop */

    int dynsmp_min_non_spill_dop; /* optimal dop cannot greater than this */

    int num_bucketmaps; /* Num of special-bucketmap stored in plannedstmt */

    uint2* bucketMap[MAX_SPECIAL_BUCKETMAP_NUM]; /* the map information need to be get */

    char* query_string; /* convey the query string to backend/stream thread of DataNode for debug purpose */

    List* subplan_ids; /* in which plan id subplan should be inited */

    List* initPlan; /* initplan in top plan node */
    /* data redistribution for DFS table.
     * dataDestRelIndex is index into the range table. This variable
     * will take effect on data redistribution state.
     */
    Index dataDestRelIndex;

    int MaxBloomFilterNum;

    int query_dop; /* Dop of current query. */

    double plannertime; /* planner execute time */

    /* set true in do_query_for_planrouter() for PlannedStmt sent to
     * the compute pool
     */
    bool in_compute_pool;

    /* true if there is/are ForeignScan node(s) of OBS foreign table
     * in plantree.
     */
    bool has_obsrel;

    List* plan_hint_warning; /* hint warning during plan generation, only used in CN */

    List* noanalyze_rellist; /* relations and attributes that have no statistics, only used in CN */

    int ng_num;                     /* nodegroup number */
    NodeGroupQueryMem* ng_queryMem; /* each nodegroup's query mem */
    bool ng_use_planA;              /* true means I am a planA, default false */

    bool isRowTriggerShippable; /* true if all row triggers are shippable. */
    bool is_stream_plan;
    bool multi_node_hint;

    uint64 uniqueSQLId;
} PlannedStmt;

commandType:存储操作类型 select, insert, update, delete
rtable:存储关系列表条目
relationOids:存储查询相关表的oid
plantree:存储由计划节点组成的计划树,每个节点有特定的操作,如顺序扫描,排序,索引扫描

计划节点的基础节点是Plan结构,其他计划结构始终包含它。

typedef struct Plan {
    NodeTag type;

    int plan_node_id;   /* node id */
    int parent_node_id; /* parent node id */
    RemoteQueryExecType exec_type;

    /*
     * estimated execution costs for plan (see costsize.c for more info)
     */
    Cost startup_cost; /* cost expended before fetching any tuples */
    Cost total_cost;   /* total cost (assuming all tuples fetched) */

    /*
     * planner's estimate of result size of this plan step
     */
    double plan_rows; /* number of global rows plan is expected to emit */
    double multiple;
    int plan_width; /* average row width in bytes */
    int dop;        /* degree of parallelism of current plan */

    /*
     * machine learning model estimations
     */
    double pred_rows;
    double pred_startup_time;
    double pred_total_time;
    long pred_max_memory;
    /*
     * MPPDB Recursive-Union Support
     *
     * - @recursive_union_plan_nodeid
     *      Pointing to its belonging RecursiveUnion's plan node id to indate if we are
     *      under RecursiveUnion
     *
     * - @recursive_union_controller
     *      Indicate if current Plan node is controller node in recursive-union steps
     *
     * - @control_plan_nodeid
     *      Normally, set on the top-plan node of a producer thread, to indicate which
     *      control-plan we need syn-up with
     *
     * - @is_sync_planode
     *      Indicate the current producer thread is the sync-up thread in recursive union,
     *      normally set on produer's top plan node
     *
     * Please note the above four variables is meaningless if a plan node is not under
     * recursive-union's recursive part
     */
    /*
     * plan node id of RecursiveUnion node where current plan node belongs to, 0 for
     * not under recursive-union
     */
    int recursive_union_plan_nodeid;

    /* flag to indicate if it is controller plan node */
    bool recursive_union_controller;

    /* plan node id of Controller plan node, 0 for not in control */
    int control_plan_nodeid;

    /* flag indicate if the current plan node is the sync node (for multi-stream case) */
    bool is_sync_plannode;

    /*
     * Common structural data for all Plan types.
     */
    List* targetlist;      /* target list to be computed at this node */
    List* qual;            /* implicitly-ANDed qual conditions */
    struct Plan* lefttree; /* input plan tree(s) */
    struct Plan* righttree;

    bool ispwj;  /* is it special for partitionwisejoin? */
    int paramno; /* the partition'sn that it is scaning */

    List* initPlan;    /* Init Plan nodes (un-correlated expr
                        * subselects) */

    List* distributed_keys; /* distributed on which key */
    ExecNodes* exec_nodes;  /*  List of Datanodes where to execute this plan    */

    /*
     * Information for management of parameter-change-driven rescanning
     *
     * extParam includes the paramIDs of all external PARAM_EXEC params
     * affecting this plan node or its children.  setParam params from the
     * node's initPlans are not included, but their extParams are.
     *
     * allParam includes all the extParam paramIDs, plus the IDs of local
     * params that affect the node (i.e., the setParams of its initplans).
     * These are _all_ the PARAM_EXEC params that affect this node.
     */
    Bitmapset* extParam;
    Bitmapset* allParam;

    // For vectorized engine, plan produce vector output
    //
    bool vec_output;
    /*
     * @hdfs
     * Mark the foreign scan whether has unique results on one of its
     * output columns.
     */
    bool hasUniqueResults;
    /*
     * Mark the plan whether includes delta table or not.
     */
    bool isDeltaTable;

    /* used to replace work_mem, maxmem in [0], and minmem in [1] */
    int operatorMemKB[2];
    /* allowed max mem after spread */
    int operatorMaxMem;

    bool parallel_enabled; /* Is it run in parallel? */
    bool hasHashFilter;    /* true for this plan has a hashfilter */

    List* var_list;        /* Need bloom filter var list. */
    List* filterIndexList; /* Need used bloomfilter array index. */

    /* used to replace work_mem */
    int** ng_operatorMemKBArray; /* for multiple logic cluster */
    int ng_num;
    double innerdistinct; /* join inner rel distinct estimation value */
    double outerdistinct; /* join outer rel distinct estimation value */
} Plan;

startup_cost, total_cost:操作的成本估算
plan_rows:Planner估算会扫描的行数
targetlist:存储查询树中包含的目标对象列表
qual:存储qual条件的列表
lefttree, righttree:添加子节点的节点

  • 示例1
    这里介绍上一节示例1的计划树,上一节示例1最廉价的访问路径树是sort path和scan path的集合,根路径是sort path,子路径是scan path。而计划树则是几乎是从最廉价的路径中生成。由下图(a) 可以看到,SortNode被加入到PlannedStmt的plantree中,而SeqScanNode则添加到SortNode的lefttree中。
    SortNode中,lefttree指向SeqScanNode。
    SeqScanNode中,qual持有where条件。


    计划树.png

    使用EXPLAIN查看执行计划:

postgres=# explain select * from tbl_1 where id < 300 order by data;
                           QUERY PLAN
----------------------------------------------------------------
 Sort  (cost=1704.33..1705.03 rows=279 width=8)
   Sort Key: data
   ->  Seq Scan on tbl_1  (cost=0.00..1693.00 rows=279 width=8)
         Filter: (id < 300)
(4 rows)
  • 示例2
    这里介绍上一节示例2的计划树,示例2最廉价的访问路径树是index path。所以计划树中只有IndexScanNode。该示例中,where条件是访问谓词,存储在IndexScanNode的indexqual中。
    访问谓词:Index Cond,表达叶节点遍历的开始和停止条件。
    使用EXPLAIN查看执行计划:
postgres=# explain select * from tbl_2 where id < 240;
                                QUERY PLAN
---------------------------------------------------------------------------
 Index Scan using tbl_2_pkey on tbl_2  (cost=0.29..13.25 rows=226 width=8)
   Index Cond: (id < 240)
(2 rows)
  1. 执行器如何执行
    在单表查询中,执行器以从计划树的末尾到根部的顺序使用计划节点,然后调用执行相应节点处理的函数。
    每个计划节点具有用于执行相应操作的函数,它们位于src/backend/executor/目录中。
    了解执行器如何执行的最佳方法是读取EXPLAIN命令的输出,因为PostgreSQL的EXPLAIN几乎展示了计划树的内容。 如示例1的EXPLAIN输出:
postgres=# explain select * from tbl_1 where id < 300 order by data;
                           QUERY PLAN
----------------------------------------------------------------
 Sort  (cost=1704.33..1705.03 rows=279 width=8)
   Sort Key: data
   ->  Seq Scan on tbl_1  (cost=0.00..1693.00 rows=279 width=8)
         Filter: (id < 300)
(4 rows)

从末尾往上读,首先读到SeqScan节点,执行器执行顺序扫描操作。然后到Sort节点,执行器对顺序扫描的结果执行排序操作。

你可能感兴趣的:(查询处理 - 代价估计和计划树)