PostgreSQL的B-tree索引

结构

B-tree索引适合用于存储排序的数据。对于这种数据类型需要定义大于、大于等于、小于、小于等于操作符。

通常情况下，B-tree的索引记录存储在数据页中。叶子页中的记录包含索引数据（keys）以及指向heap tuple记录（即表的行记录TIDs）的指针。内部页中的记录包含指向索引子页的指针和子页中最小值。

B-tree有几点重要的特性：

1、B-tree是平衡树，即每个叶子页到root页中间有相同个数的内部页。因此查询任何一个值的时间是相同的。

2、B-tree中一个节点有多个分支，即每页（通常8KB）具有许多TIDs。因此B-tree的高度比较低，通常4到5层就可以存储大量行记录。

3、索引中的数据以非递减的顺序存储（页之间以及页内都是这种顺序），同级的数据页由双向链表连接。因此不需要每次都返回root，通过遍历链表就可以获取一个有序的数据集。

下面是一个索引的简单例子，该索引存储的记录为整型并只有一个字段：

该索引最顶层的页是元数据页，该数据页存储索引root页的相关信息。内部节点位于root下面，叶子页位于最下面一层。向下的箭头表示由叶子节点指向表记录（TIDs）。

等值查询

例如通过"indexed-field = expression"形式的条件查询49这个值。

root节点有三个记录：(4,32,64)。从root节点开始进行搜索，由于32≤ 49 < 64，所以选择32这个值进入其子节点。通过同样的方法继续向下进行搜索一直到叶子节点，最后查询到49这个值。

实际上，查询算法远不止看上去的这么简单。比如，该索引是非唯一索引时，允许存在许多相同值的记录，并且这些相同的记录不止存放在一个页中。此时该如何查询？我们返回到上面的的例子，定位到第二层节点(32,43,49)。如果选择49这个值并向下进入其子节点搜索，就会跳过前一个叶子页中的49这个值。因此，在内部节点进行等值查询49时，定位到49这个值，然后选择49的前一个值43，向下进入其子节点进行搜索。最后，在底层节点中从左到右进行搜索。

(另外一个复杂的地方是，查询的过程中树结构可能会改变，比如分裂)

非等值查询

通过"indexed-field ≤ expression" (or "indexed-field ≥ expression")查询时，首先通过"indexed-field = expression"形式进行等值（如果存在该值）查询，定位到叶子节点后，再向左或向右进行遍历检索。

下图是查询 n ≤ 35的示意图：

大于和小于可以通过同样的方法进行查询。查询时需要排除等值查询出的值。

范围查询

范围查询"expression1 ≤ indexed-field ≤ expression2"时，需要通过 "expression1 ≤ indexed-field =expression2"找到一匹配值，然后在叶子节点从左到右进行检索，一直到不满足"indexed-field ≤ expression2" 的条件为止；或者反过来，首先通过第二个表达式进行检索，在叶子节点定位到该值后，再从右向左进行检索，一直到不满足第一个表达式的条件为止。

下图是23 ≤ n ≤ 64的查询示意图:

案例

下面是一个查询计划的实例。通过demo database中的aircraft表进行介绍。该表有9行数据，由于整个表只有一个数据页，所以执行计划不会使用索引。为了解释说明问题，我们使用整个表进行说明。

demo=#select*fromaircrafts;

aircraft_code | model | range

---------------+---------------------+-------

773 | Boeing 777-300 | 11100

763 | Boeing 767-300 | 7900

SU9 | Sukhoi SuperJet-100 | 3000

320 | Airbus A320-200 | 5700

321 | Airbus A321-200 | 5600

319 | Airbus A319-100 | 6700

733 | Boeing 737-300 | 4200

CN1 | Cessna 208 Caravan | 1200

CR2 | Bombardier CRJ-200 | 2700

(9 rows)

demo=#createindexonaircrafts(range);

demo=#setenable_seqscan =off;

（更准确的方式：create index on aircrafts using btree(range)，创建索引时默认构建B-tree索引。）

等值查询的执行计划：

demo=#explain(costsoff)select*fromaircraftswhererange=3000;

QUERY PLAN

---------------------------------------------------

Index Scan using aircrafts_range_idx on aircrafts

Index Cond: (range = 3000)

(2 rows)

非等值查询的执行计划：

demo=#explain(costsoff)select*fromaircraftswhererange<3000;

QUERY PLAN

---------------------------------------------------

Index Scan using aircrafts_range_idx on aircrafts

Index Cond: (range < 3000)

(2 rows)

范围查询的执行计划：

demo=#explain(costsoff)select*fromaircrafts

whererangebetween3000and5000;

QUERY PLAN

-----------------------------------------------------

Index Scan using aircrafts_range_idx on aircrafts

Index Cond: ((range >= 3000) AND (range <= 5000))

(2 rows)

排序

再次强调，通过index、index-only或bitmap扫描，btree访问方法可以返回有序的数据。因此如果表的排序条件上有索引，优化器会考虑以下方式：表的索引扫描；表的顺序扫描然后对结果集进行排序。

排序顺序

当创建索引时可以明确指定排序顺序。如下所示，在range列上建立一个索引，并且排序顺序为降序：

demo=#createindexonaircrafts(rangedesc);

本案例中，大值会出现在树的左边，小值出现在右边。为什么有这样的需求？这样做是为了多列索引。创建aircraft的一个视图，通过range分成3部分：

demo=#createviewaircrafts_vas

selectmodel,

case

whenrange<4000then1

whenrange<10000then2

else3

endasclass

fromaircrafts;

demo=#select*fromaircrafts_v;

model | class

---------------------+-------

Boeing 777-300 | 3

Boeing 767-300 | 2

Sukhoi SuperJet-100 | 1

Airbus A320-200 | 2

Airbus A321-200 | 2

Airbus A319-100 | 2

Boeing 737-300 | 2

Cessna 208 Caravan | 1

Bombardier CRJ-200 | 1

(9 rows)

然后创建一个索引（使用下面表达式）：

demo=#createindexonaircrafts(

(casewhenrange<4000then1whenrange<10000then2else3end),

model);

现在，可以通过索引以升序的方式获取排序的数据：

demo=#selectclass,modelfromaircrafts_vorderbyclass,model;

class | model

-------+---------------------

1 | Bombardier CRJ-200

1 | Cessna 208 Caravan

1 | Sukhoi SuperJet-100

2 | Airbus A319-100

2 | Airbus A320-200

2 | Airbus A321-200

2 | Boeing 737-300

2 | Boeing 767-300

3 | Boeing 777-300

(9 rows)

demo=#explain(costsoff)

selectclass,modelfromaircrafts_vorderbyclass,model;

QUERY PLAN

--------------------------------------------------------

Index Scan using aircrafts_case_model_idx on aircrafts

(1 row)

同样，可以以降序的方式获取排序的数据：

demo=#selectclass,modelfromaircrafts_vorderbyclassdesc,modeldesc;

class | model

-------+---------------------

3 | Boeing 777-300

2 | Boeing 767-300

2 | Boeing 737-300

2 | Airbus A321-200

2 | Airbus A320-200

2 | Airbus A319-100

1 | Sukhoi SuperJet-100

1 | Cessna 208 Caravan

1 | Bombardier CRJ-200

(9 rows)

demo=#explain(costsoff)

selectclass,modelfromaircrafts_vorderbyclassdesc,modeldesc;

QUERY PLAN

-----------------------------------------------------------------

Index Scan BACKWARD using aircrafts_case_model_idx on aircrafts

(1 row)

然而，如果一列以升序一列以降序的方式获取排序的数据的话，就不能使用索引，只能单独排序：

demo=#explain(costsoff)

selectclass,modelfromaircrafts_vorderbyclassASC,modelDESC;

QUERY PLAN

-------------------------------------------------

Sort

Sort Key: (CASE ...END), aircrafts.modelDESC

-> SeqScanonaircrafts

(3rows)

（注意，最终执行计划会选择顺序扫描，忽略之前设置的enable_seqscan = off。因为这个设置并不会放弃表扫描，只是设置他的成本----查看costs on的执行计划）

若有使用索引，创建索引时指定排序的方向：

demo=#createindexaircrafts_case_asc_model_desc_idxonaircrafts(

(case

whenrange<4000then1

whenrange<10000then2

else3

end)ASC,

modelDESC);

demo=#explain(costsoff)

selectclass,modelfromaircrafts_vorderbyclassASC,modelDESC;

QUERY PLAN

-----------------------------------------------------------------

Index Scan using aircrafts_case_asc_model_desc_idx on aircrafts

(1 row)

列的顺序

当使用多列索引时与列的顺序有关的问题会显示出来。对于B-tree，这个顺序非常重要：页中的数据先以第一个字段进行排序，然后再第二个字段，以此类推。

下图是在range和model列上构建的索引：

当然，上图这么小的索引在一个root页足以存放。但是为了清晰起见，特意将其分成几页。

从图中可见，通过类似的谓词class = 3（仅按第一个字段进行搜索）或者class = 3 and model = 'Boeing 777-300'（按两个字段进行搜索）将非常高效。

然而，通过谓词model = 'Boeing 777-300'进行搜索的效率将大大降低：从root开始，判断不出选择哪个子节点进行向下搜索，因此会遍历所有子节点向下进行搜索。这并不意味着永远无法使用这样的索引----它的效率有问题。例如，如果aircraft有3个classes值，每个class类中有许多model值，此时不得不扫描索引1/3的数据，这可能比全表扫描更有效。

但是，当创建如下索引时：

demo=#createindexonaircrafts(

model,

(casewhenrange<4000then1whenrange<10000then2else3end));

索引字段的顺序会改变：

通过这个索引，model = 'Boeing 777-300'将会很有效，但class = 3则没这么高效。

NULLs

PostgreSQL的B-tree支持在NULLs上创建索引，可以通过IS NULL或者IS NOT NULL的条件进行查询。

考虑flights表，允许NULLs：

demo=#createindexonflights(actual_arrival);

demo=#explain(costsoff)select*fromflightswhereactual_arrivalisnull;

QUERY PLAN

-------------------------------------------------------

Bitmap Heap Scan on flights

Recheck Cond: (actual_arrival IS NULL)

-> Bitmap Index Scan on flights_actual_arrival_idx

Index Cond: (actual_arrival IS NULL)

(4 rows)

NULLs位于叶子节点的一端或另一端，这依赖于索引的创建方式（NULLS FIRST或NULLS LAST）。如果查询中包含排序，这就显得很重要了：如果SELECT语句在ORDER BY子句中指定NULLs的顺序索引构建的顺序一样（NULLS FIRST或NULLS LAST），就可以使用整个索引。

下面的例子中，他们的顺序相同，因此可以使用索引：

demo=#explain(costsoff)

select*fromflightsorderbyactual_arrivalNULLSLAST;

QUERY PLAN

--------------------------------------------------------

Index Scan using flights_actual_arrival_idx on flights

(1 row)

下面的例子，顺序不同，优化器选择顺序扫描然后进行排序：

demo=#explain(costsoff)

select*fromflightsorderbyactual_arrivalNULLSFIRST;

QUERY PLAN

----------------------------------------

Sort

Sort Key: actual_arrival NULLS FIRST

-> Seq Scan on flights

(3 rows)

NULLs必须位于开头才能使用索引：

demo=#createindexflights_nulls_first_idxonflights(actual_arrivalNULLSFIRST);

demo=#explain(costsoff)

select*fromflightsorderbyactual_arrivalNULLSFIRST;

QUERY PLAN

-----------------------------------------------------

Index Scan using flights_nulls_first_idx on flights

(1 row)

像这样的问题是由NULLs引起的而不是无法排序，也就是说NULL和其他这比较的结果无法预知：

demo=# \pset null NULL

demo=#selectnull<42;

?column?

----------

NULL

(1 row)

这和B-tree的概念背道而驰并且不符合一般的模式。然而NULLs在数据库中扮演者很重要的角色，因此不得不为NULL做特殊设置。

由于NULLs可以被索引，因此即使表上没有任何标记也可以使用索引。（因为这个索引包含表航记录的所有信息）。如果查询需要排序的数据，而且索引确保了所需的顺序，那么这可能是由意义的。这种情况下，查询计划更倾向于通过索引获取数据。

属性

下面介绍btree访问方法的特性。

amname | name | pg_indexam_has_property

--------+---------------+-------------------------

btree | can_order | t

btree | can_unique | t

btree | can_multi_col | t

btree | can_exclude | t

可以看到，B-tree能够排序数据并且支持唯一性。同时还支持多列索引，但是其他访问方法也支持这种索引。我们将在下次讨论EXCLUDE条件。

name | pg_index_has_property

---------------+-----------------------

clusterable | t

index_scan | t

bitmap_scan | t

backward_scan | t

Btree访问方法可以通过以下两种方式获取数据：index scan以及bitmap scan。可以看到，通过tree可以向前和向后进行遍历。

name | pg_index_column_has_property

--------------------+------------------------------

asc | t

desc | f

nulls_first | f

nulls_last | t

orderable | t

distance_orderable | f

returnable | t

search_array | t

search_nulls | t

前四种特性指定了特定列如何精确的排序。本案例中，值以升序（asc）进行排序并且NULLs在后面（nulls_last）。也可以有其他组合。

search_array的特性支持向这样的表达式：

demo=#explain(costsoff)

select*fromaircraftswhereaircraft_codein('733','763','773');

QUERY PLAN

-----------------------------------------------------------------

Index Scan using aircrafts_pkey on aircrafts

Index Cond: (aircraft_code = ANY ('{733,763,773}'::bpchar[]))

(2 rows)

returnable属性支持index-only scan，由于索引本身也存储索引值所以这是合理的。下面简单介绍基于B-tree的覆盖索引。

具有额外列的唯一索引

前面讨论了：覆盖索引包含查询所需的所有值，需不要再回表。唯一索引可以成为覆盖索引。

假设我们查询所需要的列添加到唯一索引，新的组合唯一键可能不再唯一，同一列上将需要2个索引：一个唯一，支持完整性约束；另一个是非唯一，为了覆盖索引。这当然是低效的。

在我们公司 Anastasiya Lubennikova @ lubennikovaav 改进了btree，额外的非唯一列可以包含在唯一索引中。我们希望这个补丁可以被社区采纳。实际上PostgreSQL11已经合了该补丁。

考虑表bookings：d

demo=# \d bookings

Table "bookings.bookings"

Column | Type | Modifiers

--------------+--------------------------+-----------

book_ref | character(6) | not null

book_date | timestamp with time zone | not null

total_amount | numeric(10,2) | not null

Indexes:

"bookings_pkey" PRIMARY KEY, btree (book_ref)

Referenced by:

TABLE "tickets" CONSTRAINT "tickets_book_ref_fkey" FOREIGN KEY (book_ref) REFERENCES bookings(book_ref)

这个表中，主键（book_ref,booking code）通过常规的btree索引提供，下面创建一个由额外列的唯一索引：

demo=#createuniqueindexbookings_pkey2onbookings(book_ref)INCLUDE(book_date);

然后使用新索引替代现有索引：

demo=#begin;

demo=#altertablebookingsdropconstraintbookings_pkeycascade;

demo=#altertablebookingsaddprimarykeyusingindexbookings_pkey2;

demo=#altertableticketsaddforeignkey(book_ref)referencesbookings (book_ref);

demo=#commit;

然后表结构：

demo=# \d bookings

Table "bookings.bookings"

Column | Type | Modifiers

--------------+--------------------------+-----------

book_ref | character(6) | not null

book_date | timestamp with time zone | not null

total_amount | numeric(10,2) | not null

Indexes:

"bookings_pkey2" PRIMARY KEY, btree (book_ref) INCLUDE (book_date)

Referenced by:

TABLE "tickets" CONSTRAINT "tickets_book_ref_fkey" FOREIGN KEY (book_ref) REFERENCES bookings(book_ref)

此时，这个索引可以作为唯一索引工作也可以作为覆盖索引：

demo=#explain(costsoff)

selectbook_ref, book_datefrombookingswherebook_ref ='059FC4';

QUERY PLAN

--------------------------------------------------

Index Only Scan using bookings_pkey2 on bookings

Index Cond: (book_ref = '059FC4'::bpchar)

(2 rows)

创建索引

众所周知，对于大表，加载数据时最好不要带索引；加载完成后再创建索引。这样做不仅提升效率还能节省空间。

创建B-tree索引比向索引中插入数据更高效。所有的数据大致上都已排序，并且数据的叶子页已创建好，然后只需构建内部页直到root页构建成一个完整的B-tree。

这种方法的速度依赖于RAM的大小，受限于参数maintenance_work_mem。因此增大该参数值可以提升速度。对于唯一索引，除了分配maintenance_work_mem的内存外，还分配了work_mem的大小的内存。

比较

前面，提到PG需要知道对于不同类型的值调用哪个函数，并且这个关联方法存储在哈希访问方法中。同样，系统必须找出如何排序。这在排序、分组（有时）、merge join中会涉及。PG不会将自身绑定到操作符名称，因为用户可以自定义他们的数据类型并给出对应不同的操作符名称。

例如bool_ops操作符集中的比较操作符：

postgres=#selectamop.amopopr::regoperatorasopfamily_operator,

amop.amopstrategy

frompg_am am,

pg_opfamily opf,

pg_amop amop

whereopf.opfmethod = am.oid

andamop.amopfamily = opf.oid

andam.amname ='btree'

andopf.opfname ='bool_ops'

orderbyamopstrategy;

opfamily_operator | amopstrategy

---------------------+--------------

<(boolean,boolean) | 1

<=(boolean,boolean) | 2

=(boolean,boolean) | 3

>=(boolean,boolean) | 4

>(boolean,boolean) | 5

(5 rows)

这里可以看到有5种操作符，但是不应该依赖于他们的名字。为了指定哪种操作符做什么操作，引入策略的概念。为了描述操作符语义，定义了5种策略：

1 — less

2 — less or equal

3 — equal

4 — greater or equal

5 — greater

postgres=#selectamop.amopopr::regoperatorasopfamily_operator

frompg_am am,

pg_opfamily opf,

pg_amop amop

whereopf.opfmethod = am.oid

andamop.amopfamily = opf.oid

andam.amname ='btree'

andopf.opfname ='integer_ops'

andamop.amopstrategy =1

orderbyopfamily_operator;

pfamily_operator

----------------------

<(integer,bigint)

<(smallint,smallint)

<(integer,integer)

<(bigint,bigint)

<(bigint,integer)

<(smallint,integer)

<(integer,smallint)

<(smallint,bigint)

<(bigint,smallint)

(9 rows)

一些操作符族可以包含几种操作符，例如integer_ops包含策略1的几种操作符：

正因如此，当比较类型在一个操作符族中时，不同类型值的比较，优化器可以避免类型转换。

索引支持的新数据类型

文档中提供了一个创建符合数值的新数据类型，以及对这种类型数据进行排序的操作符类。该案例使用C语言完成。但不妨碍我们使用纯SQL进行对比试验。

创建一个新的组合类型：包含real和imaginary两个字段

postgres=#createtypecomplexas(refloat, imfloat);

创建一个包含该新组合类型字段的表：

postgres=#createtablenumbers(x complex);

postgres=#insertintonumbersvalues((0.0,10.0)), ((1.0,3.0)), ((1.0,1.0));

现在有个疑问，如果在数学上没有为他们定义顺序关系，如何进行排序？

已经定义好了比较运算符：

postgres=#select*fromnumbersorderbyx;

--------

(0,10)

(1,1)

(1,3)

(3 rows)

默认情况下，对于组合类型排序是分开的：首先比较第一个字段然后第二个字段，与文本字符串比较方法大致相同。但是我们也可以定义其他的排序方式，例如组合数字可以当做一个向量，通过模值进行排序。为了定义这样的顺序，我们需要创建一个函数：

postgres=#createfunctionmodulus(a complex)returnsfloatas$$

selectsqrt(a.re*a.re + a.im*a.im);

$$ immutable language sql;

//此时，使用整个函数系统的定义5种操作符：

postgres=#createfunctioncomplex_lt(a complex, b complex)returnsbooleanas$$

selectmodulus(a) < modulus(b);

$$ immutable language sql;

postgres=#createfunctioncomplex_le(a complex, b complex)returnsbooleanas$$

selectmodulus(a) <= modulus(b);

$$ immutable language sql;

postgres=#createfunctioncomplex_eq(a complex, b complex)returnsbooleanas$$

selectmodulus(a) = modulus(b);

$$ immutable language sql;

postgres=#createfunctioncomplex_ge(a complex, b complex)returnsbooleanas$$

selectmodulus(a) >= modulus(b);

$$ immutable language sql;

postgres=#createfunctioncomplex_gt(a complex, b complex)returnsbooleanas$$

selectmodulus(a) > modulus(b);

$$ immutable language sql;

然后创建对应的操作符：

postgres=#createoperator#<#(leftarg=complex, rightarg=complex,procedure=complex_lt);

postgres=#createoperator#<=#(leftarg=complex, rightarg=complex,procedure=complex_le);

postgres=#createoperator#=#(leftarg=complex, rightarg=complex,procedure=complex_eq);

postgres=#createoperator#>=#(leftarg=complex, rightarg=complex,procedure=complex_ge);

postgres=#createoperator#>#(leftarg=complex, rightarg=complex,procedure=complex_gt);

此时，可以比较数字：

postgres=#select(1.0,1.0)::complex #<# (1.0,3.0)::complex;

?column?

----------

(1 row)

除了整个5个操作符，还需要定义函数：小于返回-1；等于返回0；大于返回1。其他访问方法可能需要定义其他函数：

postgres=#createfunctioncomplex_cmp(a complex, b complex)returnsintegeras$$

selectcasewhenmodulus(a) < modulus(b)then-1

whenmodulus(a) > modulus(b)then1

else0

end;

$$ language sql;

创建一个操作符类：

postgres=# createoperatorclass complex_ops

defaultfortypecomplex

usingbtree as

operator1#<#,

operator2#<=#,

operator3#=#,

operator4#>=#,

operator5#>#,

function1complex_cmp(complex,complex);

//排序结果：

postgres=# select * from numbers order by x;

--------

(1,1)

(1,3)

(0,10)

(3rows)

//可以使用此查询获取支持的函数：

postgres=# select amp.amprocnum,

amp.amproc,

amp.amproclefttype::regtype,

amp.amprocrighttype::regtype

from pg_opfamily opf,

pg_am am,

pg_amproc amp

where opf.opfname ='complex_ops'

andopf.opfmethod = am.oid

andam.amname ='btree'

andamp.amprocfamily = opf.oid;

amprocnum | amproc | amproclefttype | amprocrighttype

-----------+-------------+----------------+-----------------

1| complex_cmp |complex|complex

(1row)

内部结构

使用pageinspect插件观察B-tree结构：

demo=# create extension pageinspect;

索引的元数据页：

demo=#select*frombt_metap('ticket_flights_pkey');

--------+---------+------+-------+----------+-----------

340322 | 2 | 164 | 2 | 164 | 2

(1 row)

值得关注的是索引level：不包括root，有一百万行记录的表其索引只需要2层就可以了。

Root页，即164号页面的统计信息：

demo=#selecttype, live_items, dead_items, avg_item_size, page_size, free_size

frombt_page_stats('ticket_flights_pkey',164);

------+------------+------------+---------------+-----------+-----------

r | 33 | 0 | 31 | 8192 | 6984

(1 row)

该页中数据：

demo=#selectitemoffset, ctid, itemlen,left(data,56)asdata

frombt_page_items('ticket_flights_pkey',164)limit5;

itemoffset | ctid | itemlen | data

------------+---------+---------+----------------------------------------------------------

1 | (3,1) | 8 |

2 | (163,1) | 32 | 1d 30 30 30 35 34 33 32 33 30 35 37 37 31 00 00 ff 5f 00

3 | (323,1) | 32 | 1d 30 30 30 35 34 33 32 34 32 33 36 36 32 00 00 4f 78 00

4 | (482,1) | 32 | 1d 30 30 30 35 34 33 32 35 33 30 38 39 33 00 00 4d 1e 00

5 | (641,1) | 32 | 1d 30 30 30 35 34 33 32 36 35 35 37 38 35 00 00 2b 09 00

(5 rows)

第一个tuple指定该页的最大值，真正的数据从第二个tuple开始。很明显最左边子节点的页号是163，然后是323。反过来，可以使用相同的函数搜索。

PG10版本提供了"amcheck"插件，该插件可以检测B-tree数据的逻辑一致性，使我们提前探知故障。

原文

https://habr.com/en/company/postgrespro/blog/443284/

PostgreSQL的B-tree索引

你可能感兴趣的:(PostgreSQL的B-tree索引)