CHAPTER 7 Designing and Tuning the Indexes
第七章:设计和调整索引
Clustered Index Design Considerations
聚集索引设计注意事项
Uniqueidentifier:唯一标识符
Every time you change the value of a clustered index key, two things happen. First, SQL Server moves the row to a different place in the clustered index page chain and in the data files. Second, it updates the row-id, which is the clustered index key. The row-id is stored and needs to be updated in all nonclustered indexes. That can be expensive in terms of I/O, especially in the case of batch updates. Moreover, it can increase the fragmentation of the clustered index and, in cases of row-id size increase, of the nonclustered indexes. Thus, it is better to have a static clustered index where key values do not change.
每次更改聚集索引键的值时,都会发生两件事。 首先,SQL Server将行移动到聚集索引页链和数据文件中的不同位置。 其次,它更新了row-id,它是聚集索引键。 存储了行id,需要在所有非聚集索引中更新。 就I / O而言,这可能是昂贵的,特别是在批量更新的情况下。 此外,它可以增加聚集索引的碎片,并且在行ID大小增加的情况下,可以增加非聚集索引的碎片。 因此,最好有一个静态聚集索引,其中键值不会改变。
All nonclustered indexes use a clustered index key as the row-id. A too-wide clustered index key increases the size of nonclustered index rows and requires more space to store them. As a result, SQL Server needs to process more data pages during index- or range-scan operations, which makes the index less efficient.
所有非聚集索引都使用聚集索引键作为row-id。 过宽的聚集索引键会增加非聚集索引行的大小,并且需要更多空间来存储它们。 因此,SQL Server需要在索引或范围扫描操作期间处理更多数据页,这会降低索引的效率。
In cases of non-unique nonclustered indexes, the row-id is also stored at non-leaf index levels, which, in turn, reduces the number of index records per page and can lead to extra intermediate levels in the index. Even though non-leaf index levels are usually cached in memory, this introduces additional logical reads every time SQL Server traverses the nonclustered index B-Tree.
在非唯一非聚集索引的情况下,row-id也存储在非叶索引级别,这反过来会减少每页索引记录的数量,并可能导致索引中的额外中间级别。 尽管非叶索引级别通常缓存在内存中,但每次SQL Server遍历非聚集索引B-Tree时,这都会引入额外的逻辑读取。
Finally, larger nonclustered indexes use more space in the buffer pool and introduce more overhead during index maintenance. Obviously, it is impossible to provide a generic threshold value that defines the maximum acceptable size of a key that can be applied to any table. However, as a general rule, it is better to have a narrow clustered index key, with the index key as small as possible.
最后,较大的非聚集索引在缓冲池中占用更多空间,并在索引维护期间引入更多开销。 显然,不可能提供一个通用阈值来定义可应用于任何表的密钥的最大可接受大小。 但是,作为一般规则,最好使用窄聚集索引键,索引键尽可能小。
It is also beneficial to have the clustered index be defined as unique. The reason this is important is not obvious. Consider a scenario in which a table does not have a unique clustered index and you want to run a query that uses a nonclustered index seek in the execution plan. In this case, if the row-id in the nonclustered index were not unique, SQL Server would not know what clustered index row to choose during the key lookup operation.
将聚集索引定义为唯一的也是有益的。 这很重要的原因并不明显。 考虑一种情况,其中表没有唯一的聚集索引,并且您希望在执行计划中运行使用非聚集索引查找的查询。 在这种情况下,如果非聚集索引中的row-id不是唯一的,则SQL Server将不知道在键查找操作期间要选择哪个聚集索引行。
SQL Server solves such problems by adding another nullable integer column called uniquifier to nonunique clustered indexes. SQL Server populates uniquifiers with NULL for the first occurrence of the key value, autoincrementing it for each subsequent duplicate inserted into the table.
SQL Server通过向非唯一聚集索引添加另一个名为uniquifier的可为空的整数列来解决此类问题。 对于第一次出现的键值,SQL Server使用NULL填充uniquifier,为插入表中的每个后续副本自动增加它。
■ Note The number of possible duplicates per clustered index key value is limited by integer domain values. You cannot have more than 2,147,483,648 rows with the same clustered index key. This is a theoretical limit, and it is clearly a bad idea to create indexes with such poor selectivity.
■注意每个聚集索引键值的可能重复项数量受整数域值的限制。 具有相同聚集索引键的行不能超过2,147,483,648。 这是理论上的限制,创建具有如此差的选择性的索引显然是个坏主意。
Let’s look at the overhead introduced by uniquifiers in non-unique clustered indexes. The code shown in Listing 7-1 creates three different tables of the same structure and populates them with 65,536 rows each. Table dbo.UniqueCI is the only table with a unique clustered index defined. Table dbo.NonUniqueCINoDups does not have any duplicated key values. Finally, table dbo.NonUniqueCDups has a large number of duplicates in the index.
让我们看看非唯一聚集索引中的uniquifiers引入的开销。 清单7-1中显示的代码创建了三个具有相同结构的不同表,并使用每个65,536行填充它们。 表dbo.UniqueCI是唯一定义了唯一聚集索引的表。 表dbo.NonUniqueCINoDups没有任何重复的键值。 最后,表dbo.NonUniqueCDups在索引中有大量重复项。
Listing 7-1. Nonunique clustered index: Table creation
例7-1。 非唯一聚集索引:表创建
create table dbo.UniqueCI
(
KeyValue int not null,
ID int not null,
Data char(986) null,
VarData varchar(32) not null
constraint DEF_UniqueCI_VarData
default 'Data'
);
create unique clustered index IDX_UniqueCI_KeyValue
on dbo.UniqueCI(KeyValue);
create table dbo.NonUniqueCINoDups
(
KeyValue int not null,
ID int not null,
Data char(986) null,
VarData varchar(32) not null
constraint DEF_NonUniqueCINoDups_VarData
default 'Data'
);
create /*unique*/ clustered index IDX_NonUniqueCINoDups_KeyValue
on dbo.NonUniqueCINoDups(KeyValue);
create table dbo.NonUniqueCIDups
(
KeyValue int not null,
ID int not null,
Data char(986) null,
VarData varchar(32) not null
constraint DEF_NonUniqueCIDups_VarData
default 'Data'
);
create /*unique*/ clustered index IDX_NonUniqueCIDups_KeyValue
on dbo.NonUniqueCIDups(KeyValue);
-- Populating data
;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select row_number() over (order by (select null)) from N5)
insert into dbo.UniqueCI(KeyValue, ID)
select ID, ID from IDs;
insert into dbo.NonUniqueCINoDups(KeyValue, ID)
select KeyValue, ID from dbo.UniqueCI;
insert into dbo.NonUniqueCIDups(KeyValue, ID)
select KeyValue % 10, ID from dbo.UniqueCI;
Now, let’s look at the clustered indexes’ physical statistics for each table. The code for this is shown in
Listing 7-2, and the results are shown in Figure 7-1.
现在,让我们看一下每个表的聚集索引的物理统计信息。 代码如例7-2所示,结果如图7-1所示。
Listing 7-2. Nonunique clustered index : Checking clustered indexes’ row sizes
例7-2 非唯一聚集索引:检查聚集索引的行大小
select index_level, page_count, min_record_size_in_bytes as [min row size]
,max_record_size_in_bytes as [max row size]
,avg_record_size_in_bytes as [avg row size]
from
sys.dm_db_index_physical_stats(db_id(), object_id(N'dbo.UniqueCI'), 1, null ,'DETAILED');
select index_level, page_count, min_record_size_in_bytes as [min row size]
,max_record_size_in_bytes as [max row size]
,avg_record_size_in_bytes as [avg row size]
from
sys.dm_db_index_physical_stats(db_id(), object_id(N'dbo.NonUniqueCINoDups'), 1, null
,'DETAILED');
select index_level, page_count, min_record_size_in_bytes as [min row size]
,max_record_size_in_bytes as [max row size]
,avg_record_size_in_bytes as [avg row size]
from
sys.dm_db_index_physical_stats(db_id(), object_id(N'dbo.NonUniqueCIDups'), 1, null
,'DETAILED');
图7-1。 非唯一聚集索引:聚集索引的行大小
Even though there are no duplicated key values in the dbo.NonUniqueCINoDups table, there are still two
extra bytes added to the row. SQL Server stores a uniquifier in the variable-length section of the data, and
those two bytes are added by yet another entry in a variable-length data offset array.
即使dbo.NonUniqueCINoDups表中没有重复的键值,仍然有两个额外的字节添加到该行。 SQL Server将一个uniquifier存储在数据的可变长度部分中,并且这两个字节由可变长度数据偏移数组中的另一个条目添加。
In the case, when a clustered index has duplicate values, uniquifiers add yet another four bytes, which
makes for an overhead of six bytes total.
在这种情况下,当聚集索引具有重复值时,uniquifiers会再添加另外四个字节,总共产生六个字节的开销。
It is worth mentioning that in some edge cases, the extra storage space used by the uniquifier can reduce the number of rows that can fit onto the data page. Our example demonstrates such a condition. As you can see, dbo.UniqueCI uses about 15 percent fewer data pages than the other two tables. Now, let’s see how the uniquifier affects nonclustered indexes. The code shown in Listing 7-3 creates nonclustered indexes in all three tables. Figure 7-2 shows the physical statistics for those indexes.
值得一提的是,在某些边缘情况下,uniquifier使用的额外存储空间可以减少可以放入数据页面的行数。 我们的例子说明了这种情况。 如您所见,dbo.UniqueCI使用的数据页数比其他两个表少15%。 现在,让我们看看uniquifier如何影响非聚集索引。例7-3中显示的代码在所有三个表中创建非聚集索引。 图7-2显示了这些索引的物理统计信息。
Listing 7-3. Nonunique clustered index : Checking nonclustered indexes’ row size
例7-3 非唯一聚集索引:检查非聚集索引的行大小
create nonclustered index IDX_UniqueCI_ID
on dbo.UniqueCI(ID);
create nonclustered index IDX_NonUniqueCINoDups_ID
on dbo.NonUniqueCINoDups(ID);
create nonclustered index IDX_NonUniqueCIDups_ID
on dbo.NonUniqueCIDups(ID);
select index_level, page_count, min_record_size_in_bytes as [min row size]
,max_record_size_in_bytes as [max row size]
,avg_record_size_in_bytes as [avg row size]
from
sys.dm_db_index_physical_stats(db_id(), object_id(N'dbo.UniqueCI'), 2, null
,'DETAILED');
select index_level, page_count, min_record_size_in_bytes as [min row size]
,max_record_size_in_bytes as [max row size]
,avg_record_size_in_bytes as [avg row size]
from
sys.dm_db_index_physical_stats(db_id(), object_id(N'dbo.NonUniqueCINoDups'), 2, null
,'DETAILED');
select index_level, page_count, min_record_size_in_bytes as [min row size]
,max_record_size_in_bytes as [max row size]
,avg_record_size_in_bytes as [avg row size]
from
sys.dm_db_index_physical_stats(db_id(), object_id(N'dbo.NonUniqueCIDups'), 2, null
,'DETAILED');
Figure 7-2. Nonunique clustered index: Nonclustered indexes’ row size
图7-2 非唯一聚集索引:非聚集索引的行大小
There is no overhead in the nonclustered index in the dbo.NonUniqueCINoDups table. As you will recall, SQL Server does not store offset information in a variable-length offset array for trailing columns storing NULL data. Nonetheless, the uniquifier introduces eight bytes of overhead in the dbo.NonUniqueCIDups table. Those eight bytes consist of a four-byte uniquifier value, a two-byte variable-length data offset array entry,
and a two-byte entry storing the number of variable-length columns in the row.
dbo.NonUniqueCINoDups表中的非聚集索引没有开销。 您可能还记得,SQL Server不会将偏移量信息存储在可变长度偏移数组中,以用于存储NULL数据的尾随列。 尽管如此,uniquifier在dbo.NonUniqueCIDups表中引入了8个字节的开销。 这八个字节由一个四字节的unquifier值,一个双字节的可变长度数据偏移数组条目和一个存储行中可变长度列数的双字节条目组成。
We can summarize the storage overhead of the uniquifier in the following way. For the rows that have a uniquifier as NULL, there is a two-byte overhead if the index has at least one variable-length column that stores a NOT NULL value. That overhead comes from the variable-length offset array entry for the uniquifier column. There is no overhead otherwise.
我们可以通过以下方式总结uniquifier的存储开销。 对于具有uniquifier为NULL的行,如果索引至少有一个存储NOT NULL值的ariable-length列,则会产生两个字节的开销。 该开销来自uniquifier列的可变长度偏移数组条目。 否则没有开销。
In cases where the uniquifier is populated, the overhead is six bytes if there are variable-length columns that store NOT NULL values. Otherwise, the overhead is eight bytes.
在填充uniquifier的情况下,如果存在存储NOT NULL值的可变长度列,则开销为六个字节。 否则,开销是八个字节。
■ Tip If you expect a large number of duplicates in the clustered index values, you can add an integer identity column as the rightmost column to the index, thereby making it unique. This adds a four-byte predictable storage overhead to every row as compared to an unpredictable up to eight-byte storage overhead introduced by uniquifiers. This can also improve the performance of individual lookup operations when you reference the row by all of its clustered index columns.
■提示如果预计聚集索引值中存在大量重复项,则可以将整数标识列作为索引的最右列添加,从而使其唯一。 与由uniquifiers引入的不可预测的高达8字节的存储开销相比,这为每一行增加了四字节可预测的存储开销。 当您通过其所有聚集索引列引用该行时,这还可以提高单个查找操作的性能。
It is beneficial to design clustered indexes in a way that minimizes index fragmentation caused by inserting new rows. One of the methods to accomplish this is by making clustered index values ever increasing. The index on the identity column is one such example. Another example is a datetime column populated with the current system time at the moment of insertion.
以最小化插入新行导致的索引碎片的方式设计聚集索引是有益的。 实现此目标的方法之一是使聚集索引值不断增加。 标识列上的索引就是一个这样的例子。 另一个示例是使用插入时的当前系统时间填充的日期时间列。
There are two potential issues with ever-increasing indexes, however. The first relates to statistics. As you learned in Chapter 3, the legacy cardinality estimator in SQL Server underestimates cardinality when parameter values are not present in the histogram. You should factor such behavior into your statistics maintenance strategy for the system, unless you are using the new SQL Server 2014-2016 cardinality estimators, which assume that data outside of the histogram has distributions similar to those of other data in the table.
然而,不断增加的指数存在两个潜在的问题。 第一个涉及统计。 正如您在第3章中学到的,当直方图中不存在参数值时,SQL Server中的遗留基数估计器会低估基数。 您应该将此类行为纳入系统的统计信息维护策略,除非您使用新的SQL Server 2014-2016基数估算器,该估算器假定直方图之外的数据具有与表中其他数据类似的分布。
The next problem is more complicated. With ever-increasing indexes, the data is always inserted at the end of the index. On the one hand, it prevents page splits and reduces fragmentation. On the other hand, it can lead to hot spots, which are serialization delays that occur when multiple sessions are trying to modify the same data page and/or allocate new pages or extents. SQL Server does not allow multiple sessions to update the same data structures, and instead serializes those operations.
下一个问题更复杂。 随着索引不断增加,数据总是插入到指数结束。 一方面,它可以防止页面拆分并减少碎片。 另一方面,它可能导致热点,这是当多个会话试图修改相同数据页和/或分配新页面或范围时发生的序列化延迟。 SQL Server不允许多个会话更新相同的数据结构,而是序列化这些操作。
Hot spots are usually not an issue unless a system collects data at a very high rate and the index handles hundreds of inserts per second. We will discuss how to detect such an issue in Chapter 27, “System Troubleshooting.”
除非系统以非常高的速率收集数据并且索引每秒处理数百个插入,否则热点通常不是问题。 我们将在第27章“系统故障排除”中讨论如何检测此类问题。
Finally, if a system has a set of frequently executed and important queries, it might be beneficial to consider a clustered index, which optimizes them. This eliminates expensive key lookup operations and improves the performance of the system.
最后,如果系统具有一组频繁执行且重要的查询,则考虑聚合索引可能是有益的,这会优化它们。 这消除了昂贵的密钥查找操作并提高了系统的性能。
Even though such queries can be optimized by using covering nonclustered indexes, it is not always the ideal solution. In some cases, it requires you to create very wide nonclustered indexes, which will use up a lot of storage space both on disk and in the buffer pool.
即使可以使用覆盖非聚集索引来优化此类查询,但它并不总是理想的解决方案。 在某些情况下,它需要您创建非常宽的非聚集索引,这将占用磁盘和缓冲池中的大量存储空间。
Another important factor is how often columns are modified. Adding frequently modified columns to nonclustered indexes requires SQL Server to change data in multiple places, which negatively affects the update performance of the system and increases blocking.
另一个重要因素是修改列的频率。 将经常修改的列添加到非聚集索引需要SQL Server在多个位置更改数据,这会对系统的更新性能产生负面影响并增加阻塞。
With all that being said, it is not always possible to design clustered indexes that will satisfy all of these guidelines. Moreover, you should not consider these guidelines to be absolute requirements. You should analyze the system, business requirements, workload, and queries and choose clustered indexes that would benefit you, even if they violate some of those guidelines.
尽管如此,并不总是能够设计满足所有这些准则的聚集索引。 此外,您不应将这些指南视为绝对要求。 您应该分析系统,业务需求,工作负载和查询,并选择对您有益的聚集索引,即使它们违反了某些指导原则。
Identities, Sequences, and Uniqueidentifiers
身份,序列和唯一标识符
People often choose identities, sequences, and uniqueidentifiers as clustered index keys. As always, that approach has its own set of pros and cons.
人们通常选择身份,序列和唯一标识符作为聚集索引键。 与往常一样,这种方法有其自身的优缺点。
Clustered indexes defined on such columns are unique, static, and narrow. Moreover, identities and sequences are ever increasing, which reduces index fragmentation. One of the ideal use cases for them is catalog entity tables. You can think about tables, which store lists of customers, articles, or devices, as an example. Those tables store thousands, or maybe even a few million, rows, although the data is relatively static, and, as a result, hot spots are not an issue. Moreover, such tables are usually referenced by foreign keys and used in joins. Indexes on integer or bigint columns are very compact and efficient, which will improve the performance of queries.
在此类列上定义的聚集索引是唯一的,静态的和窄的。 此外,身份和序列不断增加,这减少了索引碎片。 其中一个理想的用例是目录实体表。 作为示例,您可以考虑存储客户,文章或设备列表的表。 这些表存储数千甚至数百万行,尽管数据相对静态,因此热点不是问题。 此外,这些表通常由外键引用并用于连接。 integer或bigint列上的索引非常紧凑和高效,这将提高查询的性能。
■ Note We will discuss foreign key constraints in greater detail in Chapter 8, “Constraints.” Clustered indexes on identity or sequence columns are less efficient in the case of transactional tables, which collect large amounts of data at a very high rate, due to the potential hot spots they introduce.
Uniqueidentifiers, on the other hand, are rarely a good choice for indexes, both clustered and nonclustered. Random values generated with the NEWID() function greatly increase index fragmentation. Moreover, indexes on uniqueidentifiers decrease the performance of batch operations. Let’s look at an example and create two tables: one with clustered indexes on identity columns and one with clustered indexes on uniqueidentifier columns. In the next step, we will insert 65,536 rows into both tables. You can see the code for doing this in Listing 7-4.
■注意我们将在第8章“约束”中更详细地讨论外键约束。在事务表的情况下,标识或序列列上的聚集索引效率较低,因为事务表以非常高的速率收集大量数据 他们引入的潜在热点。 另一方面,Uniqueidentifiers很少是集群和非集群索引的理想选择。 使用NEWID()函数生成的随机值极大地增加了索引碎片。 此外,uniqueidentifiers上的索引会降低批处理操作的性能。 让我们看一个示例并创建两个表:一个表在标识列上有聚集索引,另一个在uniqueidentifier列上有聚集索引。 在下一步中,我们将在两个表中插入65,536行。 您可以在例7-4中看到执行此操作的代码。
Listing 7-4. Uniqueidentifiers: Table creation
例7-4 Uniqueidentifiers:表创建
create table dbo.IdentityCI
(
ID int not null identity(1,1),
Val int not null,
Placeholder char(100) null
);
create unique clustered index IDX_IdentityCI_ID
on dbo.IdentityCI(ID);
create table dbo.UniqueidentifierCI
(
ID uniqueidentifier not null
constraint DEF_UniqueidentifierCI_ID
default newid(),
Val int not null,
Placeholder char(100) null,
);
create unique clustered index IDX_UniqueidentifierCI_ID
on dbo.UniqueidentifierCI(ID)
go
;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select row_number() over (order by (select null)) from N5)
insert into dbo.IdentityCI(Val)
select ID from IDs;
;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select row_number() over (order by (select null)) from N5)
insert into dbo.UniqueidentifierCI(Val)
select ID from IDs;
The execution time on my computer and number of reads are shown in Table 7-1. Figure 7-3 shows execution plans for both queries.
我的计算机上的执行时间和读取次数如表7-1所示。 如图7-3所示
两个查询的执行计划。
Table 7-1. Inserting Data into the Tables: Execution Statistics
表7-1 将数据插入表:执行统计
Figure 7-3. Inserting data into the tables: Execution plans
图7-3 将数据插入表中:执行计划
As you can see, there is another sort operator in the case of the index on the uniqueidentifier column. SQL Server sorts randomly generated uniqueidentifier values before the insert, which decreases the performance of the query.
如您所见,uniqueidentifier列上的索引有另一个排序运算符。 SQL Server在插入之前对随机生成的uniqueidentifier值进行排序,这会降低查询的性能。
Let’s insert another batch of rows into the table and check index fragmentation. The code for doing this
is shown in Listing 7-5. Figure 7-4 shows the results of the queries.
让我们在表中插入另一批行并检查索引碎片。 执行此操作的代码如清单7-5所示。 图7-4显示了查询的结果。
Listing 7-5. Uniqueidentifiers : Inserting rows and checking fragmentation
例7-5 Uniqueidentifiers:插入行并检查碎片
;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select row_number() over (order by (select null)) from N5)
insert into dbo.IdentityCI(Val)
select ID from IDs;
;with N1(C) as (select 0 union all select 0) -- 2 rows
,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows
,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows
,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows
,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows
,IDs(ID) as (select row_number() over (order by (select null)) from N5)
insert into dbo.UniqueidentifierCI(Val)
select ID from IDs;
select page_count, avg_page_space_used_in_percent, avg_fragmentation_in_percent
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.IdentityCI'),1,null,'DETAILED');
select page_count, avg_page_space_used_in_percent, avg_fragmentation_in_percent
from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.UniqueidentifierCI'),1,null
,'DETAILED');
Figure 7-4. Fragmentation of the indexes
图7-4 索引碎片
As you can see, the index on the uniqueidentifier column is heavily fragmented, and it uses about 40
percent more data pages as compared to the index on the identity column.
如您所见,uniqueidentifier列上的索引严重碎片化,与标识列上的索引相比,它使用的数据页数大约多40%。
A batch insert into the index on the uniqueidentifier column inserts data at different places in the data file, which leads to heavy, random physical I/O in the case of large tables. This can significantly decrease the performance of the operation.
在uniqueidentifier列的索引中的批量插入会在数据文件的不同位置插入数据,这会导致在大型表的情况下出现繁重的随机物理I / O. 这可能会显着降低操作性能。
PERSONAL EXPERIENCE
个人经验
Some time ago, I had been involved in the optimization of a system that had a 250 GB table with one clustered and three nonclustered indexes. One of the nonclustered indexes was the index on the niqueidentifier column. By removing this index, we were able to speed up a batch insert of 50,000 rows from 45 seconds down to 7 seconds.
前段时间,我参与了一个系统的优化,该系统具有250 GB的表,其中包含一个聚集索引和三个非聚集索引。 其中一个非聚集索引是uniqueidentifier列上的索引。 通过删除此索引,我们能够将50,000行的批量插入从45秒加速到7秒。
There are two common use cases for when you would want to create indexes on uniqueidentifier columns. The first one is for supporting the uniqueness of values across multiple databases. Think about a distributed system where rows can be inserted into every database. Developers often use uniqueidentifiers to make sure that every key value is unique system wide.
当您想要在uniqueidentifier列上创建索引时,有两种常见用例。 第一个是支持跨多个数据库的值的唯一性。 想想可以将行插入每个数据库的分布式系统。 开发人员经常使用uniqueidentifiers来确保每个键值在系统范围内都是唯一的。
The key element in such an implementation is how key values were generated. As you have already seen, the random values generated with the NEWID() function or in the client code negatively affect system performance. However, you can use the NEWSEQUENTIALID() function, which generates unique and generally ever-increasing values (SQL Server resets their base value from time to time). Indexes on uniqueidentifier columns generated with the NEWSEQUENTIALID() function are similar to indexes on identity and sequence columns; however, you should remember that the uniqueidentifier data type uses 16 bytes of storage space, compared to the 4-byte int or 8-byte bigint data types.
此类实现中的关键元素是如何生成键值。 正如您已经看到的,使用NEWID()函数或客户端代码生成的随机值会对系统性能产生负面影响。 但是,您可以使用NEWSEQUENTIALID()函数,该函数生成唯一且通常不断增加的值(SQL Server会不时重置其基值)。 使用NEWSEQUENTIALID()函数生成的uniqueidentifier列的索引类似于identity和sequence列的索引; 但是,您应该记住,uniqueidentifier数据类型使用16字节的存储空间,而4字节的int或8字节的bigint数据类型。
As an alternative solution, you may consider creating a composite index with two columns (InstallationId, Unique_Id_Within_Installation). The combination of these two columns guarantees uniqueness across multiple installations and databases and uses less storage space than uniqueidentifiers do. You can use an integer identity or sequence to generate the Unique_Id_Within_Installation value, which will reduce the fragmentation of the index.
作为替代解决方案,您可以考虑创建具有两列(InstallationId,Unique_Id_Within_Installation)的复合索引。 这两列的组合保证了多个安装和数据库的唯一性,并且比独特标识符使用更少的存储空间。 您可以使用整数标识或序列来生成Unique_Id_Within_Installation值,这将减少索引的碎片。
In cases where you need to generate unique key values across all entities in the database, you can consider using a single sequence object across all entities. This approach fulfils the requirement but uses a smaller data type than uniqueidentifiers.
如果需要在数据库中的所有实体上生成唯一键值,则可以考虑在所有实体中使用单个序列对象。 此方法满足要求,但使用比uniqueidentifier更小的数据类型。
Another common use case is security, where a uniqueidentifier value is used as a security token or a random object ID. Unfortunately, you cannot use the NEWSEQUENTIALID() function in this scenario, because it is possible to guess the next value returned by that function.
另一个常见用例是安全性,其中uniqueidentifier值用作安全性令牌或随机对象ID。 不幸的是,您无法在此方案中使用NEWSEQUENTIALID()函数,因为可以猜测该函数返回的下一个值。
One possible improvement in this scenario is creating a calculated column using the CHECKSUM() function, indexing it afterward without creating the index on the uniqueidentifier column. The code is shown in Listing 7-6.
在这种情况下,一种可能的改进是使用CHECKSUM()函数创建计算列,然后对其进行索引,而不在uniqueidentifier列上创建索引。 代码如例7-6所示。
Listing 7-6. Using CHECKSUM(): Table structure
例7-6 使用CHECKSUM():表结构
create table dbo.Articles
(
ArticleId int not null identity(1,1),
ExternalId uniqueidentifier not null
constraint DEF_Articles_ExternalId
default newid(),
ExternalIdCheckSum as checksum(ExternalId),
/* Other Columns */
);
create unique clustered index IDX_Articles_ArticleId
on dbo.Articles(ArticleId);
create nonclustered index IDX_Articles_ExternalIdCheckSum
on dbo.Articles(ExternalIdCheckSum);
■ Tip You can index a calculated column without persisting it. Even though the IDX_Articles_ExternalIdCheckSum index is going to be heavily fragmented, it will be more compact as compared to the index on the uniqueidentifier column (a 4-byte key versus 16 bytes). It also improves the performance of batch operations because of faster sorting, which also requires less memory to proceed.
■提示您可以索引计算列而不保留它。 尽管IDX_Articles_ExternalIdCheckSum索引将严重分段,但与uniqueidentifier列上的索引(4字节密钥与16字节)相比,它将更紧凑。 它还提高了批处理操作的性能,因为更快的排序,这也需要更少的内存来进行。
One thing that you must keep in mind is that the result of the CHECKSUM() function is not guaranteed to be unique. You should include both predicates to the queries, as shown in Listing 7-7.
您必须记住的一件事是CHECKSUM()函数的结果不保证是唯一的。 您应该在查询中包含两个谓词,如例7-7所示。
Listing 7-7. Using CHECKSUM(): Selecting data
例7-7 使用CHECKSUM():选择数据
select ArticleId /* Other Columns */
from dbo.Articles
where checksum(@ExternalId) = ExternalIdCheckSum and ExternalId = @ExternalId
■ Tip You can use the same technique in cases where you need to index string columns larger than 900/1,700 bytes, which is the maximum size of a nonclustered index key. Even though such an index would not support range scan operations, it could be used for point lookups.
■提示如果需要索引大于900 / 1,700字节的字符串列(这是非聚集索引键的最大大小),则可以使用相同的技术。 即使这样的索引不支持范围扫描操作,它也可以用于点查找。