SQL Server中的查询优化技术:基础

描述 (Description)

Fixing and preventing performance problems is critical to the success of any application. We will use a variety of tools and best practices to provide a set of techniques that can be used to analyze and speed up any performance problem!

修复和防止性能问题对于任何应用程序的成功都是至关重要的。 我们将使用各种工具和最佳实践来提供可用于分析和加速任何性能问题的技术!

This is one of my personal favorite areas of research and discussion as it is inherently satisfying. Taking a performance nightmare and tuning it into something fast and sleek feels great and will undoubtedly make others happy.

这是我个人最喜欢的研究和讨论领域之一,因为它本质上令人满意。 参加一场表演噩梦并将其调整为快速而时尚的感觉真是太好了,并且无疑会让其他人感到高兴。

I often view optimization as a detective mystery. Something terrible has happened and you need to follow clues to locate and apprehend the culprit! This series of articles is all about these clues, how to identify them, and how to use them in order to find the root cause of a performance problem.

我经常将优化视为侦探之谜。 发生了一件可怕的事情,您需要遵循线索来找到并逮捕罪魁祸首! 本系列文章全部涉及这些线索,如何识别它们以及如何使用它们以查找性能问题的根本原因。

  • For more information about Query optimization, see the SQL Query Optimization — How to Determine When and If It’s Needed article

    有关查询优化的更多信息,请参见“ SQL查询优化-如何确定何时以及是否需要”一文。

定义优化 (Defining Optimization)

What is “optimal”? The answer to this will also determine when we are done with a problem and can move onto the next one. Often, a query can be sped up through many different means, each of which has an associated time and resource cost.

什么是“最佳”? 答案也将决定我们何时解决问题,并可以继续解决下一个问题。 通常,可以通过许多不同的方式加快查询的速度,每种方式都有相关的时间和资源成本。

We usually cannot spend the resources needed to make a script run as fast as possible, nor should we want to. For the sake of simplicity, we will define “optimal” as the point at which a query performs acceptably and will continue to do so for a reasonable amount of time in the future. This is as much as a business definition as it is a technical definition. With infinite money, time, and computing resources, anything is possible, but we do not have the luxury of unlimited resources, and therefore must define what “done” is whenever we chase any performance problem.

我们通常不能花费使脚本尽可能快地运行所需的资源,我们也不应该这样做。 为了简单起见,我们将“最佳”定义为查询可接受的执行点,并且在将来的合理时间内将继续执行该操作。 这既是业务定义,又是技术定义。 有了无限的金钱,时间和计算资源,一切皆有可能,但是我们没有无限资源的奢侈,因此,当我们追求任何性能问题时,必须定义“完成”是什么。

This provides us with several useful checkpoints that will force us to re-evaluate our progress as we optimize:

这为我们提供了几个有用的检查点,这些检查点将迫使我们在优化时重新评估我们的进度:

  1. The query now performs adequately.

    查询现在可以正常执行。
  2. The resources needed to optimize further are very expensive.

    进一步优化所需的资源非常昂贵。
  3. We have reached a point of diminishing returns for any further optimization.

    对于任何进一步的优化,我们已经达到了收益递减的地步。
  4. A completely different solution is discovered that renders this unneeded.

    发现了一个完全不同的解决方案,从而不需要此解决方案。

Over-optimization sounds good, but in the context of resource management is generally wasteful. A giant (but unnecessary) covering index will cost us computing resources whenever we write to a table for the rest of eternity (a long time). A project to rewrite code that was already acceptable might cost days or weeks of development and QA time. Trying to further tweak an already good query may net a gain of 3%, but take a week of sweating to get there.

过度优化听起来不错,但是在资源管理的情况下通常是浪费的。 每当我们在永恒的余下时间(很长一段时间)中向表写入数据时,一个庞大的(但不必要的)覆盖索引将使我们浪费计算资源。 重写已经可以接受的代码的项目可能会花费数天或数周的开发和质量检查时间。 尝试进一步调整本来不错的查询可能会获得3%的收益,但要花上一周的时间才能达到目标。

Our goal is to solve a problem and not over-solve it.

我们的目标是解决问题,而不是过度解决。

查询做什么? (What Does the Query Do?)

Question #1 that we must always answer is: What is the purpose of a query?

我们必须始终回答的问题#1:查询的目的是什么?

  • What is its purpose?

    目的是什么?
  • What should the result set look like?

    结果集应该是什么样?
  • What sort of code, report, or UI is generating the query?

    什么样的代码,报告或UI会生成查询?

It is first-nature for us to want to dive in with a sword in hand and slay the dragon as quickly as humanly possible. We have a trace running, execution plans in hand, and a pile of IO and timing statistics collected before realizing that we have no idea what we are doing

我们想用手中的剑跳入水中并尽快杀死人,这是我们的天性。 在意识到我们不知道自己在做什么之前,我们有一个跟踪运行,手头的执行计划以及一堆IO和时序统计信息

Step #1 is to step back and understand the query. Some helpful questions that can aid in optimization:

步骤#1是退后一步并了解查询。 一些有助于优化的有用问题:

  • How large is the result set? Should we brace ourselves for a million rows returned, or just a few? 结果集有多大? 我们应该为返回的一百万行还是为几行做好准备?
  • Are there any parameters that have limited values? Will a given parameter always have the same value, or are there other limitations on values that can simplify our work by eliminating avenues of research. 是否有任何值有限的参数? 给定的参数将始终具有相同的值,还是在值上存在其他限制,可以通过消除研究途径来简化我们的工作。
  • How often is the query executed? Something that occurs once a day will be treated very differently than one that is run every second. 查询多久执行一次? 每天发生一次的事情与每秒运行一次的事情将有很大不同。
  • Are there any invalid or unusual input values that are indicative of an application problem? Is one input set to NULL, but never should be NULL? Are any other inputs set to values that make no sense, are contradictory, or otherwise go against the use-case of the query? 是否有任何指示应用程序问题的无效或异常输入值? 一个输入是否设置为NULL,但永远不应为NULL? 是否将其他任何输入设置为没有意义,自相矛盾或违反查询用例的值?
  • Are there any obvious logical, syntactical, or optimization problems staring us in the face? Do we see any immediate performance bombs that will always perform poorly, regardless of parameter values or other variables? More on these later when we discuss optimization techniques. 有没有明显的逻辑,句法或优化问题盯着我们? 我们是否看到任何即时性能炸弹,无论参数值或其他变量如何,总是会表现不佳? 稍后,当我们讨论优化技术时,将详细介绍这些内容。
  • What is acceptable query performance? How fast must the query be for its consumers to be happy? If server performance is poor, how much do we need to decrease resource consumption for it to be acceptable? Lastly, what is the current performance of the query? This will provide us with a baseline so we know how much improvement is needed. 可接受的查询性能是什么? 查询必须多快才能使消费者满意? 如果服务器性能不佳,我们需要减少多少资源消耗才能使服务器可接受? 最后,查询的当前性能如何? 这将为我们提供基线,因此我们知道需要多少改进。

By stopping and asking these questions prior to optimizing a query, we avoid the uncomfortable situation in which we spend hours collecting data about a query only to not fully understand how to use it. In many ways, query optimization and database design force us to ask many of the same questions.

通过在优化查询之前停止并提出这些问题,我们避免了这种不舒服的情况,在这种情况下,我们花费数小时来收集有关查询的数据只是为了不完全了解如何使用它。 在许多方面,查询优化和数据库设计迫使我们提出许多相同的问题。

The results of this additional foresight will often lead us to more innovative solutions. Maybe a new index isn’t needed and we can break a big query into a few smaller ones. Maybe one parameter value is incorrect and there is a problem in code or the UI that needs to be resolved. Maybe a report is run once a week, so we can pre-cache the data set and send the results to an email, dashboard, or file, rather than force a user wait 10 minutes for it interactively.

这种额外的远见卓识往往会导致我们获得更多创新的解决方案。 也许不需要新索引,我们可以将一个大查询分解为几个较小的查询。 可能一个参数值不正确,并且代码或UI中存在问题需要解决。 也许报告每周运行一次,所以我们可以预先缓存数据集并将结果发送到电子邮件,仪表板或文件,而不必强迫用户以交互方式等待10分钟。

工具类 (Tools)

To keep things simple, we’ll use only a handful of tools in this article:

为简单起见,本文中将仅使用少数工具:

执行计划 (Execution Plans)

An execution plan provides a graphical representation of how the query optimizer chose to execute a query:

执行计划提供了查询优化器如何选择执行查询的图形表示:

SQL Server中的查询优化技术:基础_第1张图片

The execution plan shows us which tables were accessed, how they were accessed, how they were joined together, and any other operations that occurred along the way. Included are query costs, which are estimates of the overall expense of any query component. A treasure trove of data is also included, such as row size, CPU cost, I/O cost, and details on which indexes were utilized.

执行计划向我们显示访问了哪些表,如何访问它们,如何将它们连接在一起以及在此过程中发生的任何其他操作。 其中包括查询成本,这是对任何查询组件的总费用的估计。 还包括大量数据,例如行大小,CPU成本,I / O成本以及使用索引的详细信息。

In general, what we are looking for are scenarios in which large numbers of rows are being processed by any given operation within the execution plan. Once we have found a high cost component, we can zoom in on what the cause is and how to resolve it.

通常,我们要寻找的是执行计划中的任何给定操作正在处理大量行的场景。 一旦找到了高成本的组成部分,我们就可以放大原因并解决问题。

统计IO (STATISTICS IO)

This allows us to see how many logical and physical reads are made when a query is executed and may be turned on interactively in SQL Server Management Studio by running the following TSQL:

这使我们可以查看执行查询时进行了多少逻辑和物理读取,并且可以通过运行以下TSQL在SQL Server Management Studio中以交互方式打开它:

SET STATISTICS IO ON;

将统计信息IO设置为ON;

Once on, we will see additional data included in the Messages pane:

启用之后,我们将在“消息”窗格中看到其他数据:

Logical reads tell us how many reads were made from the buffer cache. This is the number that we will refer to whenever we talk about how many reads a query is responsible for, or how much IO it is causing.

逻辑读取告诉我们从缓冲区高速缓存进行了多少次读取。 每当谈论一个查询负责多少次读取或引起多少IO时,我们都将使用此数字。

Physical reads tell us how much data was read from a storage device as it was not yet present in memory. This can be a useful indication of buffer cache/memory capacity problems if data is very frequently being read from storage devices, rather than memory.

物理读取告诉我们从存储设备读取了多少数据,因为它们尚未出现在内存中。 如果经常从存储设备而非内存中读取数据,这可能是缓冲区高速缓存/内存容量问题的有用指示。

In general, IO will be the primary cause of latency and bottlenecks when analyzing slow queries. The unit of measurement of STATISTICS IO = 1 read = a single 8kb page = 8192 bytes.

通常,在分析慢查询时,IO将成为延迟和瓶颈的主要原因。 STATISTICS IO的度量单位= 1读取=一个8kb页面= 8192字节。

查询时长 (Query Duration)

Typically, the #1 reason we will research a slow query is because someone has complained and told us that it is too slow. The time it takes a query to execute is going to often be the smoking gun that leads us to a performance problem in need of a solution.

通常,我们研究缓慢查询的第一原因是因为有人抱怨并告诉我们它太慢了。 执行查询所花费的时间通常是抽烟,这导致我们遇到需要解决方案的性能问题。

For our work here, we will measure duration manually using the timer found in the lower-right hand corner of SSMS:

对于此处的工作,我们将使用SSMS右下角的计时器手动测量持续时间:

There are other ways to accurately measure query duration, such as setting on STATISTICS TIME, but we’ll focus on queries that are slow enough that such a level of accuracy will not be necessary. We can easily observe when a 30 second query is improved to run in sub-second time. This also reinforces the role of the user as a constant source of feedback as we try to improve the speed of an application.

还有其他一些方法可以精确地测量查询持续时间,例如在STATISTICS TIME上进行设置,但是我们将重点放在足够慢的查询上,以至于不需要这样的准确性。 我们可以轻松地观察到30秒的查询何时可以改进以在亚秒内运行。 当我们试图提高应用程序速度时,这也加强了用户作为不断反馈源的作用。

我们的眼睛 (Our Eyes)

Many performance problems are the result of common query patterns that we will become familiar with below. This pattern recognition allows us to short-circuit a great deal of research when we see something that is clearly poorly written.

许多性能问题是我们将在下面熟悉的常见查询模式的结果。 当我们看到明显写得不好的东西时,这种模式识别使我们可以将大量研究短路。

As we optimize more and more queries, quickly identifying these indicators becomes more second-nature and we’ll get the pleasure of being able to fix a problem quickly, without the need for very time-consuming research.

随着我们对越来越多的查询进行优化,快速识别这些指标变得更加自然,我们将很高兴能够快速解决问题,而无需进行非常耗时的研究。

In addition to common query mistakes, we will also look out for any business logic hints that may tell us if there is an application problem, parameter issue, or some other flaw in how the query was generated that may require involvement from others aside from us.

除了常见的查询错误外,我们还将寻找可能告诉我们是否存在应用程序问题,参数问题或查询生成方式中是否存在其他一些缺陷的业务逻辑提示,这些缺陷可能需要我们以外的其他人参与。

查询优化器做什么? (What Does the Query Optimizer Do?)

Every query follows the same basic process from TSQL to completing execution on a SQL Server:

从TSQL到在SQL Server上完成执行,每个查询都遵循相同的基本过程:

Parsing is the process by which query syntax is checked. Are keywords valid and are the rules of the TSQL language being followed correctly. If you made a spelling error, named a column using a reserved word, or forgot a semicolon before a common table expression, this is where you’ll get error messages informing you of those problems.

解析是检查查询语法的过程。 关键字是否有效,是否正确遵循TSQL语言的规则? 如果您犯了拼写错误,使用保留字命名了列或在公用表表达式之前忘记了分号,那么您将在此处收到错误消息,以通知您这些问题。

Binding checks all objects referenced in your TQL against the system catalogs and any temporary objects defined within your code to determine if they are both valid and referenced correctly. Information about these objects is retrieved, such as data types, constraints, and if a column allows NULL or not. The result of this step is a query tree that is composed of a basic list of the processes needed to execute the query. This provides basic instructions, but does not yet include specifics, such as which indexes or joins to use.

绑定检查系统目录中TQL中引用的所有对象以及代码中定义的任何临时对象,以确定它们是否有效并正确引用。 检索有关这些对象的信息,例如数据类型,约束以及列是否允许为NULL。 此步骤的结果是一个查询树,该树由执行查询所需的基本过程列表组成。 这提供了基本说明,但尚未包括具体说明,例如要使用的索引或联接。

Optimization is the process that we will reference most often here. The optimizer operates similarly to a chess (or any gaming) computer. It needs to consider an immense number of possible moves as quickly as possible, remove the poor choices, and finish with the best possible move. At any point in time, there may be millions of combinations of moves available for the computer to consider, of which only a handful will be the best possible moves. Anyone that has played chess against a computer knows that the less time the computer has, the more likely it is to make an error.

优化是我们在这里最常引用的过程。 优化器的操作类似于象棋(或任何游戏)计算机。 它需要尽快考虑大量可能的动作,消除错误的选择,并以最佳的动作完成。 在任何时间点,计算机都会考虑数百万种动作组合,其中只有极少数是最佳动作。 对计算机下过象棋的任何人都知道,计算机拥有的时间越短,出错的可能性就越大。

In the world of SQL Server, we will talk about execution plans instead of chess moves. The execution plan is the set of specific steps that the execution engine will follow to process a query. Every query has many choices to make to arrive at that execution plan and must do so in a very short span of time.

在SQL Server的世界中,我们将讨论执行计划而不是象棋棋步。 执行计划是执行引擎处理查询所遵循的一组特定步骤。 每个查询都有很多选择可以到达执行计划,并且必须在很短的时间内完成。

These choices include questions such as:

这些选择包括以下问题:

  • What order should tables be joined?

    表应该以什么顺序连接?
  • What joins should be applied to tables?

    什么联接应应用于表?
  • Which indexes should be used?

    应该使用哪些索引?
  • Should a seek or scan be used against a given table?

    是否应针对给定的表使用搜索或扫描?
  • Is there a benefit in caching data in a worktable or spooling data for future use?

    将数据缓存在工作表中或假脱机数据以备将来使用是否有好处?

Any execution plan that is considered by the optimizer must return the same results, but the performance of each plan may differ due to those questions above (and many more!).

优化程序考虑的任何执行计划都必须返回相同的结果,但是每个计划的性能可能会由于上述问题(甚至更多!)而有所不同。

Query optimization is a CPU-intensive operation. The process to sift through plans requires significant computing resources and to find the best plan may require more time than is available. As a result, a balance must be maintained between the resources needed to optimize the query, the resources required to execute the query, and the time we must wait for the entire process to complete. As a result, the optimizer is not built to select the best execution plan, but instead to search and find the best possible plan after a set amount of time passes. It may not be the perfect execution plan, but we accept that as a limitation of how a process with so many possibilities must operate.

查询优化是一项占用大量CPU的操作。 筛选计划的过程需要大量的计算资源,而找到最佳计划可能需要比可用时间更多的时间。 因此,必须在优化查询所需的资源,执行查询所需的资源以及必须等待整个过程完成的时间之间保持平衡。 结果,优化器不是为了选择最佳执行计划而构建的,而是经过一定时间后搜索并找到最佳可能的计划。 这可能不是一个完美的执行计划,但我们接受这是一个限制,要求必须处理具有多种可能性的流程。

The metric used to judge execution plans and decide which to consider or not is query cost. The cost has no unit and is a relative measure of the resources required to execute each step of an execution plan. The overall query cost is the sum of the costs of each step within a query. You can view these costs in any execution plan:

用于判断执行计划并决定考虑或不考虑的指标是查询成本。 成本没有单位,是执行计划的每个步骤所需资源的相对度量。 总查询成本是查询中每个步骤的成本之和。 您可以在任何执行计划中查看这些成本:

Subtree costs for each component of a query are calculated and used to either:

计算查询每个组成部分的子树成本,并将其用于以下任一情况:

  1. Remove a high-cost execution plan and any similar ones from the pool of available plans.

    从可用计划池中删除高成本的执行计划以及任何类似的计划。
  2. Rank the remaining plans based on how low their cost is.

    根据剩余计划的成本降低其排名。

While query cost is a useful metric to understand how SQL Server has optimized a particular query, it is important to remember that its primary purpose is to aid the query optimizer in choosing good execution plans. It is not a direct measure of IO, CPU, memory, duration, or any other metric that matters to an application user waiting for query execution to complete. A low query cost may not indicate a fast query or the best plan. Alternatively, a high query cost may sometimes be acceptable. As a result, it’s best to not rely heavily on query cost as a metric of performance.

虽然查询成本是了解SQL Server如何优化特定查询的有用指标,但重要的是要记住,它的主要目的是帮助查询优化器选择良好的执行计划。 它不是IO,CPU,内存,持续时间或任何其他对等待查询执行完成的应用程序用户重要的指标的直接度量。 低查询成本可能并不表示快速查询或最佳计划。 备选地,有时可以接受高查询成本。 因此,最好不要过分依赖查询成本作为性能指标。

As the query optimizer churns through candidate execution plans, it will rank them from lowest cost to highest cost. Eventually, the optimizer will reach one of the following conclusions:

当查询优化器遍历候选执行计划时,它将把它们从最低成本到最高成本进行排名。 最终,优化器将得出以下结论之一:

  • Every execution plan has been evaluated and the best one chosen.

    每个执行计划都经过评估,并选择了最佳执行计划。
  • There isn’t enough time to evaluate every plan, and the best one thus far is chosen.

    没有足够的时间来评估每个计划,因此选择了迄今为止最好的计划。

Once an execution plan is chosen, the query optimizer’s job is complete and we can move to the final step of query processing.

一旦选择了执行计划,查询优化器的工作便完成了,我们可以进入查询处理的最后一步。

Execution is the final step. SQL Server takes the execution plan that was identified in the optimization step and follows those instructions in order to execute the query.

执行是最后一步。 SQL Server采用在优化步骤中确定的执行计划,并遵循这些指令以执行查询。

A note on plan reuse: Because optimizing is an inherently expensive process, SQL Server maintains an execution plan cache that stores details about each query executed on a server and the plan that was chosen for it. Typically, databases experience the same queries executed over and over again, such as a web search, order placement, or social media post. Reuse allows us to avoid the expensive optimization process and rely on the work we have previously done to optimize a query.

关于计划重用的注释:由于优化是一个固有的昂贵过程,因此SQL Server维护一个执行计划缓存,该缓存存储有关在服务器上执行的每个查询以及为其选择的计划的详细信息。 通常,数据库会经历一遍又一遍执行的相同查询,例如Web搜索,订单放置或社交媒体帖子。 重用使我们避免了昂贵的优化过程,而依靠我们之前完成的工作来优化查询。

When a query is executed that already has a valid plan in cache, that plan will be chosen, rather than going through the process of building a new one. This saves computing resources and speeds up query execution immensely. We’ll discuss plan reuse more in a future article when we tackle parameter sniffing.

当查询已经在缓存中具有有效计划的查询执行时,将选择该计划,而不是执行构建新计划的过程。 这样可以节省计算资源并极大地加快查询的执行速度。 在处理参数嗅探时,我们将在以后的文章中讨论计划重用。

查询优化中的常见主题 (Common Themes in Query Optimization)

With the introduction out of the way, let’s dive into optimization! The following is a list of the most common metrics that will assist in optimization. Once the basics are out of the way, we can use these basic processes to identify tricks, tips, and patterns in query structure that can be indicative of poor performance.

随着介绍的进行,让我们开始进行优化! 以下是有助于优化的最常见指标列表。 一旦不了解基础知识,我们就可以使用这些基本过程来识别查询结构中的技巧,技巧和模式,这些技巧,技巧和模式可能表明性能不佳。

索引扫描 (Index Scans)

Data may be accessed from an index via either a scan or a seek. A seek is a targeted selection of rows from the table based on a (typically) narrow filter. A scan is when an entire index is searched to return the requested data. If a table contains a million rows, then a scan will need to traverse all million rows to service the query. A seek of the same table can traverse the index’s binary tree quickly to return only the data needed, without the need to inspect the entire table.

可以通过扫描或查找从索引访问数据。 搜索是基于(通常)窄过滤器从表中选择行的目标。 扫描是指搜索整个索引以返回所请求的数据。 如果一个表包含一百万行,则扫描将需要遍历所有一百万行以服务查询。 对同一表的搜索可以快速遍历索引的二叉树,从而仅返回所需的数据,而无需检查整个表。

If there is a legitimate need to return a great deal of data from a table, then an index scan may be the correct operation. If we needed to return 950,000 rows from a million row table, then an index scan makes sense. If we only need to return 10 rows, then a seek would be far more efficient.

如果确实有必要从表中返回大量数据,则索引扫描可能是正确的操作。 如果我们需要从一百万行表中返回950,000行,那么索引扫描就很有意义。 如果我们只需要返回10行,那么查找将更加有效。

Index scans are easy to spot in execution plans:

索引扫描很容易在执行计划中发现:

SELECT
	*
FROM Sales.OrderTracking
INNER JOIN Sales.SalesOrderHeader
ON SalesOrderHeader.SalesOrderID = OrderTracking.SalesOrderID
INNER JOIN Sales.SalesOrderDetail
ON SalesOrderDetail.SalesOrderID = SalesOrderHeader.SalesOrderID
WHERE OrderTracking.EventDateTime = '2014-05-29 00:00:00';

We can quickly spot the index scan in the top-right corner of the execution plan. Consuming 90% of the resources of the query, and being labeled as a clustered index scan quickly lets us know what is going on here. STATISTICS IO also shows us a large number of reads against the OrderTracking table:

我们可以在执行计划的右上角快速发现索引扫描。 消耗查询的90%的资源,并被快速标记为聚集索引扫描,这使我们知道这里发生了什么。 STATISTICS IO还向我们显示了对OrderTracking表的大量读取:

Many solutions are available when we have identified an undesired index scan. Here is a quick list of some thoughts to consider when resolving an index scan problem:

当我们确定了不需要的索引扫描时,可以使用许多解决方案。 以下是解决索引扫描问题时要考虑的一些想法的快速列表:

    • EventDateTime? EventDateTime上是否有索引?
    • Is this query executed often enough to warrant this change? Indexes improve read speeds on queries, but will reduce write speeds, so we should add them with caution.

      是否经常执行此查询以保证进行此更改? 索引可提高查询的读取速度,但会降低写入速度,因此我们应谨慎添加它们。
    • Should we discuss this with those responsible for the app to determine a better way to search for this data?

      我们是否应该与负责该应用程序的人员讨论此问题,以便确定搜索此数据的更好方法?
  • EventDataTime in this example), then there may be some other shenanigans here that require our attention! EventDataTime ),则此处可能还有其他需要我们注意的恶作剧!
    • EventDateTIme happens to equal “5-29-2014” in every row in Sales.OrderTracking的每一行中Sales.OrderTracking, then a scan is expected. Similarly, if we were performing a fuzzy string search, an index scan would be difficult to avoid without implementing a Full-Text Index, or some similar feature. EventDateTIme恰好等于“ 5-29-2014”,则应该进行扫描。 同样,如果执行模糊字符串搜索,那么如果不实施全文索引或某些类似功能,就很难避免索引扫描。

As we walk through more examples, we’ll find a wide variety of other ways to identify and resolve undesired index scans.

当我们遍历更多示例时,我们将找到多种其他方法来标识和解决不希望的索引扫描。

联接和WHERE子句周围的函数 (Functions Wrapped Around Joins and WHERE Clauses)

A theme in optimization is a constant focus on joins and the WHERE clause. Since IO is generally our biggest cost, and these are the query components that can limit IO the most, we’ll often find our worst offenders here. The faster we can slice down our data set to only the rows we need, the more efficient query execution will be!

优化的主题是不断关注联接和WHERE子句。 由于IO通常是我们最大的成本,而这些查询组件可能会最大程度地限制IO,因此我们经常在这里找到最糟糕的违规者。 我们将数据集切成仅需要的行的速度越快,查询执行的效率就越高!

When evaluating a WHERE clause, any expressions involved need to be resolved prior to returning our data. If a column contains functions around it, such as DATEPART, SUBSTRING, or CONVERT, then these functions will also need to be resolved. If the function must be evaluated prior to execution to determine a result set, then the entirety of the data set will need to be scanned to complete that evaluation.

在评估WHERE子句时,需要先解决所有涉及的表达式,然后再返回我们的数据。 如果列周围包含函数,例如DATEPART,SUBSTRING或CONVERT,则这些函数也需要解析。 如果必须在执行之前评估功能以确定结果集,则将需要扫描整个数据集以完成该评估。

Consider the following query:

考虑以下查询:

SELECT
	Person.BusinessEntityID,
	Person.FirstName,
	Person.LastName,
	Person.MiddleName
FROM Person.Person
WHERE LEFT(Person.LastName, 3) = 'For';

This will return any rows from Person.Person that have a last name beginning in “For”. Here is how the query performs:

这将返回Person.Person中姓氏以“ For”开头的所有行。 查询的执行方式如下:

Despite only returning 4 rows, the entire index was scanned to return our data. The reason for this behavior is the use of LEFT on Person.LastName. While our query is logically correct and will return the data we want, SQL Server will need to evaluate LEFT against every row in the table before being able to determine which rows fit the filter. This forces an index scan, but luckily one that can be avoided!

尽管仅返回4行,但整个索引仍被扫描以返回我们的数据。 此行为的原因是对Person.LastName使用LEFT。 虽然我们的查询在逻辑上是正确的,并且将返回我们想要的数据,但是SQL Server将需要针对表中的每一行评估LEFT,然后才能确定哪些行适合过滤器。 这会强制进行索引扫描,但幸运的是可以避免!

When faced with functions in the WHERE clause or in a join, consider ways to move the function onto the scalar variable instead. Also think of ways to rewrite the query in such a way that the table columns can be left clean (that is: no functions attached to them!)

当在WHERE子句或联接中遇到函数时,请考虑将函数移至标量变量的方法。 还请考虑以可以使表列保持整洁的方式重写查询的方法(即:没有附加的函数!)

The query above can be rewritten to do just this:

上面的查询可以重写为执行以下操作:

SELECT
	Person.BusinessEntityID,
	Person.FirstName,
	Person.LastName,
	Person.MiddleName
FROM Person.Person
WHERE Person.LastName LIKE 'For%';

By using LIKE and shifting the wildcard logic into the string literal, we have cleaned up the LastName column, which will allow SQL Server full access to seek indexes against it. Here is the performance we see on the rewritten version:

通过使用LIKE并将通配符逻辑转换为字符串文字,我们清理了LastName列,该列将允许SQL Server完全访问权限来为其寻找索引。 这是我们在重写版本上看到的性能:

The relatively minor query tweak we made allowed the query optimizer to utilize an index seek and pull the data we wanted with only 2 logical reads, instead of 117.

我们进行的相对较小的查询调整允许查询优化器利用索引查找并仅通过2个逻辑读取(而不是117个)读取所需的数据。

The theme of this optimization technique is to ensure that columns are left clean! When writing queries, feel free to put complex string/date/numeric logic onto scalar variables or parameters, but not on columns. If you are troubleshooting a poorly performing query and notice functions (system or user-defined) wrapped around column names, then begin thinking of ways to push those functions off into other scalar parts of the query. This will allow SQL Server to seek indexes, rather than scan, and therefore make the most efficient decisions possible when executing the query!

此优化技术的主题是确保列保持干净! 编写查询时,可以将复杂的字符串/日期/数字逻辑放到标量变量或参数上,但不能放到列上。 如果要对性能不佳的查询和通知功能(系统或用户定义的)包裹在列名周围进行故障排除,请开始考虑将这些功能推入查询的其他标量部分的方法。 这将使SQL Server可以查找索引,而不是扫描索引,因此可以在执行查询时做出最有效的决策!

隐式转换 (Implicit Conversions)

Earlier, we demonstrated how wrapping functions around columns can result in unintended table scans, reducing query performance and increasing latency. Implicit conversions behave the exact same way but are far more hidden from plain sight.

之前,我们演示了如何在列周围包装功能会导致意外的表扫描,从而降低查询性能并增加延迟。 隐式转换的行为方式完全相同,但对普通人而言隐藏得多。

When SQL Server compares any values, it needs to reconcile data types. All data types are assigned a precedence in SQL Server and whichever is of the lower precedence will be automatically converted to the data type of higher precedence. For more info on operator precedence, see the link at the end of this article containing the complete list.

当SQL Server比较任何值时,它需要协调数据类型。 在SQL Server中为所有数据类型分配了优先级,并且优先级较低的那个将自动转换为优先级较高的数据类型。 有关运算符优先级的更多信息,请参见本文末尾的包含完整列表的链接。

Some conversions can occur seamlessly, without any performance impact. For example, a VARCHAR(50) and VARCHAR(MAX) can be compared no problem. Similarly, a TINYINT and BIGINT, DATE and DATETIME, or TIME and a VARCHAR representation of a TIME type. Not all data types can be compared automatically, though.

某些转换可以无缝进行,而不会影响性能。 例如,可以比较VARCHAR(50)和VARCHAR(MAX)。 同样,TINYINT和BIGINT,DATE和DATETIME或TIME以及TIME类型的VARCHAR表示形式。 但是,并非所有数据类型都可以自动比较。

Consider the following SELECT query, which is filtered against an indexed column:

考虑以下SELECT查询,该查询是根据索引列进行过滤的:

SELECT
	EMP.BusinessEntityID,
	EMP.LoginID,
	EMP.JobTitle
FROM HumanResources.Employee EMP
WHERE EMP.NationalIDNumber = 658797903;

A quick glance and we assume that this query will result in an index seek and return data to us quite efficiently. Here is the resulting performance:

快速浏览一下,我们假设此查询将导致索引查找并将数据非常有效地返回给我们。 这是产生的性能:

Despite only looking for a single row against an indexed column, we got a table scan for our efforts. What happened? We get a hint from the execution plan in the yellow exclamation mark over the SELECT operation:

尽管只针对索引列只查找一行,但是我们还是进行了表格扫描以查找我们的工作。 发生了什么? 我们从执行计划中的SELECT操作的黄色感叹号中得到了提示:

Hovering over the operator reveals a CONVERT_IMPLICIT warning. Whenever we see this, it is an indication that we are comparing two data types that are different enough from each other that they cannot be automatically converted. Instead, SQL Server converts every single value in the table prior to applying the filter.

将鼠标悬停在运算符上会显示CONVERT_IMPLICIT警告。 每当我们看到这种情况时,就表明我们正在比较两个彼此完全不同以至于无法自动转换的数据类型。 而是,SQL Server在应用筛选器之前转换表中的每个单个值。

SQL Server中的查询优化技术:基础_第2张图片

When we hover over the NationalIDNumber column in SSMS, we can confirm that it is in fact an NVARCHAR(15). The value we are comparing it to is a numeric. The solution to this problem is very similar to when we had a function on a column: Move the conversion over to the scalar value, instead of the column. In this case, we would change the scalar value 658797903 to the string representation, ‘658797903’:

当我们将鼠标悬停在SSMS中的NationalIDNumber列上时,我们可以确认它实际上是NVARCHAR(15)。 我们要与之比较的值是一个数字。 此问题的解决方案与在列上具有函数时非常相似:将转换移至标量值而不是列。 在这种情况下,我们将标量值658797903更改为字符串表示形式'658797903':

SELECT
	EMP.BusinessEntityID,
	EMP.LoginID,
	EMP.JobTitle
FROM HumanResources.Employee EMP
WHERE EMP.NationalIDNumber = '658797903'

This simple change will completely alter how the query optimizer handles the query:

这个简单的更改将完全改变查询优化器处理查询的方式:

SQL Server中的查询优化技术:基础_第3张图片

The result is an index seek instead of a scan, less IO, and the implicit conversion warning is gone from our execution plan.

结果是索引查找而不是扫描,从而减少了IO,并且隐式转换警告已从我们的执行计划中删除。

Implicit conversions are easy to spot as you’ll get a prominent warning from SQL Server in the execution plan whenever it happens. Once you’ve been tipped off to this problem, you can check the data types of the columns indicated in the warning and resolve the issue.

隐式转换很容易发现,因为您会在执行计划中从SQL Server得到明显的警告。 解决此问题后,可以检查警告中指示的列的数据类型并解决问题。

结论 (Conclusion)

Query optimization is a huge topic that can easily become overwhelming without a good dose of focus. The best way to approach a performance problem is to find specific areas of focus that are most likely the cause of latency. A stored procedure could be 10,000 lines long, but only a single line needs to be addressed to resolve the problem. In these scenarios, finding the suspicious, high-cost, high resource-consuming parts of a script can quickly narrow down the search and allow us to solve a problem rather than hunt for it.

查询优化是一个巨大的主题,如果没有足够的重点,很容易变得不知所措。 解决性能问题的最佳方法是找到最有可能导致延迟的特定关注领域。 存储过程的长度可能为10,000行,但是只需一行即可解决该问题。 在这些情况下,找到脚本中可疑,高成本,高资源消耗的部分可以Swift缩小搜索范围,并允许我们解决问题而不是寻找问题。

The information in this article should provide a good starting point to tackling latency and performance problems. Query optimization sometimes requires additional resources, such as adding a new index but often can end up as a freebie. When we can improve performance solely by rewriting a query, we reduce resource consumption at no cost (aside from our time). As a result, query optimization can be a direct source of cost-savings! In addition to saving money, resources, and the sanity of those waiting for queries to complete, there is a great deal of satisfaction to be gained by improving a process at no further cost to anyone else.

本文中的信息应为解决延迟和性能问题提供一个很好的起点。 查询优化有时需要额外的资源,例如添加新索引,但通常最终可以成为免费赠品。 当我们仅通过重写查询就可以提高性能时,我们会免费(除了我们的时间)减少资源消耗。 结果,查询优化可以直接节省成本! 除了节省金钱,资源和等待查询完成的人员的理智之外,通过改进流程也可以获得很多满足,而其他任何人都无需承担任何其他费用。

Thanks for reading, and let’s keep on making things go faster!

感谢您的阅读,让我们继续前进!

目录 (Table of contents)

Query optimization techniques in SQL Server: the basics
Query optimization techniques in SQL Server: tips and tricks
Query optimization techniques in SQL Server: Database Design and Architecture
Query Optimization Techniques in SQL Server: Parameter Sniffing
SQL Server中的查询优化技术:基础
SQL Server中的查询优化技术:提示和技巧
SQL Server中的查询优化技术:数据库设计和体系结构
SQL Server中的查询优化技术:参数嗅探

翻译自: https://www.sqlshack.com/query-optimization-techniques-in-sql-server-the-basics/

你可能感兴趣的:(大数据,编程语言,数据库,python,机器学习)