使用一个队列作为任务和它调用的服务之间的缓冲区,以平滑间歇性的重负载,否则可能导致服务失败或任务超时。此模式可以帮助最小化需求高峰对任务和服务的可用性和响应性的影响。
Many solutions in the cloud involve running tasks that invoke services. In this environment, if a service is subjected to intermittent heavy loads, it can cause performance or reliability issues
云中的许多解决方案都涉及运行调用服务的任务。在这种环境中,如果服务承受间歇性的重负载,就会导致性能或可靠性问题
A service could be a component that is part of the same solution as the tasks that utilize it, or it could be a third-party service providing access to frequently used resources such as a cache or a storage service. If the same service is utilized by a number of tasks running concurrently, it can be difficult to predict the volume of requests to which the service might be subjected at any given point in time.
服务可以是与使用它的任务属于同一解决方案的组件,也可以是提供对经常使用的资源(如缓存或存储服务)的访问的第三方服务。如果同一个服务被许多并发运行的任务使用,那么在任何给定的时间点都很难预测服务可能受到的请求量。
It is possible that a service might experience peaks in demand that cause it to become overloaded and unable to respond to requests in a timely manner. Flooding a service with a large number of concurrent requests may also result in the service failing if it is unable to handle the contention that these requests could cause.
服务可能会遇到需求高峰,导致其超载,无法及时响应请求。如果服务无法处理这些请求可能导致的争用,则用大量并发请求淹没服务也可能导致服务失败。
Refactor the solution and introduce a queue between the task and the service. The task and the service run asynchronously. The task posts a message containing the data required by the service to a queue. The queue acts as a buffer, storing the message until it is retrieved by the service. The service retrieves the messages from the queue and processes them. Requests from a number of tasks, which can be generated at a highly variable rate, can be passed to the service through the same message queue. Figure 1 shows this structure.
重构解决方案并在任务和服务之间引入队列。任务和服务异步运行。任务将包含服务所需数据的消息发送到队列。队列充当缓冲区,在服务检索消息之前一直存储该消息。服务从队列中检索消息并处理它们。可以通过相同的消息队列将来自多个任务的请求传递给服务,这些任务可以以高度可变的速率生成。图1显示了这个结构。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HQroYTpl-1655720742810)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589783.23297652914b37b1fd1c46dce590171b(en-us,pandp.10)].png)
Figure 1 - Using a queue to level the load on a service
图1-使用队列平衡服务的负载
The queue effectively decouples the tasks from the service, and the service can handle the messages at its own pace irrespective of the volume of requests from concurrent tasks. Additionally, there is no delay to a task if the service is not available at the time it posts a message to the queue.
队列有效地将任务与服务分离开来,服务可以按照自己的速度处理消息,而不管并发任务的请求量有多大。此外,如果服务在向队列发送消息时不可用,则任务不会延迟。
This pattern provides the following benefits:
这种模式提供了以下好处:
Note
注意
Some services may implement throttling if demand reaches a threshold beyond which the system could fail. Throttling may reduce the functionality available. You might be able to implement load leveling with these services to ensure that this threshold is not reached.
如果需求达到了系统可能失败的阈值,一些服务可能会实现节流。节流可能会降低可用的功能。您可以使用这些服务实现负载均衡,以确保不会达到这个阈值。
Consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,请考虑以下几点:
This pattern is ideally suited to any type of application that uses services that may be subject to overloading.
此模式非常适合于使用可能会被重载的服务的任何类型的应用程序。
This pattern might not be suitable if the application expects a response from the service with minimal latency.
如果应用程序期望从服务获得最小延迟的响应,则此模式可能不适合。
A Microsoft Azure web role stores data by using a separate storage service. If a large number of instances of the web role run concurrently, it is possible that the storage service could be overwhelmed and be unable to respond to requests quickly enough to prevent these requests from timing out or failing. Figure 2 highlights this issue.
MicrosoftAzure Web 角色通过使用单独的存储服务来存储数据。如果 web 角色的大部分数量同时运行,存储服务可能会不堪重负,无法快速响应请求,以防止这些请求超时或失败。图2突出显示了这个问题。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3g1SSNAP-1655720742811)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589783.7b5bc0860e63b63f6cc46411f1857135(en-us,pandp.10)].png)
Figure 2 - A service being overwhelmed by a large number of concurrent requests from instances of a web role
图2-一个服务被来自 Web 角色实例的大量并发请求所淹没
To resolve this issue, you can use a queue to level the load between the web role instances and the storage service. However, the storage service is designed to accept synchronous requests and cannot be easily modified to read messages and manage throughput. Therefore, you can introduce a worker role to act as a proxy service that receives requests from the queue and forwards them to the storage service. The application logic in the worker role can control the rate at which it passes requests to the storage service to prevent the storage service from being overwhelmed. Figure 3 shows this solution.
要解决这个问题,可以使用队列来平衡 Web 角色实例和存储服务之间的负载。但是,存储服务被设计为接受同步请求,不容易修改以读取消息和管理吞吐量。因此,可以引入辅助角色作为代理服务,接收来自队列的请求并将其转发到存储服务。Worker 角色中的应用程序逻辑可以控制向存储服务传递请求的速率,以防止存储服务不堪重负。图3显示了这个解决方案。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-esTZNW9f-1655720742811)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589783.ee8b20330d85bb546f581b90ad337680(en-us,pandp.10)].png)
Figure 3 - Using a queue and a worker role to level the load between instances of the web role and the service
图3-使用一个队列和一个工作者角色来平衡 Web 角色实例和服务之间的负载
The following patterns and guidance may also be relevant when implementing this pattern:
下列模式和指南在实现此模式时也可能有用:
当应用程序试图连接到服务或网络资源时,通过透明地重新尝试以前失败的操作,使应用程序能够处理预期的临时失败,因为预期失败的原因是临时的。此模式可以提高应用程序的稳定性。
An application that communicates with elements running in the cloud must be sensitive to the transient faults that can occur in this environment. Such faults include the momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that arise when a service is busy.
与云中运行的元素进行通信的应用程序必须对可能在此环境中发生的瞬时故障敏感。此类故障包括与组件和服务的网络连接暂时丢失、服务暂时不可用或服务繁忙时出现超时。
These faults are typically self-correcting, and if the action that triggered a fault is repeated after a suitable delay it is likely to be successful. For example, a database service that is processing a large number of concurrent requests may implement a throttling strategy that temporarily rejects any further requests until its workload has eased. An application attempting to access the database may fail to connect, but if it tries again after a suitable delay it may succeed.
这些错误通常是自我纠正的,如果触发错误的动作在适当的延迟之后重复,那么它很可能是成功的。例如,正在处理大量并发请求的数据库服务可能实现一个节流策略,该策略暂时拒绝任何进一步的请求,直到其工作负载得到缓解。试图访问数据库的应用程序可能无法连接,但如果在适当的延迟后再次尝试,则可能会成功。
In the cloud, transient faults are not uncommon and an application should be designed to handle them elegantly and transparently, minimizing the effects that such faults might have on the business tasks that the application is performing.
在云中,暂时性故障并不少见,应用程序应该设计成能够优雅和透明地处理这些故障,最小化这些故障可能对应用程序正在执行的业务任务造成的影响。
If an application detects a failure when it attempts to send a request to a remote service, it can handle the failure by using the following strategies:
如果应用程序在尝试向远程服务发送请求时检测到故障,它可以使用以下策略处理故障:
For the more common transient failures, the period between retries should be chosen so as to spread requests from multiple instances of the application as evenly as possible. This can reduce the chance of a busy service continuing to be overloaded. If many instances of an application are continually bombarding a service with retry requests, it may take the service longer to recover.
对于更常见的暂时故障,应该选择重试之间的时间间隔,以便尽可能均匀地分布来自应用程序多个实例的请求。这可以减少繁忙服务继续超载的可能性。如果应用程序的许多实例不断地用重试请求轰击服务,则服务可能需要更长时间才能恢复。
If the request still fails, the application can wait for a further period and make another attempt. If necessary, this process can be repeated with increasing delays between retry attempts until some maximum number of requests have been attempted and failed. The delay time can be increased incrementally, or a timing strategy such as exponential back-off can be used, depending on the nature of the failure and the likelihood that it will be corrected during this time.
如果请求仍然失败,应用程序可以等待一段时间,然后再次尝试。如果有必要,可以重复这个过程,重试之间的延迟越来越长,直到尝试了最大数量的请求并失败。延迟时间可以逐渐增加,或者可以使用指数后退等定时策略,这取决于故障的性质以及在此期间纠正故障的可能性。
Figure 1 illustrates this pattern. If the request is unsuccessful after a predefined number of attempts, the application should treat the fault as an exception and handle it accordingly.
图1说明了这种模式。如果在预定义的多次尝试之后请求仍然不成功,应用程序应该将该错误视为异常并相应地处理它。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-84O8WhzC-1655720742815)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589788.f67c15d0bbd1904bcd7493ae870920a2(en-us,pandp.10)].png)
Figure 1 - Invoking an operation in a hosted service using the Retry pattern
图1-使用 Retry 模式在宿主服务中调用操作
The application should wrap all attempts to access a remote service in code that implements a retry policy matching one of the strategies listed above. Requests sent to different services can be subject to different policies, and some vendors provide libraries that encapsulate this approach. These libraries typically implement policies that are parameterized, and the application developer can specify values for items such as the number of retries and the time between retry attempts.
应用程序应该将访问远程服务的所有尝试包装在实现与上面列出的策略之一匹配的重试策略的代码中。发送到不同服务的请求可能受制于不同的策略,一些供应商提供了封装这种方法的库。这些库通常实现参数化的策略,应用程序开发人员可以为项目指定值,例如重试次数和重试间隔时间。
The code in an application that detects faults and retries failing operations should log the details of these failures. This information may be useful to operators. If a service is frequently reported as unavailable or busy, it is often because the service has exhausted its resources. You may be able to reduce the frequency with which these faults occur by scaling out the service. For example, if a database service is continually overloaded, it may be beneficial to partition the database and spread the load across multiple servers.
应用程序中检测错误并重试失败操作的代码应该记录这些失败的详细信息。此信息可能对操作员有用。如果某个服务经常被报告为不可用或忙碌,这通常是因为该服务已耗尽其资源。您可以通过扩展服务来减少这些故障发生的频率。例如,如果一个数据库服务不断地超载,那么对数据库进行分区并将负载分散到多个服务器可能是有益的。
Note
注意
Microsoft Azure provides extensive support for the Retry pattern. The patterns & practices Transient Fault Handling Block enables an application to handle transient faults in many Azure services using a range of retry strategies. The Microsoft Entity Framework version 6 provides facilities for retrying database operations. Additionally, many of the Azure Service Bus and Azure Storage APIs implement retry logic transparently.
MicrosoftAzure 为 Retry 模式提供了广泛的支持。模式和实践瞬态故障处理块使应用程序能够使用一系列重试策略处理许多 Azure 服务中的瞬态故障。MicrosoftEntity Framework 版本6提供了重试数据库操作的工具。此外,许多 Azure 服务总线和 Azure 存储 API 透明地实现了重试逻辑。
You should consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,应考虑以下几点:
Use this pattern:
使用以下模式:
This pattern might not be suitable:
这种模式可能并不合适:
This example illustrates an implementation of the Retry pattern. The OperationWithBasicRetryAsync method, shown below, invokes an external service asynchronously through the TransientOperationAsync method (the details of this method will be specific to the service and are omitted from the sample code).
此示例说明了 Retry 模式的实现。OperationWithBasicRetryAsync 方法(如下所示)通过 TranentOperationAsync 方法异步调用外部服务(该方法的详细信息将特定于服务,并从示例代码中省略)。
C# C #Copy 收到
private int retryCount = 3;...public async Task OperationWithBasicRetryAsync(){ int currentRetry = 0; for (; ;) { try { // Calling external service. await TransientOperationAsync(); // Return or break. break; } catch (Exception ex) { Trace.TraceError("Operation Exception"); currentRetry++; // Check if the exception thrown was a transient exception // based on the logic in the error detection strategy. // Determine whether to retry the operation, as well as how // long to wait, based on the retry strategy. if (currentRetry > this.retryCount || !IsTransient(ex)) { // If this is not a transient error // or we should not retry re-throw the exception. throw; } } // Wait to retry the operation. // Consider calculating an exponential delay here and // using a strategy best suited for the operation and fault. Await.Task.Delay(); }}// Async method that wraps a call to a remote service (details not shown).private async Task TransientOperationAsync(){ ...}
The statement that invokes this method is encapsulated within a try/catch block wrapped in a for loop. The for loop exits if the call to the TransientOperationAsync method succeeds without throwing an exception. If the TransientOperationAsync method fails, the catch block examines the reason for the failure, and if it is deemed to be a transient error the code waits for a short delay before retrying the operation.
调用此方法的语句封装在包装在 for 循环中的 try/catch 块中。如果对 TranentOperationAsync 方法的调用成功而没有引发异常,则 for 循环将退出。如果 TranentOperationAsync 方法失败,catch 块检查失败的原因,如果被认为是暂时错误,则代码在重试操作之前等待短暂的延迟。
The for loop also tracks the number of times that the operation has been attempted, and if the code fails three times the exception is assumed to be more long lasting. If the exception is not transient or it is longlasting, the catch handler throws an exception. This exception exits the for loop and should be caught by the code that invokes the OperationWithBasicRetryAsync method.
For 循环还跟踪尝试操作的次数,如果代码失败三次,则假定异常持续时间更长。如果异常不是瞬时的或者是长期的,catch 处理程序将引发异常。此异常退出 for 循环,应由调用 OperationWithBasicRetryAsync 方法的代码捕获。
The IsTransient method, shown below, checks for a specific set of exceptions that are relevant to the environment in which the code is run. The definition of a transient exception may vary according to the resources being accessed and the environment in which the operation is being performed.
如下所示的 IsTranent 方法检查与运行代码的环境相关的一组特定异常。瞬态异常的定义可能会根据所访问的资源和执行操作的环境而有所不同。
C# C #Copy 收到
private bool IsTransient(Exception ex){ // Determine if the exception is transient. // In some cases this may be as simple as checking the exception type, in other // cases it may be necessary to inspect other properties of the exception. if (ex is OperationTransientException) return true; var webException = ex as WebException; if (webException != null) { // If the web exception contains one of the following status values // it may be transient. return new[] {WebExceptionStatus.ConnectionClosed, WebExceptionStatus.Timeout, WebExceptionStatus.RequestCanceled }. Contains(webException.Status); } // Additional exception checking logic goes here. return false;}
The following pattern may also be relevant when implementing this pattern:
在实现此模式时,下列模式也可能是相关的:
设计一个应用程序,使其可以重新配置,而不需要重新部署或重新启动应用程序。这有助于维护可用性和最小化停机时间。
A primary aim for important applications such as commercial and business websites is to minimize downtime and the consequent interruption to customers and users. However, at times it is necessary to reconfigure the application to change specific behavior or settings while it is deployed and in use. Therefore, it is an advantage for the application to be designed in such a way as to allow these configuration changes to be applied while it is running, and for the components of the application to detect the changes and apply them as soon as possible.
商业和商业网站等重要应用程序的主要目标是尽量减少停机时间,从而减少对客户和用户的中断。但是,有时需要重新配置应用程序,以便在部署和使用时更改特定的行为或设置。因此,将应用程序设计成允许在其运行时应用这些配置更改,并使应用程序的组件能够检测这些更改并尽快应用它们,这是一个优势。
Examples of the kinds of configuration changes to be applied might be adjusting the granularity of logging to assist in debugging a problem with the application, swapping connection strings to use a different data store, or turning on or off specific sections or functionality of the application.
要应用的各种配置更改的例子可能是调整日志记录的粒度以帮助调试应用程序的问题,交换连接字符串以使用不同的数据存储,或者打开或关闭应用程序的特定部分或功能。
The solution for implementing this pattern depends on the features available in the application hosting environment. Typically, the application code will respond to one or more events that are raised by the hosting infrastructure when it detects a change to the application configuration. This is usually the result of uploading a new configuration file, or in response to changes in the configuration through the administration portal or by accessing an API.
实现此模式的解决方案取决于应用程序宿主环境中可用的特性。通常,当检测到应用程序配置的更改时,应用程序代码将响应宿主基础结构引发的一个或多个事件。这通常是上传新配置文件的结果,或者是通过管理门户或访问 API 来响应配置中的更改。
Code that handles the configuration change events can examine the changes and apply them to the components of the application. It is necessary for these components to detect and react to the changes, and so the values they use will usually be exposed as writable properties or methods that the code in the event handler can set to new values or execute. From this point, the components should use the new values so that the required changes to the application behavior occur.
处理配置更改事件的代码可以检查更改并将其应用于应用程序的组件。这些组件必须检测到更改并对更改作出反应,因此它们使用的值通常将作为可写属性或方法公开,事件处理程序中的代码可以将这些属性或方法设置为新值或执行。此时,组件应该使用新值,以便对应用程序行为进行所需的更改。
If it is not possible for the components to apply the changes at runtime, it will be necessary to restart the application so that these changes are applied when the application starts up again. In some hosting environments it may be possible to detect these types of changes, and indicate to the environment that the application must be restarted. In other cases it may be necessary to implement code that analyses the setting changes and forces an application restart when necessary.
如果组件不可能在运行时应用更改,则需要重新启动应用程序,以便在应用程序再次启动时应用这些更改。在某些宿主环境中,可以检测到这些类型的更改,并向环境指示必须重新启动应用程序。在其他情况下,可能需要实现分析设置更改的代码,并在必要时强制应用程序重新启动。
Figure 1 shows an overview of this pattern.
图1显示了此模式的概述。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7wXOd2wB-1655720742819)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589785.3bec791b019e315e0f28943d7421fd26(en-us,pandp.10)].png)
Figure 1 - A basic overview of this pattern
图1-此模式的基本概述
Most environments expose events raised in response to configuration changes. In those that do not, a polling mechanism that regularly checks for changes to the configuration and applies these changes will be necessary. It may also be necessary to restart the application if the changes cannot be applied at runtime. For example, it may be possible to compare the date and time of a configuration file at preset intervals, and run code to apply the changes when a newer version is found. Another approach would be to incorporate a control in the administration UI of the application, or expose a secured endpoint that can be accessed from outside the application, that executes code that reads and applies the updated configuration.
大多数环境公开响应配置更改而引发的事件。对于那些没有这样做的配置,需要一种轮询机制来定期检查配置的更改并应用这些更改。如果无法在运行时应用更改,则可能还需要重新启动应用程序。例如,可以按预设的间隔比较配置文件的日期和时间,并在找到新版本时运行代码来应用更改。另一种方法是在应用程序的管理 UI 中合并一个控件,或者公开一个可以从应用程序外部访问的安全端点,该端点执行读取和应用更新的配置的代码。
Alternatively, the application could react to some other change in the environment. For example, occurrences of a specific runtime error might change the logging configuration to automatically collect additional information, or the code could use the current date to read and apply a theme that reflects the season or a special event.
或者,应用程序可以对环境中的其他变化作出反应。例如,特定运行时错误的出现可能会更改日志配置以自动收集其他信息,或者代码可以使用当前日期读取和应用反映季节或特殊事件的主题。
Consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,请考虑以下几点:
This pattern is ideally suited for:
这种模式非常适合:
This pattern might not be suitable if the runtime components are designed so they can be configured only at initialization time, and the effort of updating those components cannot be justified in comparison to restarting the application and enduring a short downtime.
如果将运行时组件设计成只能在初始化时配置,并且与重新启动应用程序和持续较短的停机时间相比,更新这些组件的工作是不合理的,那么这种模式可能不适合。
Microsoft Azure Cloud Services roles detect and expose two events that are raised when the hosting environment detects a change to the ServiceConfiguration.cscfg files:
Microsoft Azure Cloud Services 角色检测并公开两个事件,这两个事件是在宿主环境检测到对 ServiceConfiguration.cscfg 文件的更改时引发的:
When you cancel a change in the RoleEnvironment.Changing event you are indicating to Azure that a new setting cannot be applied while the application is running, and that it must be restarted in order to use the new value. Effectively you will cancel a change only if your application or component cannot react to the change at runtime, and requires a restart in order to use the new value.
取消角色环境中的更改时。更改事件,指示 Azure 在应用程序运行时不能应用新设置,必须重新启动该设置才能使用新值。实际上,只有在应用程序或组件无法在运行时对更改作出反应并且需要重新启动才能使用新值时,才能取消更改。
注意
For more information see RoleEnvironment.Changing Event and Use the RoleEnvironment.Changing Event on MSDN.
有关更多信息,请参见 MSDN 上的 RoleEnvironment。更改事件和使用 RoleEnvironment。更改事件。
To handle the RoleEnvironment.Changing and RoleEnvironment.Changed events you will typically add a custom handler to the event. For example, the following code from the Global.asax.cs class in the Runtime Reconfiguration solution of the examples you can download for this guide shows how to add a custom function named RoleEnvironment_Changed to the event hander chain. This is from the Global.asax.cs file of the example.
处理角色环境。变化与角色环境。已更改的事件通常将向事件添加自定义处理程序。例如,下面的代码来自可以为本指南下载的示例的运行时重构解决方案中的 Global.asax.cs 类,它显示了如何将名为 RoleEnvironment _ Changed 的自定义函数添加到事件处理程序链中。这来自示例的 Global.asax.cs 文件。
注意
The examples for this pattern are in the RuntimeReconfiguration.Web project of the RuntimeReconfiguration solution.
此模式的示例位于 RuntimeReconfiguration.Web 项目的 RuntimeReconfigurationSolutions 中。
protected void Application_Start(object sender, EventArgs e)
{
ConfigureFromSetting(CustomSettingName);
RoleEnvironment.Changed += this.RoleEnvironment_Changed;
}
In a web or worker role you can use similar code in the OnStart event handler of the role to handle the RoleEnvironment.Changing event. This is from the WebRole.cs file of the example.
在 web 或 worker 角色中,可以在角色的 OnStart 事件处理程序中使用类似的代码来处理 RoleEnvironment。改变事件。这来自示例的 WebRole.cs 文件。
public override bool OnStart()
{
// Add the trace listener. The web role process is not configured by web.config.
Trace.Listeners.Add(new DiagnosticMonitorTraceListener());
RoleEnvironment.Changing += this.RoleEnvironment_Changing;
return base.OnStart();
}
Be aware that, in the case of web roles, the OnStart event handler runs in a separate process from the web application process itself. This is why you will typically handle the RoleEnvironment.Changed event handler in the Global.asax file so that you can update the runtime configuration of your web application, and the RoleEnvironment.Changing event in the role itself. In the case of a worker role, you can subscribe to both the RoleEnvironment.Changing and RoleEnvironment.Changed events within the OnStart event handler.
请注意,在 Web 角色的情况下,OnStart 事件处理程序运行在一个独立于 Web 应用程序流程本身的流程中。这就是您通常处理 RoleEnvironment 的原因。更改 Global.asax 文件中的事件处理程序,以便更新 Web 应用程序和 RoleEnvironment 的运行时配置。更改角色本身中的事件。在工作者角色的情况下,您可以同时订阅两个角色环境。变化与角色环境。更改了 OnStart 事件处理程序中的事件。
注意
You can store custom configuration settings in the service configuration file, in a custom configuration file, in a database such as Azure SQL Database or SQL Server in a Virtual Machine, or in Azure blob or table storage. You will need to create code that can access the custom configuration settings and apply these to the application—typically by setting the properties of components within the application.
您可以将自定义配置设置存储在服务配置文件中、自定义配置文件中、虚拟机中的 Azure SQL 数据库或 SQL Server 数据库中、 Azure blob 或表存储中。您需要创建可以访问自定义配置设置的代码,并将这些设置应用于应用程序ーー通常是通过设置应用程序内部组件的属性。
For example, the following custom function reads the value of a setting, whose name is passed as a parameter, from the Azure service configuration file and then applies it to the current instance of a runtime component named SomeRuntimeComponent. This is from the Global.asax.cs file of the example
例如,下面的自定义函数从 Azure 服务配置文件读取设置的值(其名称作为参数传递) ,然后将其应用于名为 Some RuntimeComponent 的运行时组件的当前实例。这来自示例的 Global.asax.cs 文件
private static void ConfigureFromSetting(string settingName)
{
var value = RoleEnvironment.GetConfigurationSettingValue(settingName);
SomeRuntimeComponent.Instance.CurrentValue = value;
}
注意
Some configuration settings, such as those for Windows Identity Framework, cannot be stored in the Azure service configuration file and must be in the App.config or Web.config file.
一些配置设置,比如 Windows Identity Framework,不能存储在 Azure 服务配置文件中,必须存储在 App.config 或 Web.config 文件中。
In Azure, some configuration changes are detected and applied automatically. This includes the configuration of the Widows Azure diagnostics system in the Diagnostics.wadcfg file, which specifies the types of information to collect and how to persist the log files. Therefore, it is only necessary to write code that handles the custom settings you add to the service configuration file. Your code should either:
在 Azure 中,一些配置更改将被检测并自动应用。这包括 Diagnotics.wadcfg 文件中 Widows Azure 诊断系统的配置,该文件指定要收集的信息类型以及如何持久化日志文件。因此,只需编写处理添加到服务配置文件中的自定义设置的代码。您的代码应该:
For example, the following code from the WebRole.cs class in the Runtime Reconfiguration solution of the examples you can download for this guide shows how you can use the RoleEnvironment.Changing event to cancel the update for all settings except the ones that can be applied at runtime without requiring a restart. This example allows a change to the settings named “CustomSetting” to be applied at runtime without restarting the application (the component that uses this setting will be able to read the new value and change its behavior accordingly at runtime). Any other change to the configuration will automatically cause the web or worker role to restart.
例如,下面的代码来自您可以为本指南下载的示例的运行时重构解决方案中的 WebRole.cs 类,它显示了如何使用 RoleEnvironment。更改事件以取消除可在运行时应用而无需重新启动的设置以外的所有设置的更新。此示例允许在运行时更改名为“ CustomSet”的设置,而无需重新启动应用程序(使用此设置的组件将能够读取新值,并在运行时相应地更改其行为)。对配置的任何其他更改都将自动导致 Web 或 worker 角色重新启动。
C# C #Copy 收到
private void RoleEnvironment_Changing(object sender,
RoleEnvironmentChangingEventArgs e)
{
var changedSettings = e.Changes.OfType<RoleEnvironmentConfigurationSettingChange>()
.Select(c => c.ConfigurationSettingName).ToList();
Trace.TraceInformation("Changing notification. Settings being changed: "
+ string.Join(", ", changedSettings));
if (changedSettings
.Any(settingName => !string.Equals(settingName, CustomSettingName,
StringComparison.Ordinal)))
{
Trace.TraceInformation("Cancelling dynamic configuration change (restarting).");
// Setting this to true will restart the role gracefully. If Cancel is not
// set to true, and the change is not handled by the application, the
// application will not use the new value until it is restarted (either
// manually or for some other reason).
e.Cancel = true;
}
Else
{
Trace.TraceInformation("Handling configuration change without restarting. ");
}
}
Note
注意
This approach demonstrates good practice because it ensures that a change to any setting that the application code is not aware of (and so cannot be sure that it can be applied at runtime) will cause a restart. If any one of the changes is cancelled, the role will be restarted.
这种方法展示了良好的实践,因为它确保了对应用程序代码不知道的任何设置的更改(因此不能确保它可以在运行时应用)将导致重新启动。如果任何一个更改被取消,角色将重新启动。
Updates that are not cancelled in the RoleEnvironment.Changing event handler can then be detected and applied to the application components after the new configuration has been accepted by the Azure framework. For example, the following code in the Global.asax file of the example solution handles the RoleEnvironment.Changed event. It examines each configuration setting and, when it finds the setting named “CustomSetting”, calls a function (shown earlier) that applies the new setting to the appropriate component in the application.
在角色环境中未取消的更新。然后,在 Azure 框架接受新配置之后,可以检测到更改的事件处理程序并将其应用到应用程序组件中。例如,示例解决方案的 Global.asax 文件中的以下代码处理 RoleEnvironment。事情有变。它检查每个配置设置,并在找到名为“ CustomSet”的设置时调用一个函数(如前所示) ,该函数将新设置应用于应用程序中的适当组件。
C# C #Copy 收到
private void RoleEnvironment_Changed(object sender,
RoleEnvironmentChangedEventArgs e)
{
Trace.TraceInformation("Updating instance with new configuration settings.");
foreach (var settingChange in
e.Changes.OfType<RoleEnvironmentConfigurationSettingChange>())
{
if (string.Equals(settingChange.ConfigurationSettingName,
CustomSettingName,
StringComparison.Ordinal))
{
// Execute a function to update the configuration of the component.
ConfigureFromSetting(CustomSettingName );
}
}
}
Note that if you fail to cancel a configuration change, but do not apply the new value to your application component, then the change will not take effect until the next time that the application is restarted. This may lead to unpredictable behavior, particularly if the hosting role instance is restarted automatically by Azure as part of its regular maintenance operations—at which point the new setting value will be applied.
请注意,如果未能取消配置更改,但未将新值应用于应用程序组件,则更改将在下次重新启动应用程序之前不会生效。这可能会导致不可预测的行为,特别是如果托管角色实例作为其常规维护操作的一部分被 Azure 自动重新启动时ーー此时将应用新的设置值。
The following pattern may also be relevant when implementing this pattern:
在实现此模式时,下列模式也可能是相关的:
跨分布式服务和其他远程资源协调一组操作,如果这些操作中的任何一个失败,尝试透明地处理错误,或者如果系统无法从错误中恢复,则撤消所执行工作的影响。此模式通过使分布式系统能够恢复和重试由于暂时异常、长期故障和进程失败而失败的操作,从而为分布式系统增加了弹性。
An application performs tasks that comprise a number of steps, some of which may invoke remote services or access remote resources. The individual steps may be independent of each other, but they are orchestrated by the application logic that implements the task.
应用程序执行包含许多步骤的任务,其中一些步骤可能调用远程服务或访问远程资源。各个步骤可能彼此独立,但它们是由实现任务的应用程序逻辑编排的。
Whenever possible, the application should ensure that the task runs to completion and resolve any failures that might occur when accessing remote services or resources. These failures could occur for a variety of reasons. For example, the network might be down, communications could be interrupted, a remote service may be unresponsive or in an unstable state, or a remote resource might be temporarily inaccessible—perhaps due to resource constraints. In many cases these failures may be transient and can be handled by using the Retry pattern.
只要有可能,应用程序应确保任务运行到完成,并解决访问远程服务或资源时可能发生的任何故障。这些故障可能由于各种原因而发生。例如,网络可能关闭,通信可能中断,远程服务可能没有响应或处于不稳定状态,或远程资源可能暂时无法访问(可能是由于资源限制)。在许多情况下,这些故障可能是暂时的,可以通过使用 Retry 模式来处理。
If the application detects a more permanent fault from which it cannot easily recover, it must be able to restore the system to a consistent state and ensure integrity of the entire end-to-end operation.
如果应用程序检测到一个不容易恢复的永久性错误,它必须能够将系统恢复到一致的状态,并确保整个端到端操作的完整性。
The Scheduler Agent Supervisor pattern defines the following actors. These actors orchestrate the steps (individual items of work) to be performed as part of the task (the overall process):
计划程序代理管理器模式定义以下参与者。这些参与者编排要作为任务(整个流程)的一部分执行的步骤(单个工作项) :
The Scheduler arranges for the individual steps that comprise the overall task to be executed and orchestrates their operation. These steps can be combined into a pipeline or workflow, and the Scheduler is responsible for ensuring that the steps in this workflow are performed in the appropriate order. The Scheduler maintains information about the state of the workflow as each step is performed (such as “step not yet started,” “step running,” or “step completed”) and records information about this state. This state information should also include an upper limit of the time allowed for the step to finish (referred to as the Complete By time). If a step requires access to a remote service or resource, the Scheduler invokes the appropriate Agent, passing it the details of the work to be performed. The Scheduler typically communicates with an Agent by using asynchronous request/response messaging. This can be implemented by using queues, although other distributed messaging technologies could be used instead.
计划程序安排组成要执行的整个任务的各个步骤,并编排它们的操作。这些步骤可以组合成管道或工作流,计划程序负责确保以适当的顺序执行此工作流中的步骤。计划程序在执行每个步骤时维护关于工作流状态的信息(例如“步骤尚未启动”、“步骤正在运行”或“步骤已完成”) ,并记录关于此状态的信息。此状态信息还应包括允许完成步骤的时间上限(称为“按时完成”)。如果某个步骤需要访问远程服务或资源,计划程序将调用适当的 Agent,并将要执行的工作的详细信息传递给它。调度程序通常通过使用异步请求/响应消息传递与代理进行通信。这可以通过使用队列来实现,不过也可以使用其他分布式消息传递技术。
Note 注意
The Scheduler performs a similar function to the Process Manager in the Process Manager pattern. The actual workflow is typically defined and implemented by a workflow engine that is controlled by the Scheduler. This approach decouples the business logic in the workflow from the Scheduler.
计划程序在 Process Manager 模式中执行与 Process Manager 类似的功能。实际工作流通常由调度程序控制的工作流引擎定义和实现。此方法将工作流中的业务逻辑与调度程序分离。
The Agent contains logic that encapsulates a call to a remote service, or access to a remote resource referenced by a step in a task. Each Agent typically wraps calls to a single service or resource, implementing the appropriate error handling and retry logic (subject to a timeout constraint, described later). If the steps in the workflow being run by the Scheduler utilize several services and resources across different steps, each step might reference a different Agent (this is an implementation detail of the pattern).
代理包含封装对远程服务的调用或对任务中的步骤引用的远程资源的访问的逻辑。每个 Agent 通常包装对单个服务或资源的调用,实现适当的错误处理和重试逻辑(受超时约束的影响,稍后将介绍)。如果计划程序运行的工作流中的步骤跨不同的步骤使用多个服务和资源,则每个步骤可能引用不同的 Agent (这是模式的实现细节)。
The Supervisor monitors the status of the steps in the task being performed by the Scheduler. It runs periodically (the frequency will be system-specific), examines the status of steps as maintained by the Scheduler. If it detects any that have timed out or failed, it arranges for the appropriate Agent to recover the step or execute the appropriate remedial action (this may involve modifying the status of a step). Note that the recovery or remedial actions are typically implemented by the Scheduler and Agents. The Supervisor should simply request that these actions be performed.
主管监视计划执行者执行任务的步骤的状态。它定期运行(频率将是系统特定的) ,检查由计划程序维护的步骤的状态。如果它发现任何超时或失败,它会安排适当的代理人恢复步骤或执行适当的补救行动(这可能涉及修改步骤的状态)。注意,恢复或补救操作通常由计划程序和代理实现。主管应该简单地要求执行这些操作。
The Scheduler, Agent, and Supervisor are logical components and their physical implementation depends on the technology being used. For example, several logical agents may be implemented as part of a single web service.
调度程序、代理和监督程序是逻辑组件,它们的物理实现取决于所使用的技术。例如,多个逻辑代理可以作为单个 Web 服务的一部分实现。
The Scheduler maintains information about the progress of the task and the state of each step in a durable data store, referred to as the State Store. The Supervisor can use this information to help determine whether a step has failed. Figure 1 illustrates the relationship between the Scheduler, the Agents, the Supervisor, and the State Store.
计划程序维护有关任务进度和持久数据存储区(称为 State Store)中每个步骤的状态的信息。主管可以使用此信息来帮助确定步骤是否失败。图1说明了计划程序、代理、主管和状态存储之间的关系。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RbmSzzGJ-1655720742823)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589780.54fe416e9fb02df01f2ffbccc63e651a(en-us,pandp.10)].png)
Figure 1 - The actors in the Scheduler Agent Supervisor pattern
图1-调度代理主管模式中的参与者
Note 注意
This diagram shows a simplified illustration of the pattern. In a real implementation, there may be many instances of the Scheduler running concurrently, each a subset of tasks. Similarly, the system could run multiple instances of each Agent, or even multiple Supervisors. In this case, Supervisors must coordinate their work with each other carefully to ensure that they don’t compete to recover the same failed steps and tasks. The Leader Election pattern provides one possible solution to this problem.
此图显示了该模式的简化说明。在实际的实现中,可能会有许多并发运行调度程序的实例,每个实例都是任务的子集。类似地,系统可以运行每个 Agent 的多个实例,甚至可以运行多个督导员。在这种情况下,主管必须彼此仔细协调他们的工作,以确保他们不会竞争,以恢复相同的失败的步骤和任务。“领导人选举”模式为这个问题提供了一个可能的解决方案。
When an application wishes to run a task, it submits a request to the Scheduler. The Scheduler records initial state information about the task and its steps (for example, “step not yet started”) in the State Store and then commences performing the operations defined by the workflow. As the Scheduler starts each step, it updates the information about the state of that step in the State Store (for example, “step running”).
当应用程序希望运行任务时,它向计划程序提交一个请求。调度程序在 State Store 中记录关于任务及其步骤的初始状态信息(例如,“ step not yet start”) ,然后开始执行由工作流定义的操作。当调度程序开始每个步骤时,它将更新有关该步骤在 State Store 中的状态的信息(例如,“ step running”)。
If a step references a remote service or resource, the Scheduler sends a message to the appropriate Agent. The message may contain the information that the Agent needs to pass to the service or access the resource, in addition to the Complete By time for the operation. If the Agent completes its operation successfully, it returns a response to the Scheduler. The Scheduler can then update the state information in the State Store (for example, “step completed”) and perform the next step. This process continues until the entire task is complete.
如果步骤引用远程服务或资源,计划程序将向适当的代理发送消息。除了操作的“完成时间”之外,消息还可能包含代理需要传递给服务或访问资源的信息。如果代理成功完成其操作,则向计划程序返回响应。然后,计划程序可以更新状态存储中的状态信息(例如,“完成步骤”)并执行下一步。这个过程一直持续到整个任务完成为止。
An Agent can implement any retry logic that is necessary to perform its work. However, if the Agent does not complete its work before the Complete By period expires the Scheduler will assume that the operation has failed. In this case, the Agent should stop its work and not attempt to return anything to the Scheduler (not even an error message), or attempt any form of recovery. The reason for this restriction is that, after a step has timed out or failed, another instance of the Agent may be scheduled to run the failing step (this process is described later).
代理可以实现执行其工作所必需的任何重试逻辑。但是,如果代理没有在“完成时间”过期之前完成其工作,则计划程序将假定操作失败。在这种情况下,代理应该停止其工作,不要尝试向计划程序返回任何内容(甚至不要返回错误消息) ,也不要尝试任何形式的恢复。这种限制的原因是,在一个步骤超时或失败之后,可能会调度 Agent 的另一个实例来运行失败的步骤(稍后将介绍此过程)。
If the Agent itself fails, the Scheduler will not receive a response. The pattern may not make a distinction between a step that has timed out and one that has genuinely failed.
如果代理本身失败,计划程序将不会收到响应。这种模式可能无法区分超时的步骤和真正失败的步骤。
If a step times out or fails, the State Store will contain a record that indicates that the step is running (“step running”), but the Complete By time will have passed. The Supervisor looks for steps such as this and attempts to recover them. One possible strategy is for the Supervisor to update the Complete By value to extend the time available to complete the step, and then send a message to the Scheduler identifying the step that has timed out . The Scheduler can then attempt to repeat this step. However, such a design requires the tasks to be idempotent.
如果某个步骤超时或失败,州存储将包含一条记录,指示该步骤正在运行(“步骤正在运行”) ,但“完成时间”已经过去。主管寻找这样的步骤并尝试恢复它们。一种可能的策略是,主管更新 CompleteBy 值,以延长可用于完成步骤的时间,然后向调度程序发送消息,指出已超时的步骤。然后,计划程序可以尝试重复此步骤。然而,这样的设计要求任务是幂等的。
It may be necessary for the Supervisor to prevent the same step from being retried if it continually fails or times out. To achieve this, the Supervisor could maintain a retry count for each step, along with the state information, in the State Store. If this count exceeds a predefined threshold the Supervisor can adopt a strategy such as waiting for an extended period before notifying the Scheduler that it should retry the step, in the expectation that the fault will be resolved during this period. Alternatively, the Supervisor can send a message to the Scheduler to request the entire task be undone by implementing a Compensating Transaction (this approach will depend on the Scheduler and Agents providing the information necessary to implement the compensating operations for each step that completed successfully).
如果同一步骤持续失败或超时,主管可能有必要防止重新尝试同一步骤。为了实现这一点,主管可以在 State Store 中维护每个步骤的重试次数以及状态信息。如果计数超过预定义的阈值,主管可以采取一种策略,比如等待一段延长的时间,然后通知调度程序它应该重试该步骤,期望故障在这段时间内得到解决。或者,主管可以向调度程序发送消息,要求通过实现补偿事务来撤销整个任务(这种方法将取决于调度程序和代理提供必要的信息,以实现成功完成的每个步骤的补偿操作)。
Note 注意
It is not the purpose of the Supervisor to monitor the Scheduler and Agents, and restart them if they fail. This aspect of the system should be handled by the infrastructure in which these components are running. Similarly, the Supervisor should not have knowledge of the actual business operations that the tasks being performed by the Scheduler are running (including how to compensate should these tasks fail). This is the purpose of the workflow logic implemented by the Scheduler. The sole responsibility of the Supervisor is to determine whether a step has failed and arrange either for it to be repeated or for the entire task containing the failed step to be undone.
监视调度程序和代理并在它们失败时重新启动它们不是主管的目的。系统的这个方面应该由运行这些组件的基础设施来处理。同样,主管不应该知道调度程序执行的任务正在运行的实际业务操作(包括如何在这些任务失败时进行补偿)。这是调度程序实现的工作流逻辑的目的。监督员的唯一责任是确定某一步骤是否失败,并安排重复该步骤或撤销包含失败步骤的整个任务。
If the Scheduler is restarted after a failure, or the workflow being performed by the Scheduler terminates unexpectedly, the Scheduler should be able to determine the status of any in-flight task that it was handling when it failed, and be prepared to resume this task from the point at which it failed. The implementation details of this process are likely to be system specific. If the task cannot be recovered, it may be necessary to undo the work already performed by the task. This may also require implementing a Compensating Transaction.
如果计划程序在失败后重新启动,或者计划程序正在执行的工作流意外终止,计划程序应该能够确定失败时正在处理的任何飞行中任务的状态,并准备从失败的地方恢复该任务。此过程的实现细节可能是系统特定的。如果无法恢复该任务,则可能需要撤消该任务已执行的工作。这可能还需要实现补偿事务。
The key advantage of this pattern is that the system is resilient in the event of unexpected temporary or unrecoverable failures. The system can be constructed to be self-healing. For example, if an Agent or the Scheduler crashes, a new one can be started and the Supervisor can arrange for a task to be resumed. If the Supervisor fails, another instance can be started and can take over from where the failure occurred. If the Supervisor is scheduled to run periodically, a new instance may be automatically started after a predefined interval. The State Store may be replicated to achieve an even greater degree of resiliency.
此模式的主要优点是,在发生意外的临时或不可恢复的故障时,系统具有弹性。这个系统可以被构造成自我修复的。例如,如果代理或计划程序崩溃,则可以启动新的代理或计划程序,并且主管可以安排恢复任务。如果监督失败,则可以启动另一个实例,并从发生故障的地方接管。如果监督程序被安排定期运行,则新实例可能会在预定义的时间间隔之后自动启动。可以复制州储备,以实现更大程度的弹性。
You should consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,应考虑以下几点:
Use this pattern when a process that runs in a distributed environment such as the cloud must be resilient to communications failure and/or operational failure.
当在分布式环境(如云)中运行的进程必须能够应对通信故障和/或操作故障时,请使用此模式。
This pattern might not be suitable for tasks that do not invoke remote services or access remote resources.
此模式可能不适合不调用远程服务或访问远程资源的任务。
A web application that implements an ecommerce system has been deployed on Microsoft Azure. Users can run this application to browse the products available from an organization, and place orders for these products. The user interface runs as a web role, and the order processing elements of the application are implemented as a set of worker roles. Part of the order processing logic involves accessing a remote service, and this aspect of the system could be prone to transient or more long-lasting faults. For this reason, the designers used the Scheduler Agent Supervisor pattern to implement the order processing elements of the system.
一个实现电子商务系统的 web 应用程序已经部署在微软 Azure 上。用户可以运行此应用程序来浏览组织提供的产品,并为这些产品下订单。用户界面作为 Web 角色运行,应用程序的订单处理元素作为一组辅助角色实现。订单处理逻辑的一部分涉及到访问远程服务,系统的这一方面可能容易出现短暂故障或更持久的故障。出于这个原因,设计人员使用调度代理监督模式来实现系统的订单处理元素。
When a customer places an order, the application constructs a message that describes the order and posts this message to a queue. A separate Submission process, running in a worker role, retrieves this message, inserts the details of the order into the Orders database, and creates a record for the order process in the State Store. Note that the inserts into the Orders database and the State Store are performed as part of the same operation. The Submission process is designed to ensure that both inserts complete together.
当客户下订单时,应用程序构造一条描述订单的消息,并将该消息发送到队列。在辅助角色中运行的单独的 Submission 进程检索此消息,将订单的详细信息插入 Orders 数据库,并为 State Store 中的订单进程创建一条记录。注意,对 Orders 数据库和 State Store 的插入是作为相同操作的一部分执行的。提交过程旨在确保两个插入一起完成。
The state information that the Submission process creates for the order includes:
提交过程为订单创建的状态信息包括:
OrderID 命令: The ID of the order in the Orders database. : Orders 数据库中订单的 ID
LockedBy 被锁住了: The instance ID of the worker role handling the order. There may be multiple current instances of the worker role running the Scheduler, but each order should only be handled by a single instance. : 处理订单的 worker 角色的实例 ID。运行计划程序的辅助角色可能有多个当前实例,但是每个订单应该只由单个实例处理
CompleteBy 完成: The time by which the order should be processed. : 处理订单的时间
The current state of the task handling the order. The possible states are:
处理订单的任务的当前状态。可能的状态是:
FailureCount 故障计数: The number of times that processing has been attempted for the order. : 尝试处理订单的次数
In this state information, the OrderID field is copied from the order ID of the new order. The LockedBy and CompleteBy fields are set to null, the ProcessState field is set to Pending, and the FailureCount field is set to 0.
在此状态信息中,将从新订单的订单 ID 复制 OrderID 字段。LockedBy 和 CompleteBy 字段被设置为 null,ProcessState 字段被设置为 Pending,false ureCount 字段被设置为0。
注意
In this example, the order handling logic is relatively simple and only comprises a single step that invokes a remote service. In a more complex multi-step scenario, the Submission process would likely involve several steps, and so several records would be created in the State Store—each one describing the state of an individual step.
在本例中,订单处理逻辑相对简单,只包含一个调用远程服务的步骤。在更复杂的多步骤场景中,提交过程可能涉及多个步骤,因此在 State Store 中将创建多个记录ーー每个记录描述单个步骤的状态。
The Scheduler also runs as part of a worker role and implements the business logic that handles the order. An instance of the Scheduler polling for new orders examines the State Store for records where the LockedBy field is null and the ProcessState field is Pending. When the Scheduler finds a new order, it immediately populates the LockedBy field with its own instance ID, sets the CompleteBy field to an appropriate time, and sets the ProcessState field to Processing. The code that does this is designed to be exclusive and atomic to ensure that two concurrent instances of the Scheduler cannot attempt to handle the same order simultaneously.
计划程序还作为工作者角色的一部分运行,并实现处理订单的业务逻辑。对新订单进行调度器轮询的实例检查 State Store 中 LockedBy 字段为 null 且 ProcessState 字段为 Pending 的记录。当 Scheduler 发现一个新订单时,它会立即用自己的实例 ID 填充 LockedBy 字段,将 CompleteBy 字段设置为适当的时间,并将 ProcessState 字段设置为 Processing。执行此操作的代码被设计为独占的和原子的,以确保调度程序的两个并发实例不能尝试同时处理相同的顺序。
The Scheduler then runs the business workflow to process the order asynchronously, passing it the value in the OrderID field from the State Store. The workflow handling the order retrieves the details of the order from the Orders database and performs its work. When a step in the order processing workflow needs to invoke the remote service, it uses an Agent. The workflow step communicates with the Agent by using a pair of Azure Service Bus message queues acting as a request/response channel. Figure 2 shows a high-level view of the solution.
然后,计划程序运行业务工作流来异步处理订单,并将来自 State Store 的 OrderID 字段中的值传递给它。处理订单的工作流从 Orders 数据库中检索订单的详细信息并执行其工作。当订单处理工作流中的某个步骤需要调用远程服务时,它将使用一个 Agent。工作流步骤通过使用一对 Azure Service Bus 消息队列作为请求/响应通道与 Agent 进行通信。图2显示了解决方案的高级视图。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6NSJUgSE-1655720742824)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589780.4227b7f0d0c87afdd6571a7d743ab806(en-us,pandp.10)].png)
Figure 2 - Using the Scheduler Agent Supervisor pattern to handle orders in a Azure solution
图2-在 Azure 解决方案中使用调度代理监管模式处理订单
The message sent to the Agent from a workflow step describes the order and includes the CompleteBy time. If the Agent receives a response from the remote service before the CompleteBy time expires, it constructs a reply message that it posts on the Service Bus queue on which the workflow is listening. When the workflow step receives the valid reply message, it completes its processing and the Scheduler sets the ProcessState field of the order state to Processed. At this point, the order processing has completed successfully.
从工作流步骤发送到 Agent 的消息描述了顺序并包括 CompleteBy 时间。如果 Agent 在 CompleteBy 时间到期之前从远程服务收到响应,它将构造一条应答消息,并将其发送到工作流正在侦听的 Service Bus 队列上。当工作流步骤接收到有效的应答消息时,它将完成其处理并且 Scheduler 将订单状态的 ProcessState 字段设置为 Processed。此时,订单处理已经成功完成。
If the CompleteBy time expires before the Agent receives a response from the remote service, the Agent simply halts its processing and terminates handling the order. Similarly, if the workflow handling the order exceeds the CompleteBy time, it also terminates. In both of these cases, the state of the order in the State Store remains set to Processing, but the CompleteBy time indicates that the time for processing the order has passed and the process is deemed to have failed. Note that if the Agent that is accessing the remote service, or the workflow that is handling the order (or both) terminate unexpectedly, the information in the State Store will again remain set to Processing and eventually will have an expired CompleteBy value.
如果 CompleteBy 时间在 Agent 从远程服务收到响应之前过期,则 Agent 只需停止其处理并终止处理订单。类似地,如果处理订单的工作流超过 CompleteBy 时间,它也将终止。在这两种情况下,State Store 中订单的状态仍然设置为 Processing,但 CompleteBy 时间表明处理订单的时间已经过去,并且流程被认为已经失败。请注意,如果访问远程服务的 Agent 或处理订单的工作流(或两者)意外终止,则 State Store 中的信息将再次保持设置为 Processing,并且最终将具有过期的 CompleteBy 值。
If the Agent detects an unrecoverable non-transient fault while it is attempting to contact the remote service, it can send an error response back to the workflow. The Scheduler can set the status of the order to Error and raise an event that alerts an operator. The operator can then attempt to resolve the reason for the failure manually and resubmit the failed processing step.
如果代理在试图联系远程服务时检测到不可恢复的非瞬态故障,则可以将错误响应发送回工作流。计划程序可以将订单的状态设置为 Error 并引发警告操作员的事件。然后,操作员可以尝试手动解决失败的原因,并重新提交失败的处理步骤。
The Supervisor periodically examines the State Store looking for orders with an expired CompleteBy value. If the Supervisor finds such a record, it increments the FailureCount field. If the FailureCount value is below a specified threshold value, the Supervisor resets the LockedBy field to null, updates the CompleteBy field with a new expiration time, and sets the ProcessState field to Pending. An instance of the Scheduler can pick up this order and perform its processing as before. If the FailureCount value exceeds a specified threshold, the reason for the failure is assumed to be non-transient. The Supervisor sets the status of the order to Error and raises an event that alerts an operator, as previously described.
督导程序定期检查 State Store,以查找具有过期 CompleteBy 值的订单。如果管理员发现这样一个记录,它将增加“故障计数”字段。如果 False ureCount 值低于指定的阈值,督导程序将 LockedBy 字段重置为 null,用新的过期时间更新 CompleteBy 字段,并将 ProcessState 字段设置为 Pending。计划程序的实例可以拾取此订单并像前面一样执行其处理。如果 False ureCount 值超过指定的阈值,则假定失败的原因是非瞬态的。如前所述,主管将订单的状态设置为 Error 并引发警告操作员的事件。
注意
In this example, the Supervisor is implemented in a separate worker role. You can utilize a variety of strategies to arrange for the Supervisor task to be run, including using the Azure Scheduler service (not to be confused with the Scheduler component in this pattern). For more information about the Azure Scheduler service, visit the Scheduler page.
在这个例子中,督导者是在一个单独的工作者角色中实现的。您可以使用各种策略来安排督导任务的运行,包括使用 Azure 调度器服务(不要与此模式中的调度器组件混淆)。有关 Azure 调度器服务的更多信息,请访问调度器页面。
Although it is not shown in this example, the Scheduler may need to keep the application that submitted the order in the first place informed about the progress and status of the order. The application and the Scheduler are isolated from each other to eliminate any dependencies between them. The application has no knowledge of which instance of the Scheduler is handling the order, and the Scheduler is unaware of which specific application instance posted the order.
虽然这个示例中没有显示,但是调度程序可能需要首先将订单的进度和状态通知给提交了订单的应用程序。应用程序和计划程序彼此隔离,以消除它们之间的任何依赖关系。应用程序不知道调度程序的哪个实例正在处理订单,调度程序也不知道是哪个特定的应用程序实例发布了订单。
To enable the order status to be reported, the application could use its own private response queue. The details of this response queue would be included as part of the request sent to the Submission process, which would include this information in the State Store. The Scheduler would then post messages to this queue indicating the status of the order (“request received,” “order completed,” “order failed,” and so on). It should include the Order ID in these messages so that they can be correlated with the original request by the application.
要报告订单状态,应用程序可以使用自己的私有响应队列。这个响应队列的详细信息将作为发送到提交过程的请求的一部分包括在内,提交过程将在 State Store 中包含这些信息。然后,调度程序将向此队列发送消息,指示订单的状态(“请求已收到”、“订单已完成”、“订单失败”等)。它应该在这些消息中包含 Order ID,以便它们可以与应用程序的原始请求相关联。
The following patterns and guidance may also be relevant when implementing this pattern:
下列模式和指南在实现此模式时也可能有用:
将数据存储区划分为一组水平分区或分片。这种模式可以提高存储和访问大量数据时的可伸缩性。
A data store hosted by a single server may be subject to the following limitations:
由单台服务器承载的数据存储可能受到以下限制:
Scaling vertically by adding more disk capacity, processing power, memory, and network connections may postpone the effects of some of these limitations, but it is likely to be only a temporary solution. A commercial cloud application capable of supporting large numbers of users and high volumes of data must be able to scale almost indefinitely, so vertical scaling is not necessarily the best solution.
通过增加更多的磁盘容量、处理能力、内存和网络连接来垂直扩展,可能会推迟其中一些限制的影响,但这可能只是一个临时解决方案。能够支持大量用户和大量数据的商业云应用程序必须能够几乎无限扩展,因此垂直扩展不一定是最佳解决方案。
Divide the data store into horizontal partitions or shards. Each shard has the same schema, but holds its own distinct subset of the data. A shard is a data store in its own right (it can contain the data for many entities of different types), running on a server acting as a storage node.
将数据存储区划分为水平分区或分片。每个碎片具有相同的模式,但是拥有自己独特的数据子集。碎片本身就是一个数据存储(它可以包含许多不同类型实体的数据) ,运行在充当存储节点的服务器上。
This pattern offers the following benefits:
这种模式有以下好处:
When dividing a data store up into shards, decide which data should be placed in each shard. A shard typically contains items that fall within a specified range determined by one or more attributes of the data. These attributes form the shard key (sometimes referred to as the partition key). The shard key should be static. It should not be based on data that might change.
当将数据存储区划分为碎片时,决定应该在每个碎片中放置哪些数据。碎片通常包含属于由数据的一个或多个属性确定的指定范围内的项。这些属性形成了分片键(有时称为分区键)。碎片密钥应该是静态的。它不应该基于可能发生变化的数据。
Sharding physically organizes the data. When an application stores and retrieves data, the sharding logic directs the application to the appropriate shard. This sharding logic may be implemented as part of the data access code in the application, or it could be implemented by the data storage system if it transparently supports sharding.
分片对数据进行物理组织。当应用程序存储和检索数据时,分片逻辑将应用程序指向适当的分片。这种分片逻辑可以作为应用程序中数据访问代码的一部分实现,也可以由数据存储系统实现,如果它透明地支持分片的话。
Abstracting the physical location of the data in the sharding logic provides a high level of control over which shards contain which data, and enables data to migrate between shards without reworking the business logic of an application should the data in the shards need to be redistributed later (for example, if the shards become unbalanced). The tradeoff is the additional data access overhead required in determining the location of each data item as it is retrieved.
抽象分片逻辑中数据的物理位置可以提供对哪些分片包含哪些数据的高级控制,并且如果分片中的数据以后需要重新分布(例如,如果分片变得不平衡) ,可以使数据在分片之间迁移,而无需重新修改应用程序的业务逻辑。折衷是在检索每个数据项时确定其位置所需的额外数据访问开销。
To ensure optimal performance and scalability, it is important to split the data in a way that is appropriate for the types of queries the application performs. In many cases, it is unlikely that the sharding scheme will exactly match the requirements of every query. For example, in a multi-tenant system an application may need to retrieve tenant data by using the tenant ID, but it may also need to look up this data based on some other attribute such as the tenant’s name or location. To handle these situations, implement a sharding strategy with a shard key that supports the most commonly performed queries.
为了确保最佳性能和可伸缩性,以适合应用程序执行的查询类型的方式分割数据非常重要。在许多情况下,分片方案不太可能完全匹配每个查询的需求。例如,在多租户系统中,应用程序可能需要通过使用租户 ID 来检索租户数据,但是它也可能需要基于其他属性(如租户的名称或位置)来查找这些数据。要处理这些情况,使用支持最常执行的查询的分片键实现分片策略。
If queries regularly retrieve data by using a combination of attribute values, it may be possible to define a composite shard key by concatenating attributes together. Alternatively, use a pattern such as Index Table to provide fast lookup to data based on attributes that are not covered by the shard key.
如果查询通过组合使用属性值定期检索数据,则可以通过将属性连接在一起来定义组合碎片键。或者,可以使用像 Index Table 这样的模式来提供基于属性的数据的快速查找,这些属性不在碎片键的覆盖范围内。
Three strategies are commonly used when selecting the shard key and deciding how to distribute data across shards. Note that there does not have to be a one-to-one correspondence between shards and the servers that host them—a single server can host multiple shards. The strategies are:
在选择碎片密钥和决定如何跨碎片分发数据时,通常使用三种策略。请注意,碎片和承载碎片的服务器之间不一定要有双射ーー一台服务器可以承载多个碎片。策略如下:
The Lookup strategy. In this strategy the sharding logic implements a map that routes a request for data to the shard that contains that data by using the shard key. In a multi-tenant application all the data for a tenant might be stored together in a shard by using the tenant ID as the shard key. Multiple tenants might share the same shard, but the data for a single tenant will not be spread across multiple shards. Figure 1 shows an example of this strategy.
查找策略。在此策略中,分片逻辑实现了一个映射,该映射通过使用分片键将数据请求路由到包含该数据的分片。在多租户应用程序中,通过使用租户 ID 作为分片密钥,可以将租户的所有数据存储在一个分片中。多个租户可能共享同一个碎片,但是单个租户的数据不会跨多个碎片分布。图1显示了此策略的一个示例。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EmxYK1ly-1655720742828)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589797.8b9ba4351b0a38c364b862ac9cb1a744(en-us,pandp.10)].png)
Figure 1 - Sharding tenant data based on tenant IDs
图1-基于租户 ID 的租户数据分片
The mapping between the shard key and the physical storage may be based on physical shards where each shard key maps to a physical partition. Alternatively, a technique that provides more flexibility when rebalancing shards is to use a virtual partitioning approach where shard keys map to the same number of virtual shards, which in turn map to fewer physical partitions. In this approach, an application locates data by using a shard key that refers to a virtual shard, and the system transparently maps virtual shards to physical partitions. The mapping between a virtual shard and a physical partition can change without requiring the application code to be modified to use a different set of shard keys.
碎片密钥和物理存储之间的映射可能基于物理碎片,其中每个碎片密钥映射到一个物理分区。或者,在重新平衡碎片时提供更多灵活性的一种技术是使用虚拟分区方法,在这种方法中,碎片键映射到相同数量的虚拟碎片,而这些虚拟碎片又映射到更少的物理分区。在这种方法中,应用程序通过使用引用虚拟碎片的碎片键来定位数据,系统透明地将虚拟碎片映射到物理分区。虚拟碎片和物理分区之间的映射可以更改,而无需修改应用程序代码以使用不同的碎片键集。
The Range strategy. This strategy groups related items together in the same shard, and orders them by shard key—the shard keys are sequential. It is useful for applications that frequently retrieve sets of items by using range queries (queries that return a set of data items for a shard key that falls within a given range). For example, if an application regularly needs to find all orders placed in a given month, this data can be retrieved more quickly if all orders for a month are stored in date and time order in the same shard. If each order was stored in a different shard, they would have to be fetched individually by performing a large number of point queries (queries that return a single data item). Figure 2 shows an example of this strategy.
Range 策略。这种策略将相关项目组合在同一个碎片中,并按照碎片键ーー碎片键是连续的ーー对它们进行排序。对于通过使用范围查询(查询返回属于给定范围的碎片键的一组数据项)频繁检索项集的应用程序来说,它非常有用。例如,如果应用程序定期需要查找给定月份的所有订单,则如果将一个月的所有订单按日期和时间顺序存储在同一碎片中,则可以更快地检索此数据。如果每个订单都存储在不同的碎片中,那么就必须通过执行大量的点查询(返回单个数据项的查询)来单独获取订单。图2显示了此策略的一个示例。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zcni8mHU-1655720742829)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589797.b81c120e4f92520569ce46d25dafc6d9(en-us,pandp.10)].png)
Figure 2 - Storing sequential sets (ranges) of data in shards
图2-在碎片中存储数据的顺序集(范围)
In this example, the shard key is a composite key comprising the order month as the most significant element, followed by the order day and the time. The data for orders is naturally sorted when new orders are created and appended to a shard. Some data stores support two-part shard keys comprising a partition key element that identifies the shard and a row key that uniquely identifies an item within the shard. Data is usually held in row key order within the shard. Items that are subject to range queries and need to be grouped together can use a shard key that has the same value for the partition key but a unique value for the row key.
在此示例中,碎片键是一个复合键,其中订单月份是最重要的元素,其次是订单日期和时间。当创建新订单并将其追加到一个分片时,订单数据将自然进行排序。一些数据存储支持由两部分组成的碎片键,包括标识碎片的分区键元素和唯一标识碎片内项的行键。数据通常按行键顺序保存在分片中。受范围查询影响并需要组合在一起的项目可以使用一个分片键,该分片键对于分区键具有相同的值,但对于行键具有唯一的值。
The Hash strategy. The purpose of this strategy is to reduce the chance of hotspots in the data. It aims to distribute the data across the shards in a way that achieves a balance between the size of each shard and the average load that each shard will encounter. The sharding logic computes the shard in which to store an item based on a hash of one or more attributes of the data. The chosen hashing function should distribute data evenly across the shards, possibly by introducing some random element into the computation. Figure 2 shows an example of this strategy.
哈希策略。这种策略的目的是减少数据中出现热点的机会。它的目标是在每个碎片之间分布数据,以实现每个碎片的大小和每个碎片将遇到的平均负载之间的平衡。分片逻辑根据数据的一个或多个属性的哈希值计算要在其中存储项的分片。所选的散列函数应该将数据均匀地分布在碎片上,可能需要在计算中引入一些随机元素。图2显示了此策略的一个示例。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Xc7ThIHZ-1655720742829)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589797.41432a1160682c42ed3ee1dc6dbc3dc0(en-us,pandp.10)].png)
Figure 3 - Sharding tenant data based on a hash of tenant IDs
图3-基于租户 ID 散列的租户数据分片
To understand the advantage of the Hash strategy over other sharding strategies, consider how a multi-tenant application that enrolls new tenants sequentially might assign the tenants to shards in the data store. When using the Range strategy, the data for tenants 1 to n will all be stored in shard A, the data for tenants n+1 to m will all be stored in shard B, and so on. If the most recently registered tenants are also the most active, most data activity will occur in a small number of shards—which could cause hotspots. In contrast, the Hash strategy allocates tenants to shards based on a hash of their tenant ID. This means that sequential tenants are most likely to be allocated to different shards, as shown in Figure 3 for tenants 55 and 56, which will distribute the load across these shards.
为了理解 Hash 策略相对于其他分片策略的优势,请考虑一下按顺序登记新租户的多租户应用程序如何将租户分配给数据存储区中的分片。在使用 Range 策略时,租户1到 n 的数据将全部存储在分片 A 中,租户 n + 1到 m 的数据将全部存储在分片 B 中,依此类推。如果最近注册的租户也是最活跃的,那么大多数数据活动将发生在少量碎片中,这可能会导致热点。相反,Hash 策略根据租户 ID 的散列将租户分配给碎片。这意味着顺序租户最有可能被分配到不同的分片,如图3中的55和56租户所示,它们将负载分配到这些分片上。
The following table lists the main advantages and considerations for these three sharding strategies.
下表列出了这三种分片策略的主要优点和注意事项。
Strategy策略 | Advantages好处 | Considerations考虑因素 |
---|---|---|
Lookup查一下 | More control over the way that shards are configured and used.更多地控制碎片的配置和使用方式。Using virtual shards reduces the impact when rebalancing data because new physical partitions can be added to even out the workload. The mapping between a virtual shard and the physical partitions that implement the shard can be modified without affecting application code that uses a shard key to store and retrieve data.使用虚拟碎片可以减少重新平衡数据时的影响,因为可以添加新的物理分区来均衡工作负载。可以修改虚拟碎片和实现碎片的物理分区之间的映射,而不会影响使用碎片密钥存储和检索数据的应用程序代码。 | Looking up shard locations can impose an additional overhead.查找碎片位置会增加额外的开销。 |
Range范围 | Easy to implement and works well with range queries because they can often fetch multiple data items from a single shard in a single operation.易于实现并且可以很好地处理范围查询,因为它们通常可以在单个操作中从单个碎片中获取多个数据项。Easier data management. For example, if users in the same region are in the same shard, updates can be scheduled in each time zone based on the local load and demand pattern.更容易的数据管理。例如,如果相同区域中的用户位于相同的分片中,则可以根据本地负载和需求模式在每个时区中调度更新。 | May not provide optimal balancing between shards.可能无法在碎片之间提供最佳平衡。Rebalancing shards is difficult and may not resolve the problem of uneven load if the majority of activity is for adjacent shard keys.如果大部分活动是针对相邻的碎片键,那么重新平衡碎片是困难的,并且可能无法解决负载不均匀的问题。 |
Hash大麻 | Better chance of a more even data and load distribution.数据和负载分布更均匀的机会更大。Request routing can be accomplished directly by using the hash function. There is no need to maintain a map.请求路由可以通过使用散列函数直接完成。不需要维护映射。 | Computing the hash may impose an additional overhead.计算散列可能会增加额外的开销。Rebalancing shards is difficult.重新平衡碎片是困难的。 |
Most common sharding schemes implement one of the approaches described above, but you should also consider the business requirements of your applications and their patterns of data usage. For example, in a multi-tenant application:
大多数常见的分片方案实现了上述方法之一,但是您还应该考虑应用程序的业务需求及其数据使用模式。例如,在多租户应用程序中:
Each of the sharding strategies implies different capabilities and levels of complexity for managing scale in, scale out, data movement, and maintaining state.
每个分片策略都意味着管理扩展、扩展、数据移动和维护状态的能力和复杂程度不同。
The Lookup strategy permits scaling and data movement operations to be carried out at the user level, either online or offline. The technique is to suspend some or all user activity (perhaps during off-peak periods), move the data to the new virtual partition or physical shard, change the mappings, invalidate or refresh any caches that hold this data, and then allow user activity to resume. Often this type of operation can be centrally managed. The Lookup strategy requires state to be highly cacheable and replica friendly.
查找策略允许在用户级别(在线或离线)执行伸缩和数据移动操作。该技术是暂停部分或全部用户活动(可能在非高峰期) ,将数据移动到新的虚拟分区或物理分片,更改映射,使保存这些数据的任何缓存失效或刷新,然后允许用户活动恢复。这种类型的操作通常可以集中管理。查找策略要求状态具有高度可缓存性并且对副本友好。
The Range strategy imposes some limitations on scaling and data movement operations, which must typically be carried out when a part or all of the data store is offline because the data must be split and merged across the shards. Moving the data to rebalance shards may not resolve the problem of uneven load if the majority of activity is for adjacent shard keys or data identifiers that are within the same range. The Range strategy may also require some state to be maintained in order to map ranges to the physical partitions.
Range 策略对缩放和数据移动操作施加了一些限制,这些操作通常必须在部分或全部数据存储脱机时执行,因为数据必须通过分片进行拆分和合并。如果大部分活动是针对相邻的碎片键或相同范围内的数据标识符,那么将数据移动到重新平衡碎片可能无法解决负载不均衡的问题。Range 策略还可能需要维护某些状态,以便将范围映射到物理分区。
The Hash strategy makes scaling and data movement operations more complex because the partition keys are hashes of the shard keys or data identifiers. The new location of each shard must be determined from the hash function, or the function modified to provide the correct mappings. However, the Hash strategy does not require maintenance of state.
哈希策略使缩放和数据移动操作更加复杂,因为分区键是碎片键或数据标识符的哈希。每个分片的新位置必须通过散列函数确定,或者通过修改函数来提供正确的映射。但是,Hash 策略不需要维护状态。
Consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,请考虑以下几点:
Sharding is complementary to other forms of partitioning, such as vertical partitioning and functional partitioning. For example, a single shard may contain entities that have been partitioned vertically, and a functional partition may be implemented as multiple shards. For more information about partitioning, see the Data Partitioning Guidance.
分片与其他形式的分区(如垂直分区和功能分区)相辅相成。例如,单个碎片可能包含已经垂直分区的实体,而功能分区可以实现为多个碎片。有关分区的详细信息,请参阅数据分区指南。
Keep shards balanced so that they all handle a similar volume of I/O. As data is inserted and deleted, it may be necessary to periodically rebalance the shards to guarantee an even distribution and to reduce the chance of hotspots. Rebalancing can be an expensive operation. To reduce the frequency with which rebalancing becomes necessary you should plan for growth by ensuring that each shard contains sufficient free space to handle the expected volume of changes. You should also develop strategies and scripts that you can use to quickly rebalance shards should this become necessary.
保持碎片的平衡,以便它们都能处理相似的 I/O 量。随着数据的插入和删除,可能需要定期重新平衡碎片,以保证均匀分布,并减少出现热点的机会。再平衡可能是一项代价高昂的操作。为了减少重新平衡变得必要的频率,您应该通过确保每个碎片包含足够的可用空间来处理预期的更改量来规划增长。您还应该开发策略和脚本,以便在必要时能够快速重新平衡碎片。
Use stable data for the shard key. If the shard key changes, the corresponding data item may have to move between shards, increasing the amount of work performed by update operations. For this reason, avoid basing the shard key on potentially volatile information. Instead, look for attributes that are invariant or that naturally form a key.
对碎片密钥使用稳定的数据。如果碎片键发生更改,则相应的数据项可能必须在碎片之间移动,从而增加更新操作执行的工作量。出于这个原因,请避免将碎片密钥基于可能不稳定的信息。相反,应该寻找不变的属性或者自然形成键的属性。
Ensure that shard keys are unique. For example, avoid using auto-incrementing fields as the shard key. Is some systems, auto-incremented fields may not be coordinated across shards, possibly resulting in items in different shards having the same shard key.
确保碎片键是唯一的。例如,避免使用自动递增字段作为切分键。在某些系统中,自动增加的字段可能无法跨碎片进行协调,这可能导致不同碎片中的项具有相同的碎片键。
Note 注意
Auto-incremented values in fields that do not comprise the shard key can also cause problems. For example, if you use auto-incremented fields to generate unique IDs, then two different items located in different shards may be assigned the same ID.
在不包含碎片键的字段中自动增加值也可能导致问题。例如,如果使用自动递增的字段来生成唯一的 ID,那么位于不同碎片中的两个不同项可能被分配相同的 ID。
It may not be possible to design a shard key that matches the requirements of every possible query against the data. Shard the data to support the most frequently performed queries, and if necessary create secondary index tables to support queries that retrieve data by using criteria based on attributes that are not part of the shard key. For more information, see the Index Table pattern.
设计一个能够匹配每个可能查询对数据的需求的分片密钥可能是不可能的。分片数据以支持执行频率最高的查询,并在必要时创建辅助索引表以支持通过使用基于不属于分片键的属性的条件来检索数据的查询。有关更多信息,请参见索引表模式。
Queries that access only a single shard will be more efficient than those that retrieve data from multiple shards, so avoid implementing a sharding scheme that results in applications performing large numbers of queries that join data held in different shards. Remember that a single shard can contain the data for multiple types of entities. Consider denormalizing your data to keep related entities that are commonly queried together (such as the details of customers and the orders that they have placed) in the same shard to reduce the number of separate reads that an application performs.
只访问单个碎片的查询比从多个碎片检索数据的查询效率更高,因此要避免实现分片方案,因为该方案会导致应用程序执行大量连接不同碎片中存储的数据的查询。请记住,单个碎片可以包含多种实体类型的数据。考虑将数据反规范化,以便将通常被查询的相关实体(例如客户的详细信息和他们下的订单)保存在同一个分片中,从而减少应用程序执行的单独读取的次数。
Note 注意
If an entity in one shard references an entity stored in another shard, include the shard key for the second entity as part of the schema for the first entity. This can help to improve the performance of queries that reference related data across shards.
如果一个分片中的实体引用存储在另一个分片中的实体,则将第二个实体的分片密钥作为第一个实体的架构的一部分包含在内。这有助于提高跨碎片引用相关数据的查询的性能。
If an application must perform queries that retrieve data from multiple shards, it may be possible to fetch this data by using parallel tasks. Examples include fan-out queries, where data from multiple shards is retrieved in parallel and then aggregated into a single result. However, this approach inevitably adds some complexity to the data access logic of a solution.
如果应用程序必须执行从多个碎片检索数据的查询,则可以通过使用并行任务来获取这些数据。示例包括扇形查询,其中并行检索来自多个碎片的数据,然后将其聚合为单个结果。然而,这种方法不可避免地给解决方案的数据访问逻辑增加了一些复杂性。
For many applications, creating a larger number of small shards can be more efficient than having a small number of large shards because they can offer increased opportunities for load balancing. This approach can also be useful if you anticipate the need to migrate shards from one physical location to another. Moving a small shard is quicker than moving a large one.
对于许多应用程序来说,创建大量小碎片可能比拥有少量大碎片更有效,因为它们可以提供更多的负载平衡机会。如果您预期需要将碎片从一个物理位置迁移到另一个物理位置,那么这种方法也很有用。移动一个小的碎片比移动一个大的要快。
Make sure that the resources available to each shard storage node are sufficient to handle the scalability requirements in terms of data size and throughput. For more information, see the section “Designing Partitions for Scalability” in the Data Partitioning Guidance.
确保每个碎片存储节点可用的资源足以处理数据大小和吞吐量方面的可伸缩性需求。有关更多信息,请参见数据分区指南中的“为可伸缩性设计分区”部分。
Consider replicating reference data to all shards. If an operation that retrieves data from a shard also references static or slow-moving data as part of the same query, add this data to the shard. The application can then fetch all of the data for the query easily, without having to make an additional round trip to a separate data store.
考虑将引用数据复制到所有碎片。如果从分片检索数据的操作也引用静态或缓慢移动的数据作为同一查询的一部分,则将此数据添加到分片中。然后,应用程序可以很容易地获取查询的所有数据,而无需到单独的数据存储区进行额外的往返。
Note 注意
If reference data held in multiple shards changes, the system must synchronize these changes across all shards. The system may experience a degree of inconsistency while this synchronization occurs. If you follow this approach, you should design your applications to be able to handle this inconsistency.
如果多个碎片中保存的引用数据发生更改,系统必须同步所有碎片中的这些更改。当这种同步发生时,系统可能会遇到一定程度的不一致性。如果遵循这种方法,应该将应用程序设计为能够处理这种不一致性。
It can be difficult to maintain referential integrity and consistency between shards, so you should minimize operations that affect data in multiple shards. If an application must modify data across shards, evaluate whether complete data consistency is actually a requirement. Instead, a common approach in the cloud is to implement eventual consistency. The data in each partition is updated separately, and the application logic must take responsibility for ensuring that the updates all complete successfully, as well as handling the inconsistencies that can arise from querying data while an eventually consistent operation is running. For more information about implementing eventual consistency, see the Data Consistency Primer.
维护碎片之间的参照完整性和一致性可能很困难,因此应该尽量减少影响多个碎片中数据的操作。如果应用程序必须跨碎片修改数据,请评估是否实际上需要完全的数据一致性。相反,云计算中的一种常见方法是实现最终一致性。每个分区中的数据都是单独更新的,应用程序逻辑必须负责确保所有更新都成功完成,并处理在运行最终一致的操作时查询数据可能产生的不一致性。有关实现最终一致性的详细信息,请参阅数据一致性入门。
Configuring and managing a large number of shards can be a challenge. Tasks such as monitoring, backing up, checking for consistency, and logging or auditing must be accomplished on multiple shards and servers, possibly held in multiple locations. These tasks are likely to be implemented by using scripts or other automation solutions, but scripting and automation might not be able to completely eliminate the additional administrative requirements.
配置和管理大量碎片可能是一个挑战。必须在多个碎片和服务器上完成监视、备份、一致性检查以及日志记录或审计等任务,这些任务可能保存在多个位置。这些任务可能通过使用脚本或其他自动化解决方案来实现,但是脚本和自动化可能无法完全消除额外的管理需求。
Shards can be geo-located so that the data that they contain is close to the instances of an application that use it. This approach can considerably improve performance, but requires additional consideration for tasks that must access multiple shards in different locations.
可以对碎片进行地理定位,使其包含的数据接近使用它的应用程序的实例。这种方法可以显著提高性能,但是对于必须访问不同位置的多个碎片的任务,需要进一步考虑。
Use this pattern:
使用以下模式:
Note 注意
The primary focus of sharding is to improve the performance and scalability of a system, but as a by-product it can also improve availability by virtue of the way in which the data is divided into separate partitions. A failure in one partition does not necessarily prevent an application from accessing data held in other partitions, and an operator can perform maintenance or recovery of one or more partitions without making the entire data for an application inaccessible. For more information, see the Data Partitioning Guidance.
分片的主要重点是提高系统的性能和可伸缩性,但作为副产品,它也可以通过将数据划分为单独分区的方式提高可用性。一个分区中的故障不一定会阻止应用程序访问其他分区中保存的数据,操作员可以对一个或多个分区进行维护或恢复,而不会使应用程序的整个数据无法访问。有关更多信息,请参见数据分区指南。
The following example uses a set of SQL Server databases acting as shards. Each database holds a subset of the data used by an application. The application retrieves data that is distributed across the shards by using its own sharding logic (this is an example of a fan-out query). The details of the data that is located in each shard is returned by a method called GetShards. This method returns an enumerable list of ShardInformation objects, where the ShardInformation type contains an identifier for each shard and the SQL Server connection string that an application should use to connect to the shard (the connection strings are not shown in the code example).
下面的示例使用一组 SQLServer 数据库作为碎片。每个数据库都包含应用程序使用的数据的一个子集。应用程序通过使用自己的分片逻辑(这是扇形查询的一个示例)检索分布在分片之间的数据。位于每个碎片中的数据的详细信息由一个名为 GetShards 的方法返回。此方法返回 ShardInformation 对象的可枚举列表,其中 ShardInformation 类型包含每个 Shard 的标识符和应用程序应该用于连接到 Shard 的 SQL Server 连接字符串(连接字符串在代码示例中未显示)。
private IEnumerable<ShardInformation> GetShards(){
// This retrieves the connection information from a shard store
// (commonly a root database).
return new[] {
new ShardInformation
{
Id = 1,
ConnectionString = ...
},
new ShardInformation
{
Id = 2,
ConnectionString = ...
}
};
}
The code below shows how the application uses the list of ShardInformation objects to perform a query that fetches data from each shard in parallel. The details of the query are not shown, but in this example the data that is retrieved comprises a string which could hold information such as the name of a customer if the shards contain the details of customers. The results are aggregated into a ConcurrentBag collection for processing by the application.
下面的代码显示了应用程序如何使用 ShardInformation 对象列表来执行查询,该查询并行地从每个碎片中获取数据。没有显示查询的详细信息,但是在这个示例中,检索到的数据包含一个字符串,如果碎片包含客户的详细信息,该字符串可以包含客户的名称等信息。结果聚合到 ConcurrentBag 集合中,供应用程序处理。
// Retrieve the shards as a ShardInformation[] instance.
var shards = GetShards();
var results = new ConcurrentBag<string>();
// Execute the query against each shard in the shard list.
// This list would typically be retrieved from configuration
// or from a root/master shard
store.Parallel.ForEach(shards, shard =>{
// NOTE: Transient fault handling is not included,
// but should be incorporated when used in a real world application.
using (var con = new SqlConnection(shard.ConnectionString)) {
con.Open();
var cmd = new SqlCommand("SELECT ... FROM ...", con);
Trace.TraceInformation("Executing command against shard: {0}", shard.Id);
var reader = cmd.ExecuteReader();
// Read the results in to a thread-safe data structure.
while (reader.Read())
{
results.Add(reader.GetString(0));
}
}
});
Trace.TraceInformation("Fanout query complete - Record Count: {0}", results.Count);
The following patterns and guidance may also be relevant when implementing this pattern:
下列模式和指南在实现此模式时也可能有用:
将静态内容部署到基于云的存储服务,该服务可以将静态内容直接交付给客户机。这种模式可以减少对可能开销很大的计算实例的需求。
Web applications typically include some elements of static content. This static content may include HTML pages and other resources such as images and documents that are available to the client, either as part of an HTML page (such as inline images, style sheets, and client-side JavaScript files) or as separate downloads (such as PDF documents).
Web 应用程序通常包含静态内容的一些元素。这些静态内容可能包括 HTML 页面和其他资源,比如客户端可用的图片和文档,或者作为 HTML 页面的一部分(比如内联图片、样式表和客户端 JavaScript 文件) ,或者作为单独的下载(比如 PDF 文档)。
Although web servers are well tuned to optimize requests through efficient dynamic page code execution and output caching, they must still handle requests to download static content. This absorbs processing cycles that could often be put to better use.
尽管 Web 服务器通过有效的动态页面代码执行和输出缓存进行了良好的调优,以优化请求,但是它们仍然必须处理下载静态内容的请求。这吸收了处理周期,往往可以更好地加以利用。
In most cloud hosting environments it is possible to minimize the requirement for compute instances (for example, to use a smaller instance or fewer instances), by locating some of an application’s resources and static pages in a storage service. The cost for cloud-hosted storage is typically much less than for compute instances.
在大多数云托管环境中,通过在存储服务中定位应用程序的一些资源和静态页面,可以最大限度地减少对计算实例的需求(例如,使用更小的实例或更少的实例)。云托管存储的成本通常比计算实例低得多。
When hosting some parts of an application in a storage service, the main considerations are related to deployment of the application and to securing resources that are not intended to be available to anonymous users.
在存储服务中承载应用程序的某些部分时,主要考虑与应用程序的部署和保护匿名用户无法使用的资源有关。
Consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,请考虑以下几点:
This pattern is ideally suited for:
这种模式非常适合:
This pattern might not be suitable in the following situations:
这种模式可能不适用于下列情况:
Note 注意
It is sometimes possible to store a complete website that contains only static content such as HTML pages, images, style sheets, client-side JavaScript files, and downloadable documents such as PDF files in a cloud-hosted storage. For more information see An efficient way of deploying a static web site on Microsoft Azure on the Infosys blog.
有时可以在云存储中存储完整的网站,其中只包含静态内容,如 HTML 页面、图像、样式表、客户端 JavaScript 文件和可下载文档,如 PDF 文件。要了解更多信息,请看 Infosys 博客上 Microsoft Azure 上部署静态网站的有效方法。
Static content located in Azure blob storage can be accessed directly by a web browser. Azure provides an HTTP-based interface over storage that can be publicly exposed to clients. For example, content in a Azure blob storage container is exposed using a URL of the form:
Azure blob 存储中的静态内容可以通过 Web 浏览器直接访问。Azure 提供了一个基于 HTTP 的存储接口,可以向客户机公开。例如,Azure blob 存储容器中的内容使用表单的 URL 公开:
http://[storage-account-name].blob.core.windows.net/[container-name]/[file-name]
Http://[ Storage-account-name ] . blob.core.windows.net/[ container-name ]/[ file-name ]
When uploading the content for the application it is necessary to create one or more blob containers to hold the files and documents. Note that the default permission for a new container is Private, and you must change this to Public to allow clients to access the contents. If it is necessary to protect the content from anonymous access, you can implement the Valet Key pattern so users must present a valid token in order to download the resources.
在上传应用程序的内容时,需要创建一个或多个 blob 容器来保存文件和文档。注意,新容器的默认权限是 Private,必须将其更改为 Public,以允许客户端访问内容。如果需要保护内容不受匿名访问,可以实现 Valet Key 模式,这样用户必须提供有效的令牌才能下载资源。
Note 注意
The page Blob Service Concepts on the Azure website contains information about blob storage, and the ways that you can access it and use it.
Azure 网站上的 Blob 服务概念页面包含有关 Blob 存储的信息,以及访问和使用它的方法。
The links in each page will specify the URL of the resource and the client will access this resource directly from the storage service. Figure 1 shows this approach.
每个页面中的链接将指定资源的 URL,客户端将直接从存储服务访问该资源。图1显示了这种方法。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FZsCM690-1655720742835)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589776.906db286e5909f599dfade8f32c22152(en-us,pandp.10)].png)
Figure 1 - Delivering static parts of an application directly from a storage service
图1-从存储服务直接交付应用程序的静态部分
The links in the pages delivered to the client must specify the full URL of the blob container and resource. For example, a page that contains a link to an image in a public container might contain the following.
传递给客户端的页面中的链接必须指定 blob 容器和资源的完整 URL。例如,包含指向公共容器中图像的链接的页面可能包含以下内容。
HTML 超文本标示语言Copy 收到
<img src="http://mystorageaccount.blob.core.windows.net/myresources/image1.png"
alt="My image" />
Note 注意
If the resources are protected by using a valet key, such as an Azure Shared Access Signature (SAS), this signature must be included in the URLs in the links.
如果资源是通过使用代理密钥来保护的,比如 Azure 共享访问签名(Azure Shared Access Signature,SAS) ,那么这个签名必须包含在链接中的 URL 中。
The examples available for this guide contain a solution named StaticContentHosting that demonstrates using external storage for static resources. The StaticContentHosting.Cloud project contains configuration files that specify the storage account and container that holds the static content.
本指南提供的示例包含一个名为 StaticContentHoost 的解决方案,该解决方案演示了如何对静态资源使用外部存储。静态内容托管。云项目包含指定存储帐户和容纳静态内容的容器的配置文件。
<Setting name="StaticContent.StorageConnectionString"
value="UseDevelopmentStorage=true" />
<Setting name="StaticContent.Container" value="static-content" />
The Settings class in the file Settings.cs of the StaticContentHosting.Web project contains methods to extract these values and build a string value containing the cloud storage account container URL.
StaticContentHohost 的 Settings.cs 文件中的 Settings 类。Web 项目包含提取这些值和构建包含云存储帐户容器 URL 的字符串值的方法。
public class Settings{
public static string StaticContentStorageConnectionString {
get {
return RoleEnvironment.GetConfigurationSettingValue( "StaticContent.StorageConnectionString");
}
}
public static string StaticContentContainer {
get {
return RoleEnvironment.GetConfigurationSettingValue("StaticContent.Container");
}
}
public static string StaticContentBaseUrl {
get {
var account = CloudStorageAccount.Parse(StaticContentStorageConnectionString);
return string.Format("{0}/{1}", account.BlobEndpoint.ToString().TrimEnd('/'), StaticContentContainer.TrimStart('/'));
}
}
}
The StaticContentUrlHtmlHelper class in the file StaticContentUrlHtmlHelper.cs exposes a method named StaticContentUrl that generates a URL containing the path to the cloud storage account if the URL passed to it starts with the ASP.NET root path character (~).
文件 StaticContentUrlHtmlHelper.cs 中的 staticContenturlHtmlHelper 类公开了一个名为 staticContenturl 的方法,该方法生成一个 URL,如果传递给它的 URL 以 ASP.NET 根路径字符(~)开始,则该 URL 包含云存储帐户的路径。
public static class StaticContentUrlHtmlHelper{
public static string StaticContentUrl(this HtmlHelper helper, string contentPath) {
if (contentPath.StartsWith("~")) {
contentPath = contentPath.Substring(1);
}
contentPath = string.Format("{0}/{1}", Settings.StaticContentBaseUrl.TrimEnd('/'), contentPath.TrimStart('/'));
var url = new UrlHelper(helper.ViewContext.RequestContext);
return url.Content(contentPath);
}
}
The file Index.cshtml in the Views\Home folder contains an image element that uses the StaticContentUrl method to create the URL for its src attribute.
ViewsHome 文件夹中的文件 Index.cshtml 包含一个图像元素,该元素使用 StaticContentUrl 方法为其 src 属性创建 URL。
HTML 超文本标示语言Copy 收到
The following pattern may also be relevant when implementing this pattern:
在实现此模式时,下列模式也可能是相关的:
控制应用程序实例、单个租户或整个服务使用的资源的消耗。这种模式可以使系统继续运行并满足服务水平协议,即使需求的增加对资源造成了极大的负担。
The load on a cloud application typically varies over time based on the number of active users or the types of activities they are performing. For example, more users are likely to be active during business hours, or the system may be required to perform computationally expensive analytics at the end of each month. There may also be sudden and unanticipated bursts in activity. If the processing requirements of the system exceed the capacity of the resources that are available, it will suffer from poor performance and may even fail. The system may be obliged to meet an agreed level of service, and such failure could be unacceptable.
云应用程序的负载通常随着时间的推移而变化,这取决于活动用户的数量或他们正在执行的活动类型。例如,更多的用户可能在工作时间活动,或者系统可能需要在每个月底执行计算代价高昂的分析。也可能有突然和意想不到的活动爆发。如果系统的处理要求超过了可用资源的容量,它将受到性能差的影响,甚至可能失败。系统可能有义务满足商定的服务水平,这种故障可能是不可接受的。
There are many strategies available for handling varying load in the cloud, depending on the business goals for the application. One strategy is to use autoscaling to match the provisioned resources to the user needs at any given time. This has the potential to consistently meet user demand, while optimizing running costs. However, while autoscaling may trigger the provisioning of additional resources, this provisioning is not instantaneous. If demand grows quickly, there may be a window of time where there is a resource deficit.
根据应用程序的业务目标,有许多策略可用于处理云中不同的负载。一种策略是在任何给定时间使用自动伸缩来匹配所提供的资源以满足用户的需求。这有可能持续满足用户需求,同时优化运行成本。然而,尽管自动伸缩可能会触发额外资源的配置,但这种配置并不是即时的。如果需求快速增长,可能会出现资源短缺。
An alternative strategy to autoscaling is to allow applications to use resources only up to some soft limit, and then throttle them when this limit is reached. The system should monitor how it is using resources so that, when usage exceeds some system-defined threshold, it can throttle requests from one or more users to enable the system to continue functioning and meet any service level agreements (SLAs) that are in place. For more information on monitoring resource usage, see the Instrumentation and Telemetry Guidance.
自动伸缩的另一种策略是允许应用程序只使用某些软限制的资源,然后在达到这个限制时限制它们。系统应该监视它是如何使用资源的,以便当使用量超过某个系统定义的阈值时,它可以限制来自一个或多个用户的请求,从而使系统能够继续运行并满足任何已经到位的服务水平协议(SLA)。有关监视资源使用情况的详细信息,请参阅仪器和遥测指南。
The system could implement several throttling strategies, including:
该系统可实施若干节流策略,包括:
Figure 1 shows an area graph for resource utilization (a combination of memory, CPU, bandwidth, and other factors) against time for applications that are making use of three features. A feature is an area of functionality, such as a component that performs a specific set of tasks, a piece of code that performs a complex calculation, or an element that provides a service such as an in-memory cache. These features are labeled A, B, and C.
图1显示了使用三个特性的应用程序的资源利用率(内存、 CPU、带宽和其他因素的组合)随时间变化的区域图。特性是功能的一个区域,例如执行特定任务集的组件、执行复杂计算的代码段或提供服务(如内存缓存)的元素。这些特征被标记为 A、 B 和 C。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LGfAMN0J-1655720742839)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589798.a1a3b082c679b2531f312b7fb4ff23e9(en-us,pandp.10)].png)
Figure 1 - Graph showing resource utilization against time for applications running on behalf of three users
图1-显示代表三个用户运行的应用程序的资源利用率与时间的关系图
Note 注意
The area immediately below the line for a feature indicates the resources used by applications when they invoke this feature. For example, the area below the line for Feature A shows the resources used by applications that are making use of Feature A, and the area between the lines for Feature A and Feature B indicates the resources by used by applications invoking Feature B. Aggregating the areas for each feature shows the total resource utilization of the system.
特性行下面的区域表示应用程序在调用该特性时使用的资源。例如,功能 A 线下面的区域显示使用功能 A 的应用程序所使用的资源,功能 A 和功能 B 线之间的区域显示调用功能 B 的应用程序所使用的资源。
The graph in Figure 1 illustrates the effects of deferring operations. Just prior to time T1, the total resources allocated to all applications using these features reach a threshold (the soft limit of resource utilization). At this point, the applications are in danger of exhausting the resources available. In this system, Feature B is less critical than Feature A or Feature C, so it is temporarily disabled and the resources that it was using are released. Between times T1 and T2, the applications using Feature A and Feature C continue running as normal. Eventually, the resource use of these two features diminishes to the point when, at time T2, there is sufficient capacity to enable Feature B again.
图1中的图表说明了延迟操作的影响。就在 T1之前,分配给使用这些特性的所有应用程序的总资源达到了一个阈值(资源利用率的软限制)。此时,应用程序有耗尽可用资源的危险。在这个系统中,特征 B 的关键性不如特征 A 或特征 C,因此它被暂时禁用,它所使用的资源被释放。在 T1和 T2之间,使用 FeatureA 和 FeatureC 的应用程序继续正常运行。最终,这两个特性的资源使用减少到一定程度,在时间 T2时,有足够的容量再次启用特性 B。
The autoscaling and throttling approaches can also be combined to help keep the applications responsive and within SLAs. If the demand is expected to remain high, throttling may provide a temporary solution while the system scales out. At this point, the full functionality of the system can be restored.
还可以将自动缩放和节流方法结合起来,以帮助保持应用程序在 SLA 内响应。如果预计需求仍然很高,节流可能提供一个临时的解决方案,而系统的规模。此时,可以恢复系统的全部功能。
Figure 2 shows an area graph of the overall resource utilization by all applications running in a system against time, and illustrates how throttling can be combined with autoscaling.
图2显示了系统中运行的所有应用程序对时间的总体资源利用率的面积图,并说明了如何将节流与自动缩放结合起来。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zLomGiCn-1655720742840)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589798.9063247c16e21dfefe12b06e3f42d9ce(en-us,pandp.10)].png)
Figure 2 - Graph showing the effects of combining throttling with autoscaling
图2-图表显示了节流与自动缩放相结合的效果
At time T1, the threshold specifying the soft limit of resource utilization is reached. At this point, the system can start to scale out. However, if the new resources do not become available sufficiently quickly then the existing resources may be exhausted and the system could fail. To prevent this from occurring, the system is temporarily throttled, as described earlier. When autoscaling has completed and the additional resources are available, throttling can be relaxed.
在 T1时,达到指定资源利用软限制的阈值。此时,系统可以开始向外扩展。但是,如果新的资源不能足够快地变得可用,那么现有的资源可能会耗尽,系统可能会失败。为了防止这种情况发生,如前所述,系统被临时节流。当自动伸缩完成并且附加资源可用时,可以放松节流。
You should consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,应考虑以下几点:
Use this pattern:
使用以下模式:
Figure 3 illustrates how throttling can be implemented in a multi-tenant system. Users from each of the tenant organizations access a cloud-hosted application where they fill out and submit surveys. The application contains instrumentation that monitors the rate at which these users are submitting requests to the application.
图3说明了如何在多租户系统中实现节流。每个租户组织的用户都可以访问云托管的应用程序,填写并提交调查表。应用程序包含监视这些用户向应用程序提交请求的速率的仪器。
In order to prevent the users from one tenant affecting the responsiveness and availability of the application for all other users, a limit is applied to the number of requests per second that the users from any one tenant can submit. The application blocks requests that exceed this limit.
为了防止用户受到某个租户的影响而影响应用程序对所有其他用户的响应性和可用性,对来自任何一个租户的用户可以提交的每秒请求数施加了一个限制。应用程序阻止超过此限制的请求。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RzgBAcET-1655720742840)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn589798.7aef73084f6873caf839a57e1f13fb3b(en-us,pandp.10)].png)
Figure 3 - Implementing throttling in a multi-tenant application
图3-在多租户应用程序中实现节流
The following patterns and guidance may also be relevant when implementing this pattern:
下列模式和指南在实现此模式时也可能有用:
使用令牌或密钥为客户端提供对特定资源或服务的受限制的直接访问,以便从应用程序代码中卸载数据传输操作。此模式在使用云托管存储系统或队列的应用程序中特别有用,可以最小化成本并最大限度地提高可伸缩性和性能。
Client programs and web browsers often need to read and write files or data streams to and from an application’s storage. Typically, the application will handle the movement of the data—either by fetching it from storage and streaming it to the client, or by reading the uploaded stream from the client and storing it in the data store. However, this approach absorbs valuable resources such as compute, memory, and bandwidth.
客户端程序和 Web 浏览器通常需要在应用程序的存储中读写文件或数据流。通常,应用程序将处理数据的移动ーー要么从存储中提取数据并将其流到客户端,要么从客户端读取上传的流并将其存储在数据存储中。但是,这种方法会占用有价值的资源,如计算、内存和带宽。
Data stores have the capability to handle upload and download of data directly, without requiring the application to perform any processing to move this data, but this typically requires the client to have access to the security credentials for the store. While this can be a useful technique to minimize data transfer costs and the requirement to scale out the application, and to maximize performance, it means that the application is no longer able to manage the security of the data. Once the client has a connection to the data store for direct access, the application cannot act as the gatekeeper. It is no longer in control of the process and cannot prevent subsequent uploads or downloads from the data store.
数据存储能够直接处理数据的上传和下载,而不需要应用程序执行任何处理来移动这些数据,但是这通常需要客户端访问存储的安全凭据。虽然这是一种有用的技术,可以最小化数据传输成本和扩展应用程序的需求,并最大限度地提高性能,但这意味着应用程序不再能够管理数据的安全性。一旦客户端连接到数据存储区进行直接访问,应用程序就不能充当网关守护者。它不再控制进程,并且不能阻止数据存储区的后续上传或下载。
This is not a realistic approach in modern distributed systems that may need to serve untrusted clients. Instead, applications must be able to securely control access to data in a granular way, but still reduce the load on the server by setting up this connection and then allowing the client to communicate directly with the data store to perform the required read or write operations.
在现代分布式系统中,这种方法并不现实,因为它可能需要为不受信任的客户机提供服务。相反,应用程序必须能够以粒度的方式安全地控制对数据的访问,但是仍然可以通过建立这种连接,然后允许客户端直接与数据存储通信来执行所需的读或写操作,从而减少服务器的负载。
To resolve the problem of controlling access to a data store where the store itself cannot manage authentication and authorization of clients, one typical solution is to restrict access to the data store’s public connection and provide the client with a key or token that the data store itself can validate.
为了解决对数据存储的访问控制问题,在数据存储本身不能管理客户端的身份验证和授权的情况下,一个典型的解决方案是限制对数据存储的公共连接的访问,并向客户端提供数据存储本身可以验证的密钥或令牌。
This key or token is usually referred to as a valet key. It provides time-limited access to specific resources and allows only predefined operations such as reading and writing to storage or queues, or uploading and downloading in a web browser. Applications can create and issue valet keys to client devices and web browsers quickly and easily, allowing clients to perform the required operations without requiring the application to directly handle the data transfer. This removes the processing overhead, and the consequent impact on performance and scalability, from the application and the server.
此密钥或令牌通常称为代客密钥。它提供了对特定资源的有时间限制的访问,并且只允许预定义的操作,比如对存储或队列的读写,或者在 Web 浏览器中的上传和下载。应用程序可以快速、方便地创建并向客户端设备和 Web 浏览器发出代客密钥,从而允许客户端执行所需的操作,而无需应用程序直接处理数据传输。这就从应用程序和服务器上消除了处理开销以及随之而来的对性能和可伸缩性的影响。
The client uses this token to access a specific resource in the data store for only a specific period, and with specific restrictions on access permissions, as shown in Figure 1. After the specified period, the key becomes invalid and will not allow subsequent access to the resource.
客户端使用此令牌仅在特定时间内访问数据存储区中的特定资源,并对访问权限进行特定限制,如图1所示。在指定的时间段之后,密钥变为无效,并且不允许后续访问资源。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WcatFtgN-1655720742845)(https://docs.microsoft.com/en-us/previous-versions/msp-n-p/images/dn568102.8604c297d0e997a7481901d7533b8aa1(en-us,pandp.10)].png)
Figure 1 - Overview of the pattern
图1-模式概述
It is also possible to configure a key that has other dependencies, such as the scope of the location of the data. For example, depending on the data store capabilities, the key may specify a complete table in a data store, or only specific rows in a table. In cloud storage systems the key may specify a container, or just a specific item within a container.
还可以配置具有其他依赖项(如数据位置的范围)的密钥。例如,根据数据存储功能,键可以指定数据存储中的完整表,或者仅指定表中的特定行。在云存储系统中,密钥可以指定一个容器,或者仅指定容器中的一个特定项。
The key can also be invalidated by the application. This is a useful approach if the client notifies the server that the data transfer operation is complete. The server can then invalidate that key to prevent its use for any subsequent access to the data store.
应用程序也可以使密钥失效。如果客户机通知服务器数据传输操作已完成,那么这是一种有用的方法。然后,服务器可以使该密钥失效,以防止其用于对数据存储区的任何后续访问。
Using this pattern can simplify managing access to resources because there is no requirement to create and authenticate a user, grant permissions, and then remove the user again. It also makes it easy to constrain the location, the permission, and the validity period—all by simply generating a suitable key at runtime. The important factors are to limit the validity period, and especially the location of the resource, as tightly as possible so that the recipient can use it for only the intended purpose.
使用此模式可以简化对资源的访问管理,因为不需要创建和验证用户、授予权限,然后再次删除用户。它还可以很容易地约束位置、权限和有效期ーー只需在运行时生成一个合适的键。重要的因素是尽可能严格地限制有效期,特别是资源的位置,以便接收者只能将其用于预期目的。
Consider the following points when deciding how to implement this pattern:
在决定如何实现此模式时,请考虑以下几点:
Other issues to be aware of when implementing this pattern are:
在实现此模式时需要注意的其他问题包括:
This pattern is ideally suited for the following situations:
这种模式非常适合下列情况:
This pattern might not be suitable in the following situations:
这种模式可能不适用于下列情况:
Microsoft Azure supports Shared Access Signatures (SAS) on Azure storage for granular access control to data in blobs, tables, and queues, and for Service Bus queues and topics. An SAS token can be configured to provide specific access rights such as read, write, update, and delete to a specific table; a key range within a table; a queue; a blob; or a blob container. The validity can be a specified time period or with no time limit.
微软 Azure 支持 Azure 存储上的共享访问签名(SAS) ,用于对 blobs、表和队列中的数据进行粒度访问控制,以及对服务总线队列和主题进行访问控制。可以将 SAS 令牌配置为提供特定的访问权限,如对特定表的读、写、更新和删除; 表中的键范围; 队列; blob 或 blob 容器。有效期可以是指定的时间段,也可以没有时间限制。
Azure SAS also supports server-stored access policies that can be associated with a specific resource such as a table or blob. This feature provides additional control and flexibility compared to application-generated SAS tokens, and should be used whenever possible. Settings defined in a server-stored policy can be changed and are reflected in the token without requiring a new token to be issued, but settings defined in the token itself cannot be changed without issuing a new token. This approach also makes it possible to revoke a valid SAS token before it has expired.
Azure SAS 还支持服务器存储的访问策略,这些策略可以与特定的资源(如表或 blob)相关联。与应用程序生成的 SAS 令牌相比,此特性提供了额外的控制和灵活性,应尽可能使用。可以更改在服务器存储策略中定义的设置,并将其反映在令牌中,而不需要发出新令牌,但是如果不发出新令牌,则不能更改在令牌本身中定义的设置。这种方法还可以在有效的 SAS 令牌过期之前撤销它。
Note 注意
For more information see Introducing Table SAS (Shared Access Signature), Queue SAS and update to Blob SAS in the Azure Storage Team blog and Shared Access Signatures, Part 1: Understanding the SAS Model on MSDN.
有关详细信息,请参阅 Azure 存储团队博客中的表 SAS (共享访问签名)、队列 SAS 和对 Blob SAS 的更新,以及共享访问签名,第1部分: 理解 MSDN 上的 SAS 模型。
The following code demonstrates how to create a SAS that is valid for five minutes. The GetSharedAccessReferenceForUpload method returns a SAS that can be used to upload a file to Azure Blob Storage.
下面的代码演示如何创建有效时间为5分钟的 SAS。GetSharedAccessReferenceForUpload 方法返回可用于将文件上传到 Azure Blob Storage 的 SAS。
public class ValuesController : ApiController{
private readonly CloudStorageAccount account;
private readonly string blobContainer;
...
///
/// Return a limited access key that allows the caller to upload a file
/// to this specific destination for a defined period of time.
///
private StorageEntitySas GetSharedAccessReferenceForUpload(string blobName) {
var blobClient = this.account.CreateCloudBlobClient();
var container = blobClient.GetContainerReference(this.blobContainer);
var blob = container.GetBlockBlobReference(blobName);
var policy = new SharedAccessBlobPolicy {
Permissions = SharedAccessBlobPermissions.Write,
// Specify a start time five minutes earlier to allow for client clock skew.
SharedAccessStartTime = DateTime.UtcNow.AddMinutes(-5),
// Specify a validity period of five minutes starting from now.
SharedAccessExpiryTime = DateTime.UtcNow.AddMinutes(5)
};
// Create the signature.
var sas = blob.GetSharedAccessSignature(policy);
return new StorageEntitySas {
BlobUri = blob.Uri,
Credentials = sas,
Name = blobName
};
}
public struct StorageEntitySas {
public string Credentials;
public Uri BlobUri;
public string Name;
}
}
Note 注意
The complete sample containing this code is available in the ValetKey solution available for download with this guidance. The ValetKey.Web project in this solution contains a web application that includes the ValuesController class shown above. A sample client application that uses this web application to retrieve a SAS key and upload a file to blob storage is available in the ValetKey.Client project.
包含此代码的完整示例可在 ValetKey 解决方案中获得,可通过本指南下载。ValetKey.此解决方案中的 Web 项目包含一个 Web 应用程序,其中包含上面显示的 ValuesController 类。ValetKey 中提供了一个示例客户端应用程序,该应用程序使用此 Web 应用程序检索 SAS 密钥并将文件上传到 blob 存储。客户项目。
The following patterns and guidance may also be relevant when implementing this pattern:
下列模式和指南在实现此模式时也可能有用: