5 、强制故障转移(可能丢失数据) |
5 、Forced Failover (with Possible Data Loss) |
||||||||||||||||||||||||
强制执行可用性组的故障转移(可能丢失数据)是一种灾难恢复方法,可使你使用次要副本作为热备用服务器。因为强制执行故障转移可能面临数据丢失的风险,因此应审慎使用它。建议仅当您必须立即将服务还原到可用性数据库并愿意承担数据丢失的风险时,才执行强制故障转移。有关强制故障转移的先决条件和建议,以及使用强制故障转移从灾难性故障中恢复的示例应用场景的详细信息,请参阅 执行可用性组的强制手动故障转移(SQL Server)。 |
Forcing failover of an availability group (with possible data loss) is a disaster recovery method that allows you to use a secondary replica as a warm standby server.Because forcing failover risks possible data loss, it should be used cautiously and sparingly. We recommend forcing failover only if you must restore service to your availability databases immediately and are willing to risk losing data. For more information about the prerequisites and recommendations for forcing failover and for an example scenario that uses a forced failover to recover from a catastrophic failure, see Perform a Forced Manual Failover of an Availability Group (SQL Server). |
||||||||||||||||||||||||
警告 |
Warnings |
||||||||||||||||||||||||
强制故障转移要求WSFC群集具有仲裁。有关配置仲裁和强制仲裁的信息,请参阅 Windows Server 故障转移群集(WSFC) 与SQL Server。 |
Forcing failover requires that the WSFC cluster have quorum. For information about configuring quorum and forcing quorum, see Windows Server Failover Clustering (WSFC) with SQL Server. |
||||||||||||||||||||||||
5.1 、强制故障转移的原理 |
5.1 、How Forced Failover Works |
||||||||||||||||||||||||
强制故障转移会启动一个将主角色转换为角色处于辅助或正在解析状态的目标副本的过程。故障转移目标成为新的主副本,并立即将其数据库副本提供给客户端。当以前的主副本变得可用时,它将转换为辅助角色,并且其数据库将成为辅助数据库。 |
Forcing failover initiates a transition of the primary role to a target replica whose role is in the SECONDARY or RESOLVING state. The failover target becomes the new primary replica and immediately serves its copies of the databases to clients. When the former primary replica becomes available, it will transition to the secondary role and its databases will become secondary databases. |
||||||||||||||||||||||||
所有辅助数据库(包括现在变得可用的以前的主数据库)将挂起。根据挂起的辅助数据库以前的数据同步状态,它可能适合于补救该主数据库的未能提交的数据。在配置为只读访问的辅助副本上,您可以查询辅助数据库以手动发现丢失的数据。然后,您可以对新的主数据库发出Transact-SQL语句来进行必要的更改。 |
All secondary databases (including the former primary databases, when they become available) are SUSPENDED. Depending on the previous data synchronization state of a suspended secondary database, it might be suitable for salvaging missing committed data for that primary database. On a secondary replica that is configured for read-only access, you can query the secondary databases to manually discover missing data. Then you can issue Transact-SQL statements on the new primary databases to make any necessary changes. |
||||||||||||||||||||||||
5.2 、强制故障转移的风险 |
5.2 、Risks of Forcing Failover |
||||||||||||||||||||||||
一定要注意,强制故障转移可能会造成数据丢失。这是因为目标副本无法与主副本进行通信,从而不能保证两个数据库同步。强制故障转移启动新的恢复分叉。因为原始主数据库和辅助数据库位于不同的恢复分叉上,所以每个数据库现在包含另一个数据库不包含的数据:每个原始主数据库包含任何尚未从其发送队列发送到以前的辅助数据库的更改(未发送的日志);以前的辅助数据库包含任何强制故障转移之后发生的更改。 |
It is essential to understand that forcing failover can cause data loss. Data loss is possible because the target replica cannot communicate with the primary replica and, therefore, cannot guarantee that the databases are synchronized. Forcing failover starts a new recovery fork. Because the original primary databases and secondary databases are on different recovery forks, each of them now contains data that the other database does not contain: each original primary database contains whatever changes were not yet sent from its send queue to the former secondary database (the unsent log); the former secondary databases contain whatever changes occur after failover was forced. |
||||||||||||||||||||||||
如果因为主副本出现故障而强制进行故障转移,则潜在的数据丢失取决于是否在出现故障之前已将所有事务日志发送到辅助副本。在异步提交模式下,可能会始终存在累积的未发送日志。在同步提交模式下,可能仅在辅助数据库同步之前会出现这种情况。 |
If failover is forced because the primary replica has failed, potential data loss depends on whether or not any transaction logs had been sent to the secondary replica before the failure. Under the asynchronous-commit mode, accumulated unsent log is always a possibility. Under synchronous-commit mode, this is possible only until the secondary databases become synchronized. |
||||||||||||||||||||||||
下表总结了在强制故障转移到该副本上时特定数据库丢失数据的可能性。 |
The following table summarizes the possibility of data loss for a particular database on the replica to which you force failover. |
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
辅助数据库仅跟踪两个恢复分叉,因此,如果您执行多个强制故障转移,则确实已与先前的强制故障转移启动数据同步的任何辅助数据库都可能无法恢复运行。如果发生这种情况,则需要从可用性组中删除无法恢复的所有辅助数据库,还原到正确的时间点,然后重新加入可用性组。在此方案中,可能会发生状态为103的错误1408(错误:1408,严重性:16,状态:103)。还原不能跨多个恢复分叉执行,因此请确保在执行多个强制故障转移后执行日志备份。 |
Secondary databases track only two recovery forks, so if you perform multiple forced failovers, any secondary database that did start data synchronization with the previous force failover might not be able to resume. If this occurs, any secondary databases that cannot be resumed will need to be removed from the availability group, restored to the correct point in time, and rejoined to the availability group. Error 1408 with state 103 may be observed in this scenario (Error: 1408, Severity: 16, State: 103). A restore will not work across multiple recovery forks, therefore, be sure to perform a log backup after performing more than one forced failover. |
||||||||||||||||||||||||
5.3 、强制仲裁后需要强制故障转移的原因 |
5.3 、Why Forced Failover is Required After Forcing Quorum |
||||||||||||||||||||||||
在对WSFC群集强制执行仲裁(强制仲裁)后,你需要在每个可用性组上执行强制故障转移(可能会丢失数据)。强制故障转移是必需的,因为WSFC群集的真实状态值可能已丢失。在强制仲裁后需要防止常规故障转移,因为在重新配置的WSFC群集上未同步的辅助副本很可能显示为“已同步”。 |
After quorum is forced on the WSFC cluster ( forced quorum) you need to perform a forced failover (with possible data loss) on each availability group. The forced failover is required because the real state of the WSFC cluster values might have been lost. Preventing normal failovers after a forced quorum is required because of the possibility than an unsynchronized secondary replica would appear to be synchronized on the reconfigured WSFC cluster. |
||||||||||||||||||||||||
例如,考虑在3个节点上承载可用性组的WSFC群集:节点A承载主要副本,而节点B和节点C分别承载一个次要副本。节点C断开了与WSFC群集的连接,而此时该节点上的本地辅助副本处于同步状态。但是节点A和节点B仍可以正常仲裁,可用性组仍处于联机状态。在节点A上,主副本继续接受更新,在节点B上,辅助副本继续与主副本同步。节点C上的辅助副本就会变得不同步,并且越来越滞后于主副本。但是,由于节点C已断开连接,该副本仍错误地处于同步状态。 |
For example, consider a WSFC cluster that hosts an availability group on three nodes: Node A hosts the primary replica and Node B and Node C each hosts a secondary replica. Node C gets disconnected from the WSFC cluster while the local secondary replica is SYNCHRONIZED. But Node A and Node B retain a healthy quorum and the availability group remains online. On Node A, the primary replica continues to accept updates, and on Node B, the secondary replica continues to synchronize with the primary replica. The secondary replica on Node C becomes unsynchronized and falls increasingly behind the primary replica. However, because Node C is disconnected, the replica remains, incorrectly, in the SYNCHRONIZED state. |
||||||||||||||||||||||||
如果仲裁丢失,然后在节点A上强制执行,则WSFC群集上可用性组的同步状态应是正确的(节点C上的辅助副本显示为未同步状态)。但是,如果在节点C上强制执行仲裁,则可用性组的同步状态将是不正确的。群集上的同步状态将恢复为节点C断开连接时所处的状态(节点C上的辅助副本“错误地”显示为同步状态)。由于计划的手动故障转移确保了数据的安全性,在强制仲裁后它们不允许将可用性组恢复为联机状态。 |
If quorum is lost and is then forced on Node A, the synchronization state of the availability group on the WSFC cluster should be correct-with the secondary replica on Node C shown as UNSYNCHRONIZED. However, if quorum is forced on Node C, the synchronization of the availability group will be incorrect. The synchronization state on the cluster will have reverted back to when Node C was disconnected-with the secondary replica on Node C incorrectly shown as SYNCHRONIZED. Since planned manual failovers guarantee the safety of the data, they are disallowed for bring an availability group back online after quorum is forced. |
||||||||||||||||||||||||
5.4 、跟踪可能的数据丢失 |
5.4 、Tracking Potential Data Loss |
||||||||||||||||||||||||
WSFC 群集正常仲裁时,您可以估计数据库上当前可能的数据丢失量。对于给定的辅助副本,当前可能的数据丢失量取决于本地辅助数据库滞后相应主数据库的程度。因为滞后程度随时间而变化,我们建议您定期跟踪未同步的辅助数据库可能的数据丢失情况。跟踪滞后情况涉及比较每个主数据库和辅助数据库的上次提交LSN和上次提交时间,如下所示: |
When the WSFC cluster has a healthy quorum, you can estimate the current potential for data loss on databases. For a given secondary replica, the current potential for data loss depends on how far the local secondary databases are lagging behind the corresponding primary databases. Because the amount of lag varies over time, we recommend that you periodically track potential data loss for your unsynchronized secondary databases. Tracking lag involves comparing the Last Commit LSN and Last Commit Time for each primary database and its secondary databases, as follows: |
||||||||||||||||||||||||
1. 连接到主副本。 |
1.Connect to the primary replica. |
||||||||||||||||||||||||
2. 查询 sys.dm_hadr_database_replica_states动态管理视图的 last_commit_lsn(上次提交事务的LSN)和 last_commit_time(上次提交时间)列。 |
2.Query the last_commit_lsn (LSN of the last committed transaction) and last_commit_time (time of the last commit) columns of the sys.dm_hadr_database_replica_states dynamic management view. |
||||||||||||||||||||||||
3. 比较为每个主数据库和它的每个辅助数据库返回的值。它们的上次提交LSN的差值指示滞后的程度。 |
3.Compare the values returned for each primary database and each of its secondary databases. The difference between their Last Commit LSNs indicate the amount of lag. |
||||||||||||||||||||||||
4. 当某个或某组数据库上的滞后程度超过指定时间段的最大滞后程度时,您可以触发一个警报。例如,可以通过每分钟在每个主数据库上执行的一个作业来运行查询。如果自上次执行该作业以来,主数据库的 last_commit_time和任意辅助数据库的相应值的差值超过恢复点目标(RPO)(例如,5分钟),该作业可能引发一个警报。 |
4.You can trigger an alert when the amount of lag on a database or set of databases exceeds your desired maximum lag for a given period of time. For example, the query can be run by a job that executes every minute on each primary database. If the difference between the last_commit_time of a primary database and any of its secondary databases has exceeded the recovery point objective (RPO) (for example, 5 minutes) since the last time the job executed, the job can raise an alert. |
||||||||||||||||||||||||
重要 |
Important |
||||||||||||||||||||||||
当WSFC群集缺少仲裁或已强制执行仲裁时, last_commit_lsn和 last_commit_time为NULL。有关在强制仲裁后如何避免数据丢失的信息,请参阅 执行可用性组的强制手动故障转移(SQL Server)。 |
When the WSFC cluster lacks quorum or quorum has been forced, last_commit_lsn and last_commit_time are NULL. For information about how you might be able to avoid data loss after you forced quorum, see "Potential Ways to Avoid Data Loss After Quorum is Forced" in Perform a Forced Manual Failover of an Availability Group(SQL Server). |
||||||||||||||||||||||||
5.5 、管理潜在的数据丢失 |
5.5 、Managing the Potential Data Loss |
||||||||||||||||||||||||
强制故障转移后,所有辅助数据库都将挂起。这包括以前的主数据库(在以前的主副本返回到联机状态并且发现它现在是辅助副本后)。您必须单独在每个辅助副本上手动恢复每个挂起的数据库。 |
After failover is forced, all secondary databases are suspended. This includes the former primary databases, after the former primary replica comes back online and discovers that it is now a secondary replica. You must manually resume each suspended database individually on each secondary replica. |
||||||||||||||||||||||||
以前的主副本可用后,假设其数据库没有损坏,则可以尝试管理可能的数据丢失。管理潜在数据丢失的可用方法取决于原始主副本是否已连接到新的主副本。假设原始主副本可以访问新的主实例,则会自动透明地进行重新连接。 |
Once the former primary replica is available, assuming that its databases are undamaged, you can attempt to manage the potential data loss. The available approach for managing potential data loss depends on whether the original primary replica has connected to the new primary replica. Assuming that the original primary replica can access the new primary instance, reconnecting occurs automatically and transparently. |
||||||||||||||||||||||||
已重新连接原始主副本 |
The Original Primary Replica Has Reconnected |
||||||||||||||||||||||||
通常,出现故障后,原始主副本在重新启动时便会迅速重新连接到其伙伴。重新连接后,原始主副本将成为辅助副本。其数据库将成为辅助数据库,然后进入挂起状态。除非您恢复新的辅助数据库,否则不会回滚它们。 |
Typically, after a failure, when the original primary replica restarts it quickly reconnects to its partner. On reconnecting, the original primary replica becomes the secondary replica. Its databases becomes the secondary databases and enter the SUSPENDED state. The new secondary databases will not be not rolled back unless you resume them. |
||||||||||||||||||||||||
但是,无法访问挂起的数据库;因此,不能对其进行检查以确定恢复给定数据库时可能丢失的数据。因此,确定是恢复还是删除辅助数据库取决于您是否能够完全接受数据丢失,如下所示: |
However, the suspended databases are inaccessible, so you cannot inspect them to evaluate what data would be lost if you were to resume a given database. Therefore, the decision on whether to resume or remove a secondary database depends on whether you are willing to accept any data loss, as follows: |
||||||||||||||||||||||||
· 如果数据丢失不可接受,则应该从可用性组中删除数据库以对数据进行补救。 |
· If losing any data would be unacceptable, you should remove the databases from the availability group to salvage them. |
||||||||||||||||||||||||
数据库管理员现在可以恢复以前的主数据库,并尝试恢复可能已丢失的数据。但是,当以前的主数据库处于联机状态后,它与当前主数据库存在偏差,因此,数据库管理员需要使客户端无法访问删除的数据库或当前主要数据库,以免数据库之间出现更大偏差并防止出现客户端故障转移问题。 |
The database administrator can now recover the former primary databases and attempt to recover the data that would have been lost. However, when a former primary database comes online, it is divergent from the current primary database, so the database administrator needs to make either the removed database or the current primary database inaccessible to clients to avoid further divergence of the databases and to prevent client-failover issues. |
||||||||||||||||||||||||
· 如果数据丢失对于您的业务目标是可以接受的,您可以恢复辅助数据库。 |
· If losing data would be acceptable to your business goals, you can resume the secondary databases. |
||||||||||||||||||||||||
恢复辅助数据库会导致它如同步数据库第一步所述那样回滚。如果出现故障时日志记录在发送队列中等待,则相应的事务将会丢失,即使已提交这些事务也会如此。 |
Resuming a new secondary database causes it to be rolled back as the first step in synchronizing the database. If any log records were waiting in the send queue at the time of failure, the corresponding transactions are lost, even if they were committed. |
||||||||||||||||||||||||
未重新连接原始主副本 |
The Original Primary Replica Has Not Reconnected |
||||||||||||||||||||||||
如果可以暂时阻止原始主副本通过网络重新连接到新的主副本,则可以检查原始主数据库以确定恢复它们时可能丢失的数据。 |
If you can temporarily prevent the original primary replica from reconnecting over the network to the new primary replica, you can inspect the original primary databases to evaluate what data would be lost if they were resumed. |
||||||||||||||||||||||||
· 如果潜在的数据丢失可以接受 |
· If the potential data loss is acceptable |
||||||||||||||||||||||||
允许原始主副本重新连接到新的主副本。重新连接会导致新的辅助数据库被挂起。要启动数据库的数据同步,只需恢复它。新的辅助副本会删除该数据库的原始恢复分叉,从而丢失从未发送到以前的辅助副本或由其接收的所有事务。 |
Allow the original primary replica to reconnect to the new primary replica. Reconnecting causes the new secondary databases to be suspended. To start data synchronization on a database, simply resume it. The new secondary replica drops the original recovery fork for that database, losing any transactions that were never sent to or received by the former secondary replica. |
||||||||||||||||||||||||
· 如果数据丢失不可接受 |
· If the data loss is unacceptable |
||||||||||||||||||||||||
如果原始主数据库包含在恢复挂起的数据库时可能丢失的重要数据,则可以从可用性组中删除它,以保留原始主数据库中的数据。这样会导致数据库进入“正在还原”状态。此时,我们建议您尝试备份已删除数据库的日志尾部。然后,通过从原始主数据库中导出要补救的数据,并将其导入当前主数据库来更新当前主数据库(以前的辅助数据库)。建议尽快对已更新的主数据库执行完整数据库备份。 |
If the original primary database contains critical data that would be lost if you resumed the suspended database, you can preserve the data on the original primary database by removing it from the availability group. This causes the database to enter the RESTORING state. At this point, we recommend that you attempt to back up the tail of the removed database's log. Then, you can update the current primary (the former secondary database) by exporting the data you want to salvage from the original primary database and importing it into the current primary database. We recommend taking a full database backup of the updated primary database as quickly as possible. |
||||||||||||||||||||||||
然后,在承载新的辅助副本的服务器实例上,您可以使用RESTORE WITH NORECOVERY来还原此备份(以及至少一个后续日志备份),从而删除挂起的辅助数据库并创建新的辅助数据库。我们建议延迟当前主数据库的其他日志备份,直到恢复相应的辅助数据库。 |
Then, on the server instance that hosts the new secondary replica, you can delete the suspended secondary database and create a new secondary database by restoring this backup (and least one subsequent log backup) using RESTORE WITH NORECOVERY. We recommend delaying additional log backups of the current primary databases until the corresponding secondary databases are resumed. |
||||||||||||||||||||||||
警告 |
Warnings |
||||||||||||||||||||||||
在其任何辅助数据库被挂起时,事务日志截断在主数据库上被延迟。此外,只要任何本地数据库保持挂起状态,同步提交辅助副本的同步运行状况就无法转换到“正常”。 |
Transaction log truncation is delayed on a primary database while any of its secondary databases is suspended. Also the synchronization health of a synchronous-commit secondary replica cannot transition to HEALTHY as long as any local database remains suspended. |
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/81227/viewspace-2655027/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/81227/viewspace-2655027/