最近朋友去恒大面试,考官给出这样一个观点:

同步情况下丢失数据有两种情况:一种是阻塞丢失,一种是同步失败。

给出的处理办法:阻塞丢失的话干掉阻塞进程,或者重启实例都能解决;同步失败就只能重做节点。

让我们一起来理解下,微软官方对于AlwaysOn同步提交模式的理解,当然直接看英文原文理解更精准:

https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/availability-modes-always-on-availability-groups?view=sql-server-2017#SyncCommitAvMode

In Always On availability groups, the availability mode is a replica property that determines whether a given availability replica can run in synchronous-commit mode. For each availability replica, the availability mode must be configured for either synchronous-commit mode, asynchronous-commit, or configuration only mode. If the primary replica is configured for asynchronous-commit mode, it does not wait for any secondary replica to write incoming transaction log records to disk (to harden the log). If a given secondary replica is configured for asynchronous-commit mode, the primary replica does not wait for that secondary replica to harden the log. If both the primary replica and a given secondary replica are both configured for synchronous-commit mode, the primary replica waits for the secondary replica to confirm that it has hardened the log (unless the secondary replica fails to ping the primary replica within the primary's session-timeout period).

Note:
If primary's session-timeout period is exceeded by a secondary replica, the primary replica temporarily shifts into asynchronous-commit mode for that secondary replica. When the secondary replica reconnects with the primary replica, they resume synchronous-commit mode.


How Synchronization Works on a Secondary Replica

Under the synchronous-commit mode, after a secondary replica joins the availability group and establishes a session with the primary replica, the secondary replica writes incoming log records to disk (hardens the log) and sends a confirmation message to the primary replica. Once the hardened log on the secondary database has caught up the end of log on the primary database, the state of the secondary database is set to SYNCHRONIZED. The time required for synchronization depends essentially on how far the secondary database was behind the primary database at the start of the session (measured by the number of log records initially received from the primary replica), the work load on the primary database, and the speed of the computer of the server instance that hosts the secondary replica.

Synchronous operation is maintained in the following manner:

    1.On receiving a transaction from a client, the primary replica writes the log for the transaction to the transaction log and concurrently sends the log record to the secondary replicas.

    2.Once a log record is written to the transaction log of the primary database, the transaction can be undone only if there is a failover at this point to a secondary that did not receive the log. The primary replica waits for confirmation from the synchronous-commit secondary replica.

    3.The secondary replica hardens the log and returns an acknowledgement to the primary replica.

    4.On receiving the confirmation from the secondary replica, the primary replica finishes the commit processing and sends a confirmation message to the client.

    Note:
    If a synchronous-commit secondary replica times out without confirming that it has hardened the log, the primary marks that secondary replica as failed. The connected state of the secondary replica changes to DISCONNECTED, and the primary replica stops waiting for confirmation from the secondary replica. This behavior ensures that a failed synchronous-commit secondary replica does not prevent hardening of the transaction log on the primary replica.

    Synchronous-commit mode protects your data by requiring the data to be synchronized between two places, at the cost of somewhat increasing the latency of the transaction.


我想了又想,当开启了延迟事物持久化、内存优化表这种场景下,主挂了,切到备,会有丢失;在主库有物理坏块,有数据丢失。在不切备库的时候,只能理解为备库的延时。延时这东西,跟你规范化操作有关,就算镜像也会造成延时,比如对大表做在线重建索引的时候,一般我都不会去这么做而是去做反复的维护索引。

他说的同步丢失,可能是从应用程序端来理解的。当主备之间有网络问题,或者备库有异常的时候。应用程序端超时,返回错误,写入失败而已。这不算丢失。这个不能算数据库侧的丢失。

他说的阻塞和同步,都是指LAG导致握手超时,应用返回失败。我只能说他应用侧事务控制不严谨。没有把逻辑放到一个事务提交。

我以前写游戏的时候,SP里好多都没有写事务。

比如,SP里先游戏角色数据、再添加装备、再写日志。
而这三个逻辑没有放到事务里。
那刚好出故障的时候,提交了前两个语句,而日志没有写。
这种情况,我们是有遇到的。

也就是有两种情况,一个是他应用侧的,一个是数据库侧的。我觉得他遇到这种丢数据的情况,是不是这两种之一。


对于同步提交模式下的延时排查,可以参考:

https://blogs.msdn.microsoft.com/psssql/2018/04/05/troubleshooting-data-movement-latency-between-synchronous-commit-always-on-availability-groups/

https://blogs.msdn.microsoft.com/alwaysonpro/2018/02/06/analyze-synchronous-commit-impact-on-high-commit-rate-workloads/

https://channel9.msdn.com/Series/SQL-Workshops/AlwaysOn-Availability-Groups-Synchronous-Replica-Readable-Secondary-Data-Access-Latency

https://blogs.msdn.microsoft.com/sql_server_team/new-in-ssms-always-on-availability-group-latency-reports/