故障分析:网络闪断引发的ServiceStack.Redis死锁问题

背景说明

某天生产环境发生进程的活跃线程数过高的预警问题,且一天两个节点分别出现相同预警。此程序近一年没出现过此类预警,事出必有因,本文就记录下此次根因分析的过程。

监控看到的线程数变化情况:

故障分析:网络闪断引发的ServiceStack.Redis死锁问题_第1张图片

初步的分析和发现

异常的进程在重启时分别保留了dump,这是进行下一步windbg分析的前提。

查看线程明细和异常情况:!threads

0:000> !threads
PDB symbol for clr.dll not loaded
ThreadCount:      297
UnstartedThread:  0
BackgroundThread: 228
PendingThread:    0
DeadThread:       60
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1 11744 000002bb7332f750  2026020 Preemptive  000002BC775885C0:000002BC7758A518 000002bb73302f40 0     STA 
  10    2 e988 000002bb733dd610    2b220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Finalizer) 
.....省略.....
  97  428 2cf40 000002bb05df9240  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
  98  410 1c3d8 000002bb05ee44d0  602b221 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA ServiceStack.Redis.RedisException 000002bbf5127940
  99  420 29718 000002bb05ee0e20  202b220 Preemptive  000002BE75FE6398:000002BE75FE76F8 000002bb73302f40 0     MTA 
 100  442 2e76c 000002bb05edd770    2b020 Preemptive  000002BBF5967E88:000002BBF5969E20 000002bb73302f40 1     MTA 
 101  421 108ec 000002bb05edeee0    2b020 Preemptive  000002BB75A3A208:000002BB75A3C1A0 000002bb73302f40 1     MTA 
 102  403 2d260 000002bb05eeb230    2b020 Preemptive  000002BDF5580228:000002BDF55821C0 000002bb73302f40 1     MTA 
 103  472 f008 000002bb06c28950    2b220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA 
 104  395 4f64 000002bb06b45b80  202b220 Preemptive  000002BBF6213018:000002BBF62144D0 000002bb73302f40 0     MTA 
 105  398 21be8 000002bb06b44be0    2b020 Preemptive  000002BE758DB828:000002BE758DD7C0 000002bb73302f40 1     MTA 
 117  355 113d4 000002bb06302a70  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 115  392 2852c 000002bb06a167d0  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 111  399 8b4c 000002bb0633e6c0  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 106  318 3de0 000002bb06b43c40  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 110  333 20530 000002bb06300b30  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 116  329 2458 000002bb06a16fa0  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 109  343 1b5f8 000002bb05cd5910  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker)  
 112  344 26bdc 000002bb06a17770  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 3     MTA (Threadpool Worker) ServiceStack.Redis.RedisException 000002bc751983b8
 107  320 1a508 000002bb0633ee90  3029220 Preemptive  0000000000000000:0000000000000000 000002bb73302f40 0     MTA (Threadpool Worker) 
 .....省略.....

最后一列可以看到98号、112号线程又异常,分别查看:

0:000> ~98s

0:098> !pe
Exception object: 000002bbf5127940
Exception type:   ServiceStack.Redis.RedisException
Message:          Host:10.201.107.18,Port:26379 Exceeded timeout of 00:00:03
InnerException:   System.Net.Sockets.SocketException, Use !PrintException 000002bef5278f08 to see more.
StackTrace (generated):
    SP               IP               Function
    000000D8D77FC260 00007FFB92F54063 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)+0x2b3
    000000D8D77FE510 00007FFB92F596C4 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])+0xd4
    000000D8D77FE580 00007FFB92F595B5 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)+0x85
    000000D8D77FE5C0 00007FFB92F594F3 UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)+0x13
    000000D8D77FE5F0 00007FFB92F5940D UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.GetSentinelInfo()+0x1d
    000000D8D77FE640 00007FFB92F5937B UNKNOWN!ServiceStack.Redis.RedisSentinel.GetSentinelInfo()+0x4b
    000000D8D77FE6A0 00007FFB9394441A UNKNOWN!ServiceStack.Redis.RedisSentinel.ResetClients()+0x2a
    000000D8D77FE740 00007FFB93944395 UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.SentinelMessageReceived(System.String, System.String)+0x1d5
    000000D8D77FE7A0 00007FFB93944127 UNKNOWN!ServiceStack.Redis.RedisPubSubServer+<>c__DisplayClass98_0.b__0(System.String, System.String)+0x267
    000000D8D77FE810 00007FFB92F5E455 UNKNOWN!ServiceStack.Redis.RedisSubscription.ParseSubscriptionResults(Byte[][])+0x2c5
    000000D8D77FE8A0 00007FFB92F5D681 UNKNOWN!ServiceStack.Redis.RedisSubscription.SubscribeToChannelsMatching(System.String[])+0x51
    000000D8D77FE8D0 00007FFB92F5CFC5 UNKNOWN!ServiceStack.Redis.RedisPubSubServer.RunLoop()+0x1d5
0:098> !PrintException 000002bef5278f08
Exception object: 000002bef5278f08
Exception type:   System.Net.Sockets.SocketException
Message:          由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。
InnerException:   
StackTrace (generated):
    SP               IP               Function
    000000D8D77FE3C0 00007FFBEE35C842 System_ni!System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32)+0xe2

StackTraceString: 
HResult: 80004005

 

0:000> ~112s
0:112> !pe
Exception object: 000002bc751983b8
Exception type:   ServiceStack.Redis.RedisException
Message:          Host:10.201.107.18,Port:26379 Exceeded timeout of 00:00:03
InnerException:   ServiceStack.Redis.RedisRetryableException, Use !PrintException 000002bcf51e92e0 to see more.
StackTrace (generated):
    SP               IP               Function
    000000F42BCFBBD0 00007FFB92F54063 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)+0x2b3
    000000F42BCFDE80 00007FFB92F596C4 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])+0xd4
    000000F42BCFDEF0 00007FFB92F595B5 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)+0x85
    000000F42BCFDF30 00007FFB92F594F3 UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)+0x13
    000000F42BCFDF60 00007FFB92F5F17B UNKNOWN!ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)+0x1b

StackTraceString: 
HResult: 80131500
0:112> !pe 000002bcf51e92e0 
Exception object: 000002bcf51e92e0
Exception type:   ServiceStack.Redis.RedisRetryableException
Message:          Socket is not connected
InnerException:   
StackTrace (generated):
    SP               IP               Function
    000000F42BCFDE20 00007FFB92F53F13 UNKNOWN!ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)+0x163

StackTraceString: 
HResult: 80131500

异常显示Socket连接失败,于是查看故障期间的网络情况,得到了第一个有价值的发现:网络有闪断!

故障分析:网络闪断引发的ServiceStack.Redis死锁问题_第2张图片

 

 

进一步的分析和发现

按说网络闪断后,故障应该自动恢复,不至于造成线程数持续高位。既然有那么多活跃线程,那就看看线程都在干什么?

0:000> ~*e !clrstack

因涉及公司源码信息,不能展示全部堆栈。不过通过堆栈得到的结论是几乎全部线程都卡在获取Redis Client步骤处:ServiceStack.Redis.PooledRedisClientManager.CreateRedisClient(),且在等待加锁的操作。

既然是线程等待加锁,那么就查看dump中锁的情况:

0:036> !syncblk
Index SyncBlock       MonitorHeld Recursion       Owning Thread Info                  SyncBlock Owner
  510 000002bb0513d4a8           79   1           000002bb073568c0 f9e8 177          000002be7520adb0 System.Object
  530 000002bb0513a1f8          219   1           000002bb06a17770 26bdc 112         000002bd74f07d30 System.Object
  567 000002bb0513e148            5   1           000002bb06ce6f40 24ce0 113         000002be74f320b8 System.Object
 1343 000002bb0513c2b8            7   1           000002bb06a17770 26bdc 112         000002bd7515b158 ServiceStack.Redis.RedisSentinelWorker
-----------------------------
Total           1369
CCW             3
RCW             2
ComClassFactory 0
Free            507

根据输出内容,可以得到如下结论:

  • 绝大多数线程分别在等待000002be7520adb0、000002bd74f07d30对象的释放;
  • 这两个锁对象当前分别被177和112号线程占用。

于是通过!clrstack查看这两个线程的堆栈: 

OS Thread Id: 0x26bdc (112)
        Child SP               IP Call Site
000000f42bcf96a8 00007ffc025967c4 [HelperMethodFrame_1OBJ: 000000f42bcf96a8] System.Threading.Thread.AbortInternal()
000000f42bcf97a0 00007ffbf06a912a System.Threading.Thread.Abort()
000000f42bcf97e0 00007ffb92f5c8e9 ServiceStack.Redis.RedisPubSubServer.KillBgThreadIfExists()
000000f42bcf9820 00007ffb938dc511 ServiceStack.Redis.RedisPubSubServer.Dispose()
000000f42bcf9870 00007ffb92f57dcb ServiceStack.DisposableExtensions.Dispose(System.Collections.Generic.IEnumerable`1, ServiceStack.Logging.ILog)
000000f42bcf98d0 00007ffb92f59223 ServiceStack.Redis.RedisSentinel.GetNextSentinel()
000000f42bcf9950 00007ffb93945d77 ServiceStack.Redis.RedisSentinel.OnSentinelError(System.Exception)
000000f42bcf9990 00007ffb92f5f1b8 ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)
000000f42bcfbae8 00007ffbf225a626 [HelperMethodFrame: 000000f42bcfbae8] 
000000f42bcfbbd0 00007ffb92f54063 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)
000000f42bcfdd38 00007ffbf225a626 [HelperMethodFrame: 000000f42bcfdd38] 
000000f42bcfde20 00007ffb92f53f13 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)
000000f42bcfde80 00007ffb92f596c4 ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])
000000f42bcfdef0 00007ffb92f595b5 ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)
000000f42bcfdf30 00007ffb92f594f3 ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)
000000f42bcfdf60 00007ffb92f5f17b ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)
000000f42bcfdfa0 00007ffb92f5f067 ServiceStack.Redis.RedisSentinel.GetMaster()
000000f42bcfe020 00007ffb92f5efc6 ServiceStack.Redis.RedisSentinelResolver.GetReadWriteHost(Int32)
000000f42bcfe060 00007ffb92f5ef70 ServiceStack.Redis.RedisSentinelResolver.CreateMasterClient(Int32)
000000f42bcfe090 00007ffb92f5ee0d ServiceStack.Redis.PooledRedisClientManager.CreateRedisClient()
000000f42bcfe100 00007ffb92f5eb3c ServiceStack.Redis.PooledRedisClientManager.GetClient()
    ....省略....
    
    
OS Thread Id: 0xf9e8 (177)
        Child SP               IP Call Site
0000014977dfccf8 00007ffc025967c4 [GCFrame: 0000014977dfccf8] 
0000014977dfcef0 00007ffc025967c4 [GCFrame: 0000014977dfcef0] 
0000014977dfcf48 00007ffc025967c4 [HelperMethodFrame: 0000014977dfcf48] System.Threading.Monitor.Enter(System.Object)
0000014977dfd040 00007ffb92f5f059 ServiceStack.Redis.RedisSentinel.GetMaster()
0000014977dfd0c0 00007ffb92f5efc6 ServiceStack.Redis.RedisSentinelResolver.GetReadWriteHost(Int32)
0000014977dfd100 00007ffb92f5ef70 ServiceStack.Redis.RedisSentinelResolver.CreateMasterClient(Int32)
0000014977dfd130 00007ffb92f5ee0d ServiceStack.Redis.PooledRedisClientManager.CreateRedisClient()
0000014977dfd1a0 00007ffb92f5eb3c ServiceStack.Redis.PooledRedisClientManager.GetClient()
        ....省略....

 又能得到以下结论:

  • 112号线程在执行线程中止操作。
  • 177线程跟之前看到的绝大多数线程操作一样,是在等待锁释放。

看到这里就有了点奇怪的味道:别忘了177号本身就是持有了锁对象000002be7520adb0,且有39个线程等着用呢。极有可能会发生死锁了。于是自然就要查看177号等待的锁对象是什么。方法如下:

0:098> ~177s
ntdll!NtWaitForMultipleObjects+0x14:
00007ffc`025967c4 c3              ret
0:177> !clrstack -l
OS Thread Id: 0xf9e8 (177)
        Child SP               IP Call Site
0000014977dfccf8 00007ffc025967c4 [GCFrame: 0000014977dfccf8] 
0000014977dfcef0 00007ffc025967c4 [GCFrame: 0000014977dfcef0] 
0000014977dfcf48 00007ffc025967c4 [HelperMethodFrame: 0000014977dfcf48] System.Threading.Monitor.Enter(System.Object)
0000014977dfd040 00007ffb92f5f059 ServiceStack.Redis.RedisSentinel.GetMaster()
    LOCALS:
         = 0x000002bd7515b158
        0x0000014977dfd080 = 0x000002bd7515b158
        
        
        
        0x0000014977dfd078 = 0x0000000000000000
            
....省略....

上面输出的是堆栈以及局部变量信息。既然要给目标对象加锁,那么目标对象自然是要存到局部变量中的。 于是可以判断000002bd7515b158就是锁对象的地址,且根据前面syncblk的输出,可以确认此锁对象当前被112线程持有。而112线程的堆栈表明它并没有在等待加锁,所以从现有信息判断没有构成死锁。

确定是发生了死锁

至此,根据已知的所有信息,可以归纳出:因为112线程持有的2个锁对象,绝大多数线程都被阻塞。此时就要重点分析112号线程的情况了。前面的分析过程表明,112号线程在执行线程中止操作,即Abort()。于是查看此被中止的线程有什么线索没有:

0:112> !clrstack -a
OS Thread Id: 0x26bdc (112)
        Child SP               IP Call Site
000000f42bcf96a8 00007ffc025967c4 [HelperMethodFrame_1OBJ: 000000f42bcf96a8] System.Threading.Thread.AbortInternal()
000000f42bcf97a0 00007ffbf06a912a System.Threading.Thread.Abort()
    PARAMETERS:
        this = 

000000f42bcf97e0 00007ffb92f5c8e9 ServiceStack.Redis.RedisPubSubServer.KillBgThreadIfExists()
    PARAMETERS:
        this () = 0x000002be7520a900

000000f42bcf9820 00007ffb938dc511 ServiceStack.Redis.RedisPubSubServer.Dispose()
    PARAMETERS:
        this (0x000000f42bcf9870) = 0x000002be7520a900
    LOCALS:
        
        
....省略....
            
            
            
0:177> !do2 000002be7520a900
0x000002be7520a900 ServiceStack.Redis.RedisPubSubServer
[statics]
  0000  startedAt                               : 000002be7511e4e8 (System.Diagnostics.Stopwatch)
  0008  heartbeatTimer                          : NULL
  0010  k__BackingField                 : NULL
  0018  k__BackingField                : NULL
  0020  k__BackingField        : NULL
  0028  k__BackingField    : NULL
  0030  k__BackingField                 : NULL
  0038  k__BackingField              : NULL
  0040  k__BackingField              : 000002be7520abe0 (System.Action)
  0048  k__BackingField       : NULL
  0050  k__BackingField          : NULL
  0058  k__BackingField                : NULL
  0060  k__BackingField             : NULL
  0068  rand                                    : 000002be7520aa30 (System.Random)
  0070  lastExMsg                               : NULL
  0078  bgThread                                : 000002be7511e550 (System.Threading.Thread)
  0080  k__BackingField         : 000002be7520a7d0 (ServiceStack.Redis.BasicRedisClientManager)
  0088  k__BackingField               : 000002be7520aa18 (System.String[]) [Length: 0]
  0090  k__BackingField       : 000002be7520abc0 (System.String[]) [Length: 1]
  0098  masterClient                            : 000002bef51d8288 (ServiceStack.Redis.RedisClient)
  00a0  lastHeartbeatTicks                      : 637793659924009545 (System.Int64)
  00a8  timesStarted                            : 1 (System.Int64)
  00b0  noOfErrors                              : 0 (System.Int64)
  00b8  bgThreadCount                           : 1 (System.Int64)
  00c0  doOperation                             : 1 (System.Int32)
  00c4  noOfContinuousErrors                    : 0 (System.Int32)
  00c8  status                                  : -1 (System.Int32)
  00cc  autoRestart                             : 0 (System.Int32)
  00d0  k__BackingField : True (System.Boolean)
  00d8  serverTimeAtStart                       : 000002be7520a9e0 2022/2/2 2:33:12 (System.DateTime)
  00e0  HeartbeatInterval                       : 000002be7520a9e8 (System.Nullable)
  00f0  HeartbeatTimeout                        : 000002be7520a9f8 00:00:30 (System.TimeSpan)
  00f8  k__BackingField  : 000002be7520aa00 (System.Nullable)
      
      
0:177> !do2 000002be7511e550
0x000002be7511e550 System.Threading.Thread
[statics]
  0000  m_Context                             : 000002bdf4f025d8 (System.Runtime.Remoting.Contexts.Context)
  0008  m_ExecutionContext                    : 000002be7511e7b0 (System.Threading.ExecutionContext)
  0010  m_Name                                : 000002be7511e618  "RedisPubSubServer 1" [19] (System.String)
  0018  m_Delegate                            : NULL
  0020  m_CurrentCulture                      : NULL
  0028  m_CurrentUICulture                    : NULL
  0030  m_ThreadStartArg                      : NULL
  0038  DONT_USE_InternalThread               : 000002bb05ee44d0 (System.IntPtr)
  0040  m_Priority                            : 2 (System.Int32)
  0044  m_ManagedThreadId                     : 410 (System.Int32)
  0048  m_ExecutionContextBelongsToOuterScope : False (System.Boolean)

结合上述dump信息和反编译的程序源码,他们之间的联系是:112线程执行了RedisPubSubServer.KillBgThreadIfExists()方法,方法内执行了this.bgThread.Abort()。

查看dump信息得到RedisPubSubServer对象为000002be7520a900,bgThread对象为000002be7511e550,bgThread的托管线程号是410,根据!threads内容查找对应关系即98号线程。


于是得到的结论:112要中止的是98号线程。看到这里,又出现了一点怪味道:因为前面的分析结论是绝大多数线程都被112线程的锁对象阻塞了。那么98号呢?赶紧看看98号的堆栈:

0:098> !clrstack -l
OS Thread Id: 0x1c3d8 (98)
        Child SP               IP Call Site
000000d8d77f87a8 00007ffc025967c4 [GCFrame: 000000d8d77f87a8] 
000000d8d77f89a0 00007ffc025967c4 [GCFrame: 000000d8d77f89a0] 
000000d8d77f89f8 00007ffc025967c4 [HelperMethodFrame: 000000d8d77f89f8] System.Threading.Monitor.Enter(System.Object)
000000d8d77f8af0 00007ffb92f59371 ServiceStack.Redis.RedisSentinel.GetSentinelInfo()
    LOCALS:
         = 0x000002bd7515b158
        0x000000d8d77f8b28 = 0x000002bd7515b158
        
        0x000000d8d77f8b20 = 0x0000000000000000

000000d8d77f8b50 00007ffb9394441a ServiceStack.Redis.RedisSentinel.ResetClients()
    LOCALS:
        
        
        
....省略....

果然,98号也是在等锁对象000002bd7515b158的释放,而这个对象是被112占用的!

重述一下前面分析得到的现状:

  • 112正在中止98号线程
  • 98号线程在等112的锁对象释放

这个味道一看就像是死锁,但是之前确实没有验证过这种情况。那么自然Google一下,于是....真相大白:

故障分析:网络闪断引发的ServiceStack.Redis死锁问题_第3张图片

同时还查询了死锁的112线程的执行时间,方法是通过!clrstack -a查看堆栈的入参和局部变量,结合业务系统的日志,验证了死锁发生的时间正是网络闪断的时间点。

死锁的产生原因

既然确定是发生了死锁,下面就要分析死锁产生的逻辑关系了。

结合堆栈与ServiceStack.Redis源码,98号线程的执行逻辑是:

  • 进程运行,启动了RedisPubSubServer.RunLoop();
  • 网络闪断,需要复位客户端连接:RedisSentinel.ResetClients();
  • RedisSentinel.GetSentinelInfo()内部又需要加锁等待;
OS Thread Id: 0x1c3d8 (98)
        Child SP               IP Call Site
000000d8d77f87a8 00007ffc025967c4 [GCFrame: 000000d8d77f87a8] 
000000d8d77f89a0 00007ffc025967c4 [GCFrame: 000000d8d77f89a0] 
000000d8d77f89f8 00007ffc025967c4 [HelperMethodFrame: 000000d8d77f89f8] System.Threading.Monitor.Enter(System.Object)
000000d8d77f8af0 00007ffb92f59371 ServiceStack.Redis.RedisSentinel.GetSentinelInfo()
000000d8d77f8b50 00007ffb9394441a ServiceStack.Redis.RedisSentinel.ResetClients()
000000d8d77f8bf0 00007ffb93944395 ServiceStack.Redis.RedisSentinelWorker.SentinelMessageReceived(System.String, System.String)
000000d8d77f8c50 00007ffb93944127 ServiceStack.Redis.RedisPubSubServer+<>c__DisplayClass98_0.b__0(System.String, System.String)
000000d8d77f8cc0 00007ffb92f5e455 ServiceStack.Redis.RedisSubscription.ParseSubscriptionResults(Byte[][])
000000d8d77f8d50 00007ffb938dcd72 ServiceStack.Redis.RedisSubscription.UnSubscribeFromAllChannelsMatchingAnyPatterns()
000000d8d77f8d90 00007ffb92f5d0f9 ServiceStack.Redis.RedisPubSubServer.RunLoop()
000000d8d77fc178 00007ffbf225a626 [HelperMethodFrame: 000000d8d77fc178] 
000000d8d77fc260 00007ffb92f54063 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)
000000d8d77fe3c8 00007ffbf225a626 [HelperMethodFrame: 000000d8d77fe3c8] 
000000d8d77fe4b0 00007ffb92f53f13 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)
000000d8d77fe510 00007ffb92f596c4 ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])
000000d8d77fe580 00007ffb92f595b5 ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)
000000d8d77fe5c0 00007ffb92f594f3 ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)
000000d8d77fe5f0 00007ffb92f5940d ServiceStack.Redis.RedisSentinelWorker.GetSentinelInfo()
000000d8d77fe640 00007ffb92f5937b ServiceStack.Redis.RedisSentinel.GetSentinelInfo()
000000d8d77fe6a0 00007ffb9394441a ServiceStack.Redis.RedisSentinel.ResetClients()
000000d8d77fe740 00007ffb93944395 ServiceStack.Redis.RedisSentinelWorker.SentinelMessageReceived(System.String, System.String)
000000d8d77fe7a0 00007ffb93944127 ServiceStack.Redis.RedisPubSubServer+<>c__DisplayClass98_0.b__0(System.String, System.String)
000000d8d77fe810 00007ffb92f5e455 ServiceStack.Redis.RedisSubscription.ParseSubscriptionResults(Byte[][])
000000d8d77fe8a0 00007ffb92f5d681 ServiceStack.Redis.RedisSubscription.SubscribeToChannelsMatching(System.String[])
000000d8d77fe8d0 00007ffb92f5cfc5 ServiceStack.Redis.RedisPubSubServer.RunLoop()
000000d8d77fe940 00007ffbeff6df12 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000d8d77fea10 00007ffbeff6dd95 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000d8d77fea40 00007ffbeff6dd65 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000d8d77fea90 00007ffbf0013e85 System.Threading.ThreadHelper.ThreadStart()
000000d8d77fece0 00007ffbf2216913 [GCFrame: 000000d8d77fece0] 
000000d8d77ff040 00007ffbf2216913 [DebuggerU2MCatchHandlerFrame: 000000d8d77ff040]

112号线程的执行逻辑是:

  • 业务调用获取redis 客户端对象实例,调用到PooledRedisClientManager.GetClient()
  • 网络闪断,引发异常,触发OnSentinelError事件
  • OnSentinelError事件处理程序中执行Dispose()操作
  • Dispose()操作包含了对现有线程bgThread的中止操作
OS Thread Id: 0x26bdc (112)
        Child SP               IP Call Site
000000f42bcf96a8 00007ffc025967c4 [HelperMethodFrame_1OBJ: 000000f42bcf96a8] System.Threading.Thread.AbortInternal()
000000f42bcf97a0 00007ffbf06a912a System.Threading.Thread.Abort()
000000f42bcf97e0 00007ffb92f5c8e9 ServiceStack.Redis.RedisPubSubServer.KillBgThreadIfExists()
000000f42bcf9820 00007ffb938dc511 ServiceStack.Redis.RedisPubSubServer.Dispose()
000000f42bcf9870 00007ffb92f57dcb ServiceStack.DisposableExtensions.Dispose(System.Collections.Generic.IEnumerable`1, ServiceStack.Logging.ILog)
000000f42bcf98d0 00007ffb92f59223 ServiceStack.Redis.RedisSentinel.GetNextSentinel()
000000f42bcf9950 00007ffb93945d77 ServiceStack.Redis.RedisSentinel.OnSentinelError(System.Exception)
000000f42bcf9990 00007ffb92f5f1b8 ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)
000000f42bcfbae8 00007ffbf225a626 [HelperMethodFrame: 000000f42bcfbae8] 
000000f42bcfbbd0 00007ffb92f54063 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)
000000f42bcfdd38 00007ffbf225a626 [HelperMethodFrame: 000000f42bcfdd38] 
000000f42bcfde20 00007ffb92f53f13 ServiceStack.Redis.RedisNativeClient.SendReceive[[System.__Canon, mscorlib]](Byte[][], System.Func`1, System.Action`1>, Boolean)
000000f42bcfde80 00007ffb92f596c4 ServiceStack.Redis.RedisNativeClient.SendExpectMultiData(Byte[][])
000000f42bcfdef0 00007ffb92f595b5 ServiceStack.Redis.RedisNativeClient.SentinelGetMasterAddrByName(System.String)
000000f42bcfdf30 00007ffb92f594f3 ServiceStack.Redis.RedisSentinelWorker.GetMasterHostInternal(System.String)
000000f42bcfdf60 00007ffb92f5f17b ServiceStack.Redis.RedisSentinelWorker.GetMasterHost(System.String)
000000f42bcfdfa0 00007ffb92f5f067 ServiceStack.Redis.RedisSentinel.GetMaster()
000000f42bcfe020 00007ffb92f5efc6 ServiceStack.Redis.RedisSentinelResolver.GetReadWriteHost(Int32)
000000f42bcfe060 00007ffb92f5ef70 ServiceStack.Redis.RedisSentinelResolver.CreateMasterClient(Int32)
000000f42bcfe090 00007ffb92f5ee0d ServiceStack.Redis.PooledRedisClientManager.CreateRedisClient()
000000f42bcfe100 00007ffb92f5eb3c ServiceStack.Redis.PooledRedisClientManager.GetClient()
....省略....

上述过程简化后的逻辑示意图:

故障分析:网络闪断引发的ServiceStack.Redis死锁问题_第4张图片

确定死锁原因后,简单搜索可找到官方论坛也有一个相同的反馈:https://forums.servicestack.net/t/application-appears-to-be-deadlocked-after-sentinel-failover/5782,并且官方做了相应的修复。

总结

根据文章的分析思路和步骤看起来比较顺畅,这是因为我已经按照关联关系重新梳理并整合了,实际上整个过程确实花费了一些时间。

但是得益于对windbg的理解和熟悉,以及对ServiceStack.Redis的阅读把握,最终将此故障的根因确定下来。

 

你可能感兴趣的:(故障分析:网络闪断引发的ServiceStack.Redis死锁问题)