用windbg分析一个dead lock的问题

难得Winform项目中碰到dead lock,记录一下。

QA报告说,有时候晚上跑完自动化脚本,第二天早上来发现系统hang在屏保界面没反应,从日志看也没有报错。这种属于很少才会发生,也不知道怎么重现,但是很严重的bug,于是抓个dump来研究一下。

# Windbg加载dump文件后的一些文件信息

Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64

Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [F:\DeadLock\Hang.DMP]

User Mini Dump File with Full Memory: Only application data is available

Symbol search path is: *** Invalid ***

****************************************************************************

* Symbol loading may be unreliable without a symbol search path. *

* Use .symfix to have the debugger choose a symbol path. *

* After setting your symbol path, use .reload to refresh symbol locations. *

****************************************************************************

Executable search path is:

Windows 7 Version 7601 (Service Pack 1) MP (4 procs) Free x64

Product: WinNt, suite: EmbeddedNT SingleUserTS

Machine Name:

Debug session time: Thu Feb 27 02:47:42.000 2014 (UTC + 8:00)

System Uptime: 4 days 16:41:57.807

Process Uptime: 4 days 16:41:28.000

................................................................

Loading unloaded module list

................................................................

*** ERROR: Symbol file could not be found. Defaulted to export symbols for ntdll.dll -

ntdll!NtWaitForMultipleObjects+0xa:

00000000`7706186a c3 ret

# 设置symbol path

0:000> .sympath srv*c:\MyServerSymbols*http://msdl.microsoft.com/download/symbols

Symbol search path is: srv*c:\MyServerSymbols*http://msdl.microsoft.com/download/symbols

Expanded Symbol search path is: srv*c:\myserversymbols*http://msdl.microsoft.com/download/symbols

#手动load SOS.dll,因为我们的调试机和测试机上的SOS不是同一个版本,所以不用.loadby sos clr

0:000> .load F:\SOS.dll\1008\SOS.dll

#看看线程都在干啥

0:000> ~* e!clrstack

OS Thread Id: 0xfe8 (0)

Child SP IP Call Site

00000000001d5f80 000000007706186a [GCFrame: 00000000001d5f80]

00000000001d6128 000000007706186a [GCFrame: 00000000001d6128]

00000000001d60d8 000000007706186a [HelperMethodFrame_1OBJ: 00000000001d60d8] System.Threading.Monitor.Enter(System.Object)

00000000001d6220 000007ff02b1c734 TheSystem.CCDialog.OnClosed(System.EventArgs)

00000000001d62c0 000007fef18f4695 System.Windows.Forms.Form.CheckCloseDialog(Boolean)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll

00000000001d6330 000007fef1e64073 System.Windows.Forms.Application+ThreadContext.System.Windows.Forms.UnsafeNativeMethods.IMsoComponent.FContinueMessageLoop(Int32, Int32, MSG[])

00000000001d6470 000007fef12aadbe System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32)

00000000001d66b0 000007fef12aa813 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)

00000000001d6810 000007fef12aa1a1 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)

00000000001d68a0 000007fef18f8a52 System.Windows.Forms.Form.ShowDialog(System.Windows.Forms.IWin32Window)

00000000001d6c10 000007ff02c844f6 TheSystem.CCMessageBox.ShowInternal(System.Windows.Forms.Form, MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons, Boolean, System.Collections.Generic.Dictionary`2<System.String,System.String>, System.Windows.Forms.Control, System.Drawing.ContentAlignment)

00000000001d70e0 000007ff02c862cd TheSystem.CCMessageBox.Show(MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons)

......<cut to save space>......

OS Thread Id: 0x2180 (132)

Child SP IP Call Site

0000000043cad940 000000007706186a [GCFrame: 0000000043cad940]

0000000043cadb98 000000007706186a [GCFrame: 0000000043cadb98]

0000000043cadb48 000000007706186a [HelperMethodFrame: 0000000043cadb48] System.Threading.Monitor.Enter(System.Object)

0000000043cadca0 000007ff02ca8a18 TheSystem.CCDialog.CloseModalDialogs()

0000000043cadd30 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs)

0000000043cadda0 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object)

0000000043cade20 000007fef6ba2f78 System.Threading.ExecutionContext.runTryCode(System.Object)

0000000043cae548 000007fef7ba44c4 [HelperMethodFrame_PROTECTOBJ: 0000000043cae548] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object)

0000000043cae670 000007fef6b91661 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

0000000043cae6d0 000007fef6c0ced7 System.Threading._TimerCallback.PerformTimerCallback(System.Object)

0000000043cae948 000007fef7ba44c4 [GCFrame: 0000000043cae948]

0000000043caeb20 000007fef7ba44c4 [DebuggerU2MCatchHandlerFrame: 0000000043caeb20]

......<cut to save space>......

OS Thread Id: 0x363c (1076)

Child SP IP Call Site

00000000d190e7a0 000000007706186a [GCFrame: 00000000d190e7a0]

00000000d190e9f8 000000007706186a [GCFrame: 00000000d190e9f8]

00000000d190e9a8 000000007706186a [HelperMethodFrame: 00000000d190e9a8] System.Threading.Monitor.Enter(System.Object)

00000000d190eb00 000007ff02ca8a18 TheSystem.CCDialog.CloseModalDialogs()

00000000d190eb90 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs)

00000000d190ec00 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object)

00000000d190ec80 000007fef6ba2f78 System.Threading.ExecutionContext.runTryCode(System.Object)

00000000d190f3a8 000007fef7ba44c4 [HelperMethodFrame_PROTECTOBJ: 00000000d190f3a8] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object)

00000000d190f4d0 000007fef6b91661 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)

00000000d190f530 000007fef6c0ced7 System.Threading._TimerCallback.PerformTimerCallback(System.Object)

00000000d190f7a8 000007fef7ba44c4 [GCFrame: 00000000d190f7a8]

00000000d190f980 000007fef7ba44c4 [DebuggerU2MCatchHandlerFrame: 00000000d190f980]

......<cut to save space>......

OS Thread Id: 0x11a0 (1120)

Failed to start stack walk: 80004005

结果发现Windbg异常繁忙,打印了很久才停下来,1120个线程…正常情况下应该是百来个线程。而且从堆栈上来看中间绝大多数(好几百个)线程的call stack都是和132号线程是一样的,看上去都在等一个什么锁。

  #等啥呢?

0:000> !syncblk

Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner

1453 0000000039f170a8 21 1 000000002e74f8e0 1a3c 82 00000000143cccd0 System.Object

3711 000000002dba2568 1823 1 0000000067884fe0 1920 151 0000000011271e58 System.Collections.Hashtable

-----------------------------

Total 4562

CCW 551

RCW 21

ComClassFactory 14

Free 1196

也就是说151号线程拥有0000000011271e58这个hashtable对象的锁,然后还有 823个线程在等这个锁!!!

(可以参考Tess对!syncblk的解释http://blogs.msdn.com/b/tess/archive/2006/01/09/a-hang-scenario-locks-and-critical-sections.aspx)

#先看看151号线程

0:000> ~151 e!clrstack

OS Thread Id: 0x1920 (151)

Child SP IP Call Site

000000005ec9dbb8 000000007706186a [HelperMethodFrame_1OBJ: 000000005ec9dbb8] System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean)

000000005ec9dcf0 000007fef6c1c094 System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)*** WARNING: Unable to verify checksum for mscorlib.ni.dll

000000005ec9dd30 000007fef18b01ad System.Windows.Forms.Control.WaitForWaitHandle(System.Threading.WaitHandle)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll

000000005ec9de20 000007fef12a41e2 System.Windows.Forms.Control.MarshaledInvoke(System.Windows.Forms.Control, System.Delegate, System.Object[], Boolean)

000000005ec9dfa0 000007fef18b2d99 System.Windows.Forms.Control.Invoke(System.Delegate, System.Object[])

000000005ec9e020 000007ff02ca8b34 TheSystem.CCDialog.CloseModalDialogs()***

000000005ec9e0b0 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs)*** ERROR: Module load completed but symbols could not be loaded for TheSystem.dll

000000005ec9e120 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object)*** WARNING: Unable to verify checksum for System.ni.dll

......<cut to save space>......

#等UI线程呢,看看UI线程在干啥。

0:000> !clrstack

OS Thread Id: 0xfe8 (0)

Child SP IP Call Site

00000000001d5f80 000000007706186a [GCFrame: 00000000001d5f80]

00000000001d6128 000000007706186a [GCFrame: 00000000001d6128]

00000000001d60d8 000000007706186a [HelperMethodFrame_1OBJ: 00000000001d60d8] System.Threading.Monitor.Enter(System.Object)

00000000001d6220 000007ff02b1c734 TheSystem.CCDialog.OnClosed(System.EventArgs)

00000000001d62c0 000007fef18f4695 System.Windows.Forms.Form.CheckCloseDialog(Boolean)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll

00000000001d6330 000007fef1e64073 System.Windows.Forms.Application+ThreadContext.System.Windows.Forms.UnsafeNativeMethods.IMsoComponent.FContinueMessageLoop(Int32, Int32, MSG[])

00000000001d6470 000007fef12aadbe System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32)

00000000001d66b0 000007fef12aa813 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)

00000000001d6810 000007fef12aa1a1 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)

00000000001d68a0 000007fef18f8a52 System.Windows.Forms.Form.ShowDialog(System.Windows.Forms.IWin32Window)

00000000001d6c10 000007ff02c844f6 TheSystem.CCMessageBox.ShowInternal(System.Windows.Forms.Form, MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons, Boolean, System.Collections.Generic.Dictionary`2<System.String,System.String>, System.Windows.Forms.Control, System.Drawing.ContentAlignment)

00000000001d70e0 000007ff02c862cd TheSystem.CCMessageBox.Show(MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons)

......<cut to save space>......

同时!do了一下hashtable对象0000000011271e58,发现里面有三个dialog,和log里面的一致。接下来看代码:

拥有锁的151号线程 - TheSystem.CCDialog.CloseModalDialogs()

source1

 

UI线程  TheSystem.CCDialog.OnClosed(System.EventArgs)

2

按照系统的逻辑,一段时间没有用户操作后会自动logout,如果自动logout前存在modal dialog,那么先自动close modal dialog,然后再logout。正常情况下logout前最多只有1个modal dialog,问题就出在出问题时logout前有3个model dialog,

 

那么Hang的过程是怎么产生的呢? 151号线程获取了锁,然后Invoke UI线程去Close了第一个Dialog,注意Invoke完了之后还是继续回到本线程执行,同时这个时候UI线程去响应WM_CLOSE消息然后调用OnClosed () 方法等锁, 占着锁的 151号线程接着去Invoke UI线程Close下一个dialog.

 

也就是说151号线程占有了锁,然后调Invoke—同步调用,去调用UI线程。但是UI线程呢,在等151号线程占着的锁然后才能往下执行,但是151号线程一直占着锁不释放,所以UI线程也一直在那里等着 – 死锁, dead lock。 正常情况下,只有一个dialog的时候,Invoke()完了就直接退出了,不用等下次Invoke(),所以不会有死锁。

 

那么1120个线程又是怎么回事呢?因为TheSystem.CCDialog.CloseModalDialogs()是被timer触发的,所以每过一段时间就有一个新的线程起来挂在那里等那个锁,所以我们才会看到有那么多数量的线程。

 

解决方案有多种,最简单的就是把Invoke改成BeginInvoke,这样非UI线程可以很快退出,把锁让给UI线程。当然也可以把CloseModalDialogs()放在UI线程里面调用。

你可能感兴趣的:(Lock)