虫趣:抓一个Intel显卡驱动的臭虫

引用注明>> 【作者:张佩】【原文:www.yiiyee.cn/blog

同事给我发过来一个dump文件,说是在做S4压力测试的时候挂掉的。蓝屏号0x9F。因为前几天刚好解决了一个0x9F的蓝屏,怀疑是不是类似问题?拿过来一分析,风牛马不相及。可见同样是简称LYF,对应的可能是刘亦菲,也可能是罗玉凤。

0: kd> !analyze -v
************************************************************** 
*                        Bugcheck Analysis  
*                                    
***************************************************************

DRIVER_POWER_STATE_FAILURE (9f)
A driver has failed to complete a power IRP within a specific time (usually 10 minutes).
Arguments:
Arg1: 0000000000000004, The power transition timed out waiting to synchronize with the Pnp subsystem.
Arg2: 000000000000012c, Timeout in seconds.
Arg3: ffffe00000d52040, The thread currently holding on to the Pnp lock.
Arg4: fffff8016ac79980, nt!TRIAGE_9F_PNP on Win7

Debugging Details:
------------------

Implicit thread is now ffffe000`00d52040

DRVPOWERSTATE_SUBCODE:  4

FAULTING_THREAD:  ffffe00000d52040

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  0x9F

PROCESS_NAME:  System

CURRENT_IRQL:  2

自动分析结果中,第一个参数是4(Arg1:4),说明部分讲:某个设备在进行电源状态切换的时候,系统对它的等待超时了(超时值为12c秒,5分钟)。

对于所有的电源问题错误,一个必胜的法宝是:先找到IRP,次找到调用栈,最后定位问题。所以IRP对于解决这类问题,是多么地重要!

但这个问题却让我比较头痛,自动分析掉链子了,给出的一堆云里雾里的分析,全无章法,竟然没有把IRP地址给列出来!这里给MSFT的Windbg组的大佬们提一个建议:所有0x9F的BSOD,请自动分析的时候,自觉显示IRP!

既然MSFT没有把IRP给我,爷只好靠自己动手丰衣足食了。所以我的第一步是找到IRP!怎么找呢?说真的,常规武器还真不行。要是把Windbg帮助文档里面的几千个命令一一试一遍,怎样?负责任地告诉你,没有用的。因为这里要用到的一个命令,是未文档的(!poaction)。

搞什么飞机!这么重要的命令,这么多年了,MSFT你怎么还不文档化!

0: kd> !poaction
PopAction: fffff801692b9c40
  State..........: 2 - Winlogon callout
  Updates........: 0 
  Action.........: Sleep
  Lightest State.: Hibernate
  Flags..........: 80000003 QueryApps|UIAllowed|Critical
  Irp minor......: ??
  System State...: Unspecified
  Hiber Context..: 0000000000000000

Allocated power irps (PopIrpList - fffff801692c37f0)
  IRP: ffffe00000f33bd0 (wait-wake/S4), PDO: ffffe000003d56a0
  IRP: ffffe00000ef0c10 (wait-wake/S4), PDO: ffffe0000499a060
  IRP: ffffe00000b82010 (wait-wake/S0), PDO: ffffe000042ee620
  IRP: ffffe0000182f010 (wait-wake/S4), PDO: ffffe00004970060
  IRP: ffffe00000b76010 (wait-wake/S4), PDO: ffffe0000433e050
  IRP: ffffe00004979860 (set/D3,), PDO: ffffe0000119c060, CURRENT: ffffe0000119c060

Irp worker threads (PopIrpThreadList - fffff801692c2a70)
  THREAD: ffffe0000039b880 (static)
  THREAD: ffffe0000039b040 (static), IRP: ffffe00004979860, DEVICE: ffffe00003161200
  THREAD: ffffe00000c23040 (dynamic)
  THREAD: ffffe000017ae040 (dynamic)
  THREAD: ffffe00000ddb880 (dynamic)
  THREAD: ffffe0000170e880 (dynamic)
  THREAD: ffffe000016ec880 (dynamic)
  THREAD: ffffe000017c7880 (dynamic)

这个命令列出了所有当前执行中的电源IRP,以及处理这些IRP的各个线程。一共六个IRP,前面五个都是IRP_MN_WAIT_WAKE子类型的。IRP_MN_WAIT_WAKE这个类型的IRP比较特殊,相当于一个门卫一样,用来检测系统状态,并及时唤醒设备用的。当某个设备进入休眠状态的时候,它时刻考虑着,我不能睡得太久,是不是要醒过来工作了。所以就发送此IRP给物理设备对象(PDO,由总线驱动创建),一旦物理设备对象觉察到了特定的外部事件,就会通过这个IRP来通知功能驱动及时醒来。这也从反面证明了发送这五个IRP的相关设备已经进入了休眠状态,否则不会发送IRP_MN_WAIT_WAKE这个IRP。

目光落在最后一个IRP,它独具一格地落在末尾,子类型是IRP_MN_SET_POWER,用来设置新的电源状态,新状态是D3,也就是要休眠、或掉电、或Power Off。虽然不能马上断定它就是中了彩票的IRP,但因为现在疑犯只剩它一个,所以不怀疑除非是傻掉了。

相对应地,在线程列表中,能找到处理这个IRP的线程,以及发送此IRP的设备对象(ffffe00003161200)。下面分别针对这些指针,简单列出信息。

0: kd> !irp ffffe00004979860
Irp is active with 6 stacks 3 is current (= 0xffffe000049799c0)
 No Mdl: No System Buffer: Thread 00000000:  Irp stack trace.  
     cmd  flg cl Device   File     Completion-Context
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    
      Args: 00000000 00000000 00000000 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-00000000    
      Args: 00000000 00000000 00000000 00000000
>[ 16, 2]   0 e0 ffffe0000119c060 00000000 fffff800004071e4-fffff800004064a8 Success Error Cancel 
	       \Driver\pci	ACPI!ACPIDeviceIrpDeviceFilterRequest
      Args: 00000000 00000001 00000004 00000000
 [ 16, 2]   0 e1 ffffe00001194c60 00000000 fffff80001af2f9c-ffffd000201089d0 Success Error Cancel pending
	       \Driver\ACPI	dxgkrnl!DpiFdoPowerCompletionRoutine 
      Args: 00000000 00000001 00000004 00000000
 [ 16, 2]   0 e1 ffffe00003161200 00000000 fffff8016912d0e0-ffffe00004926770 Success Error Cancel pending
	      *** ERROR: Symbol file could not be found.  Defaulted to export symbols for igdkmd64.sys - 
 \Driver\igfx	nt!PopRequestCompletion
      Args: 00000000 00000001 00000004 00000000
 [  0, 0]   0  0 00000000 00000000 00000000-ffffe00004926770          Args: 00000000 00000000 00000000 00000000

0: kd> !devstack ffffe00003161200
  !DevObj   !DrvObj            !DevExt   ObjectName
> ffffe00003161200  \Driver\igfx       ffffe00003161350  
  ffffe00001194c60  \Driver\ACPI       ffffe00001095010  
  ffffe0000119c060  \Driver\pci        ffffe0000119c1b0  NTPNP_PCI0001
!DevNode ffffe00001193010 :
  DeviceInst is "PCI\VEN_8086&DEV_0A16&SUBSYS_397817AA&REV_0B\3&11583659&1&10"
  ServiceName is "IGFX"

把上面的信息汇总之后,就能得到阶段性的结论。这是一个Intel集成显卡的设备栈(iGfx->ACPI->PCI),当前的电源IRP正由PCI总线处理,iGfx和ACPI都在等待状态。下面是线程的调用栈。

0: kd> !thread ffffe0000039b040
THREAD ffffe0000039b040  Cid 0004.0010  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (DelayExecution) KernelMode Non-Alertable
    ffffffffffffffff  NotificationEvent
Not impersonating
DeviceMap                 ffffc0000000c350
Owning Process            ffffe0000027c2c0       Image:         System
Attached Process          N/A            Image:         N/A
Wait Start TickCount      3249762        Ticks: 1 (0:00:00:00.015)
Context Switch Count      22848          IdealProcessor: 0             
UserTime                  00:00:00.000
KernelTime 00:00:45.500
Win32 Start Address nt!PopIrpWorker (0xfffff8016912d9a8)
Stack Init ffffd00020108c90 Current ffffd000201083d0
Base ffffd00020109000 Limit ffffd00020103000 Call 0
Priority 13 BasePriority 13 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
ffffd000`20108410 fffff801`69054b1e : fffff801`69306180 ffffe000`0039b040 00000000`fffffffe 00000000`fffffffe : nt!KiSwapContext+0x76 [d:\wbrtm\minkernel\ntos\ke\amd64\ctxswap.asm @ 109]
ffffd000`20108550 fffff801`690545b7 : ffffe000`0039b040 fffff801`690a30ba 00000000`2dfb8602 00000000`00194603 : nt!KiSwapThread+0x14e [d:\wbrtm\minkernel\ntos\ke\thredsup.c @ 5899]
ffffd000`201085f0 fffff801`6907b054 : ffffe000`0039b040 ffffffff`ffffffff 00000000`00000072 00000000`00000000 : nt!KiCommitThreadWait+0x127 [d:\wbrtm\minkernel\ntos\ke\waitsup.c @ 612]
ffffd000`20108650 fffff800`00201b88 : ffffe000`0119c1b0 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KeDelayExecutionThread+0x434 [d:\wbrtm\minkernel\ntos\ke\wait.c @ 154]
ffffd000`201086d0 fffff800`00204601 : 00000000`0022a401 00000000`00000004 ffffffff`fffc1707 00000000`00000004 : pci!PciStallForPowerChange+0x68 [d:\wbrtm\drivers\busdrv\pci\devpower.c @ 2193]
ffffd000`20108730 fffff800`00203e5f : ffffe000`01190003 00000000`00000000 ffffe000`0119c1b0 00000000`00000000 : pci!PciSetHardwarePowerStateDispatch+0x121 [d:\wbrtm\drivers\busdrv\pci\devpower.c @ 2076]
ffffd000`20108780 fffff800`0020429d : ffffe000`04979860 ffffe000`04979800 ffffe000`04979860 fffff801`6912e04b : pci!PciPowerDownDevice+0x2ef [d:\wbrtm\drivers\busdrv\pci\devpower.c @ 880]
ffffd000`20108830 fffff800`00201656 : ffffe000`049799c0 fffff801`6912dfa3 ffffe000`04979860 ffffe000`0119c1b0 : pci!PciDevice_SetPower+0xce [d:\wbrtm\drivers\busdrv\pci\devpower.c @ 484]
ffffd000`20108890 fffff800`0040712a : ffffe000`01095010 00000000`00000000 ffffe000`04979a08 ffffe000`04979860 : pci!PciDispatchPnpPower+0x96 [d:\wbrtm\drivers\busdrv\pci\dispatch.c @ 209]
ffffd000`201088d0 fffff800`00401205 : ffffe000`01095010 00000000`00000000 00000000`00000000 00000000`00000000 : ACPI!ACPIFilterIrpSetPower+0xf2 [d:\wbrtm\minkernel\busdrv\acpi\driver\nt\filter.c @ 2055]
ffffd000`20108930 fffff800`01b792af : 00000000`00000007 00000000`00000001 ffffe000`04979860 00000000`00000000 : ACPI!ACPIDispatchIrp+0x1e5 [d:\wbrtm\minkernel\busdrv\acpi\driver\nt\dispatch.c @ 912]
ffffd000`201089a0 fffff800`01b78c7e : ffffe000`04979801 ffffd000`20108ae9 ffffe000`04979860 fffff801`6912e2b6 : dxgkrnl!DpiFdoHandleDevicePower+0x1ef [d:\wbrtm\windows\core\dxkernel\dxgkrnl\port\dppower.cxx @ 449]
ffffd000`20108a30 fffff801`6912dc87 : 00000000`00000001 ffffd000`20108bd0 00000000`00000001 ffffe000`04979860 : dxgkrnl!DpiDispatchPower+0x8e [d:\wbrtm\windows\core\dxkernel\dxgkrnl\port\dpport.cxx @ 2205]
ffffd000`20108b50 fffff801`690fc664 : 00000000`00000000 ffffe000`0039b040 ffffe000`0039b040 ffffe000`0027c2c0 : nt!PopIrpWorker+0x2df [d:\wbrtm\minkernel\ntos\po\irpwork.c @ 599]
ffffd000`20108c00 fffff801`6916b6c6 : ffffd000`20726180 ffffe000`0039b040 ffffd000`20732100 00000000`00000000 : nt!PspSystemThreadStartup+0x58 [d:\wbrtm\minkernel\ntos\ps\psexec.c @ 5883]
ffffd000`20108c60 00000000`00000000 : ffffd000`20109000 ffffd000`20103000 00000000`00000000 00000000`00000000 : nt!KxStartSystemThread+0x16 [d:\wbrtm\minkernel\ntos\ke\amd64\threadbg.asm @ 75]

这个线程的内核时间有45秒之多(占有CPU的执行时间),考虑到系统是时间片轮询的,这个线程会不断地被换入换出的,所以一般的内核时间在10s以内才是正常的。

线程栈显示,PCI驱动并没有消极怠工,它的函数pci!PciDevice_SetPower在不断地尝试完成这个iGfx驱动发送过来的IRP,只是始终完成不了。好啦,现在问题就要转到iGfx的主人——Intel去啦。

你可能感兴趣的:(windbg,BSOD,0x9f)