文章目录

  • DPC_WATCHDOG_VIOLATION蓝屏问题分析
    • 1. 原理
      • 1.1 DPC WATCHDOG
      • 1.2 DPC超时时间获取
    • 2. 实例
      • 2.1 分析
    • 3. 总结

DPC_WATCHDOG_VIOLATION蓝屏问题分析

在Windows 10之下,引入了DPC_WATCHDOG_VIOLATION蓝屏错误,这个错误引入的原因是为了防止内核代码处于错误状态在DPC以及以上级别执行太久,影响系统的稳定性。

对于DPC_WATCHDOG_VIOLATION这个蓝屏错误码,存在两种类型(使用第一个参数来区分):

  1. 第一个参数为0表示单个DPC例程执行超时了(A single DPC or ISR exceeded its time allotment)。
  2. 第一个参数为1表示系统在DPC上面执行超时了(The system cumulatively spent an extended period of time at IRQL DISPATCH_LEVEL or above)。

这两种蓝屏代码分别如下:

DPC_WATCHDOG_VIOLATION (133)The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVELor above.Arguments:Arg1: 0000000000000000, A single DPC or ISR exceeded its time allotment. The offendingcomponent can usually be identified with a stack trace.Arg2: 0000000000000501, The DPC time count (in ticks).Arg3: 0000000000000500, The DPC time allotment (in ticks).Arg4: fffff8077ff73358, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which containsadditional information regarding this single DPC timeoutDPC_WATCHDOG_VIOLATION (133)The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVELor above.Arguments:Arg1: 0000000000000001, The system cumulatively spent an extended period of time atDISPATCH_LEVEL or above. The offending component can usually beidentified with a stack trace.Arg2: 0000000000001e00, The watchdog period.Arg3: fffff80080705330, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which containsadditional information regarding the cumulative timeoutArg4: 0000000000000000

下面我们通过要给实例来探讨一下DPC_WATCHDOG_VIOLATION的基本原理以及分析过程。

1. 原理

1.1 DPC WATCHDOG

Windows对于系统运行在DPC和中断的时间做了相关限定,一般来说DPC不能运行超过100MS,IRS不能运行超过25MS(但是实际值却比这个要高)。

在Windows中,通过KeAccumulateTicks对运行时间做相关限定,如下:

__int64 __fastcall KeAccumulateTicks(...){//...if (pcrb->NestingLevel == 2 && pcrb->DpcRoutineActive){DpcTimeCount = pcrb->DpcTimeCount;pcrb->DpcTime += KernelTime;NewDpcTimeCount = DpcTimeCount + 1;DpcTimeLimit = pcrb->DpcTimeLimit;pcrb->DpcTimeCount = NewDpcTimeCount;if (DpcTimeLimit){//...if (NewDpcTimeCount > DpcTimeLimit){if (KeEnableWatchdogTimeout){HvlInvokeHypervisorDebugger(3i64, 0i64);if (!(_BYTE)KdDebuggerEnabled || (_BYTE)KdDebuggerNotPresent)KeBugCheckEx(0x133u,0i64,pcrb->DpcTimeCount,pcrb->DpcTimeLimit,(ULONG_PTR)&KeDpcWatchdogProfileGlobalTriageBlock);}pcrb->DpcTimeCount = 0;KiResetGlobalDpcWatchdogProfiler(pcrb);}}}//...if (Irql < 2u){DpcWatchdogProfile = pcrb->DpcWatchdogProfile;pcrb->DpcWatchdogCount = 0;//...goto Over;}DpcWatchdogPeriod = pcrb->DpcWatchdogPeriod;NewDpcWatchdogCount = pcrb->DpcWatchdogCount + 1;pcrb->DpcWatchdogCount = NewDpcWatchdogCount;if (DpcWatchdogPeriod){//...if (DpcWatchdogCount >= DpcWatchdogPeriod){if (KeEnableWatchdogTimeout){HvlInvokeHypervisorDebugger(2i64, 0i64);if (!(_BYTE)KdDebuggerEnabled || (_BYTE)KdDebuggerNotPresent)KeBugCheckEx(0x133u, 1ui64, pcrb->DpcWatchdogPeriod, (ULONG_PTR)&KeDpcWatchdogProfileGlobalTriageBlock,0i64);}}}//...}

从这里看,如果超过时间并且KeEnableWatchdogTimeout为真的话,那么引发133蓝屏。那么KeEnableWatchdogTimeout这个是什么时候初始化的呢?我们可以看到初始化的代码如下:

char __fastcall KeInitSystem(int a1){//...if (!HviIsAnyHypervisorPresent())KeEnableWatchdogTimeout = 1;//...}

HviIsAnyHypervisorPresent这个函数返回cpuid指令是否由hypervisor实现,如果不是的话,那么开启WatchDog。

在上述中KeBugCheckEx存在两种情况:

  1. 如果是pcrb->DpcRoutineActive在DPC例程中,那么判断是否DPC例程超时,DPC例程判断使用的是PRCB中的DpcTimeCountDpcTimeLimitTicks比较。
  2. 接着是判断整体是否运行在DPC基本下面,使用的是PRCB中的DpcWatchdogCountDpcWatchdogPeriodTicks

由于DpcTimeCount表示的是单个DPC例程的执行时间,因此每一个DPC执行的时候都会设置改值为0。

__int64 __fastcall KiExecuteAllDpcs(...){//...a1->DpcTimeCount = 0;//...}

1.2 DPC超时时间获取

通过KeQueryDpcWatchdogInformation可以获取到当前处理器的DPC超时时间信息,如下:

NTSTATUS KeQueryDpcWatchdogInformation(PKDPC_WATCHDOG_INFORMATION WatchdogInformation);typedef struct _KDPC_WATCHDOG_INFORMATION {ULONG DpcTimeLimit;ULONG DpcTimeCount;ULONG DpcWatchdogLimit;ULONG DpcWatchdogCount;ULONG Reserved;} KDPC_WATCHDOG_INFORMATION, *PKDPC_WATCHDOG_INFORMATION;

其中:

  1. DpcTimeLimit : 当前处理器单个延迟过程调用的时间限制。如果已禁用DPC超时,则此值设置为0。
  2. DpcTimeCount : 如果已启用DPC超时,则本次延迟过程调用的使用时间。
  3. DpcWatchdogLimit : 一系列延迟过程调用所允许的总时间限制。如果已禁用DPC看门狗,则此值设置为零。
  4. DpcWatchdogCount : 如果启用DPC看门狗,则当前连续延迟过程调用序列的剩余时间值。

例如我们可以看到如下:

0: kd> dt nt!_KPRCBfffff8007bfe9180 DPC* +0x3310 DpcRuntimeHistoryHashTable : 0xffffaf0c`1f0d5690 _RTL_HASH_TABLE +0x3318 DpcRuntimeHistoryHashTableCleanupDpc : 0xffffaf0c`1f122c40 _KDPC +0x3338 DpcDelegateThread : 0xffffaf0c`1f150140 _KTHREAD +0x3340 DpcData : [2] _KDPC_DATA +0x33a0 DpcStack : 0xfffff800`7c395fb0 Void +0x33ac DpcRequestRate : 0 +0x33b4 DpcLastCount : 0x67518e +0x33ba DpcRoutineActive : 0 '' +0x33bc DpcRequestSummary : 0n131174 +0x33bc DpcRequestSlot : [2] 0n102 +0x33bc DpcNormalProcessingActive : 0y0 +0x33bc DpcNormalProcessingRequested : 0y1 +0x33bc DpcNormalThreadSignal : 0y1 +0x33bc DpcNormalTimerExpiration : 0y0 +0x33bc DpcNormalDpcPresent : 0y0 +0x33bc DpcNormalLocalInterrupt : 0y1 +0x33bc DpcNormalPriorityAntiStarvation : 0y1 +0x33bc DpcNormalSwapToDpcDelegate : 0y0 +0x33bc DpcNormalSpare : 0y00000000 (0) +0x33bc DpcThreadActive : 0y0 +0x33bc DpcThreadRequested : 0y1 +0x33bc DpcThreadSpare : 0y00000000000000 (0) +0x7e40 DpcGate : _KGATE +0x7ea8 DpcWatchdogPeriodTicks : 0n7680 +0x7eac DpcWatchdogCount : 0n7680 +0x7eb4 DpcWatchdogProfileCumulativeDpcThresholdTicks : 0x1b80 +0x7eec DpcWatchdogSequenceNumber : 0x18794c +0x814c DpcTime : 0x4c56 +0x815c DpcTimeCount : 0 +0x8160 DpcTimeLimitTicks : 0x500 +0x8240 DpcWatchdogProfileSingleDpcThresholdTicks : 0x496 +0x85d8 DpcWatchdogDpc : _KDPC +0x8618 DpcWatchdogTimer : _KTIMER +0x88a8 DpcWatchdogProfile : 0xffffaf0c`1f16a000-> 0x00000000`00000011 Void +0x88b0 DpcWatchdogProfileCurrentEmptyCapture : 0xffffaf0c`1f181cf8-> (null) 

综上,针对DPC的监控有两个:

  1. 使用DpcTimeCountDpcTimeLimitTicks来监控单个DPC例程执行的时间,防止DPC例程执行过长。
  2. 使用DpcWatchdogCountDpcWatchdogPeriodTicks来监控CPU执行的DPC时间总和,防止CPU一直执行一系列DPC函数。

2. 实例

下面我们通过一个DPC_WATCHDOG_VIOLATION蓝屏转储文件来进行实例分析。

2.1 分析

我们先来看一下蓝屏代码DPC_WATCHDOG_VIOLATION,如下:

DPC_WATCHDOG_VIOLATION (133)The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVELor above.Arguments:Arg1: 0000000000000001, The system cumulatively spent an extended period of time atDISPATCH_LEVEL or above. The offending component can usually beidentified with a stack trace.Arg2: 0000000000001e00, The watchdog period.Arg3: fffff80080705330, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which containsadditional information regarding the cumulative timeoutArg4: 0000000000000000

此时调用堆栈如下:

0: kd> k # Child-SPRetAddr Call Site00 fffff800`7c39cdd8 fffff800`7fc31c8f nt!KeBugCheckEx01 fffff800`7c39cde0 fffff800`7fc318a1 nt!KeAccumulateTicks+0x20f02 fffff800`7c39ce50 fffff800`7fc2fa8a nt!KiUpdateRunTime+0x6103 fffff800`7c39ceb0 fffff800`7fc2f876 nt!KeClockInterruptNotify+0x11a04 fffff800`7c39cf40 fffff800`7fc14650 nt!HalpTimerClockIpiRoutine+0x1605 fffff800`7c39cf70 fffff800`7fe1e05a nt!KiCallInterruptServiceRoutine+0xa006 fffff800`7c39cfb0 fffff800`7fe1e8c7 nt!KiInterruptSubDispatchNoLockNoEtw+0xfa07 fffff882`a7a57430 fffff800`7fd67972 nt!KiInterruptDispatchNoLockNoEtw+0x3708 fffff882`a7a575c0 fffff800`7fc6cb64 nt!KiAcquireKobjectLockSafe+0x3209 fffff882`a7a575f0 fffff800`7fc44136 nt!KeSetEvent+0x640a fffff882`a7a57680 fffff800`7fc18496 nt!IopCompleteRequest+0x3960b fffff882`a7a57760 fffff800`7fc70846 nt!KiDeliverApc+0x1b60c fffff882`a7a57820 fffff800`7fc72169 nt!KiSwapThread+0x9360d fffff882`a7a57900 fffff800`7fc75e26 nt!KiCommitThreadWait+0x1590e fffff882`a7a579a0 fffff800`8cce1cc1 nt!KeDelayExecutionThread+0x4160f fffff882`a7a57a30 fffff800`8cce3b7a xxxDrv_x64+0x1cc110 fffff882`a7a57a60 fffff800`7fd54295 xxxDrv_x64+0x3b7a11 fffff882`a7a57af0 fffff800`7fe20e34 nt!PspSystemThreadStartup+0x5512 fffff882`a7a57b40 00000000`00000000 nt!KiStartSystemThread+0x34

这里我们可以得出的是蓝屏并不是执行某个DPC例程导致的时间够长,而是该CPU在DPC模式下面执行过长。

此时我们看到DCP的时间如下:

0: kd> dt nt!_KPRCBfffff8007bfe9180 DPC* +0x7ea8 DpcWatchdogPeriodTicks : 0n7680 +0x7eac DpcWatchdogCount : 0n7680 +0x815c DpcTimeCount : 0 +0x8160 DpcTimeLimitTicks : 0x500

从这里我们也可以发现:

  1. DpcTimeCount为0,说明我们没有执行DPC例程。
  2. DpcWatchdogCountDpcWatchdogPeriodTicks相等,说明CPU在DPC下面执行时间过长。

我们可以看一下当前DPC的信息,如下:

0: kd> !dpcs 0CPU TypeKDPC Function 0: Normal: 0xffffaf0c32787278 0xfffff8007fe56420 nt!IopPassiveInterruptDpc 0: Normal: 0xffffaf0c21e7b520 0xfffff80082092550 Wdf01000!FxInterrupt::_InterruptDpcThunk 0: Normal: 0xffffaf0c23ce9808 0xfffff80082c314d0 ndis!ndisInterruptDpc 0: Normal: 0xffffaf0c21eb26e8 0xfffff800858666c0 dxgkrnl!DpiFdoDpcForIsr 0: Normal: 0xfffff882acf91f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xfffff8007bff15d8 0xfffff8007fc26b60 nt!PpmPerfAction 0: Normal: 0xffffaf0c2173a0c8 0xfffff80082ec51d0 tcpip!TcpPeriodicTimeoutHandler 0: Normal: 0xffffaf0c211f71e8 0xfffff80082732000 stornvme!NVMeCompletionDpcRoutine 0: Normal: 0xfffff8007bff1c68 0xfffff8007fd4c800 nt!KiEntropyDpcRoutine 0: Normal: 0xfffff882a8ab6f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xfffff882ad376f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xffffaf0c1f122c40 0xfffff8007fd3fd30 nt!KiDpcRuntimeHistoryHashTableCleanupDpcRoutine 0: Normal: 0xffffaf0c231d0de0 0xfffff80082092550 Wdf01000!FxInterrupt::_InterruptDpcThunk 0: Normal: 0xffffaf0c21d93db0 0xfffff800b332b470 i8042prt!I8042KeyboardIsrDpc 0: Normal: 0xffffaf0c23ce9a48 0xfffff80082c314d0 ndis!ndisInterruptDpc 0: Normal: 0xfffff882a8476f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xffffaf0c21233000 0xfffff8008278c0f0 storport!RaidpAdapterRedirectDpcRoutine 0: Normal: 0xffffaf0c23081270 0xfffff800ae096d90 Ndu!NduQueryStatisticsDpc 0: Normal: 0xfffff882a7f26f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xfffff882ac1a6f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Threaded: 0xfffff8007bff1758 0xfffff8007fd59dc0 nt!KiDpcWatchdog

发现大量DPC,为什么呢?导致DPC执行不了的一个原因是CPU一直处于高IRQL下面,例如中断,是不是CPU一直处于中断下面呢?

我们可以使用!dpcwatchdog查看dpcwatchdog的情况,如下:

0: kd> !dpcwatchdog All durations are in seconds (1 System tick = 15.625000 milliseconds)Circular Kernel Context Logger history: !logdump 0x2DPC and ISR stats: !intstats /d--------------------------------------------------CPU#0--------------------------------------------------Current DPC: No Active DPCPending DPCs:----------------------------------------CPU TypeKDPC Function 0: Normal: 0xffffaf0c32787278 0xfffff8007fe56420 nt!IopPassiveInterruptDpc 0: Normal: 0xffffaf0c21e7b520 0xfffff80082092550 Wdf01000!FxInterrupt::_InterruptDpcThunk 0: Normal: 0xffffaf0c23ce9808 0xfffff80082c314d0 ndis!ndisInterruptDpc 0: Normal: 0xffffaf0c21eb26e8 0xfffff800858666c0 dxgkrnl!DpiFdoDpcForIsr 0: Normal: 0xfffff882acf91f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xfffff8007bff15d8 0xfffff8007fc26b60 nt!PpmPerfAction 0: Normal: 0xffffaf0c2173a0c8 0xfffff80082ec51d0 tcpip!TcpPeriodicTimeoutHandler 0: Normal: 0xffffaf0c211f71e8 0xfffff80082732000 stornvme!NVMeCompletionDpcRoutine 0: Normal: 0xfffff8007bff1c68 0xfffff8007fd4c800 nt!KiEntropyDpcRoutine 0: Normal: 0xfffff882a8ab6f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xfffff882ad376f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xffffaf0c1f122c40 0xfffff8007fd3fd30 nt!KiDpcRuntimeHistoryHashTableCleanupDpcRoutine 0: Normal: 0xffffaf0c231d0de0 0xfffff80082092550 Wdf01000!FxInterrupt::_InterruptDpcThunk 0: Normal: 0xffffaf0c21d93db0 0xfffff800b332b470 i8042prt!I8042KeyboardIsrDpc 0: Normal: 0xffffaf0c23ce9a48 0xfffff80082c314d0 ndis!ndisInterruptDpc 0: Normal: 0xfffff882a8476f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xffffaf0c21233000 0xfffff8008278c0f0 storport!RaidpAdapterRedirectDpcRoutine 0: Normal: 0xffffaf0c23081270 0xfffff800ae096d90 Ndu!NduQueryStatisticsDpc 0: Normal: 0xfffff882a7f26f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Normal: 0xfffff882ac1a6f60 0xfffff8007fc13ff0 nt!PopExecuteProcessorCallback 0: Threaded: 0xfffff8007bff1758 0xfffff8007fd59dc0 nt!KiDpcWatchdogc0000094 Exception in kdexts.dpcwatchdog debugger extension.PC: 00007ffa`507625c0VA: 00000000`00000000R/W: 0Parameter: 00000000`00000000

同样也可以看到大量Pending DPCs,Current DPC: No Active DPC。

3. 总结

DPC_WATCHDOG_VIOLATION有两种蓝屏情况:

  1. Parameter 1 = 0说明当前正在执行DPC例程,并且超时了,可以直接看到DPC超时的原因(或者由于其他线程导致该DPC一直无法退出)。
  2. Parameter 1 = 1说明此时该CPU一直处于DPC状态执行,这个原因需要具体分析。

有两个WinDBG命令可以协助分析该蓝屏:

  1. !dpcwatchdog查看所有cpu的dpcwatchdog信息。
  2. !dpcs查看指定cpu的dpc例程信息。