Windows平台下一个崩溃而导致的死锁分析

文章目录
  1. 1. 0x00 问题介绍
  2. 2. 0x01 初探死锁原因
  3. 3. 0x02 再探死锁原因
  4. 4. 0x03 解决方案
  5. 5. 0x04 参考链接

0x00 问题介绍

测试反馈测试过程中发现程序进程存在但是界面没加载出来,看现场很快发现是因为版本不匹配而导致程序崩溃,在写dmp的过程中死锁而导致进程卡死,由于程序是卡死而非退出守护进程也未重启程序,最终导致界面一直没加载出来。
现象就如上面所说,但是为什么写dmp为什么会导致程序死锁呢?

0x01 初探死锁原因

首先我们来看下程序死锁时线程状态,分析下死锁原因:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
0:065> ~0kv
ChildEBP RetAddr Args to Child
0061c8cc 751a15bf 00000138 00000000 00000000 ntdll!NtWaitForSingleObject+0x15 (FPO: [3,0,0])
0061c938 770e1194 00000138 ffffffff 00000000 KERNELBASE!WaitForSingleObjectEx+0x98 (FPO: [Non-Fpo])
0061c950 770e1148 00000138 ffffffff 00000000 kernel32!WaitForSingleObjectExImplementation+0x75 (FPO: [Non-Fpo])
0061c964 6c710cbb 00000138 ffffffff 6c71092f kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])
0061c978 6c710983 0061ca54 00000000 6c71092f xxx!google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread+0x64 (FPO: [2,0,0]) (CONV: thiscall)
0061c994 7712030d 0061ca24 7712031f 0061ca54 xxx!google_breakpad::ExceptionHandler::HandleException+0x54 (FPO: [Non-Fpo]) (CONV: thiscall)
0061ca24 77686637 0061ca54 77686514 00000000 kernel32!UnhandledExceptionFilter+0x119 (FPO: [Non-Fpo])
0061ca2c 77686514 00000000 0061fa04 7763c6b0 ntdll!__RtlUserThreadStart+0x62 (FPO: [SEH])
0061ca40 776863b1 00000000 00000000 00000000 ntdll!_EH4_CallFilterFunc+0x12 (FPO: [Uses EBP] [0,0,4])
0061ca68 7766b81d fffffffe 0061f9f4 0061cba4 ntdll!_except_handler4+0x8e (FPO: [Non-Fpo])
0061ca8c 7766b7ef 0061cb54 0061f9f4 0061cba4 ntdll!ExecuteHandler2+0x26 (FPO: [Uses EBP] [5,3,1])
0061cab0 7766b790 0061cb54 0061f9f4 0061cba4 ntdll!ExecuteHandler+0x24 (FPO: [5,0,3])
0061cb3c 77620163 0061cb54 0061cba4 0061cb54 ntdll!RtlDispatchException+0x127 (FPO: [Non-Fpo])
0061cb3c 00000000 0061cb54 0061cba4 0061cb54 ntdll!KiUserExceptionDispatcher+0xf (FPO: [2,0,0]) (CONTEXT @ 00000008)
0:065> .exr 0061cb54
ExceptionAddress: 7519c52f (KERNELBASE!RaiseException+0x00000058)
ExceptionCode: e06d7363 (C++ EH exception)
ExceptionFlags: 00000001
NumberParameters: 3
Parameter[0]: 19930520
Parameter[1]: 0061d0a0
Parameter[2]: 6d376990
pExceptionObject: 0061d0a0
_s_ThrowInfo : 6d376990
Type : class std::bad_alloc
Type : class std::exception
0:065> .cxr 0061cba4
eax=0061d008 ebx=0061d0e4 ecx=00000003 edx=00000000 esi=3fffffef edi=00000000
eip=7519c52f esp=0061d008 ebp=0061d058 iopl=0 nv up ei pl nz ac po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000212
KERNELBASE!RaiseException+0x58:
7519c52f c9 leave
*** ERROR: Symbol file could not be found. Defaulted to export symbols for Qt5Core.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for Desktop.exe -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for Qt5Widgets.dll -
0:065> kv
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
0061d058 6fe4872d e06d7363 00000001 00000003 KERNELBASE!RaiseException+0x58 (FPO: [Non-Fpo])
0061d090 6d0a6956 0061d0a0 6d376990 6d2516d4 msvcr100!_CxxThrowException+0x48 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
0061d0ac 6d0b888b 07cb37f8 07cb37f8 00000000 Qt5Core!qBadAlloc+0x1c
0061d0f4 00285fbe 0061d11c 3fffffef 00000000 Qt5Core!QByteArray::resize+0x96
0:065> !handle 0x138 0xf
Handle 00000138
Type Semaphore
Attributes 0
GrantedAccess 0x1f0003:
Delete,ReadControl,WriteDac,WriteOwner,Synch
QueryState,ModifyState
HandleCount 2
PointerCount 4
Name <none>
No object specific information available

首先看下主线程,主线程因为bad_alloc主动触发异常被breakpad捕获到在创建dmp文件,并通过WaitForsingleObject等待句柄值为0x138的信号量释放。这里可以回到breakpad的源码,可以发现breakpad中是通过信号量控制ExceptionHandlerThreadMain线程生成dmp,此刻是通过WaitForSingleObject等待ExceptionHandlerThreadMain线程完成dmp生成。

1
2
3
4
5
6
// This causes the handler thread to call WriteMinidumpWithException.
ReleaseSemaphore(handler_start_semaphore_, 1, NULL);

// Wait until WriteMinidumpWithException is done and collect its return value.
WaitForSingleObject(handler_finish_semaphore_, INFINITE);
bool status = handler_return_value_;

那我们再看下线程ExceptionHandlerThreadMain在干嘛,怎么一直没有完成dmp文件生成,释放信号量?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0:065> ~4kv
ChildEBP RetAddr Args to Child
02f3f1fc 7764db13 00000730 00000000 00000000 ntdll!NtWaitForSingleObject+0x15 (FPO: [3,0,0])
02f3f260 7764d9f7 00000000 00000000 03d20000 ntdll!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
02f3f288 7764dc78 03d20138 759b9beb 00078000 ntdll!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
02f3f364 77643541 00001ff8 00002000 00000000 ntdll!RtlpAllocateHeap+0x159 (FPO: [Non-Fpo])
02f3f3e8 77647d7b 03d20000 00800000 00001ff8 ntdll!RtlAllocateHeap+0x23a (FPO: [Non-Fpo])
02f3f434 77647271 00000388 759b9c4b 0767d8e0 ntdll!RtlpAllocateUserBlock+0xae (FPO: [Non-Fpo])
02f3f4c4 7763e262 0767d8e0 02f3faac 02f3faac ntdll!RtlpLowFragHeapAllocFromContext+0x802 (FPO: [Non-Fpo])
02f3f538 73e1bad5 03d20000 00000008 00000374 ntdll!RtlAllocateHeap+0x206 (FPO: [Non-Fpo])
02f3f54c 73e16d52 00000374 0767d8e0 02f3faac dbghelp!Win32LiveAllocationProvider::Alloc+0x13 (FPO: [Non-Fpo])
02f3f560 73e16e96 02f3faac 00000374 0767d8e0 dbghelp!AllocMemory+0x15 (FPO: [Non-Fpo])
02f3f598 73e1a261 02f3faac 0767d8e0 000018b4 dbghelp!GenAllocateThreadObject+0x2d (FPO: [Non-Fpo])
02f3f9c0 73e15b81 02f3faac 02f3fb60 02f3fb78 dbghelp!GenGetProcessInfo+0xf2 (FPO: [Non-Fpo])
02f3fb40 73e15e2a ffffffff 00001bd0 03d207e8 dbghelp!MiniDumpProvideDump+0x16b (FPO: [Non-Fpo])
02f3fba8 6c71101a ffffffff 00001bd0 00000728 dbghelp!MiniDumpWriteDump+0xf2 (FPO: [Non-Fpo])
02f3fc70 6c710db2 00000720 00000000 ffffffff xxx!google_breakpad::ExceptionHandler::WriteMinidumpWithExceptionForProcess+0x1e3 (FPO: [Non-Fpo])
02f3fc8c 6c710875 00000720 0061ca54 00000000 xxx!google_breakpad::ExceptionHandler::WriteMinidumpWithException+0x43 (FPO: [Non-Fpo])
02f3fca0 770e338a 02811098 02f3fcec 77649a02 xxx!google_breakpad::ExceptionHandler::ExceptionHandlerThreadMain+0x39 (FPO: [1,0,0])
02f3fcac 77649a02 02811098 759b9463 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
02f3fcec 776499d5 6c71083c 02811098 00000000 ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
02f3fd04 00000000 6c71083c 02811098 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

我们可以看到ExceptionHandlerThreadMain线程果然正在写dmp,但是为什么一直没有完成了?原来是在等待获取一个临界区0x3d20138啊,看下这个临界区被谁占用了呢

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
0:065> !cs 03d20138 
-----------------------------------------
Critical section = 0x03d20138 (+0x3D20138)
DebugInfo = 0x00810238
LOCKED
LockCount = 0x1
WaiterWoken = No
OwningThread = 0x00001770
RecursionCount = 0x1
LockSemaphore = 0x730
SpinCount = 0x00000fa0
0:065> ~~[0x1770]
12 Id: 1bd0.1770 Suspend: 2 Teb: 7ef93000 Unfrozen
Priority: 0 Priority class: 32
0:065> ~12kv
ChildEBP RetAddr Args to Child
03fcf388 776cfbc7 00720000 00000005 776b4acb ntdll!RtlpQueryExtendedInformationHeap+0x4ec (FPO: [Non-Fpo])
03fcf408 776d079f 00000005 776b4acb 03fcf590 ntdll!RtlpQueryExtendedInformationAllHeaps+0xe5 (FPO: [Non-Fpo])
03fcf4f8 7769e2b6 03fcf564 776b4acb 00000000 ntdll!RtlpQueryExtendedHeapInformation+0xe7 (FPO: [Non-Fpo])
03fcf538 776b5163 00000000 00000002 03fcf564 ntdll!RtlQueryHeapInformation+0x4a (FPO: [Non-Fpo])
03fcf5dc 7769374a 0a0a0000 770d0000 03fcf6a4 ntdll!RtlQueryProcessHeapInformation+0x288 (FPO: [Non-Fpo])
03fcf658 77166093 00001bd0 00000014 0a0a0000 ntdll!RtlQueryProcessDebugInformation+0x28a (FPO: [Non-Fpo])
*** WARNING: Unable to verify checksum for libeay32.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for libeay32.dll -
03fcf688 6c38a953 0a0a0000 5205b472 00000001 kernel32!Heap32Next+0x4d (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
03fcf730 7764b83d 03fcf7cc 00000001 751d11e4 libeay32!RAND_poll+0x583
03fcf7dc 6c331cef 0000000a 00000001 6c408268 ntdll!SbpTraceSbImpl+0x4e (FPO: [Non-Fpo])
03fcf7f4 6c331d3b 0000000a 00000001 6c408268 libeay32!CRYPTO_lock+0x6f
00000000 00000000 00000000 00000000 00000000 libeay32!CRYPTO_add_lock+0x3b

原来临界区03d20138还在被12号线程线程占用着啊,那为什么一直不释放呢?12号线程也没有获取什么资源啊,通过栈看只是在查询堆信息啊,百思不得其解。看看线程状态呢,Suspend 2表示线程该线程被挂起了2次(不清楚可以看下SuspendThread的MSDN),其中一次是挂调试器引起的,那另外一次呢? 猜测可能是breakpad或则系统函数在写dmp的过程中执行的,因为通过挂起线程便于保存当前进程线程的上下文到dmp文件中。
最终发现是在执行系统函数MiniDumpWriteDump时执行的挂起线程操作。在GenAllocateThreadObject函数中执行SuspendThread挂起线程,在GenFreeProcessObject中执行ResumeThread恢复线程执行。




到这里死锁的原因已经很清晰了,整个过程如下:
1.主线程即0号线程触发异常被breakpad捕获,0号线程通过信号量控制4号线程生成dmp并等待4号线程执行完MiniDumpWriteDump释放信号量。
2.4号线程执行MiniDumpWriteDump,在执行过程中会挂起其他线程,挂起其他线程后会执行堆分配操作,这时会获取临界区0x03d20138。
3.此刻临界区0x03d20138正被12号线程占有,但是12号线程被4号线程不能释放临界区0x03d20138。

0x02 再探死锁原因

这就是一个经典的死锁问题,4号线程和12号线程相互等待,陷入死循环。本以为这时候分析完毕了,但是突然想到12号线程是libeay32的线程为什么会执行堆查询操作呢?直接Google搜一下,第一篇就是libeay32引发的死锁问题(请看第一个参考链接)。
帖子中说将openssl版本从原先的0.9.8g升到了目前最新的1.0.0e,问题就可以得到解决,看下我们的版本呢

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
0:065> lmvm libeay32
start end module name
6c330000 6c45e000 libeay32 C (export symbols) libeay32.dll
Loaded symbol image file: libeay32.dll
Image name: libeay32.dll
Timestamp: Fri Jul 24 12:42:20 2015 (55B1C22C)
CheckSum: 00000000
ImageSize: 0012E000
File version: 1.0.2.4
Product version: 1.0.2.4
File flags: 0 (Mask 3F)
File OS: 4 Unknown Win32
File type: 2.0 Dll
File date: 00000000.00000000
Translations: 0409.04b0
CompanyName: The OpenSSL Project, http://www.openssl.org/
ProductName: The OpenSSL Toolkit
InternalName: libeay32
OriginalFilename: libeay32.dll
ProductVersion: 1.0.2d
FileVersion: 1.0.2d
FileDescription: OpenSSL Shared Library
LegalCopyright: Copyright ?1998-2005 The OpenSSL Project. Copyright ?1995-1998 Eric A. Young, Tim J. Hudson. All rights reserved.

版本已经是1.0.2d了,比帖子中说的版本还新,不行下份源码来看看,在openssl-1.0.2d\crypto\rand\rand_win.c int RAND_poll(void)函数中可以找到相关代码,从相关注释可以看出遍历堆的作用是生成随机数因子,同时RAND_poll中也包含其他几种方式生成随机数因子,如CryptoAPI相关接口,生成成功后通过RAND_add加入到随机数池中。
那是不是利用堆遍历生成随机数的方式可以直接删除掉呢? 本身堆遍历性能就不好,还会增加死锁风险。
openssl被应用非常广泛,我们看下行业大厂是怎么用的呢,看线Tim中使用的libeay32.dll,果然利用堆遍历生成随机数的方式被删除掉了,不过删除的不彻底,白Loadlibrary/FreeLibrary kernal32.dll一次。

0x03 解决方案

1.彻底解决方案:微软本身就不建议在进程内生成dmp,并说明了进程内生成dmp最大的潜在风险就是死锁。因此使用进程外生成dump方式是最彻底的解决方案,breadpad本身就支持c/s方式生成dmp。
2.缓解方案:删除libeay32!RAND_poll堆遍历生成随机数的方案,这样可以减小死锁的风险,该方案本身性能就不好。
3.终极解决方案:提高代码质量,我不异常不崩溃就不会写dmp,也就不死锁了,当然这是理想状态,是不可能的。

0x04 参考链接

[1] openssl中libeay32!RAND_poll引发的死锁
[2] openssl 1.02 sourcecode
[3] MiniDumpWriteDump MSDN