deadlock on SMP machines - perhaps caused by wldap32.dll?
From: Marc Elliott (marce808_at_hotmail.com)
Date: 07/09/04
- Next message: Irish4Arm: "class template member functions"
- Previous message: Antonio: "Can rebasing affect correctness of information in the map file generated before?"
- Next in thread: William DePalo [MVP VC++]: "Re: deadlock on SMP machines - perhaps caused by wldap32.dll?"
- Reply: William DePalo [MVP VC++]: "Re: deadlock on SMP machines - perhaps caused by wldap32.dll?"
- Reply: Ivan Brugiolo [MSFT]: "Re: deadlock on SMP machines - perhaps caused by wldap32.dll?"
- Messages sorted by: [ date ] [ thread ]
Date: 9 Jul 2004 10:19:33 -0700
Hello,
Apologies in advance, this is a very long post, but I'm trying to be
as complete as possible. I'm also not sure if this is the correct
group, but it seemed appropriate.
We have a process that runs as part of our product that has been
working fine on Windows. However, we recently had an issue onsite
where these processes (there are usually multiple instances) would
hang and become completely unresponsive on SMP (Compaq 8-way) machines
running Windows NT and 2000. I've managed to recreate the issue in our
test lab on a dual Xeon (hyperthreaded) box running Windows 2003.
There is definitely some sort of deadlock situation here, but I'm at a
loss how to proceed after banging my head against this for a week now.
There is some evidence that suggests the DllInit function of wldap32
is somehow a player in this - the details are further down.
First, let me provide a little context: This service (it's not really
an application) is written in VC++ 6.0, and is spawned via a daemon
that is also part of our application. In a typical deployment there is
one of these service processes for each processor on the local
machine. The service instantiates a Java Virtual Machine via JNI, and
most of our core code executes within the VM.
One of the functions of this service is to spawn processes and feed
them input and capture their output. This is acheived by the VM
loading a VC++ 6.0 bridge library, which in turn loads another VC++
6.0 library (we'll call this the "work library") which does the actual
work. When all of the assigned work is through, the work library is
unloaded until the next time. The work library is fairly simple, and
the core behavior involves invoking CreateProcess (usually cmd.exe,
which in my test setup just echoes some msg), and creating threads to
write to the input handle, and read from the output and error handles.
This has been working on all flavors of Windows (uniprocessor) for
quite some time.
Now, to the details. Here is the basic sequence leading up to the
hang:
1) Process loads the work library
2) Process performs some number of brief "work units" (CreateProcess()
+ accompanying threads, and then cleanup) happily and healthily
3) Process CPU utilization drops to 0, and there is no more activity,
ever. This lockup appears to happen faster when more instances are
running.
Attaching cdb to these hung processes consistently reveals two
scenarios:
Scenario A - the "Is wldap32.dll doing something bad?" scenario
----------
The work library thread which called CreateProcess has successfully
waited on the process to exit. It waits on an Event that will be set
by the thread which reads the output from the process, but that Event
is never set. Mainly because those helper threads are stuck. The
debugger always has to forcibly break into the process b/c the loader
lock is held.
!locks reveals two locks in contention:
CritSec ntdll!LdrpLoaderLock+0 at 77FC2340
LockCount 8
RecursionCount 1
OwningThread b80
EntryCount 2a
ContentionCount 2a
*** Locked
CritSec WLDAP32!PerThreadListLock+0 at 76F34010
LockCount 1
RecursionCount 1
OwningThread d64
EntryCount 1
ContentionCount 1
*** Locked
---- Stack trace for thread b80: 0f5dfba8 77f43741 77f5d64e 000005a0 00000000 SharedUserData!SystemCallStub+0x4 0f5dfbac 77f5d64e 000005a0 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc 0f5dfbe8 77f42044 000005a0 76f1129f 76f34010 ntdll!RtlpWaitForCriticalSection+0x126 0f5dfbf0 76f1129f 76f34010 00000000 0f5dfc1c ntdll!RtlEnterCriticalSection+0x46 0f5dfc00 76f11257 00000b80 77f47dc9 76f10000 WLDAP32!AddPerThreadEntry+0x2f 0f5dfc08 77f47dc9 76f10000 00000002 00000000 WLDAP32!LdapDllInit+0x3e 0f5dfc28 77f5d067 76f11180 76f10000 00000002 ntdll!LdrpCallInitRoutine+0x14 0f5dfcbc 77f5cef6 0f5dfd30 0f5dfd30 00000000 ntdll!LdrpInitializeThread+0xd4 0f5dfd1c 77f45241 0f5dfd30 77f40000 00000000 ntdll!LdrpInitialize+0x171 00000000 00000000 00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x7 To me, this begs the question: why is this library referred to in the first place? We certainly don't use LDAP and we aren't using ActiveDirectory, either. 'lm' reveals that it is an unloaded module. My only guess is that the library init routine is called in order to see if LDAP is enabled in case it's needed to check process/thread creation security tokens. The interesting thing about WLDAP32!PerThreadListLock+0 is that the owning thread (d64) is nowhere to be found! So, no stack trace. There are several other threads waiting on locks. None of them are in my input/output/error read/write user code, they are all in LdrpInitialize or kernel32!ExitThread. I suspect the exiting threads are mine. 5 threads have the same trace (with differing addresses, but waiting on the same critical section - 77fc2340): 0f6dfbec 77f43741 77f5d64e 00000330 00000000 SharedUserData!SystemCallStub+0x4 0f6dfbf0 77f5d64e 00000330 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc 0f6dfc2c 77f42044 00000330 77f5cfcc 77fc2340 ntdll!RtlpWaitForCriticalSection+0x126 0f6dfc34 77f5cfcc 77fc2340 00000000 7ffdf000 ntdll!RtlEnterCriticalSection+0x46 0f6dfcbc 77f5cef6 0f6dfd30 0f6dfd30 00000000 ntdll!LdrpInitializeThread+0x2f 0f6dfd1c 77f45241 0f6dfd30 77f40000 00000000 ntdll!LdrpInitialize+0x171 00000000 00000000 00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x7 And here are the exiting threads: c18: ---- 0e0dfe70 77f43741 77f5d64e 00000330 00000000 SharedUserData!SystemCallStub+0x4 0e0dfe74 77f5d64e 00000330 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc 0e0dfeb0 77f42044 00000330 77f7c658 77fc2340 ntdll!RtlpWaitForCriticalSection+0x126 0e0dfeb8 77f7c658 77fc2340 00000000 7ffa4000 ntdll!RtlEnterCriticalSection+0x46 0e0dff44 77e57eb3 00000000 0c0bd418 00a87b88 ntdll!LdrShutdownThread+0x30 0e0dff7c 77bc917d 00000000 77bc91f3 00000000 kernel32!ExitThread+0x41 0e0dff84 77bc91f3 00000000 00000000 00000000 msvcrt!_endthreadex+0x25 0e0dffb8 77e4a990 00a87b88 00000000 00000000 msvcrt!_endthreadex+0x9b 0e0dffec 00000000 77bc917e 00a87b88 00000000 kernel32!BaseThreadStart+0x34 9cc: ---- 0e1dfe70 77f43741 77f5d64e 00000330 00000000 SharedUserData!SystemCallStub+0x4 0e1dfe74 77f5d64e 00000330 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc 0e1dfeb0 77f42044 00000330 77f7c658 77fc2340 ntdll!RtlpWaitForCriticalSection+0x126 0e1dfeb8 77f7c658 77fc2340 00000000 7ffa3000 ntdll!RtlEnterCriticalSection+0x46 0e1dff44 77e57eb3 00000000 0b2e6c48 0b38b590 ntdll!LdrShutdownThread+0x30 0e1dff7c 77bc917d 00000000 77bc91f3 00000000 kernel32!ExitThread+0x41 0e1dff84 77bc91f3 00000000 00000000 00000000 msvcrt!_endthreadex+0x25 0e1dffb8 77e4a990 0b38b590 00000000 00000000 msvcrt!_endthreadex+0x9b 0e1dffec 00000000 77bc917e 0b38b590 00000000 kernel32!BaseThreadStart+0x34 I can provide further information if it is needed. ------------------------------------------------------------------------------- Scenario B - the "what happened to the malloc lock?" scenario ---------- As in scenario A, the work library thread which called CreateProcess has successfully waited on the process to exit. It waits on an Event that will be set by the thread which reads the output from the process, but that Event is never set. These helper threads are also stuck, but this time they are actually in user code. The debugger does not have to forcibly break into this process. !locks only reveals one lock in contention: CritSec MSVCRTD!__app_type+34 at 10264780 LockCount 1 RecursionCount 0 OwningThread 0 EntryCount 4 ContentionCount 4 *** Locked The owning thread is 0. Uh oh. My two helper threads also happen to be waiting on this lock: ea8: ---- 0f2eed90 77f43741 77f5d64e 00000680 00000000 SharedUserData!SystemCallStub+0x4 0f2eed94 77f5d64e 00000680 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc 0f2eedd0 77f42044 00000680 1020b5a3 10264780 ntdll!RtlpWaitForCriticalSection+0x126 0f2eedd8 1020b5a3 10264780 0f2eeec4 0f2eee18 ntdll!RtlEnterCriticalSection+0x46 0f2eede8 1021352a 00000009 0f2efed8 0f2eee74 MSVCRTD!_lock+0x93 0f2eee18 102134ce 0de09de0 00000001 0f2eee34 MSVCRTD!_free_dbg+0x2a 0f2eee28 104bd0ed 0de09de0 0f2eee44 104ad0f0 MSVCRTD!free+0xe 0f2eee34 104ad0f0 0de09de0 0f2eeec4 0f2eee5c MSVCP60D!operator delete+0xd 0f2eee44 1048be9e 0de09de0 00000021 ffffffff MSVCP60D!std::allocator<char>::deallocate+0x10 0f2eee5c 10489b71 00000001 0f2eeec4 0f2efee4 MSVCP60D!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Tidy+0x6e 0f2eee6c 0c0e3c99 0f2eff74 0f2efeec 0de09ce0 MSVCP60D!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string<char,std::char_traits<char>,std::allocator<char> >+0x11 0f2efee4 0c0d9ddb 00000000 00000000 0de09ce0 worklib!NullCopyRunner::run+0x119 0f2eff84 1020bf53 0de09c98 00000000 00000000 worklib!JobThread::execute+0x4b 0f2effb8 77e4a990 0de09ce0 00000000 00000000 MSVCRTD!_beginthreadex+0x133 0f2effec 00000000 1020bee0 0de09ce0 00000000 kernel32!BaseThreadStart+0x34 7d0: ---- 0f1ee64c 77f43741 77f5d64e 00000680 00000000 SharedUserData!SystemCallStub+0x4 0f1ee650 77f5d64e 00000680 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc 0f1ee68c 77f42044 00000680 1020b5a3 10264780 ntdll!RtlpWaitForCriticalSection+0x126 0f1ee694 1020b5a3 10264780 00000021 0f1ee6d8 ntdll!RtlEnterCriticalSection+0x46 0f1ee6a4 102129da 00000009 00000000 0f1ee7cc MSVCRTD!_lock+0x93 0f1ee6d8 102129a6 00000021 00000001 00000001 MSVCRTD!_nh_malloc_dbg+0x2a 0f1ee6f4 1020e2cf 00000021 00000001 0f1eff44 MSVCRTD!_nh_malloc+0x16 0f1ee708 104b7a69 00000021 0f1ee728 104ad0d2 MSVCRTD!operator new+0xf 0f1ee714 104ad0d2 00000021 00000000 0f1eff44 MSVCP60D!std::_Allocate+0x19 0f1ee728 1048ba18 00000021 00000000 00000000 MSVCP60D!std::allocator<char>::allocate+0x12 0f1ee770 1048bccc 00000017 00000000 0f1eff44 MSVCP60D!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Copy+0x58 0f1ee784 1048a005 00000017 00000001 0f1eff44 MSVCP60D!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Grow+0x10c 0f1ee798 1048a060 0c136674 00000017 0f1eff44 MSVCP60D!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign+0x15 0f1ee7ac 10489ad7 0c136674 0f1eff44 0f1eff84 MSVCP60D!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign+0x20 0f1ee7bc 0c0d9e1f 0c136674 0f1eff40 1020d797 MSVCP60D!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::basic_string<char,std::char_traits<char>,std::allocator<char> >+0x27 0f1eff84 1020bf53 0de09ba8 00000000 00000000 worklib!JobThread::execute+0x8f 0f1effb8 77e4a990 0de09bf0 00000000 00000000 MSVCRTD!_beginthreadex+0x133 0f1effec 00000000 1020bee0 0de09bf0 00000000 kernel32!BaseThreadStart+0x34 So it appears that the malloc/heap/whatever-you-call-it lock is somehow munged. None of the other threads reveal anything interesting. ------------------------------------------------------------------------------- Summary: I considered the possibility of heap contamination, but using gflags I enabled full page heap and every other form of heap checking, but there appear to be no problems there. I am suspicious that there is something else at work because these scenarios happen exactly the same every time. As I mentioned before, this code has been working perfectly on Windows uniprocessor systems for more than a year now. The only variable is the multiprocessor environment. I am seriously thinking that this is system related. I've noticed that there have been some deadlock related fixes for wldap32.dll, including a recent hotfix that postdates the release of Windows 2003. I'm wondering if this would fix our problem? I eagerly await any new ideas on this issue! I am pretty much out of them at this point. If I can provide further information on this, please ask. Thanks for reading! Regards, Marc
- Next message: Irish4Arm: "class template member functions"
- Previous message: Antonio: "Can rebasing affect correctness of information in the map file generated before?"
- Next in thread: William DePalo [MVP VC++]: "Re: deadlock on SMP machines - perhaps caused by wldap32.dll?"
- Reply: William DePalo [MVP VC++]: "Re: deadlock on SMP machines - perhaps caused by wldap32.dll?"
- Reply: Ivan Brugiolo [MSFT]: "Re: deadlock on SMP machines - perhaps caused by wldap32.dll?"
- Messages sorted by: [ date ] [ thread ]