Re: WaitForSingleObject() will not deadlock



It seems that "memory visibility" has something to do with making sure that results are in
memory. I would have thought this would have been more commonly referred to as the "cache
coherency" problem. Because of the Interlocked operation used on the x86 and x64, all
results in caches and pipes are guaranteed to be in a consistent state throughout the
entire multiprocessor cache structure. Therefore, once you have acquired any locking
object, you are guaranteed that all processors are seeing the same memory contents, one
way or another.

Strictly speaking, shared variables should be declared 'volatile' to ensure the compiler
has not kept a copy around, and this is completely orthogonal to the concept of locking;
it appears that the Microsoft C/C++ compiler is very conservative about code motions
across functions and therefore seems less subject to this kind of error, although the
formal requirements of the C/C++ language would dictate that the use of 'volatile' is
mandatory to ensure the compiler has not done something bad. (Most optimizing compilers I
worked on over the years in fact would have required 'volatile', had that been part of the
languages we were compiling; instead, we had to add "optimization fences", which were
different from what are now known as "memory fences", to break the optimizer, so our lock
calls were macros that both acquired the lock and broke the optimizations by injecting an
optimization fence into the intermediate instruction stream inside the compiler).

[Historical footnote: Steve Hobbs wrote the optimizer for the Bliss-11 compiler, and the
optimization fence injected an inline assembly-code 'instruction' which was essentially a
coimment with no code attached; looking at the generated compiler source/object listings,
they were sprinkled with the line that today we would write as
__asm { ; Buzz off, Hobbs}
When the optimizer hit an inline assembly code node, it stopped all common subexpression,
alpha motion, omega motion, value propagation, register-targeting, instruction-folding,
and peephole optimizations (and perhaps others), and when the final .obj file writer came
along, it saw an empty node and emittted no instruction.
]
joe
On Tue, 3 Jul 2007 09:13:59 -0700, "Alexander Grigoriev" <alegr@xxxxxxxxxxxxx> wrote:

A synchronization primitive without memory visibility guarantee CANNOT be
used to synchronize access to some data set, thus it's pretty much useless.
You might as well not use it at all, with the same result.

Atomic functions (at least on x86 and x64), DO provide memory fence.
CRITICAL_SECTION simply uses InterlockedCompareExchange. I think
pthread_mutex is no different.

Having said that, implementation of reentrant intra-thread mutex primitive
is VERY trivial. You can cook such a thing in 10 minutes:

struct MY_CRITICAL_SECTION
{
LONG OwningThreadID;
LONG ReentranceCount;
LONG ContentionCount;
HANDLE hEvent;
};

void MyEnterCriticalSection(MY_CRITICAL_SECTION * p)
{
LONG CurrentThreadID = GetCurrentThreadId();
while (1)
{
InterlockedIncrement( & p->ContentionCount);
LONG OwnerID = InterlockedCompareExchangeAcquire( &
p->OwningThreadID, CurrentThreadID, 0);
if (CurrentThreadID == OwnerID)
{
// recursive entrance
InterlockedDecrement( & p->ContentionCount);
p->ReentranceCount++;
return;
}
if (OwnerID == 0)
{
// was not owned. We own it now
InterlockedDecrement( & p->ContentionCount);
p->ReentranceCount++;
return;
}
KeWaitForSingleObject(p->hEvent, INFINITE);
}
}

void MyLeaveCriticalSection(MY_CRITICAL_SECTION * p)
{
ASSERT(GetCurrentThreadId() == p->OwningThreadID);

p->ReentranceCount--;

if (p->ReentranceCount== 0)
{
// we don't own it anymore
p->OwningThreadID= 0;
if (0 != InterlockedExchange(p->ContentionCount, 0))
{
SetEvent(p->hEvent);
}
}
}

Note that it DOES privide memory visibility.

"Frank Cusack" <fcusack@xxxxxxxxxxx> wrote in message
news:m24pkn2szi.fsf@xxxxxxxxxxxxxxxxxx
On Sun, 1 Jul 2007 21:35:36 -0700 "Alexander Grigoriev"
<alegr@xxxxxxxxxxxxx> wrote:
CRITICAL_SECTION (EnterCriticalSection, LeaveCriticalSection) provides a
memory fence, too.

Seems expensive then. A pthreads semaphore (typically used to guard a
critical section) has no memory visibility guarantees and as such can
be implemented with so-called atomic ops instead of memory barriers.

CRITICAL_SECTION is fast, intra-process, recursive mutual exclusion
synchronization object.

Not as fast as a pthreads semaphore, since the CRITICAL_SECTION has
to execute a memory barrier (fence).

Kernel mutex (CreateMutex) can be used to synchronize _between_
processes,
too, but since it requires a roundtrip to kernel mode, its overhead is
more
than of CRITICAL_SECTION. A kernel mutex also takes care of priority
inversion, which CRITICAL_SECTION does not.

posix mutexes (good implementations, anyway) only require entry into
the kernel when they are contested. Uncontested mutexes are extremely
fast. Solaris has an "adaptive mutex" which only enters the kernel
after spinning for a bit first, which is consistent with the design
philosophy that you should only hold on to mutexes for very short
periods of time. (There are more rules but it's not important to my
point.)

It almost doesn't make sense that CRITICAL_SECTIONs execute a membar
since Windows really only runs on x86, which is TSO. This is probably
just a nod towards more widespread use (ie, less skilled programmers)
who might otherwise get the semantics wrong.

-frank

Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.



Relevant Pages

  • Re: lpc2138 taking commands merely as a suggestion?
    ... when I try other types of optimization, some work better in making the ... The compiler will do this wherever it can see that the results should not ... manipulating system control registers that are not just memory. ...
    (comp.sys.arm)
  • Re: Runtime memory effect on compile and link
    ... I suspect that the problem is that the LTCG optimization, does a different job if it has more memory use at link time. ... and some different code is generated by the compiler. ...
    (microsoft.public.vc.language)
  • Re: C#, Threads, Events, and DataGrids/DataSets
    ... JIT compiler just compiles IL to x86 (or whatever platform it is developed ... optimization theory if you are interesting). ... then volatile could be important to guarantee that all memory writes ...
    (microsoft.public.dotnet.general)
  • Re: Defeating Optimisation for memcmp()
    ... You need to expand your description of the problem to make that point clear; as soon as you do so, we can tell you what to do to prevent the optimization. ... The detail is simply that the code is writing to CMOS. ... The only reason I can think of for your comments is that you were assuming I wanted to check the write because I wasn't confident that memcpy was implemented correctly. ... You said that the memory being pointed at was not volatile. ...
    (comp.lang.c)
  • Re: Implement memcmp function with help from gccs inline asm
    ... I would expect an optimising compiler to generate ... memcmp needs to compare memory as unsigned. ... > optimization, both snippets of code should be identical on P4. ... > 2 when comparing cached data. ...
    (comp.lang.asm.x86)