Re: CMultiLock example
- From: Joseph M. Newcomer <newcomer@xxxxxxxxxxxx>
- Date: Wed, 19 Aug 2009 10:44:24 -0400
[Your wish is my command...]
Seriously, though, locking is *mandatory* if two or more threads are accessing non-scalar
data in ways that involve one or more of the threads doing potential non-monotonic
non-robust concurrent modification of the data.
By "removing all locking", I implied that you create situations in which there is no
concurrent modification of data. Deep and fundamental to the architecture of your system,
there is no concurrent modification of the data. One technique is to have one thread "set
up" the structure so it is "ready-to-go" and then fire off all the computational threads,
which can use the data but never modify it. So if you have a set of linked list objects
that are directing the computation, it is essential that under no circumstances will those
lists ever be modified while the threads are running. Furthermore, if there is any data
referenced by the lists, that data must not be modified except in a synchronized fashion
during the computation. Note that I said "in a synchronized fashion", not "under control
of a lock", because in some cases (e.g., the partial differential equation solver I
mentioned earlier) the hardware synchronized access to the memory (so you couldn't get a
half-float value fetched while the other half-float was stil being stored). But critical
to this design was the realization that hardware synchronization existed to protect the
atomic values, and the logic of the program did not require that the value be The Very
Latest Value; the algorithm would work correctly if it operated on obsolete data, it just
might take an extra iteration to converge to the same value. This worked because the
modifications were monotonic and robust. By "robust" I mean that if a thread did a
computation with obsolete data, it *still* got a correct answer. This means that
operations like a++ or a--, if a is a shared variable, *must* have locking, because the
computation is non-robust (hint: use InterlockedIncrement/InterlockedDecrement and let the
hardware handle it). But setting a variable to FALSE in one thread, where no other thread
can ever set it back to TRUE, is monotonic, and in many contexts robust (e.g., thread
shutdown by setting a "threads should continue to execute" BOOL to FALSE).
If you just take out all the locks and you have complex data (more than one value
interacting, say, for example, the pointers in a linked list) that is modified, then your
code is again incorrect. This is because linked list manipulation of a 2-way linked list
(as is CList) is non-monotonic and non-robust. Or, more simply put, it's wrong to do this
without a lock.
But note that you only have to apply the lock when you CHANGE the list. You don't lock
the entire list at the start of the computation and hold it for the entire computation
(that just serializes all the computations, as you already observed). And you cannot lock
more than one list at a time, unless you guarantee that the lists are *always* locked in
the same order, which must be monotonic (this is the deadlock avoidance issue Doug was
talking about). If you have four lists, call them lists 1, 2, 3 and 4, the valid locking
orders are 1; 2; 3; 4; (that is, never more than one list at a time!) or 1-2; 1-2-3;
1-2-3-4; 1-3; 1-3-4; 1-4; 2-3; 2-3-4; 2-4; 3-4; AND THAT'S ALL! Any locking sequence that
locks a list such that the number of the list is not strictly greater than all previous
lists locked along that execution path means you will deadlock and your threads will hang.
To make the "no lock" code work correctly, you need certain kinds of atomic operations;
for example, I often use an I/O Completion Port as an interthread queue, and sometimes a
PostMessage queue as an interthread queue. Note these actually do use locking internally
so they work correctly, *but I don't have to reason about those locks being correct*. They
are also "leaf locks" meaning that you cannot get deadlock by using them; they are the
leaves of any complex locking tree. They do not themselves do any secondary locking so
they can't deadlock with other threads using those same internal locks, so a whole class
of lock reasoning disappears from your concern.
So you have to do some evalutations. You have not said what the lists represent, for
example. If the lists represent "work lists" of things to do, you can replace them
completely by an I/O Completion Port and a worker thread, putting things in with
PostQueuedCompletionStatus and removing them with GetQueuedCompletionStatus. Or, if there
are a very small number of these, it might be simpler to create a secondary UI thread,
with a Message Map to dispatch the requests, and use PostThreadMessage.
But given you have told us practically nothing about the nature of the data structures, it
is hard to make suggestions for a specific implementation strategy.
Hints:
(a) you can't take existing single threaded code and use it without locks
(b) you can't sprinkle mutexes/CRITICAL_SECTIONs around like pixie dust
and expect your code to fly
Strategy:
Wait a minute, we've all been saying this in nearly every reply: you have to
REDESIGN your code to work in a multithreaded environment.
What surprised me was this comment:
I should point out that I have many years of experience working on
multi-processor systems. My difficulties now are because I was spoilt
by those systems as they did all the concurrency and data access
checking by hardware.
I have been using multiprocessors since 1975. At no time in the entire history of
multiprocessors am I aware of *any* architecture that would do "concurrency and data
access checking by hardware" in a way that would allow any arbitrary piece of code to
survive without explicit locking. They did not lock arrays, or linked lists, or queues,
or anything else I am aware of. Now some would do streaming data instructions on vectors
and/or arrays, but that code was carefully crafted so no other processor would do
streaming data instructions on the *same* data. Compilers like FORTRAN could easily
decompose problems into disjoint concurrent streams by doing the partitioning we've been
telling you about; it worked because the computations followed extremely rigid patterns
which the compiler could recognize.
Hardware locking can protect at most one scalar value from concurrent access. Sometimes
that "scalar" is wide, e.g., LOCK CMPXCHG8B (32-bit) or LOCK CMPXCHG16B (64-bit) which can
manipulate wide values atomically; for example the management of queued spin locks in the
kernel involves using one of these instructions to clear the lock and unlink the queue
element in a single hardware-atomic operation), but it is still a "scalar" value from the
viewpoint of the instruction involved.
I could suggest looking at OpenMP as one of the options, but it only works correctly in
very limited domains, usually dealing with mathematical computations. And what it often
does is handle the fork/join logic that is tricky in C/MFC. But without any understanding
of the basic nature of your data, it is hard to tell if OpenMP is of positive or negative
value in solving your problem.
P.S., there is new support in Vista for multiple-reader-single-writer queueing. But it
uses spin locks for the exclusion, and doesn't guarantee FIFO ordering or have
antistarvation mechanisms.
joe
On Wed, 19 Aug 2009 03:41:42 -0700 (PDT), Goran <goran.pusic@xxxxxxxxx> wrote:
On Aug 19, 2:11 am, Stephen Wolstenholme <st...@xxxxxxxxxxxxxxxxxxxx>Joseph M. Newcomer [MVP]
wrote:
Since my original question I have read all the suggestions and
comments and decided to try the application with four calculating
threads started and no locking. It ran all night in debug. In release
mode it ran for about a minute before hanging up.
Wait, what!?
Am I right in concluding that you have data (a CList<> or some such
object, but that's irrelevant) that you read/modify from multiple
threads at once, and you just tried to see if that would work? And you
wonder why it worked in debug, but not in release!?
If so, I really hope Joe N. pulls one of his rants on you.
You just can't do that and it's quite bad that you seem not to
understand why (that's assuming that if you did understand, you
wouldn't even tried).
You must prevent that one piece of data is modified from multiple
threads at once. You use a mutex for that (in MFC parlance, that's
CCriticalSection or perhaps CMutex).
You also must prevent that one piece of data is modified in one thread
while it is being read in another. You use a thing called read-write
mutex for that (see e.g.
http://www.boost.org/doc/libs/1_34_1/doc/html/thread/concepts.html#thread.concepts.ReadWriteMutex).
No MFC equivalent for that; I saw some MFC shots at that on the net
that looked dubious, but it's been a long time.
These are two basic things you must not do. Dare I say: absolutely all
the theory and practice stems from there. Observe them and you'll be
fine. Assume that if nothing is done, assaulting data with more than
one thread at a time will break said data. There are only very
specific situations where it would not happen (e.g. Joe said something
about reading a scalar variable that was DWORD aligned - that AFAIK is
true on i386, but not a general truth).
You can of course read data from as many threads as you want
simultaneously. That's easier said than done IMO (read-write mutex
helps).
Also... For very small amounts of data, you have APIs that are in fact
wrappers for CPU test-and-set instructions (see http://en.wikipedia.org/wiki/Test-and-set
and "Interlocked Variable Access" in MSDN). These are sometimes
helpful (doesn't seem to be your case, but you have provided little
specifics).
The threads are multiple instances of the same code. I'm trying to get
an application to run a background calculation on two and four
processor systems. The data is referenced by multiple lists. It works
fine on a single processor but needs to exhibit a performance
improvement with multiple processors.
That's of course the hard bit. You must somehow make your threads not
work on the same data at the same time (otherwise they will compete
for access and speed will go poof to the point of being considerably
slower than a single thread solution). Without knowing any specifics,
I can offer only the most general advice: isolate "calculation" and
"store" parts of the code. Do your calculation in any given thread in
isolation, using a copy of the data, to avoid contention (copying goes
a long way in parallel work - it's a classic speed vs. memory usage
trade-off right there). Store results when done. If you can make your
threads work on separate data well, you're done! BTW, are you doing
neural network "training"? If so, I would guess this split should be
possible because your network structure should be viewable as isolated
parts. (That's what I seem to remember from the days of yore. Don't
take my word on that, though.)
HTH,
Goran.
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.
- Follow-Ups:
- Re: CMultiLock example
- From: Stephen Wolstenholme
- Re: CMultiLock example
- References:
- CMultiLock example
- From: Stephen Wolstenholme
- Re: CMultiLock example
- From: Stephen Wolstenholme
- Re: CMultiLock example
- From: Goran
- CMultiLock example
- Prev by Date: Re: CSV files with MFC excel automation
- Next by Date: Re: CSV files with MFC excel automation
- Previous by thread: Re: CMultiLock example
- Next by thread: Re: CMultiLock example
- Index(es):
Relevant Pages
|