Re: Assembling Visual Studio generated listing files



See below...
On Sun, 09 Dec 2007 15:54:41 -0800, Geoff <geoff@xxxxxxxxxxxxxxx> wrote:

On Sun, 09 Dec 2007 08:47:47 -0500, Joseph M. Newcomer
<newcomer@xxxxxxxxxxxx> wrote:

No, you NEVER use MASM for driver development, EVER. (I teach a driver course and that is
one of the points we emphasize!)


OK. NEVER is a pretty strong word. Let's try an easy one.

You have a proprietary interface card. Among other things it has sets
of 32-bit position registers and associated 32-bit timestamp
registers. Each set consists of OldPosn and OldTime, and NewPos and
NewTime and a position request called CommandedPos. One set of five
registers for each of 32 positioners. When the hardware gets a new
position it automatically puts the last position and timestamp in the
Old* set, the current post ion and time is placed in the New* set. The
card issues an interrupt each time it has a new position for any one
of the 32 positioners. The service routine must pick up these
registers and deposit a new target position in CommandedPos register
for that positioner. Maximum rate of interrupts expected is 80kHz or
12.5 uS. Assume memory mapped I/O.
****
An ISR is constrained to act in < 10us, and for most modern machines, 12.5us is a very
generous window of time.
****

Without MASM and in pure C or C++ and without any Windows API/MFC,
write an ISR that will:
****
This is a device driver. The Windows API and MFC are completely irrelevant to the
discussion. C++ cannot be used in the kernel (in spite of the fact that streaming drivers
are written in C++, there is a suppressed internal memo I've heard rumors about that
discusses why C++ cannot possibly work in the kernel), so we are left with pure C. Note
that even in pure C, I would not need assembly code to write anything in a device driver,
because the very few places where special code (such as LOCK prefixes) are required, there
are HAL routines to do the work, and I would call those from C.

I have never discovered a reason to use MASM in a device driver.
****

1. Get old and new registers and store their contents in global
location for pickup by other tasks.
****
Global location? Device drivers don't use global variables, for very good reasons.
However, suppose I have two registers, I would write

BOOL MyISR(PKINTERRUPT int, PVOID context)
{
// this assumes the device extension is passed as the context
PMY_DEVICE_EXTENSION ext = (PMY_DEVICE_EXTENSION)context;
if(!(is-it-my-interrupt?))
return FALSE;
...dismiss interrupt
... for example, compute the register set index i...
ext->oldpos= READ_REGISTER_ULONG(&ext->registeraddress[i]);
ext->oldtime= READ_REGISTER_ULONG(&ext->registeraddress[i+1]);
...etc...
return TRUE;
}

I see no assembly code here...optimized, this code is about as fast as you can write in
assembler.
****
2. Obtain a new commanded position value from global storage and put
it in the CommandedPos register.
****
WRITE_REGISTER_ULONG(ext->CommandedPos, ext->newvalue);

I see no assembly code here. Note that this will involve a LOCK instruction, because
otherwise the value will not be forced from the CPU's write pipe to the device for some
time. Without a proper memory fence, you can't be guaranteed the data will get out, even
if the page is mapped as non-cached.

There would be no global variables. Global variables would be a mistake in a driver in
this context. "Global" state would be kept in the device extension. Global variables are
used VERY rarely, and VERY carefully, and never for a purpose like this. This is clearly
device-specific state and therefore would be kept in the device extension. Some of the
data might even come from the IRP, and the IRP pointer(s) would be kept in the device
extension as well.
****
3. This ISR may not be preempted.
****
This is impossible in Windows; an ISR can always be preempted by another interrupt, and
there's no way to prevent that. It is not permitted to disable interrupts on the current
CPU, and in any case, in a multiprocessor, you have to deal with the interrupt routing
issues, deal with the KeSynchronizeExecution locks, deal with the fact that the HAL and
Kernel own the interrupt system and you cannot bypass them, and a few other details. You
have no access to the IDT. If you manage to get all of this to work, you have bypassed so
much of Windows that to call the result "unmaintainable" is to grossly understate the
problem.
****
4. Maximum allowable execution time of the ISR: 300 nS
****
Once IN the ISR, 300ns on a 3GHz machine is several hundred instructions, not counting the
overhead of LOCK prefixes to force the write pipe flush. If there was a lot of writes,
you can write directly, as long as there is ultimately a LOCK cycle to force the write
pipe flush at the end of the sequence, and that is one of the known optimizations you can
do. But you can't control the path to your ISR, or control preemption, and one of the
failures of the current I/O system is that it is VERY difficult to force a given interrupt
to a specific priority, which is why it is usually a Very Bad Idea to program a
hard-real-time device with very tight time constraints on a Windows platform. There is no
predictable bound on interrupt latency, no predictable bound on DPC dispatch, and no
predictable bound on dispatch to user space. You'd need a realtime platform such as
www.ardence.com, which runs Windows as a low-priority thread, to really get the kind of
performance this requires.

Note that between the IDT and your ISR is a fair amount of code, including
KiInterruptDispatch, so you have no guarantees about interrupt latency. You cannot really
bypass all this; Windows is a delicate ecology and you can't just poke values into the
interrupt vectors (you can't even FIND the IDT, in general), and if you do, you can't be
guaranteed that your driver will work in every chipset configuration. The HAL handles all
this for you (how DO you flush the caches? Answer: call KeFlushIoBuffers).

I've worked with some fairly high-performance drivers, doing code walkthroughs and
debugging, and colleagues have written SERIOUSLY high-performance drivers, and we never
used a single line of assembly code anywhere. We're not even sure how we'd do it for an
ISR.

(A 2.8GHz machine has a CPU clock cycle of 350ps, and can, as a pipelined superscalar
processor, dispatch one instruction every 175ps. Under optimal conditions, assuming most
data is in the L1 cache, this would work out to slightly over 1700 instructions in 300ns.
But an L1 cache miss to an L2 cache hit would cost 1-2 CPU clock cycles, an L2 cache miss
would be horrendous and take 40-100 CPU clock cycles, so as a practical matter, you've
probably got a few hundred instructions at most that you can guarantee will fit into a
300ns handler. An ISR is nominally limited to 10us, and that's about 57,000 instructions
under optimal pipelining/cache conditions; as a practical matter, probably 10,000
instructions could be comfortable, and that's a HUGE piece of ISR code! If you work with
a design rule of 1ns/instruction on a 2.8GHz machine you are usually in a comfortable
position).

Do you have a counterexample in a real Windows driver that you can show? Note that Win9x
vxd's don't count; they were badly designed kludges on top of MS-DOS. It has to be in the
context of real Windows.
joe

****


What the OP hasn't done is explain why this step is required at all. I'm sure there are
better ways to solve the problem, and this essentially creates a completely unmaintainable
mess.

I agree. I am suspicious of his motives and am beginning to regret
helping him.
Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.