Re: Direct Copying To Share Memory In NDIS ProtocolReceive




Thomas,

Do make an estimate of the event interactions that must be used to
synchronize each "slot". Events that cross the user-kernel boundary will be
a performance bottleneck. That's the number that I believe must be minimized
to achieve decent performance. (Others may have different thoughts...).

Actually, this is the main reason why I said that performance improvement
may be rather slim, if any at all.....

No matter how you look at it, apps still have to synchronize their access to
the shared buffer, which means that thet will inevitably have to call
WaitXXX() functions that involve user-to-kernel transition before they can
access the buffer. If you just make your driver pend IRP_MJ_READ, apparently,
asynch IO completion routine will have to re-submit the request, i.e. call
ReadFileEx() upon every invocation. Is ReadFileEx() much more expensive that
WaitXXX() functions? I don't think so - after all, when you make a call that
involves a transition to the kernel mode, the lion's share of time is spent
on parameter validation, rather than on processing the call itself.
Therefore, sharing a buffer does not seem to offer an advantage so
significant that it is worth all the pain of dealing with things that you
don't need
to think about when using "standard" IO.

Consider having your "Prime" application allocate user-mode memory that will
eventually be shared amongst all your user and kernel components. Pass it to
one driver using IRp that is pended, etc. Standard stuff...

Another concern is locking memory. If you use METHOD_BUFFERED, the system
will have to copy all data from the system buffer to the user one, and vice
versa. If you use METHOD_DIRECT, the system does not have to copy data.
However, again, the same question arises - is copying data is so expensive
operation? You save few machine cycles, but, as a result, have to lock memory
for the extended periods of time (i.e. while "prime" app is active). This is
not a big deal if shared section is just few pages in size, but if it is
reasonably large, locking memory for the extended periods of time may lead to
overall performance degradation - you get the pain
of dealing with all additional complexities.......and get performance
degradation, rather than improvement!!!!

If I was in the OP's place, if would first try to do everything "stupid and
simple", and see how it all works - I wold start thinking about optimization
if and only if I am not happy with the performance. However, the OP starts
thinking about it in the very beginning of his project.

According to Knuth, " in 95% of cases optimization is mother of all evil". I
think that there is a good chance that the OP's case falls into these 95%
......

Anton Bassov

"Thomas F. Divine" wrote:


"Le Chaud Lapin" <jaibuduvin@xxxxxxxxx> wrote in message
news:1172849765.066871.323980@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi Thomas & Anton,

On Mar 2, 1:02 am, "Thomas F. Divine" <tdivine@NOpcausaSPAM> wrote:
-Le Chaud Lapin-

Surely with enough work and debugging you can get your scheme to work. Do
sit back and seriously consider my personal motto: "Keep it simple,
stupid!".

I'm considering, i'm considering. :)

I guess I asked the question in the wrong way. What I should have
done was given context of unchangeable requirements, and ask what is
the best way to do the packet copies.

I have several processes that share a section in RAM, say 27 process,
and one of the processes (prime process) loads a driver that binds to
miniport to send and receive frames. The system is designed so every
frame that is transmitted to the miniports via the protocol must
originate in the shared section, and every frame received from a
miniport via the protocol must be implanted in the shared section. I
currently have 128 frame slots in the shared section, and whenever the
prime process needs to send a frame, it migth not have opportunity to
batch the send, as there might be only 1 frame to send. On receives,
I would like to batch the receives in Protocol receive for bursty
traffic, but bypass copy of queing, copying directly to the shared
section. Since the frame slots are limited, I do not want to try to
guess the optimum preallocation count in ReadFile to batch reads, or I
will starve the other processes of frames.

So I guess my question is, given that the target of all mini-port-
indicated frames must be that shared section, what will be the
performance penalty of copying full packets from the large buffers
setup by ReadFile and the corresponding buffers in the shared
section. Is it much less significant that the kernell transitions.
Also note that this scheme will require a kernel transition anyway,
when ReadFile attempts to lower the semaphore on the frame slots in
the shared section.

-Le Chaud Lapin-

Do make an estimate of the event interactions that must be used to
synchronize each "slot". Events that cross the user-kernel boundary will be
a performance bottleneck. That's the number that I believe must be minimized
to achieve decent performance. (Others may have different thoughts...).

You should still be able to preserve the "safety net" provided by the OS
with the inverted call mechanism.

Consider having your "Prime" application allocate user-mode memory that will
eventually be shared amongst all your user and kernel components. Pass it to
one driver using IRp that is pended, etc. Standard stuff...

I don't see why your "Prime" application couldn't pass the same memory to a
second driver also. Having done so, the "Prime" application and two drivers
have access to a common memory area. In addition, the system will
automatically call both driver's Cleanup/Close routines when the "Prime"
application exits. This is the feature that you really don't want to
re-invent yourself.

Now the "Prime" application is all set. What about the other processes?

You can share memory across processes using "named shared memory" and "named
events", etc. A little work here, but in user-mode only with much less
chance of crashes of the system.

Just a thought...

Thomas F. Divine


.



Relevant Pages

  • Re: PCI bus-master and large contiguous memory buffers
    ... As soon as device reaches the end of the buffer ... Sure, I am developing both PCI adapter and device driver, so, it is ... not afford reinitializing DMA on my device after every transfer. ... x86 CPU memory management structures I never tried to dig into Windows ...
    (microsoft.public.development.device.drivers)
  • Re: PCI bus-master and large contiguous memory buffers
    ... x86 CPU memory management structures I never tried to dig into Windows ... What about buffer consistency in the long run? ... The scattergather list methods provide a vastly simplified mechanism ... CD-ROM driver may keep the system for seconds at elevated IRQL levels, ...
    (microsoft.public.development.device.drivers)
  • Re: PCI bus-master and large contiguous memory buffers
    ... I built my scatter gather list in SRAM that was on my device, ... could have done it in system memory had I needed to. ... interrupt when a buffer was filled, the application would save the buffer to ... beginning of the recording I made a device IO control call to my driver. ...
    (microsoft.public.development.device.drivers)
  • Re: PCI bus-master and large contiguous memory buffers
    ... x86 CPU memory management structures I never tried to dig into Windows ... What about buffer consistency in the long run? ... CD-ROM driver may keep the system for seconds at elevated IRQL levels, ... addresses of physical pages of memory, from which the user buffer ...
    (microsoft.public.development.device.drivers)
  • Re: C3088 CMOS Imaging Sensor Questions
    ... Which require LOTS of computation time and memory to process. ... To maintain an reasonable speed on frame update or just for processing. ... having an input buffer and an output (or temporary ... Some people use algorithms that ASSUME that if they take several pictures ...
    (comp.arch.embedded)