Re: Advice using the movsb, movsw, movsd instructions

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance




Andy Champ wrote:

Tim's right, use memcpy. If you pull it apart you'll find it does
clever things like copying a few bytes at the beginning and the end of
the block to make sure that the big lump in the middle is dword aligned,
which means each read and write take one cycle only. if they are not
aligned they take two (although the 2nd one will be from cache, you
still want to avoid it). The other thing to do is to make sure both
your buffers are on the same alignment - if one buffer is at an address
ending zero, and the other ending one, there is no way to make the read
and the write be aligned.

Yes, I saw that too.

That aside, are you sure you are optimising the right bit of your code?
Also, the clever trick will be to use MMX to pull the data out of the
source block and apply some sort of processing in one go.

Yes, there are a couple of graphs going on and memory movements are
used for bridging between them. It will save just a couple % of the CPU
but this already makes free CPU time almost double (it runs over 80%).
But as you say, memcpy (compiled with the right optimization) does the
job.


Optimising got a whole lot harder when they invented cache, and more so
with MMX. In my experience you'll get more from improved algorithms
than polishing the code.

Actually, Alessandro's (post 2) just woke me up that I was testing with
a non optimized version. Flipping the right compiler /O switch gave me
those few % and brought CPU a bit over 80% compared to almost 90%
before.

Regarding MMX, Jeremy suggests to rather use IPP. I have to take a look
at it because my understanding is that it is a higher level of
abstraction (over MMX?). More services, easier to program.

As I said earlier in thread, I suspect YUV to ARGB32 conversion
(actually done by DV Decoder filter) to consume significant amount of
CPU resources. Is it worthed replacing this conversion with a custom
filter using IPP?

Also, I need to rotate image 90 degres (around Z axis) and I need to
sometimes capture VMR9 output to a file. Those two can be resolved
using a custom allocator/presenter. Some threads on this newsgroup
suggest that a custom allocator/presenter is the best (if not only)
solution for capturing VMR9 output. I am working on that now. Another
advantage for me is that applying rotation at this later stage in graph
releases the need to manage the rotation in upstream treatments (which
makes things a lot easier for me).

Thanks for helping.
Jean-Marc.


Andy

.



Relevant Pages

  • Re: Why does FBSD always assume its on an 8080 CPU?
    ... We already preserve the "core" CPU state and the FPU state between ... Adding MMX into the mix means preserving an MMX ... The kernel itself _will not_ use any SSE or MMX operations when built. ...
    (freebsd-stable)
  • Re: Old Gateway Solo 2500
    ... cpu socket. ... does this chipset support Pentium2 ... processors with freq up to 400 MHz (while still supporting MMX CPUs)? ... The Pentium 2s have the "II" in a dark purple-red. ...
    (comp.sys.laptops)
  • Re: best processor for a Venturis FX 5166s?
    ... With the right voltage regulator, you can install a 233MHz Pentium MMX. ... or not it will handle an AMD K6 CPU, ... AMD K6-2. ...
    (comp.sys.dec)
  • Re: [OT] reminiscing about RISC OS...
    ... > The RISC OS software I mentioned is better than similar modern software, ... > Kind of sucks if you want to port it to a different CPU, ... They are made unreadable by clever programmers. ... ARM code isn't all that readable;) ...
    (Debian-User)
  • Re: Old Gateway Solo 2500
    ... cpu socket. ... does this chipset support Pentium2 ... processors with freq up to 400 MHz (while still supporting MMX CPUs)? ... The Pentium 2s have the "II" in a dark purple-red. ...
    (comp.sys.laptops)