Re: MMX Copy

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Your loop. Assuming you have a CPU with 2MB of L2
cache, you are simply copying from one area of the cache
into another.

--
=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: agnickolov@xxxxxxxx
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

<gsudeesh@xxxxxxxxx> wrote in message
news:1140148822.236940.235400@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi All,
I am performing a performance analysis on memory copy using memcpy
function and MMX method. I found out this interesting observation:

MMX has less performance compared with memcpy function when I am
copying small data of size less than 1MB. But when I copy a size
greater than 1MB, MMX has better performance.

Also please note that this issue takes place when I do the performance
in a loop. That is, when I try to copy data using memcpy/MMX method 10
times in a loop continuously.

Can anyone give me the reason?

Thank You and Best Regards,
Sudeesh G.



.



Relevant Pages

  • Re: hot path optimizations in uma_zalloc() & uma_zfree()
    ... > I suppose the reason of first gain lies in increasing of cpu cache hits. ... > separate buckets. ... I ran ministat against your tests with 1000 sockets loop and there isn't a lot ...
    (freebsd-hackers)
  • Re: hot path optimizations in uma_zalloc() & uma_zfree()
    ... > I suppose the reason of first gain lies in increasing of cpu cache hits. ... > separate buckets. ... I ran ministat against your tests with 1000 sockets loop and there isn't a lot ...
    (freebsd-hackers)
  • Re: CPU starvation, L2 missed cache
    ... It is clear that all the tables won't fit in the cache. ... Have you got some advices when working with arrays? ... If the arrays aren't OpenMP private, and your OpenMP do loop is outside what you have shown, you make it appear that you have serious race conditions, so cannot expect correct results, let alone satisfactory performance. ...
    (comp.lang.fortran)
  • [PATCH][RFC] fast file mapping for loop
    ... are done once they hit page cache. ... loop without making it even slower than it currently is. ... * Add bio to back of pending list and wakeup thread ... * Find extent mapping this lo device block to the file block on the real ...
    (Linux-Kernel)
  • Cache size restrictions obsolete for unrolling?
    ... into the cache (i.e. the code of the loops was slightly smaller than ... execution time using a cycle-true simulator. ... unrolling factor stepwise resulting in the unrolled loop that exceeded ... I expected to get a performance decrease, i.e. the stronger the loop ...
    (comp.compilers)