Re: /CLR floating point performance, inter-assembly function call performance
From: Bern McCarty (bern.mccarty_at_bentley.com)
Date: 05/06/04
- Next message: mccoyn: "How can I change the list height for a ComboBox?"
- Previous message: Dinesh Rathi: "Using StringCollection through com interfaces"
- In reply to: Yan-Hong Huang[MSFT]: "RE: /CLR floating point performance, inter-assembly function call performance"
- Next in thread: Yan-Hong Huang[MSFT]: "Re: /CLR floating point performance, inter-assembly function call performance"
- Reply: Yan-Hong Huang[MSFT]: "Re: /CLR floating point performance, inter-assembly function call performance"
- Reply: Kang Su Gatlin [MS]: "Re: /CLR floating point performance, inter-assembly function call performance"
- Reply: Yan-Hong Huang[MSFT]: "Re: /CLR floating point performance, inter-assembly function call performance"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 6 May 2004 08:59:11 -0400
>From reading various things I had already recognized the things that you
state as the current conventional wisdom. I went to the trouble to post my
results in the hopes of getting some feedback on why it might be that my
results run very much against that conventional wisdom. Please consider:
1) Floating point performance of managed code. At least in this little
test scenario floating point performance of managed code doesn't seem to be
a problem at all. In the first call out of the 8 in a test run the
DMatrix3d_multiplyDPoint3dArray function is asked to apply the matrix to a
whopping 5,000,000 3D points per call. So it is just sitting there doing
floating point operations in a 5,000,000 iteration loop and there are no
function calls in that loop at all. The managed version took only 3% longer
in that case than the all native version. It seems logical then to rule out
floating point performance as the culprit when things quickly change for the
worse in the later calls where the call granularity to
DMatrix3d_multiplyDPoint3dArray becomes very fine. It makes more sense to
assign the slowdown observed in the fine-grained call cases on function call
overhead, not on floating point performance.
2) The expense of transitions. What am I doing wrong? The version of my
test program that involves a transition in the call from
test_applyMatrixToDPoints->DMatrix3d_multiplyDPoint3dArray is actually
FASTER than the all managed version (true for both the intra-assembly and
inter-assembly call cases). Furthermore, the more finely-grained the calls
are the more the native->managed version outperforms the managed-managed
versions. Since we already established that raw floating point performance
of the loop inside of the DMatrix3d_multiplyDPoint3dArray function is very
equivalent between the managed and native versions, and the conventional
wisdom is that native->managed transitions are expensive and bad, then what
is to blame for the poor relative performance of the managed->managed
versions? The managed->managed version is flat-out beaten by the version
that does a transition for each and every call. It would seem that there is
some serious penalty associated with making regular managed->managed
function calls - not managed->native calls. What might be responsible for
it and is it something I have any control over?
3) The surprising difference in cost between inter-assembly and
intra-assembly managed->managed calls. Can someone explain this difference
and is there anything that can be done about it besides making my program
one enormous executable?
4) How can I step through JIT compiled code in assembly language in a
debugger for a release executable so that I can see what is going on? I
want the JIT to produce "non debug" x86 instructions and yet I want to step
through them to see what they do. Tips appreciated. Can I do this with the
VS.NET debugger? Windbg? How?
"Yan-Hong Huang[MSFT]" <yhhuang@online.microsoft.com> wrote in message
news:kGLwODzMEHA.3808@cpmsftngxa10.phx.gbl...
> Hello Bern,
>
> Generally speaking, the v1 JIT does not currently perform all the
> FP-specific optimizations that the VC++ backend does, making floating
point
> operations more expensive for now. That may be why managed->managed is
more
> expensive than managed->unmanaged in your test.
>
> So for areas which make heavy use of floating point arithmetic, please use
> profilers to pick the fragments where the overhead is costing you most,
and
> Keep the whole fragment in unmanaged space.
>
> Also, work to minimize the number of transitions you make. If you have
some
> unmanaged code or an interop call sitting in a loop, make the entire loop
> unmanaged. That way you'll only pay the transition cost twice, rather than
> for each iteration of the loop.
>
> By looking into ILCode, we can see that when InterOping, there are some
> extra IL instructions. So minimizing the number of transitions can save
> many IL instructions and improve performance.
>
> For some more information, you can refer to this chapter online:
> "Chapter 7 ¡ª Improving Interop Performance"
>
http://msdn.microsoft.com/library/en-us/dnpag/html/scalenetchapt07.asp?frame
> =true#scalenetchapt07 _topic12
>
> Hope that helps.
>
> Best regards,
> Yanhong Huang
> Microsoft Community Support
>
> Get Secure! ¨C www.microsoft.com/security
> This posting is provided "AS IS" with no warranties, and confers no
rights.
>
- Next message: mccoyn: "How can I change the list height for a ComboBox?"
- Previous message: Dinesh Rathi: "Using StringCollection through com interfaces"
- In reply to: Yan-Hong Huang[MSFT]: "RE: /CLR floating point performance, inter-assembly function call performance"
- Next in thread: Yan-Hong Huang[MSFT]: "Re: /CLR floating point performance, inter-assembly function call performance"
- Reply: Yan-Hong Huang[MSFT]: "Re: /CLR floating point performance, inter-assembly function call performance"
- Reply: Kang Su Gatlin [MS]: "Re: /CLR floating point performance, inter-assembly function call performance"
- Reply: Yan-Hong Huang[MSFT]: "Re: /CLR floating point performance, inter-assembly function call performance"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|
|