Re: using ref keyword performance

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Why 'without ref' test works faster that 'using ref' test?
As I understand in 'without ref' test, parameter is passed by value,
so
a new storage location created each time we enter Test method.
And in 'using ref' test we pass parameter by reference. So rather than
creating a new storage location for the variable
in the function member declaration, the same storage location is used.
From my point of view it must work faster.

There are two points of overhead that keep ref parameters from working faster:

1. The address of the value must be retrieved before passing it as a ref parameter.
2. That pointer has to be dereferenced in the called method before the value can be used.

Can you give some comments about this situation plz. Thanx.

The only real difference between passing a parameter by value and passing parameter by reference is that a pointer is used to pass by reference. So, the overhead is getting the address of the value to pass and dereferencing the pointer to get the value in the method that is called.

The tests that you ran were not optimal for getting accurate timings. Here is the code that I used.

First, this is my HighResolutionTimer class:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Security;

namespace RefTest
{
public class HighResolutionTimer
{
// private fields...
private long m_Frequency;
private long m_StartCounter;
private long m_StopCounter;

// constructors...
public HighResolutionTimer() : this(false) { }
public HighResolutionTimer(bool start)
{
if (!QueryPerformanceFrequency(out m_Frequency))
{
Debug.WriteLine("HighResolutionTimer.ctor(): Error occurred while calling QueryPerformanceFrequency.");
return;
}

if (start)
Start();
}

// win32 api methods...
[SuppressUnmanagedCodeSecurity]
[DllImport("kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool QueryPerformanceCounter(
[Out] out long lpPerformanceCount);

[SuppressUnmanagedCodeSecurity]
[DllImport("kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool QueryPerformanceFrequency(
[Out] out long lpFrequency);

// private methods...
private double CalcDuration()
{
return ((double)(m_StopCounter - m_StartCounter)) / (double)m_Frequency;
}

// public methods...
public void Reset()
{
m_StartCounter = 0;
m_StopCounter = 0;
}
public void Start()
{
Reset();
if (!QueryPerformanceCounter(out m_StartCounter))
Debug.WriteLine("HighResolutionTimer.Start(): Error occurred while calling QueryPerformanceCounter.");
}
public double Stop()
{
if (!QueryPerformanceCounter(out m_StopCounter))
{
Debug.WriteLine("HighResolutionTimer.Stop(): Error occurred while calling QueryPerformanceCounter.");
return Double.NaN;
}

return Duration;
}

// public overridden methods...
public override string ToString()
{
return CalcDuration().ToString("0.######") + " seconds";
}

// public properties...
public double Duration
{
get
{
return CalcDuration();
}
}
}
}

Second, here is a helper CodeTimer class that I use for timing code:

using System;

namespace RefTest
{
public static class CodeTimer
{
private static double Average(double[] values)
{
if (values == null)
throw new ArgumentNullException("values");

int valueCount = values.Length;

if (valueCount == 0)
return 0.0d;

double sum = 0.0d;
for (int i = 0; i < valueCount; i++)
sum += values[i];

return sum / valueCount;
}

public delegate void TimingCode();

public static double Execute(TimingCode code)
{
if (code == null)
throw new ArgumentNullException("code");

const int NUM_SAMPLES = 100;

double[] timings = new double[NUM_SAMPLES];
HighResolutionTimer timer = new HighResolutionTimer();
for (int i = 0; i < NUM_SAMPLES; i++)
{
timer.Reset();

GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

timer.Start();
code();
timer.Stop();

timings[i] = timer.Duration;
}

return Average(timings);
}
}
}

And finally, here's the Program class for my test console application:

using System;
using System.Runtime.CompilerServices;

namespace RefTest
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("TestWithoutRef: {0:###,###,##0.000000}", CodeTimer.Execute(TestWithoutRefLoop));
Console.WriteLine("TestWithRef: {0:###,###,##0.000000}", CodeTimer.Execute(TestWithRefLoop));

Console.ReadLine();
}

static void TestWithRefLoop()
{
int result;
for (int i = 0; i < 50000000; i++)
result = TestWithRef(ref i);
}
static void TestWithoutRefLoop()
{
int result;
for (int i = 0; i < 50000000; i++)
result = TestWithoutRef(i);
}

[MethodImpl(MethodImplOptions.NoInlining)]
static int TestWithRef(ref int k)
{
return k;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int TestWithoutRef(int k)
{
return k;
}
}
}

In VS 2005, create a new console application and add those files to get a more optimal test. Here are the timings that I get:

TestWithoutRef: 0.192421 seconds
TestWithRef: 0.194921

So, according to my results, passing a parameter by reference 50,000,000 times results in approximately 2.5 milliseconds. Yee-ha! This is not something to worry about. :-)

----

If you're interested in seeing what is going on under the covers, let's take a look at the IL that is generated:

static void TestWithoutRef()
{
int result;
for (int i = 0; i < 50000000; i++)
result = TestWithoutRef(i);
}

..method private hidebysig static void TestWithoutRefLoop() cil managed
{
.maxstack 2
.locals init (
[0] int32 i)
L_0000: ldc.i4.0 L_0001: stloc.0 L_0002: br.s L_000f
L_0004: ldloc.0 L_0005: call int32 RefTest.Program::TestWithoutRef(int32)
L_000a: pop L_000b: ldloc.0 L_000c: ldc.i4.1 L_000d: add L_000e: stloc.0 L_000f: ldloc.0 L_0010: ldc.i4 50000000
L_0015: blt.s L_0004
L_0017: ret }

static void TestWithRefLoop()
{
int result;
for (int i = 0; i < 50000000; i++)
result = TestWithRef(ref i);
}

..method private hidebysig static void TestWithRefLoop() cil managed
{
.maxstack 2
.locals init (
[0] int32 i)
L_0000: ldc.i4.0 L_0001: stloc.0 L_0002: br.s L_0010
L_0004: ldloca.s i
L_0006: call int32 RefTest.Program::TestWithRef(int32&)
L_000b: pop L_000c: ldloc.0 L_000d: ldc.i4.1 L_000e: add L_000f: stloc.0 L_0010: ldloc.0 L_0011: ldc.i4 50000000
L_0016: blt.s L_0004
L_0018: ret }

These methods only differ by one byte in length and the reason is found at L_0004. In TestWithoutRefLoop, the "ldloc.0" instruction is used. This simply loads the local variable at index 0 ('i') onto the stack. Because we're passing by value, that's all that's needed to make the call to TestWithoutRef(int32). However, in TestWithRefLoop, the "ldloc.a i" instruction is used. This is one byte larger because there is a byte for the instruction and a byte to indicate the index of the local to use. And, instead of loading the specified local variable onto the stack, it loads the *address* of said local variable in order to set up the TestWithRef(int32&) method call. On my machine, when I look at the optimized JITted code for these methods, I see the following x86:

TestWithoutRefLoop:

00000000 push esi 00000001 xor esi,esi 00000003 mov ecx,esi 00000005 call dword ptr ds:[00913070h] 0000000b inc esi 0000000c cmp esi,2FAF080h 00000012 jl 00000003 00000014 pop esi 00000015 ret

TestWithRefLoop

00000000 push eax 00000001 xor eax,eax 00000003 mov dword ptr [esp],eax 00000006 xor edx,edx 00000008 mov dword ptr [esp],edx 0000000b cmp dword ptr [esp],2FAF080h 00000012 jge 00000029 00000014 lea ecx,[esp] 00000017 call dword ptr ds:[0091306Ch] 0000001d inc dword ptr [esp] 00000020 cmp dword ptr [esp],2FAF080h 00000027 jl 00000014 00000029 pop ecx 0000002a ret

Obviously, a lot more work is necessary at the x86 level to get the address of this pointer.

Now, let's look at the methods that get called.

static int TestWithoutRef(int k)
{
return k;
}

..method private hidebysig static int32 TestWithoutRef(int32 k) cil managed noinlining
{
.maxstack 8
L_0000: ldarg.0 L_0001: ret }

static int TestWithRef(ref int k)
{
return k;
}

..method private hidebysig static int32 TestWithRef(int32& k) cil managed noinlining
{
.maxstack 8
L_0000: ldarg.0 L_0001: ldind.i4 L_0002: ret }


In this case, TestWithRef has one additional instruction: "ldind.i4". This instruction takes the managed pointer on the top of the evaluation stack and loads the int32 value indirectly from it (hence "ldind"). IOW, this is the pointer dereference that needs to happen before the value can be used (in this case, returned).

For completeness, here's the x86 of the optimized JITted code:

TestWithoutRef

00000000 mov eax,ecx 00000002 ret

TestWithRef

00000000 mov eax,dword ptr [ecx] 00000002 ret

Obviously, there is a lot less going on here than at the calling site. The only difference is the pointer dereference. So, most of the overhead that we observed takes place in the calling site. But, IMO, it is neglible. There's nothing to get worked up about. Take a deep breath. If you need to be concerned about performance at this low of a level, you probably shouldn't be working in a garbage-collected environment. :-)

Best Regards,
Dustin Campbell
Developer Express Inc.



.



Relevant Pages

  • Re: using ref keyword performance
    ... // private fields... ... int valueCount = values.Length; ... static void TestWithRefLoop() ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: random number generator help
    ... put the Random variable as a private member in the ... static void Main ... int returnValue = Random.RandomNumber; ... private int RandomNumber ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Beep
    ... SourceDataLine line1 = t.open; ... private static void rest(Tone t, int ms, SourceDataLine... ...
    (comp.lang.java.programmer)
  • Re: [PATCH 1/1] cgroups: introduce cft->read_seq()
    ... Subject: cgroups: introduce cft->read_seq ... static void set_access ... -static void set_majmin(char *str, int len, unsigned m) ... .private = DEVCG_ALLOW, ...
    (Linux-Kernel)
  • [PATCH 1/1] cgroups: introduce cft->read_seq()
    ... Subject: cgroups: introduce cft->read_seq ... static void set_access ... -static void set_majmin(char *str, int len, unsigned m) ... .private = DEVCG_ALLOW, ...
    (Linux-Kernel)