Re: Fast string operations



Nicholas,

Looping: I thought looping over arrays in managed code was "slow"
(relatively speaking) because of all the bounds checking and whatnot. This
is why people use unsafe code now and then to use pointer arithmetic to loop
over arrays without all the unnecessary bounds checking.

I'm aware that looping has to occur one way or the other, but with Trim(),
the looping is happening in unmanaged code (COMString::TrimHelper to be
exact) and is much faster since it isn't required to do all the bloated .NET
array handling and such.

The problem with TrimHelper is that it always returns a new string instance.
There's no way to say "Trim and tell me what the length is when you're done,
don't return the trimmed string".

There are no performance issues that I'm aware of (yet). The concern is
rapid memory growth. The customer perceives this as a memory leak. I'm 99%
sure that it's just the GC being lazy/stand-offish until there's something
to worry about (we approach the 2GB limit of a process), but I wanted to
double-check before I unknowingly fed a line of B.S. to the customer.

The customer will eventually learn to understand this and accept it, but I
wanted to make sure that I understood it completely.

We have done extensive performance counting, so we're well aware of the
memory picture. I have followed numerous profiling guides and established
that the majority of allocations are of System.String's and that the
majority of the process's memory is being taken up with the Gen2 heap. This
is what concerns me. My understanding is that you have to survive several
successive garbage collections in order to make it to the Gen2 heap. How are
temporary strings making it to Gen2?

That's the only reason for the 1% of doubt I have left.

Thanks again,
Chad

P.S.- the previous COM stuff was written in VB6 and was STA. We had
customers running on our legacy stuff (it was written before I got here,
don't blame me! ;) ) with many concurrent users. Some were starting to run
into the STA limitations, but, for the most part, it was running
surprisingly (and I mean surprisingly!) well. Eventually some hit a wall.

When our .NET rewrite/rearchitecture was finished, we wrote a COM
facade/compatibility layer so that existing COM- or ASP/SCRIPT-based code
could run against the new stuff.

We saw several orders of magnitude difference in performance, even with the
COM interop overhead, not to mention the now highly scalable multi-threaded
interface. Score +1 for .NET, again :)


"Nicholas Paldino [.NET/C# MVP]" <mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in
message news:efo6bwuZFHA.3040@xxxxxxxxxxxxxxxxxxxxxxx
> Chad,
>
> Looping over characters like that can't slow things down that much. No
> matter what you do, you will have to perform some sort of loop operation
> to check the string. There is no other way to do it.
>
> Also, the char[] that is enumerated through is not created for every
> iteration through the string. Rather, the string implements IEnumerable,
> and then the IEnumerator implementation returned will return a new char
> for each iteration.
>
> I don't think that using unsafe code is going to make things any
> better, only because it's going to do the same thing you are going to do,
> maybe with one or two operations eliminated in between (and I mean IL
> operations, not function calls).
>
> When you call Trim, a loop is going to start from the beginning of the
> string, counting the whitespace characters that are at the beginning.
> Then it is going to perform another loop to scan the end of the stirng for
> whitespace characters. Once that is done, it will get the substring,
> which will have to loop through the characters to copy them into a new
> string (on some level or another, a loop is going to execute).
>
> Also, the original issue was the amount of memory that is being
> consumed (which in reality, it is not, but it is a customer education
> issue). If the performance of the application is suffering, it is not
> because of these operations. I would look elsewhere. The fact that you
> are using COM interop means that for every call you make across that
> boundary, you are adding something on the order of 40 extra operations.
> Depending on how chunky your calls are, this could be a factor.
>
> In the end, the GC is going to take up as much memory as possible, and
> give it up only when the OS tells it (from a high level view). That's
> part of what you sign up for when you use .NET. I'd work on educating
> your customers to NOT look at task manager in order to determine whether
> or not there is a memory leak. Rather, they should look at the
> performance counters (many of which exist for .NET) which give a MUCH more
> clear performance picture.
>
> --
> - Nicholas Paldino [.NET/C# MVP]
> - mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> "Chad Myers" <cmyers@xxxxxxxxxxxxxxxxxxxxx> wrote in message
> news:P3pne.13797$PR6.10264@xxxxxxxxxxxxxxxxxxxxxxx
>> Nicholas,
>>
>> Thanks for the quick reply. Unfortunately I'm not using .NET 2.0 (yet!),
>> so I can't use Generics.
>>
>> Would looping over chars like that slow things down significantly? Also,
>> is the char[] for each string cached with the string, or is a new one
>> created when you call things like ToCharArray() or foreach() on the
>> string (not every loop iteration, but on the first iteration)? Wouldn't I
>> just be replacing a new string instance with a new char[] and not get any
>> net gain over just calling .Trim()?
>>
>> In your opinion, if I weren't against unsafe code, could I make this
>> significantly faster, or would it not afford me much difference?
>>
>> As far as performance, on some of our clients' instances, memory growth
>> is rapid. It seems the more memory they have, the faster it grows which
>> leads me to believe that the GC is being lax since it has so much free
>> memory and doesn't see the need to aggressively collect memory. But it
>> bothers our clients and they perceive this to be a memory leak.
>>
>> I realize it's an education issue, but I want to make sure that I'm
>> educating them correctly, as opposed to just making up a B.S. excuse and
>> Jedi hand-waving about the GC stuff.
>>
>> Also, it's not an ASP.NET application, it's an ASP app that used to call
>> into VB6 COM objects. We've replaced the VB6 objects with .NET objects
>> exposing a "compatibility layer" that has a ComVisible API that is
>> identical (though not binary compatible) with the old VB6 stuff.
>> Late-bound clients don't know the difference other than a different
>> ProgID for the COM objects.
>>
>> So we're dealing with the wkst GC, as far as I know (since only ASP.NET
>> uses svr unless you host the CLR yourself, from what I understand). I'm
>> not sure how I'd even do that in an ASP/COM-interop situation, but,
>> assuming it's possible, would writing our own CLR host to use the svr GC
>> help matters at all?
>>
>> Most of our clients' servers are dual-or-more processor boxes.
>>
>> Thanks again,
>> Chad Myers
>>
>> "Nicholas Paldino [.NET/C# MVP]" <mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote
>> in message news:Ozam7ZuZFHA.580@xxxxxxxxxxxxxxxxxxxxxxx
>>> Chad,
>>>
>>> For the first scenario, your solution should give you an increase.
>>>
>>> For the second scenario, you should use reflection once to get a
>>> reference to the internal static character array WhitespaceChars on the
>>> string class. Then, you can write a method which will cycle through a
>>> string passed to it, like so:
>>>
>>> public static bool TrimIsNullOrEmpty(string value)
>>> {
>>> // If null, then get out.
>>> if (value == null)
>>> {
>>> // Return true.
>>> return true;
>>> }
>>>
>>> // Cycle through the characters in the string. If the character is
>>> not found
>>> // in the whitespace array, return false, otherwise, when done,
>>> return true.
>>> foreach (char c in value)
>>> {
>>> // If the character is not found in the WhitespaceArray, then
>>> return
>>> // false.
>>> if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
>>> {
>>> // Return false.
>>> return false;
>>> }
>>> }
>>>
>>> // Return true, the string is full of whitespace.
>>> return true;
>>> }
>>>
>>> I used the generic version of the IndexOf method on the Array class
>>> in order to eliminate boxing. Also, if you really want to squeeze out
>>> every last bit of performance from this, you can take the
>>> WhitespaceArray and use the characters as keys in a dictionary. The
>>> number of whitespace characters is 25 (right now, that is). However, if
>>> your strings typically are padded with spaces, then you could get a big
>>> speed boost by copying the array initially, and then placing the space
>>> character as the first element in the array (which would cause most of
>>> the calls to IndexOf to return very quickly, probably quicker than a
>>> lookup in a dictionary).
>>>
>>> I am curious though, are you seeing a performance issue, or do you
>>> just see the numbers and are worried about them? ASP.NET applications
>>> tend to get in a nice groove with the GC over time.
>>>
>>> Hope this helps.
>>>
>>> --
>>> - Nicholas Paldino [.NET/C# MVP]
>>> - mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx
>>>
>>> "Chad Myers" <cmyers@xxxxxxxxxxxxxxxxxxxxx> wrote in message
>>> news:2rone.13780$PR6.9706@xxxxxxxxxxxxxxxxxxxxxxx
>>>> I've been perf testing an application of mine and I've noticed that
>>>> there are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
>>>> System.String instances being created.
>>>>
>>>> I've done some analysis and I'm led to believe (but can't yet
>>>> quantitatively establish as fact) that the two basic culprits are a lot
>>>> of calls to:
>>>>
>>>> 1.) if( someString.ToLower() == "somestring" )
>>>>
>>>> and
>>>>
>>>> 2.) if( someString != null && someString.Trim().Length > 0 )
>>>>
>>>>
>>>> ToLower() generates a new string instance as does Trim().
>>>>
>>>> I believe that these are getting called many times and churning up a
>>>> bunch of strings faster than the GC can collect them, or perhaps
>>>> there's some weird interning/caching thing going on. Regardless, the
>>>> number of string instances grows and grows. It gets bumped down
>>>> occasionally, but it's basically 5 steps forward, 1 back.
>>>>
>>>> For reference, this is an ASP application calling into .NET ComVisible
>>>> objects. So I assume this uses the workstation GC, right?
>>>>
>>>>
>>>> Anyhow, so I think that I can solve problem (1) with String.Compare()
>>>> which can perform in-place case-insensitive comparisons without
>>>> generating new string instances.
>>>>
>>>> Problem (2), however, is more complicated. There doesn't appear to be a
>>>> TrimmedLength or any type of method or property that can give me the
>>>> length of a string, minus whitespace and without generating a new
>>>> string instance, in the BCL.
>>>>
>>>> I suppose I could do some unsafe, or even unmanaged code (which is what
>>>> MSFT did for all their string handling stuff inside System.String and
>>>> using the COMString stuff), but I'd like to try to avoid that, or at
>>>> least use a library that's already written and well tested.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks in advance,
>>>> Chad Myers
>>>>
>>>>
>>>
>>>
>>
>>
>
>


.



Relevant Pages

  • Re: Fast string operations
    ... worried about string performance. ... > is why people use unsafe code now and then to use pointer arithmetic to loop ... > I'm aware that looping has to occur one way or the other, but with Trim, ... The customer perceives this as a memory leak. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Substring
    ... A common scenario is in fact taking an initial string and chopping it into smaller pieces. ... It's more efficient to do this using a single common source array and just maintaining indices into the array, both in terms of memory usage and in terms of speed. ... Worrying about the shared array is a premature optimization at best, and ignores an important "common case" optimization at worst. ...
    (comp.lang.java.programmer)
  • Re: Performance Question
    ... Split is more efficient than actually looping through the string. ... second time (parsing it yet again) to assign the data to the array. ...
    (microsoft.public.vb.general.discussion)
  • Re: fscanf issues, please help.
    ... those pointers point to strings of whatever length. ... An array of pointers is an array of pointer, ... they pont to some random places in memory. ... be an integer, the next one a string, followed by a hash etc., ...
    (comp.lang.c)
  • Re: Concat some string not ended...
    ... > i try to concat some string not ended ... What you're dealing here with are simple char arrays ... 'size' isn't compile-time constant and thus the length of the array ... a char pointer and than allocate enough memory. ...
    (comp.unix.programmer)