Re: Fast string operations
- From: "KH" <KH@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 1 Jun 2005 15:01:03 -0700
Array looping: The CLR has an optimization to remove bounds checking under
certain conditions, mainly your basic for loop using Array.Length as the
condition:
for (int i=0; i < myarray.Length; ++i)
{
// presumably playing with the indexer or re-assigning the array variable
// here would disable the optimization, but I don't really know.
}
Anyways I don't know what your app is but it must be mighty big to be so
worried about string performance. It's usually over-use of strings that
causes problems, like building strings by conditinally concatenating them,
stuff like that that people don't realize causes a new instance of string to
be created with EACH operation:
string str1 = " ABC";
string str2 = "DEF ";
string str3 = (str1 + str2).ToLower().Trim(); // 3 Strings created here
If that's the real issue you might look into the StringBuilder class, which
is mutable unlike String.
- KH
"Chad Myers" wrote:
> Nicholas,
>
> Looping: I thought looping over arrays in managed code was "slow"
> (relatively speaking) because of all the bounds checking and whatnot. This
> is why people use unsafe code now and then to use pointer arithmetic to loop
> over arrays without all the unnecessary bounds checking.
>
> I'm aware that looping has to occur one way or the other, but with Trim(),
> the looping is happening in unmanaged code (COMString::TrimHelper to be
> exact) and is much faster since it isn't required to do all the bloated .NET
> array handling and such.
>
> The problem with TrimHelper is that it always returns a new string instance.
> There's no way to say "Trim and tell me what the length is when you're done,
> don't return the trimmed string".
>
> There are no performance issues that I'm aware of (yet). The concern is
> rapid memory growth. The customer perceives this as a memory leak. I'm 99%
> sure that it's just the GC being lazy/stand-offish until there's something
> to worry about (we approach the 2GB limit of a process), but I wanted to
> double-check before I unknowingly fed a line of B.S. to the customer.
>
> The customer will eventually learn to understand this and accept it, but I
> wanted to make sure that I understood it completely.
>
> We have done extensive performance counting, so we're well aware of the
> memory picture. I have followed numerous profiling guides and established
> that the majority of allocations are of System.String's and that the
> majority of the process's memory is being taken up with the Gen2 heap. This
> is what concerns me. My understanding is that you have to survive several
> successive garbage collections in order to make it to the Gen2 heap. How are
> temporary strings making it to Gen2?
>
> That's the only reason for the 1% of doubt I have left.
>
> Thanks again,
> Chad
>
> P.S.- the previous COM stuff was written in VB6 and was STA. We had
> customers running on our legacy stuff (it was written before I got here,
> don't blame me! ;) ) with many concurrent users. Some were starting to run
> into the STA limitations, but, for the most part, it was running
> surprisingly (and I mean surprisingly!) well. Eventually some hit a wall.
>
> When our .NET rewrite/rearchitecture was finished, we wrote a COM
> facade/compatibility layer so that existing COM- or ASP/SCRIPT-based code
> could run against the new stuff.
>
> We saw several orders of magnitude difference in performance, even with the
> COM interop overhead, not to mention the now highly scalable multi-threaded
> interface. Score +1 for .NET, again :)
>
>
> "Nicholas Paldino [.NET/C# MVP]" <mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in
> message news:efo6bwuZFHA.3040@xxxxxxxxxxxxxxxxxxxxxxx
> > Chad,
> >
> > Looping over characters like that can't slow things down that much. No
> > matter what you do, you will have to perform some sort of loop operation
> > to check the string. There is no other way to do it.
> >
> > Also, the char[] that is enumerated through is not created for every
> > iteration through the string. Rather, the string implements IEnumerable,
> > and then the IEnumerator implementation returned will return a new char
> > for each iteration.
> >
> > I don't think that using unsafe code is going to make things any
> > better, only because it's going to do the same thing you are going to do,
> > maybe with one or two operations eliminated in between (and I mean IL
> > operations, not function calls).
> >
> > When you call Trim, a loop is going to start from the beginning of the
> > string, counting the whitespace characters that are at the beginning.
> > Then it is going to perform another loop to scan the end of the stirng for
> > whitespace characters. Once that is done, it will get the substring,
> > which will have to loop through the characters to copy them into a new
> > string (on some level or another, a loop is going to execute).
> >
> > Also, the original issue was the amount of memory that is being
> > consumed (which in reality, it is not, but it is a customer education
> > issue). If the performance of the application is suffering, it is not
> > because of these operations. I would look elsewhere. The fact that you
> > are using COM interop means that for every call you make across that
> > boundary, you are adding something on the order of 40 extra operations.
> > Depending on how chunky your calls are, this could be a factor.
> >
> > In the end, the GC is going to take up as much memory as possible, and
> > give it up only when the OS tells it (from a high level view). That's
> > part of what you sign up for when you use .NET. I'd work on educating
> > your customers to NOT look at task manager in order to determine whether
> > or not there is a memory leak. Rather, they should look at the
> > performance counters (many of which exist for .NET) which give a MUCH more
> > clear performance picture.
> >
> > --
> > - Nicholas Paldino [.NET/C# MVP]
> > - mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > "Chad Myers" <cmyers@xxxxxxxxxxxxxxxxxxxxx> wrote in message
> > news:P3pne.13797$PR6.10264@xxxxxxxxxxxxxxxxxxxxxxx
> >> Nicholas,
> >>
> >> Thanks for the quick reply. Unfortunately I'm not using .NET 2.0 (yet!),
> >> so I can't use Generics.
> >>
> >> Would looping over chars like that slow things down significantly? Also,
> >> is the char[] for each string cached with the string, or is a new one
> >> created when you call things like ToCharArray() or foreach() on the
> >> string (not every loop iteration, but on the first iteration)? Wouldn't I
> >> just be replacing a new string instance with a new char[] and not get any
> >> net gain over just calling .Trim()?
> >>
> >> In your opinion, if I weren't against unsafe code, could I make this
> >> significantly faster, or would it not afford me much difference?
> >>
> >> As far as performance, on some of our clients' instances, memory growth
> >> is rapid. It seems the more memory they have, the faster it grows which
> >> leads me to believe that the GC is being lax since it has so much free
> >> memory and doesn't see the need to aggressively collect memory. But it
> >> bothers our clients and they perceive this to be a memory leak.
> >>
> >> I realize it's an education issue, but I want to make sure that I'm
> >> educating them correctly, as opposed to just making up a B.S. excuse and
> >> Jedi hand-waving about the GC stuff.
> >>
> >> Also, it's not an ASP.NET application, it's an ASP app that used to call
> >> into VB6 COM objects. We've replaced the VB6 objects with .NET objects
> >> exposing a "compatibility layer" that has a ComVisible API that is
> >> identical (though not binary compatible) with the old VB6 stuff.
> >> Late-bound clients don't know the difference other than a different
> >> ProgID for the COM objects.
> >>
> >> So we're dealing with the wkst GC, as far as I know (since only ASP.NET
> >> uses svr unless you host the CLR yourself, from what I understand). I'm
> >> not sure how I'd even do that in an ASP/COM-interop situation, but,
> >> assuming it's possible, would writing our own CLR host to use the svr GC
> >> help matters at all?
> >>
> >> Most of our clients' servers are dual-or-more processor boxes.
> >>
> >> Thanks again,
> >> Chad Myers
> >>
> >> "Nicholas Paldino [.NET/C# MVP]" <mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote
> >> in message news:Ozam7ZuZFHA.580@xxxxxxxxxxxxxxxxxxxxxxx
> >>> Chad,
> >>>
> >>> For the first scenario, your solution should give you an increase.
> >>>
> >>> For the second scenario, you should use reflection once to get a
> >>> reference to the internal static character array WhitespaceChars on the
> >>> string class. Then, you can write a method which will cycle through a
> >>> string passed to it, like so:
> >>>
> >>> public static bool TrimIsNullOrEmpty(string value)
> >>> {
> >>> // If null, then get out.
> >>> if (value == null)
> >>> {
> >>> // Return true.
> >>> return true;
> >>> }
> >>>
> >>> // Cycle through the characters in the string. If the character is
> >>> not found
> >>> // in the whitespace array, return false, otherwise, when done,
> >>> return true.
> >>> foreach (char c in value)
> >>> {
> >>> // If the character is not found in the WhitespaceArray, then
> >>> return
> >>> // false.
> >>> if (Array.IndexOf<char>(WhitespaceArray, char) == -1)
> >>> {
> >>> // Return false.
> >>> return false;
> >>> }
> >>> }
> >>>
> >>> // Return true, the string is full of whitespace.
> >>> return true;
> >>> }
> >>>
> >>> I used the generic version of the IndexOf method on the Array class
> >>> in order to eliminate boxing. Also, if you really want to squeeze out
> >>> every last bit of performance from this, you can take the
> >>> WhitespaceArray and use the characters as keys in a dictionary. The
> >>> number of whitespace characters is 25 (right now, that is). However, if
> >>> your strings typically are padded with spaces, then you could get a big
> >>> speed boost by copying the array initially, and then placing the space
> >>> character as the first element in the array (which would cause most of
> >>> the calls to IndexOf to return very quickly, probably quicker than a
> >>> lookup in a dictionary).
> >>>
> >>> I am curious though, are you seeing a performance issue, or do you
> >>> just see the numbers and are worried about them? ASP.NET applications
> >>> tend to get in a nice groove with the GC over time.
> >>>
> >>> Hope this helps.
> >>>
> >>> --
> >>> - Nicholas Paldino [.NET/C# MVP]
> >>> - mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> >>>
> >>> "Chad Myers" <cmyers@xxxxxxxxxxxxxxxxxxxxx> wrote in message
> >>> news:2rone.13780$PR6.9706@xxxxxxxxxxxxxxxxxxxxxxx
> >>>> I've been perf testing an application of mine and I've noticed that
> >>>> there are a lot (and I mean A LOT -- megabytes and megabytes of 'em)
> >>>> System.String instances being created.
> >>>>
> >>>> I've done some analysis and I'm led to believe (but can't yet
> >>>> quantitatively establish as fact) that the two basic culprits are a lot
> >>>> of calls to:
> >>>>
> >>>> 1.) if( someString.ToLower() == "somestring" )
> >>>>
> >>>> and
> >>>>
> >>>> 2.) if( someString != null && someString.Trim().Length > 0 )
> >>>>
> >>>>
> >>>> ToLower() generates a new string instance as does Trim().
> >>>>
> >>>> I believe that these are getting called many times and churning up a
> >>>> bunch of strings faster than the GC can collect them, or perhaps
> >>>> there's some weird interning/caching thing going on. Regardless, the
> >>>> number of string instances grows and grows. It gets bumped down
> >>>> occasionally, but it's basically 5 steps forward, 1 back.
> >>>>
> >>>> For reference, this is an ASP application calling into .NET ComVisible
> >>>> objects. So I assume this uses the workstation GC, right?
> >>>>
> >>>>
> >>>> Anyhow, so I think that I can solve problem (1) with String.Compare()
> >>>> which can perform in-place case-insensitive comparisons without
> >>>> generating new string instances.
> >>>>
> >>>> Problem (2), however, is more complicated. There doesn't appear to be a
> >>>> TrimmedLength or any type of method or property that can give me the
> >>>> length of a string, minus whitespace and without generating a new
> >>>> string instance, in the BCL.
> >>>>
> >>>> I suppose I could do some unsafe, or even unmanaged code (which is what
> >>>> MSFT did for all their string handling stuff inside System.String and
> >>>> using the COMString stuff), but I'd like to try to avoid that, or at
> >>>> least use a library that's already written and well tested.
> >>>>
> >>>> Any thoughts?
> >>>>
> >>>> Thanks in advance,
> >>>> Chad Myers
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>
>
.
- Follow-Ups:
- Re: Fast string operations
- From: Chad Myers
- Re: Fast string operations
- References:
- Fast string operations
- From: Chad Myers
- Re: Fast string operations
- From: Nicholas Paldino [.NET/C# MVP]
- Re: Fast string operations
- From: Chad Myers
- Re: Fast string operations
- From: Nicholas Paldino [.NET/C# MVP]
- Re: Fast string operations
- From: Chad Myers
- Fast string operations
- Prev by Date: Re: absolute disk read write
- Next by Date: EnumChildWindows and EM_GETSELTEXT
- Previous by thread: Re: Fast string operations
- Next by thread: Re: Fast string operations
- Index(es):
Relevant Pages
|