Re: Replace strings in a text file and get the number ofreplacementsmade




There is a difference between matching and replacing.

Someone can say "ohoh" is matched twice in "ohohoh", once starting at position 0 and once starting at position 2,

but if you speak to replace (consume) it, you have only one possible 'action'.

I haven't tried, but I assume Regex would find 2 matches, while replace will replace just once the pattern.

And again, (InitialStrring.Length-InitialString.Replace(pattern, String.Empty).Length) / pattern.Length is 'safe', as far as I know, for all cases, and use no external loop, to count the number of replacements where will be of pattern into InitialString (by whatever newPattern, which is irrelevant).



Vanderghast, Access MVP


"Jesse Houwing" <jesse.houwing@xxxxxxxxxxxxxxxx> wrote in message news:e5317a7e78012e2d8cbcde395c74afa@xxxxxxxxxxxxxxxxxxxxx
Hello Peter,

Agreed on the readability part, but using regex.replace opens up a new can of worms, which people aren't usually prepared for. Say this search/replace action can be entered from the UI, then adding . or * or { into your search pattern can lead to unexpected behaviour, or worse a regex parse error. The regex will also be expensive, because it will have to be parsed/compiled every time a new pattern is used (and if it is a user defined replacement, that would be more often than not).

So this would have to be extended with a Regex.Escape call first. The same applies for the replacement pattern. Say I want to search $2 and replace it with $0.1 you'd get funny things... ($2.1 actually)... So it isn't just using a different call to get the same results.

That said, I'd opt for an extention method on string and write an efficient version (could use mine as an example) of a Replace method that returns the number of matches. And from that moment on, use that. Just as readable (or even better) and no crazy unexpected regex problems due to not exactly understanding what is involved.

Jesse

On Tue, 07 Jul 2009 14:45:02 -0700, Jesse Houwing
<jesse.houwing@xxxxxxxxxxxxxxxx> wrote:

Looping through the string in a stringbuilder is probably the safest
way to do this:

[...]
int count = 0;
for (int i = 0; i < sb.Length; )
{
if (AreEqual(sb, search, i))
{
sb.Remove(i, search.Length);
sb.Insert(i, replacement);
i += replacement.Length;
count++;
}
else
{
i++;
}
}
[...]
It might be faster to use sb.Replace(search, replacement, i,
search.Length)

instead of a sb.Remove, sb.Insert, I'm not sure, but they won't
differ that much.

If you simply call StringBuilder.Replace(string, string, int, int)
instead of having your own AreEqual() method followed by a call to
Remove() and Insert(), the performance should be practically
identical, but you wouldn't get any information about how many
replacements occurred.

Alternatively, if you still call AreEqual() and then call
StringBuilder.Replace(string, string, int, int), you're duplicating
effort (which costs performance), because
StringBuilder.Replace(string, string, int, int) has to actually do
the string comparison again. That would actually be _slower_ than
your original code.

You could do a little hack by searching for the first character that
differs between the search and replacement strings (as an
initialization, not as part of the loop), and then bumping a counter
after each call to StringBuilder.Replace() based on whether the
character at the same offset within the current StringBuilder has
changed. That would be only slightly slower than just calling
StringBuilder.Replace(string, string, int, int), but would include
the count.

That said, I would hope that any Replace() method in .NET, including
Regex.Replace(), String.Replace(), or StringBuilder.Replace() would be
faster than the code you posted. The main reason being that all of
those methods have the opportunity to optimize the construction of
the new string, whereas your example doesn't optimize at all.

At the very least, I would not use the Remove()/Insert() pattern
you've shown. Instead, I would use a String as input, and a
StringBuilder as output, appending text segments to the output
StringBuilder as I scan the input String. That way the code avoids
having to repeatedly shift your character buffer in the StringBuilder
(which happens _twice_ for each replacement in your code). That's
exactly the kind of optimization I'd expect to find inside the .NET
classes.

It might even be worthwhile to defer creation of the output
StringBuilder until you detect the first match that needs to be
replaced, if there's an expectation that for a significant frequency
of input, no replacements would be needed.

It is probably a lot faster than using a regex, though I haven't done
any measurements.

I would expect Regex to be on par with other explicit mechanisms like
that, especially given the need to count the replacements (which for
non-Regex solutions requires replacing the search text twice). If
performance is an issue, then a "scan and build" approach as I suggest
above is probably slightly faster than using built-in Replace()
methods simply because you can incorporate a count into the
replacement logic.

All that said, if performance is an issue (and there's nothing in the
OP to suggest it is), the only way to know for sure what the best
solution is would be to try the different alternatives and measure
them. Even theoretical advantages and disadvantages may be
irrelevant for a typical data set, and intuition is a terrible way to
measure performance. :)

For best performance, it may be that none of the suggestions offered
so far are probably suitable. There's an optimized text search
algorithm, the name of which I can't recall at the moment, that can
probably be adapted, but if not then a degenerate state-graph
implementation (since there's only one string to search for) would
probably work too. Either approach would avoid having to keep
performing full string comparisons at each character index in the
original string (consider an original string
"aaaaaaaaaaaaaaaaaaaaaaaaaa" where you want to replace all occurrences
of "aaaaaab" with something :) ).

But even there, as I said, there's no way to know for sure without
measuring. Performance of the various choices is to some extent going
to be data dependent; liabilities that exist in the general case
might not really be that much of a problem. For example, if dealing
with essentially random data, it's not too terrible to just keep
comparing over and over at an incremented index, because those
comparisons will normally terminate quickly when there's no match.

In other words, even the theoretically worst-case implementation might
not turn out to be much different than the more optimized ones.

Until there's a performance problem shown, the OP should stick with
whatever solution _reads_ the best, and is the most maintainable. And
if there is a performance problem shown, measuring each viable
alternative is the only way to know for sure which will be fastest.

Pete

--
Jesse Houwing
jesse.houwing at sogeti.nl



.



Relevant Pages

  • Re: [PHP] ereg_replace with user defined function?
    ... With a preg_replace_callback I am able to look for a pattern like: a number followed by whitespace followed by one, two, three or more characters, followed by a closing character. ... In that function I evaluate the unit, see if it is in my array containing the conversion "table". ... The metric / imperial calculations / replacements take between 0.00054 and 0.005 seconds per table cell / text string. ... replacement with preg_replace_callback compared to the function I wrote? ...
    (php.general)
  • multiple pattern replacement using regular expressions
    ... I needed to write an efficient pattern replacement class which could use ... Replaces multiple patterns within given string based on regular ... @param pattern a simple string or a regular expression ...
    (comp.lang.java.programmer)
  • Re: Replace strings in a text file and get the number ofreplacementsmade
    ... The regex will also be expensive, because it will have to be parsed/compiled every time a new pattern is used (and if it is a user defined replacement, that would be more often than not). ... I'd opt for an extention method on string and write an efficient version of a Replace method that returns the number of matches. ... If you simply call StringBuilder.Replace(string, string, int, int) ... character at the same offset within the current StringBuilder has ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: sentence
    ... function to replace all instances of a word in a string ... int init_kmn(int* next, int snext, char* p) ...
    (comp.lang.c)
  • Re: Replace strings in a text file and get the number of replacementsmade
    ... It might be faster to use sb.Replace(search, replacement, i, search.Length) ... If you simply call StringBuilder.Replace(string, string, int, int) instead of having your own AreEqualmethod followed by a call to Removeand Insert, the performance should be practically identical, but you wouldn't get any information about how many replacements occurred. ... You could do a little hack by searching for the first character that differs between the search and replacement strings, and then bumping a counter after each call to StringBuilder.Replacebased on whether the character at the same offset within the current StringBuilder has changed. ...
    (microsoft.public.dotnet.languages.csharp)

Loading