Re: Regular Expression taking excessive CPU



Brian Kitt <BrianKitt@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> I have a process where I do some minimal reformating on a TAB delimited
> document to prepare for DTS load. This process has been running fine, but I
> recently made a change. I have a Full Text index on one column, and
> punctuation in the column was causing some problems down the line. This
> column is used only for full text indexing, and otherwise ignored. I decided
> to use the following regular expression to remove all punctuation (actually
> anything but alphanumeric). This is the only change I made to my code, and
> it is now taking more than 4 times as long as it used to. Why is this
> regular expression adding so much time to the process, and is there a better
> way to strip out all non-alphanumeric data?
>
> ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);
>
> ftIndex is a string variable that typically won't exceed 100 characters. It
> is nothing more than keywords associated with the data that I am loading.

Okay, there are some problems here:

1) By using the static Regex.Replace method, you're making the
framework parse your pattern every time you call it.

2) You're also not using RegexOptions.Compile, which can help
performance

3) You don't need to use a regular expression at all.

Here's a test program:

using System;
using System.Text;
using System.Text.RegularExpressions;

public class Test
{
// Number of strings
const int Size = 1000;
// Length of string
const int Length = 100;
// How often to include a non-alphanumeric character
const double NonAlphaProportion = 0.01;

// How many iterations to run
const int Iterations = 1000;

// Non-alphanumeric characters to pick from (at random)
static readonly char[] NonAlphaChars =
".,!\"$%^&*()_+-=".ToCharArray();
// Alphanumeric characters to pick from (at random)
static readonly char[] AlphaChars =
("ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
"abcedfghijklmnopqrstuvwxyz"+
"0123456789 ").ToCharArray();

static void Main()
{
string[] strings = GenerateTestData();

int total=0;
DateTime start = DateTime.Now;
for (int i=0; i < Iterations; i++)
{
foreach (string x in strings)
{
total += RemoveNonAlpha1(x).Length;
}
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
Console.WriteLine ("Total length: {0}", total);
}

static string RemoveNonAlpha1(string x)
{
return Regex.Replace(x, "[^a-zA-Z0-9 ]", string.Empty);
}

static Regex regex = new Regex("[^a-zA-Z0-9 ]",
RegexOptions.Compiled);
static string RemoveNonAlpha2(string x)
{
return regex.Replace(x, string.Empty);
}

static string RemoveNonAlpha3(string x)
{
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string RemoveNonAlpha4(string x)
{
bool foundNonAlpha = false;
foreach (char c in x)
{
if (!((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
foundNonAlpha = true;
break;
}
}
if (!foundNonAlpha)
{
return x;
}
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string[] GenerateTestData()
{
Random random = new Random(0);
string[] ret = new string[Size];

for (int i=0; i < Size; i++)
{
StringBuilder builder = new StringBuilder(Length);
for (int j=0; j < Length; j++)
{
char[] selection;

if (random.NextDouble() < NonAlphaProportion)
{
selection = NonAlphaChars;
}
else
{
selection = AlphaChars;
}
builder.Append (
selection[random.Next(selection.Length)]);
}
ret[i] = builder.ToString();
}
return ret;
}
}

Here, the first version is what you've currently got.

The second version is a version using a cached, compiled regular
expression instance.

The third version goes through each character in the string in a hard-
coded manner, and appends each alpha-numeric character to a
StringBuilder.

The fourth version checks whether or not there's anything to trim
before even creating the StringBuilder.

Now, with the sample data from above, here's the timing on my machine:

1) 40 seconds
2) 10 seconds
3) 3.73 seconds
4) 3 seconds


Now, here's the timing when the proportion of non-alphanumeric
characters is changed to 5% instead of 1%:

1) 44 seconds
2) 12 seconds
3) 3.8 seconds
4) 3.86 seconds

Finally, when the proportion is changed to 0.1%:

1) 37 seconds
2) 9 seconds
3) 3.7 seconds
4) 1.3 seconds

So, as you can see, the performance depends on the data. However, I
would actually suggest you go for number 2 unless it's absolutely
performance critical. A regular expression is the simplest way of
expressing what you're after in this case, and the performance is a lot
better than with the first case.

--
Jon Skeet - <skeet@xxxxxxxxx>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
.



Relevant Pages

  • Re: RegExp irregularity in JScript
    ... we believe the VBScript Regular Expression class (version 1.0 through ... It does not however, limit the string minimum 4, maximum 8 characters. ... Obviously the first test should test the length of the string, minimum 4, ...
    (microsoft.public.scripting.jscript)
  • Re: Usename regex
    ... Think of a string, ... Regular expression benchmark ... MS MAX AVG MIN DEV INPUT ... If the textbox in question is limited to say 16 characters you'd ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Fast search for all positions in a string
    ... It addition to running timing tests in different browsers and on ... direct string comparison (which is unintuitive given the relative ... The otherwise often problematic characteristic of Regular expression ... turns all characters that are significant in regular expressions ...
    (comp.lang.javascript)
  • Re: Extracting Strings
    ... regular expression and php function that does it. ... I want to extract the data in the following string: ... The characters in the square brackets are the characters to match ...
    (alt.php)
  • Re: Regular Expression taking excessive CPU
    ... I agree about the regular expression, ... >> ftIndex is a string variable that typically won't exceed 100 characters. ... > static string RemoveNonAlpha1 ...
    (microsoft.public.dotnet.languages.csharp)