Re: Usename regex



Hello Alexey,

On Aug 1, 11:42 pm, Jesse Houwing <Jesse.houw...@xxxxxxxxxxxxxxxx>
wrote:

The problem is in the fact that it allows for excessive backtracking.
Think of a string, preferably very long that contains only
alphanumeric characters, but end in a # sign. This regex will try
every combination of the first and the second part until all options
are exhausted. This can take quite a while.

Jesse,

I understand now why I forget about the (?) in my first post. I
misread your code. I thought you use the (?) in the third part of your
expression to find only one (_). I'm sorry I just can't believe I was
so inattentive.

Regarding backtracking. I understand now the difference between the
two expressions, and I believe what you are saying, but I did a small
quick test of how our expressions work for short and very long strings
in reality and it looks like there is no big difference.

Here's the result of the test

----------------------------------------------------------------------
------
Regular expression benchmark
----------------------------
Regular expressions : 2
Test strings : 3
Iterations : 10000
Total regex calls : (10000 * 3 * 2) = 60000
RE: ^([a-zA-Z0-9]+_?|_[a-zA-Z0-9]+|[a-zA-Z0-9]+_[a-zA-Z0-9]+)$
MS MAX AVG MIN DEV INPUT
61 16,759 0,0061 0,0031 0,1768 'aaaaa#'
101 16,778 0,0101 0,0067 0,1771 'some_realusername'
178 16,786 0,0178 0,0137 0,178
'verylongusernamestringwithnumbersandlet
tersandwithoutunderscoresymbolbutveryveryverylong1234verylong'
RE: ^([a-zA-Z0-9]+_?[a-zA-Z0-9]+)$
MS MAX AVG MIN DEV INPUT
51 3,251 0,0051 0,0039 0,038 'aaaaa#'
82 3,255 0,0082 0,007 0,038 'some_realusername'
161 3,263 0,0161 0,0142 0,047
'verylongusernamestringwithnumbersandlet
tersandwithoutunderscoresymbolbutveryveryverylong1234verylong'
----------------------------------------------------------------------
------

As you can see sometimes my expression is quicker :-)

The code of the benchmark tool is here
http://www.codinghorror.com/files/code/regexbenchmark.zip
I compiled it in .NET2 and ran without debugger. If I run the
benchmark 4-5 times very often I see that my expression has the best
reported score, even for a long strings. So, I would say that

^([a-zA-Z0-9]+_?[a-zA-Z0-9]+)$

is better, because it's shorter




Yours is probably faster I believe that, less alternatives for correct strings. :) but have you tried a very long incorrect input? Something like (the longer the better)

aaaaaaaaaaaavvvvvbbbbbbbbbbbbbbbbbbbbbbbeeeeeeeeeeeeeeeeeeeAAAAAAAAAAAAAAA666666666666666666666666666666666666666666666&

If the textbox in question is limited to say 16 characters you'd probably never trigger the problem, but a hacker could just bypass the length of the textbox and supply a very very long string and if he'd send such strings in quick succession he'd essentially cause a denial of service.

There are some optimizations that are possible for thsi regex if you'd want to try them. The enige will try the first option first, if that fails it will try the second. Keeping that in mind, the order of expressions is of importance. Say that you usually have no '_' in usernames, but if they're there you mostly have them somewhere in between. Then you'd have to change the order to improve performance.
^([a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)?|[a-zA-Z0-9]+_|_[a-zA-Z0-9]+)$

Also my expression allows an underscore both at the start, the end and in between. Yours only covers the last option. I would probably rewrite my expression to this to make it do the same as your expression. It is almost as short, but does not allow excessive backtracking.
^[a-zA-Z0-9]+(_[a-zA-Z0-9]+)?$

And then there is the option to handle upper and lower case characters in the character groups, or in a regexoption. I'm actually not sure which is faster, but it's worth a try... Wrapping the whole expression with (?i: ... ) should do the trick. all [a-zA-Z] can then be replaced with just [a-z]
^(?i:....)$

And finally to suppress grouping & capturing we can improve performance by replacing the normal ( ... ) with (?: ... ).

To optimize your expression further you could completely remove the ( ... ) they have no use.

So a fair showdown of expressions would be:
Yours: ^[a-zA-Z0-9]+_?[a-zA-Z0-9]+$
Mine: ^[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)?$

Or with case insensitivity:
Yours: ^(?i:[a-z0-9]+_?[a-z0-9]+)$
Mine: ^(?i:[a-z0-9]+(?:_[a-z0-9]+)?)$

I ran it using the same tool you used on my Opteron 185 (2x2.6Ghz) under Windows Vista x64. I made a small change to the code of the benchmark though, the original benchmark takes the Compile hit in the results. I did one call using the specific expression outside of the loop so that the results can be compared more easily.

Input's used:
0 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",

1 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", _
2 "aaaa", _
3 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%", _
4 "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%", _
5 "aaaa%", _
6 "aaaaaaaaaaaaaaaaaaaa_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", _
7 "aaaaaaaaaaaaaaaaaaaa_aaaaaaaaaaaaaaaaaaaaa", _
8 "aa_aa", _
9 "aaaaaaaaaaaaaaaaaaaa_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%", _
10 "aaaaaaaaaaaaaaaaaaaa_aaaaaaaaaaaaaaaaaaaaa%", _
11 "aa_aa%" _

Results:
Regular expression benchmark
----------------------------
Regular expressions : 4
Test strings : 12
Iterations : 10000
Total regex calls : (10000 * 12 * 4) = 480000

Pass 1, measure every regex as it runs (slower due to timing code)...
------------------------------------------
Regular expression library: System.Text.RegularExpressions

Total time taken: 58905
------------------------------------------
Pass 2, measure total time only..
------------------------------------------
Regular expression library: System.Text.RegularExpressions

RE: ^[a-zA-Z0-9]+_?[a-zA-Z0-9]+$
MS MAX AVG MIN DEV INPUT
110 0,365 0,011 0,0101 0,0073 '0'
145 0,387 0,0145 0,0134 0,0088 '1'
167 0,391 0,0167 0,0154 0,0095 '2'
17804 10,683 1,7804 1,6751 0,3004 '3'
18291 10,756 1,8291 1,7206 0,3042 '4'
18319 10,78 1,8319 1,7231 0,3046 '5'
18429 10,791 1,8429 1,7335 0,3055 '6'
18464 10,794 1,8464 1,7365 0,3058 '7'
18487 10,797 1,8487 1,7388 0,3059 '8'
18741 10,842 1,8741 1,7625 0,3079 '9'
18891 10,856 1,8891 1,7762 0,3129 '10'
18918 10,88 1,8918 1,7784 0,3135 '11'
RE: ^[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)?$
MS MAX AVG MIN DEV INPUT
107 0,229 0,0107 0,0101 0,0049 '0'
141 0,484 0,0141 0,0134 0,0076 '1'
163 0,562 0,0163 0,0154 0,0085 '2'
512 1,015 0,0512 0,0489 0,0166 '3'
585 1,022 0,0585 0,0559 0,018 '4'
611 1,025 0,0611 0,0584 0,0185 '5'
715 1,035 0,0715 0,0682 0,0204 '6'
749 1,038 0,0749 0,0715 0,0207 '7'
773 1,604 0,0773 0,0735 0,0258 '8'
1010 1,634 0,101 0,0961 0,0299 '9'
1072 1,64 0,1072 0,102 0,0304 '10'
1097 1,643 0,1097 0,1045 0,0305 '11'
RE: ^(?i:[a-z0-9]+_?[a-z0-9]+)$
MS MAX AVG MIN DEV INPUT
241 2,098 0,0241 0,0204 0,0231 '0'
299 2,108 0,0299 0,0254 0,0241 '1'
325 2,11 0,0325 0,0279 0,0242 '2'
34692 17,108 3,4692 3,0303 0,4154 '3'
35653 17,191 3,5653 3,113 0,4274 '4'
35694 17,195 3,5694 3,1163 0,4284 '5'
35932 17,215 3,5932 3,137 0,4298 '6'
35990 17,22 3,599 3,1417 0,4304 '7'
36018 17,223 3,6018 3,1443 0,4305 '8'
36505 17,29 3,6505 3,1867 0,4337 '9'
36782 17,493 3,6782 3,2107 0,438 '10'
36815 17,496 3,6815 3,2138 0,4387 '11'
RE: ^(?i:[a-z0-9]+(?:_[a-z0-9]+)?)$
MS MAX AVG MIN DEV INPUT
236 1,798 0,0236 0,0201 0,0195 '0'
291 1,811 0,0291 0,0251 0,0201 '1'
320 4,578 0,032 0,0274 0,0497 '2'
946 4,635 0,0946 0,0827 0,0546 '3'
1067 4,645 0,1067 0,093 0,0611 '4'
1097 4,648 0,1097 0,0958 0,0613 '5'
1333 4,669 0,1333 0,1162 0,0631 '6'
1388 4,674 0,1388 0,1212 0,0633 '7'
1415 4,677 0,1415 0,1235 0,0634 '8'
1791 4,711 0,1791 0,1576 0,0654 '9'
1890 4,989 0,189 0,1659 0,0814 '10'
1920 4,999 0,192 0,1687 0,0816 '11'
Total time taken: 58088
------------------------------------------
Press ENTER to continue...

I wouldn't have thought that case insensitivity would make such a big difference. Learned something again today :).

My expression, without the case insensitivity is clearly the fastest:
RE: ^[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)?$

So I'd say this one is better, as it's only 3 characters longer than yours, but much faster and without a possible DoS problem.

But as the original post didn't say that the underscore could only be between alphanumeric characters. It said that there was one '_' allowed in teh string (read anywhere in the string) my first expression best covers the problem.
^(?:[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)?|[a-zA-Z0-9]+_|_[a-zA-Z0-9]+)$

Jesse


.



Relevant Pages

  • Re: Usename regex
    ... Think of a string, ... Regular expression benchmark ... MS MAX AVG MIN DEV INPUT ... If the textbox in question is limited to say 16 characters you'd ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Prothon should not borrow Python strings!
    ... """It does not make sense to have a string without knowing what encoding ... same cul de sac as Python. ... Prothon_String_As_ASCII // raises error if there are high characters ... Python's split between byte strings and Unicode strings is ...
    (comp.lang.python)
  • Re: Letter to US Sen. Byron Dorgan re unpaid overtime
    ... put them in stupid places. ... Programming is difficult (as you must surely appreciate, ... > strings will be in the range 1...1000 characters. ... impose an artificially small limit on string length." ...
    (comp.programming)
  • Re: Byte Array to String
    ... retrieved text will mismatch the original characters. ... encoding the characters. ... Dim strFileData as String ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: A note on personal corruption as a result of using C
    ... impossible to write effective string validation routines by definition ... (Note that a string literal may contain embedded null characters; ... without resorting to abusive language. ... In practice, programmers typically use "struct" ...
    (comp.programming)