Re: Search for multiple things in a string
- From: Jon Skeet [C# MVP] <skeet@xxxxxxxxx>
- Date: Tue, 20 Sep 2005 19:20:36 +0100
tshad <tscheiderich@xxxxxxxxxxxxxxx> wrote:
> > It is - the regular expression *language* is a different language to
> > C#, in the same way that XPath is. That's why under "regular
> > expressions" in MSDN, there's a "language elements" section.
>
> I think calling it a language is a stretch, although I know it is called a
> language in places(it's all in what you define as a language).
In plenty of places. It has a language with a defined syntax etc.
> It really is
> a text/string processor, as is: IndexOf, Substring, Right, Replace etc used
> by various languages.
>
> You don't build pages with it. It isn't procedural.
Neither of those are required for it to be a language.
> It is a tool used by the other languages.
Sure - so is XPath, but that's a language too.
(See http://www.w3.org/TR/xpath)
> You don't use VB.Net in C# or Vice versa but both use
> Regular expressions (as the both use Substring, Replace etc).
None of those state that regular expressions aren't a language.
> > But not as instantly clear, I believe. Can you really say that you find
> > the regex version doesn't take you *any* longer to understand than the
> > non-regex version?
>
> Depends on the C# code as well as the Regex code.
The C# code in question would be:
if (someVariable.IndexOf ("firstliteral") != -1 ||
someVariable.IndexOf ("secondliteral") != -1 ||
someVariable.IndexOf ("thirdliteral") != -1)
If I did it regularly, I'd write a short method which took a params
string array.
> Again, are we talking about the best tool for the job or the most
> readability.
Unless there's another compelling argument in favour of one tool or
another, readability is a very important part of choosing the best
tool.
> As was mentioned before, you set up loops and temporary
> variables to do what you can do in a simple Regular Expression.
>
> Again, I am not pushing Regular Expressions here, just that they are just a
> valid as C# (or VB.Net) string handlers.
But you're effectively pushing them in the situation described by the
OP when you say that the solution using regular expressions is as
readable as the solution without.
> I do use them when convenient.
>
> For example, I was creating a simple text search engine and wanted to modify
> what the user put in and found it simpler to do the following than in VB or
> C:
>
> ' The following replaces all multiple blanks with " ". It then takes
> ' out the anomalies, such as "and not and" and replaces them with "and"
>
> keywords = trim(Regex.Replace(keywords, "\s{2,}", " "))
> keywords = Regex.Replace(keywords, "( )", " or ")
> keywords = Regex.Replace(keywords," or or "," ")
> keywords = Regex.Replace(keywords,"or and or","and")
> keywords = Regex.Replace(keywords,"or near or","near")
> keywords = Regex.Replace(keywords,"and not or","and not")
>
> Fairly straight forward and easy to follow.
Reasonably, although apart from the first regex, I'd suggest doing the
rest with straight calls to String.Replace. As an example of why I
think that would be more readable, what exactly do the second line do?
In some flavours of regular expressions, brackets form capturing
groups. Do they in .NET? I'd have to look it up. If it's really just
trying to replace the string "( )" with " or ", a call to
String.Replace would mean I didn't need to look anything up.
> >> Also, you have the same problem when dealing with web pages or getting a
> >> file from the disk. You still use the escape character there (and as you
> >> say, is a little confusing) - but you still do it.
> >
> > You have to know the C# escaping, but not the regular expression
> > escaping.
>
> But you do NEED to know the C# escaping (readability not high - unless you
> understand it).
Yes, but I *already* need to know that in order to write C#. Choosing
to use String.IndexOf doesn't add to what I need to remember - choosing
regular expressions does. In addition, there aren't many things which
need escaping compared with those which need escaping in regular
expressions. In addition to *that*, whenever you need to escape in
regular expressions, you also need to escape in C# (or remember to use
verbatim string literals) - yet another piece of headache.
> > To me, a lot of readability comes from decent naming and commenting,
> > which fortunately are available in pretty much any language. I'd
> > certainly agree that object orientation (and exceptions, automatic
> > memory management etc) makes it a lot easier to write readable code
> > though.
>
> But writing objects and the objects themselves are not easily readable. But
> you would advocate not writing them, would you?
No, but I don't see how that's relevant.
> >> But if you know both and as I (and you) mentioned regex is part of .net
> >> as is C# - so it is already in the mix.
> >
> > No, it's not. It's not already used in every single C# program, any
> > more than SQL is.
>
> Nor are all the objects you use.
>
> But if you are using .Net, it is part of the mix.
It's not necessarily part of the mix I have to use. I suspect *very*
few programs don't do any string manipulation - knowing the string
methods well is *far* more fundamental to .NET programming than knowing
regular expressions.
> > In what way is it 6 of one or half a dozen of the other when one
> > solution requires knowing more than the other? I would expect *any* C#
> > programmer to know what String.IndexOf does. I wouldn't expect all C#
> > programmers to know by heart which regex language elements require
> > escaping - and if you don't know that off the top of your head, then
> > changing the code to search for a different string involves an extra
> > bit of brainpower.
>
> Why? Ever heard of references or cheat sheets? And what is wrong with a
> little extra brainpower - if you don't use it, you lose it :)
If you truly think that given two solutions which are otherwise equal,
the solution which is easiest to write, read and maintain doesn't win
hands down, we'll definitely never agree.
If you want to keep your hand in with respect to regular expressions,
do it in a test project, or with a regular expressions workbench. Keep
it out of code which needs to be read and maintained, probably by other
people who don't want to waste time because you wanted to keep your
skill set up to date.
> I don't know all of the possible combinations of calls to every Object, but
> that doesn't preclude me from using them.
Exactly - and you wouldn't go out of your way to use methods you don't
need, just to get into the habit of using them, would you?
> My position has always been, don't memorize. You will remember what you
> use. But if you know how to get it (where to look), then you have
> everything you need.
Absolutely - so why are you so keen on making people either memorise or
look up the characters which need escaping for regular expressions
every time they read or modify your code?
> I happen to use .Net. Regex is part of .Net. I would be limiting myself if
> I didn't use Regex in places where it is appropriate.
I seem to be having difficulty making myself clear on this point: I
have never stated and will never state that you shouldn't use regular
expressions where they're appropriate. But they are *not* appropriate
in this case, as they are a more complex and less readable way of
solving the problem.
Show me a problem where the regex way of solving it is simpler than
using simple string operations (and there are plenty of problems like
that) and I'll plump for the regex in a heartbeat.
> If I happen to know a good way in Regex to solve a problem, I am not
> going use *extra brainpower* to try to solve the problem in C#.
In what way is using the method which is designed for *precisely* the
task in hand (finding something in a string) using extra brainpower? If
you're not familiar with String.IndexOf, you've got *much* bigger
things to worry about than whether or not your regular expression
skills are getting rusty.
> > It was *less* readable though - and would have been *significantly*
> > less readable if the string being searched for had included dots,
> > brackets etc.
>
> But it didn't. But if it did, it is no different than having to deal with
> escapes in C (less readable)
>
> If you are talking about
>
> if ((someString.IndexOf("something1",0) >= 0) ||
> ((someString.IndexOf("something2",0) >= 0) ||
> ((someString.IndexOf("something3",0) >= 0))
> {
> Do something
> }
>
> vs
>
> if (Regex.IsMatch(myString, @"something1|something2|something3"))
>
> If you know absolutely nothing about Regular expressions, I would agree that
> this is less readable.
>
> But I would also contend that IndexOf could be just as confusing. What is
> the first 0 for? What about the 2nd? It is readable because you know C.
Well, for a start the 0s aren't necessary, and I wouldn't include them.
> I would maintain that if even if you knew nothing about Regex, you would
> assume that you are doing a Match (can't tell that from the word "IndexOf")
> and it probably has something to do with the words "something1",
> "something2" and "something3". Now if you know C than I would assume you
> would pick up that "|" is "or" (not so clear to a VB programmer). And that
> would be to someone not familier with regular expressions doing a quick
> perusal
Okay - now suppose I need to change it from searching for "something1"
to "something.1" or "something[1]". How long does it take to change in
each version? How easy is it to read afterwards?
> So I am at a loss as to how this regular expression is more unreadable than
> the C# counterpart. That is not to say that you couldn't make it more
> unreadable - but you could do the same with C# if you wanted to.
You could start by making the C# more readable, as I've shown...
However, the regex is already less readable:
1) It's got "|" as a "magic character" in there.
2) It's got all the strings concatenated, so it's harder to spot each
of them separately.
And that's before you need to actually *maintain* the code.
Furthermore, suppose you didn't just want to search for literals -
suppose one of the strings you wanted to search for was contained in a
variable. How sure are you that *no-one* on your team would use:
x+"|something2|something3"
as the regular expression?
> > I suspect not all programmers would though. Don't forget that the
> > person who writes the code is very often not the one to maintain it.
> > Can you guarantee that *everyone* who touches the code will find
> > regexes as readable as String.IndexOf?
>
> As was said, you can make readable and unreadable C or Regex code. Are you
> going to tell your programmers they "cannot" use Regex for the same reason?
I would tell programmers on my team not to use regular expressions
where the alternative is simpler and more readbale, yes.
> Are you going to leave out some objects that programmers may not be familier
> with?
Absolutely, where there are simpler and more familiar ways of solving
the same problem.
> > Which is why I've said repeatedly that I'm not trying to suggest that
> > regexes are bad, or should never be used. I'm just saying that in this
> > case it's using a sledgehammer to crack a nut.
>
> And I don't in this case, as I think I've shown. Less typing, easy to read,
> straight forward - in this case.
You've shown nothing of the kind - whereas I think I've given plenty of
examples of how using regular expressions make the code less easily
maintainable, even if you consider it equally readable to start with
(which I don't).
> >> SalaryMax.Text =
> >> String.Format("{0:c}",CalculateYearly(Regex.Replace(WagesMax.Text,"\$|\,","")))
> >>
> >> At the time, I couldn't seem to find as simple a solution as this in
> >> VB.Net
> >> so I use this (not saying there isn't one).
> >
> > And of course there is:
> > SalaryMax.Text =
> > String.Format ("{0:c}",CalculateYearly(WagesMax.Text.Replace("$", "")
> > .Replace(",", ""));
> >
> > I know which version I'd rather read...
>
> I can read either (although, I didn't know you could string multiple
> "Replace"s together).
Yes, I can read either too. The point is that in reading my version, I
didn't need to wade through various special characters, understanding
exactly what was there for. Of course, your version wasn't even valid
C#, as it didn't escape the backslashes and you didn't specify a
verbatim literal. I assume it was originally VB.NET. I wonder which
version would be easier to convert to valid C#? Mine, perhaps?
> > But I suspect you're more used to regular expressions than many other
> > programmers - and making the code less readable for other programmers
> > for no benefit is what makes it unwarranted here, even in the simple
> > case where there's nothing to escape.
>
> First of all, I am not. I don't use it much at all, but I find it easy to
> figure out and staight forward (but you can make it really complex). I use
> it to validate phone numbers, credit card numbers, zip codes etc.
And in all of those cases, regular expressions are really useful.
> Which are very well documented and when there are a myiad of ways a
> user can put input these types of data, I prefer to use Regular
> expressions which are all over the place (easy to find) then try to
> come put with some complex set of loops and temporary variables which
> make it far easier to make a mistake and much more unreadable the the
> Regex equivelant.
Where exactly are the complex loops and temporary variables in this
specific case? After all, you have been arguing for using regular
expressions in *this specific case*, haven't you?
> > Because it's more complicated! You can't deny that there's more to
> > consider due to the escaping. There's more to know, more to consider,
> > and it doesn't get the job done any more cleanly.
>
> Escaping seems to be your main compaint with it.
It's the main potential source of problems, yes. It's a potential
source of problems which simply doesn't exist when you use
String.IndexOf.
> I have the same problem with C or VB when trying to remember when to use "\"
> vs "/" in paths or do I need to add "\" in front of my slash or quote.
> These are inherent problems with pretty much all of them.
You already need to know that when writing C# though - my use of
String.IndexOf doesn't add to the volume of knowledge required.
> > As is using the power of regular expressions when there is an easier
> > way - using IndexOf, which is *precisely* there to find one string
> > within another.
>
> I am not discounting IndexOf, I am just saying that both work fine and are
> just as readable (in this case). In other cases, that may not be the case
> (with either C or Regex).
Just because they're as readable *to you* doesn't mean they're as
readable to everyone. How sure are you that the next engineer to read
this code will be familiar with regular expressions? How sure are you
that when you need to change it to look for a different string, you'll
check whether any of the characters need to be escaped? Why would you
even want to force that check on yourself?
> > Do you really think it would take you that long to refamiliarise
> > yourself with it? I don't see why it's a good idea to make some poor
> > maintenance engineer who hasn't used regular expressions before try to
> > figure out that *actually* you were just trying to find strings within
> > each other just so you can keep your skill set current.
>
> So you would prefer to code to the lowest common denominator.
When there's no good reason not to, absolutely.
> I am not going to code to the level of a junior programmer. I prefer that
> he learn to code to a higher level.
Learning to solve problems as simply as possible *is* learning to code
to a higher level.
> I am not saying that that you still should write decent, readable, commented
> code. But I am not going to limit myself because another programmer may not
> be able to read well written code. If that were the case, I would not be
> writing objects (abstract classes, interfaces, etc).
If it's not the simplest code for the situation, it's not well written
IMO. If it introduces risk for no reward (the risk of maintenance
failing to notice that they might need to escape something, versus no
reward) then it's not well written.
> > I've never had a problem with reading the documentation when I've
> > needed to use regular expressions, without putting it in projects in
> > places where I *don't* need it.
>
> "Need" is a personal question. I don't thing it applies here. You prefer
> IndexOf and I might prefer IsMatch.
I bet if I showed my code to a random sample of a hundred C# developers
and asked them to change it to search for "hello[there]", virtually all
of them would get it right. I also bet that if I showed your code to
them and asked them for the same change, some would fail to escape it
appropriately. Do you disagree?
> > Because it makes things more complicated for no benefit. The reflection
> > example was a good one - that allows you to get a property value, so do
> > you think it's a good idea to write:
> >
> > string x = (string) something.GetType()
> > .GetProperty("Name")
> > .GetValue(something, null);
> > or
> >
> > string x = something.Name;
> >
> > ?
> >
> > Maybe I should use the latter. After all, I wouldn't want to forget how
> > to use reflection, would I?
>
> Lost me on that one.
Both are ways of finding the value of a property. The first is harder
to maintain and harder to read, just like your use of regular
expressions in this instance. Now, which of the above snippets of code
would you use, and why?
--
Jon Skeet - <skeet@xxxxxxxxx>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
.
- Follow-Ups:
- Re: Search for multiple things in a string
- From: tshad
- Re: Search for multiple things in a string
- References:
- Search for multiple things in a string
- From: tshad
- Re: Search for multiple things in a string
- From: Jon Skeet [C# MVP]
- Re: Search for multiple things in a string
- From: Oliver Sturm
- Re: Search for multiple things in a string
- From: Jon Skeet [C# MVP]
- Re: Search for multiple things in a string
- From: tshad
- Re: Search for multiple things in a string
- From: Jon Skeet [C# MVP]
- Search for multiple things in a string
- Prev by Date: Re: What to do with AUTO-ID column when I add a new record to the
- Next by Date: Re: FTP peculiarity
- Previous by thread: Re: Search for multiple things in a string
- Next by thread: Re: Search for multiple things in a string
- Index(es):