Re: Search for multiple things in a string
- From: "tshad" <tscheiderich@xxxxxxxxxxxxxxx>
- Date: Tue, 20 Sep 2005 15:31:12 -0700
"Jon Skeet [C# MVP]" <skeet@xxxxxxxxx> wrote in message
news:MPG.1d9a634f148e765998c781@xxxxxxxxxxxxxxxxxxxxxxx
> tshad <tscheiderich@xxxxxxxxxxxxxxx> wrote:
>> > It is - the regular expression *language* is a different language to
>> > C#, in the same way that XPath is. That's why under "regular
>> > expressions" in MSDN, there's a "language elements" section.
>>
>> I think calling it a language is a stretch, although I know it is called
>> a
>> language in places(it's all in what you define as a language).
>
> In plenty of places. It has a language with a defined syntax etc.
Yes, but so are dolphin sounds.
When I talk about a Programming Language - I am talking about a Procedural
Language (C, Fortran, VB, Pascal, etc.).
>
>> It really is
>> a text/string processor, as is: IndexOf, Substring, Right, Replace etc
>> used
>> by various languages.
>>
>> You don't build pages with it. It isn't procedural.
>
> Neither of those are required for it to be a language.
>
>> It is a tool used by the other languages.
>
> Sure - so is XPath, but that's a language too.
> (See http://www.w3.org/TR/xpath)
>
>> You don't use VB.Net in C# or Vice versa but both use
>> Regular expressions (as the both use Substring, Replace etc).
>
> None of those state that regular expressions aren't a language.
>
>> > But not as instantly clear, I believe. Can you really say that you find
>> > the regex version doesn't take you *any* longer to understand than the
>> > non-regex version?
>>
>> Depends on the C# code as well as the Regex code.
>
> The C# code in question would be:
>
> if (someVariable.IndexOf ("firstliteral") != -1 ||
> someVariable.IndexOf ("secondliteral") != -1 ||
> someVariable.IndexOf ("thirdliteral") != -1)
>
And the Regex version:
if (Regex.IsMatch(myString, @"something1|something2|something3"))
> If I did it regularly, I'd write a short method which took a params
> string array.
>
>> Again, are we talking about the best tool for the job or the most
>> readability.
>
> Unless there's another compelling argument in favour of one tool or
> another, readability is a very important part of choosing the best
> tool.
Again, why do I need a compelling reason. If I have the solution and it
happens to be Regex, I would use it, I wouldn't necessarily say to myself -
"Is there perhaps a more readable way to write this? I wonder if Jim will
be able to read this or not."
>
>> As was mentioned before, you set up loops and temporary
>> variables to do what you can do in a simple Regular Expression.
>>
>> Again, I am not pushing Regular Expressions here, just that they are just
>> a
>> valid as C# (or VB.Net) string handlers.
>
> But you're effectively pushing them in the situation described by the
> OP when you say that the solution using regular expressions is as
> readable as the solution without.
No.
No pushing. No more than your pushing not using it.
>
>> I do use them when convenient.
>>
>> For example, I was creating a simple text search engine and wanted to
>> modify
>> what the user put in and found it simpler to do the following than in VB
>> or
>> C:
>>
>> ' The following replaces all multiple blanks with " ". It then takes
>> ' out the anomalies, such as "and not and" and replaces them with "and"
>>
>> keywords = trim(Regex.Replace(keywords, "\s{2,}", " "))
>> keywords = Regex.Replace(keywords, "( )", " or ")
>> keywords = Regex.Replace(keywords," or or "," ")
>> keywords = Regex.Replace(keywords,"or and or","and")
>> keywords = Regex.Replace(keywords,"or near or","near")
>> keywords = Regex.Replace(keywords,"and not or","and not")
>>
>> Fairly straight forward and easy to follow.
>
> Reasonably, although apart from the first regex, I'd suggest doing the
> rest with straight calls to String.Replace. As an example of why I
> think that would be more readable, what exactly do the second line do?
Actually, nothing. It is grouping a " ", which isn't necessary. I think I
used to have something else there and took it out and didn't realize I
didn't need the ().
> In some flavours of regular expressions, brackets form capturing
> groups. Do they in .NET? I'd have to look it up. If it's really just
> trying to replace the string "( )" with " or ", a call to
> String.Replace would mean I didn't need to look anything up.
Obviously, you didn't need to look this one up either - as you were correct.
It is just grouping a blank.
>
>> >> Also, you have the same problem when dealing with web pages or getting
>> >> a
>> >> file from the disk. You still use the escape character there (and as
>> >> you
>> >> say, is a little confusing) - but you still do it.
>> >
>> > You have to know the C# escaping, but not the regular expression
>> > escaping.
>>
>> But you do NEED to know the C# escaping (readability not high - unless
>> you
>> understand it).
>
> Yes, but I *already* need to know that in order to write C#. Choosing
> to use String.IndexOf doesn't add to what I need to remember - choosing
> regular expressions does. In addition, there aren't many things which
> need escaping compared with those which need escaping in regular
> expressions. In addition to *that*, whenever you need to escape in
> regular expressions, you also need to escape in C# (or remember to use
> verbatim string literals) - yet another piece of headache.
>
>> > To me, a lot of readability comes from decent naming and commenting,
>> > which fortunately are available in pretty much any language. I'd
>> > certainly agree that object orientation (and exceptions, automatic
>> > memory management etc) makes it a lot easier to write readable code
>> > though.
>>
>> But writing objects and the objects themselves are not easily readable.
>> But
>> you would advocate not writing them, would you?
>
> No, but I don't see how that's relevant.
Just that you don't want to Regex as it is not easily readable. Neither are
Regex.
But the fact a junior programmer might not understand Objects as you do
would not prevent you from writing them, would you?
>
>> >> But if you know both and as I (and you) mentioned regex is part of
>> >> .net
>> >> as is C# - so it is already in the mix.
>> >
>> > No, it's not. It's not already used in every single C# program, any
>> > more than SQL is.
>>
>> Nor are all the objects you use.
>>
>> But if you are using .Net, it is part of the mix.
>
> It's not necessarily part of the mix I have to use.
You don't have to use lots of things. That doesn't make them invalid.
Neither is the fact that you use Foreach vs For {}. They are there and are
part of the mix as is Regex. I might agree with you more if Regex were some
component that you picked up and added. Or if Regex were some obscure
technique that few knew about. They have been around for quite a long time
and is just another gun in your arsenal. If I thought that MS were
deprecating it, I would also think twice about using it. But it is part of
..Net that all the languages can make use of and I would never tell a
programmer, who may be really comfortable with it and uses it responsibly
(not obscure cryptic non-commented code), that he should be using IndexOf
instead.
>I suspect *very*
> few programs don't do any string manipulation - knowing the string
> methods well is *far* more fundamental to .NET programming than knowing
> regular expressions.
I agree with part of that and think that regular expressions are just as
important to know. As we have been saying, it is here and many people use
it, so to not understand it is to limit yourself. You don't have to use it,
but you should at least understand the basics of how it works. What are you
going to do when someone uses a RegularExpressionValidator and you don't
understand what the expression is? The fact that it is not C# (neither is a
textbox, datagrid, etc), doesn't mean you should understand them. Whether
you use them is up to you.
As you point out, you are not the only programmer and many programmers like
to use Regex and that doesn't make them any lesser programmers. What are
you going to when you run into their code?
I see code all the time (much of the time it is mine) and wonder why the
programmer didn't do it another way. There are many ways to skin a cat.
Sometimes it is just style, sometimes it is all they know. But if they
follow whatever standards are setup (and in your case maybe you forbid
Regex) then as long as the code is well written and clean - I have no
problem with it.
>
>> > In what way is it 6 of one or half a dozen of the other when one
>> > solution requires knowing more than the other? I would expect *any* C#
>> > programmer to know what String.IndexOf does. I wouldn't expect all C#
>> > programmers to know by heart which regex language elements require
>> > escaping - and if you don't know that off the top of your head, then
>> > changing the code to search for a different string involves an extra
>> > bit of brainpower.
>>
>> Why? Ever heard of references or cheat sheets? And what is wrong with a
>> little extra brainpower - if you don't use it, you lose it :)
>
> If you truly think that given two solutions which are otherwise equal,
> the solution which is easiest to write, read and maintain doesn't win
> hands down, we'll definitely never agree.
>
I agree there.
Which is easier to write is obviously your perception. I found my example,
as easy as yours to write and just as readable.
> If you want to keep your hand in with respect to regular expressions,
> do it in a test project, or with a regular expressions workbench. Keep
> it out of code which needs to be read and maintained, probably by other
> people who don't want to waste time because you wanted to keep your
> skill set up to date.
>
Keep regular expressions out of my code?????
So now you are saying there is no use for it?
>> I don't know all of the possible combinations of calls to every Object,
>> but
>> that doesn't preclude me from using them.
>
> Exactly - and you wouldn't go out of your way to use methods you don't
> need, just to get into the habit of using them, would you?
Sure.
If it is valid. As I said there are many ways to skin ..., depending on the
situation I may do it one way and the next time another way. Gives me many
options. I don't do it willy nilly, as you seem to suggest, as a test
bench.
>
>> My position has always been, don't memorize. You will remember what you
>> use. But if you know how to get it (where to look), then you have
>> everything you need.
>
> Absolutely - so why are you so keen on making people either memorise or
> look up the characters which need escaping for regular expressions
> every time they read or modify your code?
>
I am not. I don't memorize. But I still use it.
>> I happen to use .Net. Regex is part of .Net. I would be limiting myself
>> if
>> I didn't use Regex in places where it is appropriate.
>
> I seem to be having difficulty making myself clear on this point: I
> have never stated and will never state that you shouldn't use regular
> expressions where they're appropriate. But they are *not* appropriate
> in this case, as they are a more complex and less readable way of
> solving the problem.
No you are very clear. If you are so concerned with others being able to
read your code and problems with escape characters - why would you EVER want
them to use them. You can't have it both ways.
If they would have a hard time with a nothing expression like "if
(Regex.IsMatch(myString, @"something1|something2|something3"))" - they are
never going to get some of the of the other standard Regex solutions I
mentioned before.
As you said, the two solutions are equal. Your solution is that you MUST go
with IndexOf. Mine is you can use either.
>
> Show me a problem where the regex way of solving it is simpler than
> using simple string operations (and there are plenty of problems like
> that) and I'll plump for the regex in a heartbeat.
>
>> If I happen to know a good way in Regex to solve a problem, I am not
>> going use *extra brainpower* to try to solve the problem in C#.
>
> In what way is using the method which is designed for *precisely* the
> task in hand (finding something in a string) using extra brainpower?
I wasn't referring to this particular issue when I said this.
>If
> you're not familiar with String.IndexOf, you've got *much* bigger
> things to worry about than whether or not your regular expression
> skills are getting rusty.
I never said I was not familier with IndexOf.
As a matter of fact, the original question was given whether you could "do a
search for more that one string in another string".
****************************************************************
Can you do a search for more that one string in another string?
Something like:
someString.IndexOf("something1","something2","something3",0)
or would you have to do something like:
if ((someString.IndexOf("something1",0) >= 0) ||
((someString.IndexOf("something2",0) >= 0) ||
((someString.IndexOf("something3",0) >= 0))
{
Do something
}
***************************************************************************
IndexOf doesn't do it. This was the original question. You have to do
multiple calls as is said in the original question. Nicholas was correct in
his assessment. One Regex call would work.
>
>> > It was *less* readable though - and would have been *significantly*
>> > less readable if the string being searched for had included dots,
>> > brackets etc.
>>
>> But it didn't. But if it did, it is no different than having to deal
>> with
>> escapes in C (less readable)
>>
>> If you are talking about
>>
>> if ((someString.IndexOf("something1",0) >= 0) ||
>> ((someString.IndexOf("something2",0) >= 0) ||
>> ((someString.IndexOf("something3",0) >= 0))
>> {
>> Do something
>> }
>>
>> vs
>>
>> if (Regex.IsMatch(myString, @"something1|something2|something3"))
>>
>> If you know absolutely nothing about Regular expressions, I would agree
>> that
>> this is less readable.
>>
>> But I would also contend that IndexOf could be just as confusing. What
>> is
>> the first 0 for? What about the 2nd? It is readable because you know C.
>
> Well, for a start the 0s aren't necessary, and I wouldn't include them.
You're right.
>
>> I would maintain that if even if you knew nothing about Regex, you would
>> assume that you are doing a Match (can't tell that from the word
>> "IndexOf")
>> and it probably has something to do with the words "something1",
>> "something2" and "something3". Now if you know C than I would assume you
>> would pick up that "|" is "or" (not so clear to a VB programmer). And
>> that
>> would be to someone not familier with regular expressions doing a quick
>> perusal
>
> Okay - now suppose I need to change it from searching for "something1"
> to "something.1" or "something[1]". How long does it take to change in
> each version? How easy is it to read afterwards?
That wasn't the question.
What if you wanted to change "something1" to "something\". Same problem.
And if escapes were a problem (if it were me) I would have a little ***
that showed them at my desk within easy reach.
>
>> So I am at a loss as to how this regular expression is more unreadable
>> than
>> the C# counterpart. That is not to say that you couldn't make it more
>> unreadable - but you could do the same with C# if you wanted to.
>
> You could start by making the C# more readable, as I've shown...
As you can with Regular Expressions.
>
> However, the regex is already less readable:
> 1) It's got "|" as a "magic character" in there.
| = or (same as C)
> 2) It's got all the strings concatenated, so it's harder to spot each
> of them separately.
You are kidding, right?
>
> And that's before you need to actually *maintain* the code.
>
> Furthermore, suppose you didn't just want to search for literals -
> suppose one of the strings you wanted to search for was contained in a
> variable. How sure are you that *no-one* on your team would use:
>
> x+"|something2|something3"
>
> as the regular expression?
>
You are now leaving the original question. I never said that Regular
Expressions was the better (or not better) in all cases.
>> > I suspect not all programmers would though. Don't forget that the
>> > person who writes the code is very often not the one to maintain it.
>> > Can you guarantee that *everyone* who touches the code will find
>> > regexes as readable as String.IndexOf?
>>
>> As was said, you can make readable and unreadable C or Regex code. Are
>> you
>> going to tell your programmers they "cannot" use Regex for the same
>> reason?
>
> I would tell programmers on my team not to use regular expressions
> where the alternative is simpler and more readbale, yes.
Why use them at all? It isn't readable.
And if your programmers can't maintain the simple Regexs, they definately
won't be able to handle the more complicated ones.
>
>> Are you going to leave out some objects that programmers may not be
>> familier
>> with?
>
> Absolutely, where there are simpler and more familiar ways of solving
> the same problem.
>
>> > Which is why I've said repeatedly that I'm not trying to suggest that
>> > regexes are bad, or should never be used. I'm just saying that in this
>> > case it's using a sledgehammer to crack a nut.
>>
>> And I don't in this case, as I think I've shown. Less typing, easy to
>> read,
>> straight forward - in this case.
>
> You've shown nothing of the kind - whereas I think I've given plenty of
> examples of how using regular expressions make the code less easily
> maintainable, even if you consider it equally readable to start with
> (which I don't).
Not in this specific case. I was never maintaining or pushing Regex for all
or any situations.
But I am not going to force my programmers to come to me to find out whether
or not Regex is the easiest way or not. That is up to the programmer. If
there is a problem with their code and feel the programmer is way off base
in his coding we would talk about (that would be the case with his C#, VB or
Regex code).
>
>> >> SalaryMax.Text =
>> >> String.Format("{0:c}",CalculateYearly(Regex.Replace(WagesMax.Text,"\$|\,","")))
>> >>
>> >> At the time, I couldn't seem to find as simple a solution as this in
>> >> VB.Net
>> >> so I use this (not saying there isn't one).
>> >
>> > And of course there is:
>> > SalaryMax.Text =
>> > String.Format ("{0:c}",CalculateYearly(WagesMax.Text.Replace("$", "")
>> > .Replace(",", ""));
>> >
>> > I know which version I'd rather read...
>>
>> I can read either (although, I didn't know you could string multiple
>> "Replace"s together).
>
> Yes, I can read either too. The point is that in reading my version, I
> didn't need to wade through various special characters, understanding
> exactly what was there for.
If you knew enough to know about Regex at all (which you said you would have
no problem with in some situations - so the programmers better be able to
read it), there should not be a problem with the 2 special characters which
is the same as C#. There is nothing obscure in this example - that I can
see.
>Of course, your version wasn't even valid
> C#, as it didn't escape the backslashes and you didn't specify a
> verbatim literal. I assume it was originally VB.NET. I wonder which
> version would be easier to convert to valid C#? Mine, perhaps?
Actually, it was VB.Net.
>
>> > But I suspect you're more used to regular expressions than many other
>> > programmers - and making the code less readable for other programmers
>> > for no benefit is what makes it unwarranted here, even in the simple
>> > case where there's nothing to escape.
>>
>> First of all, I am not. I don't use it much at all, but I find it easy
>> to
>> figure out and staight forward (but you can make it really complex). I
>> use
>> it to validate phone numbers, credit card numbers, zip codes etc.
>
> And in all of those cases, regular expressions are really useful.
But according to you, you shouldn't use them as some of the programmers may
not be able to maintain it. Definately if they would have a problem with
our example.
Can't have it both ways. If you allow Regular Expressions, you shouldn't
have a problem if a programmer used the Regex or IndexOf in our example.
Anyone maintaining the "USEFUL" ones would have zero problems with this one.
>
>> Which are very well documented and when there are a myiad of ways a
>> user can put input these types of data, I prefer to use Regular
>> expressions which are all over the place (easy to find) then try to
>> come put with some complex set of loops and temporary variables which
>> make it far easier to make a mistake and much more unreadable the the
>> Regex equivelant.
>
> Where exactly are the complex loops and temporary variables in this
> specific case? After all, you have been arguing for using regular
> expressions in *this specific case*, haven't you?
>
I was obviously talking about Regular Expressions in general here as I was
refering to the standard ones you can get anywhere dealing with (Phone
numbers, credit card etc). There would be none in this case, obviously.
But there may be in more complicated cases.
>> > Because it's more complicated! You can't deny that there's more to
>> > consider due to the escaping. There's more to know, more to consider,
>> > and it doesn't get the job done any more cleanly.
>>
>> Escaping seems to be your main compaint with it.
>
> It's the main potential source of problems, yes. It's a potential
> source of problems which simply doesn't exist when you use
> String.IndexOf.
>
>> I have the same problem with C or VB when trying to remember when to use
>> "\"
>> vs "/" in paths or do I need to add "\" in front of my slash or quote.
>> These are inherent problems with pretty much all of them.
>
> You already need to know that when writing C# though - my use of
> String.IndexOf doesn't add to the volume of knowledge required.
>
It is still an issue. Just as the Regular expressions are. And again, if
you are going to allow Regex at all, you would still need to know about the
escapes.
>> > As is using the power of regular expressions when there is an easier
>> > way - using IndexOf, which is *precisely* there to find one string
>> > within another.
>>
>> I am not discounting IndexOf, I am just saying that both work fine and
>> are
>> just as readable (in this case). In other cases, that may not be the
>> case
>> (with either C or Regex).
>
> Just because they're as readable *to you* doesn't mean they're as
> readable to everyone. How sure are you that the next engineer to read
> this code will be familiar with regular expressions? How sure are you
> that when you need to change it to look for a different string, you'll
> check whether any of the characters need to be escaped? Why would you
> even want to force that check on yourself?
Again - then don't allow them at all.
>
>> > Do you really think it would take you that long to refamiliarise
>> > yourself with it? I don't see why it's a good idea to make some poor
>> > maintenance engineer who hasn't used regular expressions before try to
>> > figure out that *actually* you were just trying to find strings within
>> > each other just so you can keep your skill set current.
>>
>> So you would prefer to code to the lowest common denominator.
>
> When there's no good reason not to, absolutely.
I guess that is where we disagree.
>
>> I am not going to code to the level of a junior programmer. I prefer
>> that
>> he learn to code to a higher level.
>
> Learning to solve problems as simply as possible *is* learning to code
> to a higher level.
No argument there.
>
>> I am not saying that that you still should write decent, readable,
>> commented
>> code. But I am not going to limit myself because another programmer may
>> not
>> be able to read well written code. If that were the case, I would not be
>> writing objects (abstract classes, interfaces, etc).
>
> If it's not the simplest code for the situation, it's not well written
> IMO. If it introduces risk for no reward (the risk of maintenance
> failing to notice that they might need to escape something, versus no
> reward) then it's not well written.
>
I see no risk in the example we are talking about. At least, no more that
in the IndexOf solution (in this situation).
>> > I've never had a problem with reading the documentation when I've
>> > needed to use regular expressions, without putting it in projects in
>> > places where I *don't* need it.
>>
>> "Need" is a personal question. I don't thing it applies here. You
>> prefer
>> IndexOf and I might prefer IsMatch.
>
> I bet if I showed my code to a random sample of a hundred C# developers
> and asked them to change it to search for "hello[there]", virtually all
> of them would get it right. I also bet that if I showed your code to
> them and asked them for the same change, some would fail to escape it
> appropriately. Do you disagree?
No. But then the same developers would have a problem with the more
complicated expressions you claim is useful.
>
>> > Because it makes things more complicated for no benefit. The reflection
>> > example was a good one - that allows you to get a property value, so do
>> > you think it's a good idea to write:
>> >
>> > string x = (string) something.GetType()
>> > .GetProperty("Name")
>> > .GetValue(something, null);
>> > or
>> >
>> > string x = something.Name;
>> >
>> > ?
>> >
>> > Maybe I should use the latter. After all, I wouldn't want to forget how
>> > to use reflection, would I?
>>
>> Lost me on that one.
>
> Both are ways of finding the value of a property. The first is harder
> to maintain and harder to read, just like your use of regular
> expressions in this instance. Now, which of the above snippets of code
> would you use, and why?
Since I am not sure why you would use the first, I would do the 2nd.
But in our case, I would still use either - as I see the Regex version as
easy as the IndexOf.
Tom
.
- References:
- Search for multiple things in a string
- From: tshad
- Re: Search for multiple things in a string
- From: Jon Skeet [C# MVP]
- Re: Search for multiple things in a string
- From: Oliver Sturm
- Re: Search for multiple things in a string
- From: Jon Skeet [C# MVP]
- Re: Search for multiple things in a string
- From: tshad
- Re: Search for multiple things in a string
- From: Jon Skeet [C# MVP]
- Re: Search for multiple things in a string
- From: Jon Skeet [C# MVP]
- Search for multiple things in a string
- Prev by Date: RepositoryTypeLib
- Next by Date: WriteLine hangs ... PLEAZ HALP MEEE!!!
- Previous by thread: Re: Search for multiple things in a string
- Next by thread: problem with static member
- Index(es):