Re: Optimize an IEnumerable

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



You've gotten a couple of really useful replies, IMHO, regarding optimizing specifically via a cache, as you've requested. Though, personally it seems to me that unless you have a very specific need to reuse the enumeration in completely different places in the code, it might be better to just have the _client_ of the enumerator cache the results as needed.

But even so, I think the basic question has been answered pretty well. So here are a couple of comments that digress a bit, but which IMHO should still be useful...

First, the specific code you posted:

[...]
public IEnumerable<KeyValuePair<int, string>> GetValues()
{
foreach (string someThingToFind in _list)
{
string savData = _data;
int cpt = 0;
int pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
while (savData.Length > 0 && pos >= 0)
{
cpt++;
savData = savData.Substring(pos +
someThingToFind.Length);
pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
}
yield return new KeyValuePair<int, string>(cpt,
someThingToFind);
}
}

I may be missing something, but the above code appears to me to only return an iteration instance once for each word in your search list. This may be your intended semantics, but if so it's my opinion that returning a KeyValuePair isn't appropriate.

Actually, I don't think returning a KeyValuePair is appropriate in any case. The reason why is that semantically, a KeyValuePair implies a key that is unique and a value that is associated with that unique key. In this case, your "key" isn't unique at all as near as I can tell; it's simply the number of times a specific word showed up in your string.

I understand the temptation to use the KeyValuePair since it _seems_ to encapsulate the very information you're returning (a pair of related values). But IMHO it would be much better for you to just define your own generic struct that does basically the same thing.

Alternatively, at the very least it's my opinion that the key in the KeyValuePair should be the string and the count be the value. It seems much more likely to me that your string is the unique element, rather than the count.

Why would you care? Well, IMHO the biggest risk is that a KeyValuePair could be used to construct a Dictionary or similar object directly, and this would fail as soon as the key -- that is, the integer count -- showed up more than once.

I think there are other semantic reasons to fix this part of the code, but the fact that there's a genuine potential risk for future uses of the code is probably the only reason one would need to justify fixing it.


My second comment has to do with the "optimization" aspect itself. IMHO, caching the results is a "sort of" optimization. That is, yes it has the potential for improving performance, but it doesn't really change the efficiency of the algorithm.

I note that this algorithm searches the string once for each word in your list of words. For a very small number of words, this might not be too bad, but as the number of words goes up, it can dramatically inflate the cost of the algorithm.

The idea of caching the results can mitigate that problem somewhat, but given that the algorithm is basically O(N^2), if you can change the algorithm to something more efficient, you could wind up getting a HUGE performance increase even without caching.

It depends somewhat on the number of words, the length of the string, and the number of times you'd enumerate the results. But assuming that the number of words is more than the number of times you'd enumerate the results, an O(N) enumeration executed multiple times will still be faster than an O(N^2) enumeration executed once and cached for future use.

The exact break-even point would depend a little on the specific performance and how many times you need to do the enumeration for a given list of words. But it seems to me that in any situation where you are actually concerned about performance, that means the number of words, length of string, and/or number of enumerations is relatively high, and these are the situations where the disparity between an O(N) algorithm and an O(N^2) algorithm can be very significant.

Of course, if you cache the results _and_ fix the algorithm to be better than O(N^2), you stand to gain even more.

So, what's an O(N) algorithm to accomplish the same thing? A common one used for string searches such as yours would be a state machine implementation, where the nodes of the state graph are connected according to the letters in the search words.

You traverse the graph by extracting a single character at a time from the string being searched and using that character to determine which node to move to; if the character has no matching link to a subsequent node, you return to the root of the graph. The graph will have "terminal" nodes representing the end of a search word; if and when you reach one of those, you know you've found a word.

The exact implementation depends on your needs. The basic implementation suffices in many cases, but if it is possible for your search words to overlap, that can be addressed as well.

Using a state graph requires a small amount of initialization to take place, and for relatively small tasks this initialization could cost more than the overhead of a less-efficient algorithm. But in those cases, usually you don't care about performance anyway. For larger tasks where performance is actually a concern, the initialization is relatively minor compared to the benefits of being able to go through a string only once while still recovering every single word you're looking for.

So, that's a long way of saying, if you are really serious about optimizing the code, what you need is a different algorithm. You can mitigate the problems of the O(N^2) algorithm you've got somewhat, but in the end you've still got an O(N^2) algorithm. It will never scale well, no matter what you do to it.

For what it's worth, there are other forms of search algorithms that are similarly efficient. I just mention the state graph because IMHO it's relatively easy to understand _and_ implement. You might want to look at other search algorithms; pretty much anything designed specifically for matching search terms to an input string will be better than the algorithm you've got now, and it's entirely possible you could find something that would suit your specific needs better than the general-purpose state graph solution.

Pete
.



Relevant Pages

  • This Weeks Finds in Mathematical Physics (Week 226)
    ... The first week they had lots of talks on "higher-dimensional rewriting", ... to find two files that give the same bit string. ... Now, if you're a mathematician, the whole idea of a cryptographic ... You can see the algorithm for this ...
    (sci.math.research)
  • This Weeks Finds in Mathematical Physics (Week 226)
    ... The first week they had lots of talks on "higher-dimensional rewriting", ... to find two files that give the same bit string. ... Now, if you're a mathematician, the whole idea of a cryptographic ... You can see the algorithm for this ...
    (sci.physics.research)
  • Re: Regular Expression for validating a url field
    ... (Do not use the tab character for indentation, ... against a supposed string value, and since you do not do perform a strict ... If x is NaN, return false. ... `false' is returned to the calling algorithm, ...
    (comp.lang.javascript)
  • Re: Attention Sean - question about CSI
    ... the best compression algorithm you have. ... It should compress very ... which compresses that string to a single bit. ... from the binary code, and use it to decompress the data, then ...
    (talk.origins)
  • Why are reals uncountable yet algorithms countable (long)?
    ... If you are going to say: construct a number from one digit ... you cannot enumerate the infinite digits.I have read that, unlike reals, ... one of these algorithms would also be an algorithm ... enumeration of algorithms cannot produce all reals? ...
    (sci.math)