Re: Needleman-Wunsch - producing more than the optimal solution -- please help
- From: "Peter Duniho" <NpOeStPeAdM@xxxxxxxxxxxxxxxx>
- Date: Wed, 17 Sep 2008 15:33:57 -0700
On Wed, 17 Sep 2008 07:53:41 -0700, almurph@xxxxxxxxxxxxx <almurph@xxxxxxxxxxxxx> wrote:
I want to modify the Needleman Wunsch algorithm (pseudo-code here:
http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm) so that it
produces more than the optimal solution.
What I want is the optimal solution, second optimal solution, third
and so on do the chain.
Problem is though I don't know how to do this. Can anyone help
please? Any comments/suggestions/code-sample would be most
appreciated.
Caveat: until your post, I'd never heard of this algorithm, nor do I have any specific experience with the application (well, I've got friends with doctorates in genetics, bio-chem, micro-biology, etc. but that doesn't make _me_ an expert :) ).
Anyway, an issue I see is that it's not clear to me that the algorithm actually compares solutions to find the "optimal" one so much as it detects the best alignment of two inputs based on a pre-computed weighting matrix. Once you have the final "F" matrix, you can work your way backwards to compute a new chain that includes pieces of the two inputs, plus gaps. But, the moment you start talking about an output that is something other than this one "maximally scored chain", the number of possibilities are huge, with no immediately obvious way to distinguish between them.
Consider a computation that takes a list of unique integers and returns the maximum value. Then, obviously for a list of length N, you can compute the maximum, and the highest value less than the maximum, and the highest value less than that, etc. Any such solution effectively winds up ordering the list, and in fact the most obvious solution is simply to sort the list.
But in the Needleman-Wunsch algorith, what would your ordering be? The final result isn't a single maximum value among multiple possibilities. It's a sequence of choices along a chain of possibilities. We can easily talk about the sequence that involves choosing the maximal choice at each possibility, because at each choice, there's only one answer. But the moment you allow for the choice of something other than the maximal choice, you have to deal with every permutation possible, and evaluate those permutations in some way.
At the very least, we'd need to know just what evaluation metric you intend to use to describe the "optimal", "next-optimal", "next-next-optimal", etc. solutions. Is it as simple as the score in the lower-right corner of the F matrix? Or is there some other criteria?
And frankly, I suspect that even if you have a well-defined metric, the solution could become computationally VERY expensive, if not impossibly so (think NP-complete :( ).
My instinct tells me that the problem probably avoids the NP-complete issue, by virtue of only connecting nodes in the problem space to adjacent nodes. But I haven't gone to the trouble of proving that for sure, and even if it's true, you're still looking at taking what's essentially an O(N^2) algorithm and making it even worse.
One naïve thought that comes to mind is that as you process the matrix, you could compute an "error" value between the three possible choices at each element, storing data about that element (i.e. location and the non-optimal choice associated with the specific computed error) in a list sorted by the error. Each element would generate two of these "error-sorted" list elements, each representing one of the choices _not_ taken for a given F matrix element.
In that way, you could after finding the maximal solution for the whole matrix, go back and replace the maximal choice made at the lowest-error element with the choice associated with that lowest-error. In other words, the ordering of solutions becomes the ordering defined by your list of errors for the non-maximal choices made.
The problem with that strategy that I see is that at every element, the choice you made affects later choices. So even if you start by replacing the lowest-error choice at one element, that could lead to higher-error choices later in the processing of the matrix. Not only would you, after redoing the choice, have to recompute the rest of the "F" matrix that depended on that choice, there's no guarantee that when you're done, the score in the lower-right corner that results from that replacement would indeed be the next-lowest score possible in the matrix.
If it's true that there is no such guarantee, then the only approach that is apparent to me is to compute _all_ of the possible F matrices for each possible choice at each element. Then, you can order the final matrices by the score in the lower right corner.
This is easy to code, but expensive to run and memory-intensive. The original algorithm is already relatively slow (O(N^2) as mentioned above), and it only outputs a single F matrix. If at each matrix element, we wind up creating three new possible F matrices and we have N^2 elements, then that's 3^(N^2) F matrices, each taking N^2 to compute. The exponential cost dominates the polynomial cost, so you wind up with O(3^(N^2)) (anyone feel free to check my math :) ). Granted, it's not O(N!), but it's still pretty bad.
So, to sum up:
-- It's possible you can do some ordering of "paths not taken" as the initial F matrix is generated, and then generate new F matrices after the fact based on that ordering. The problem with this approach is that I haven't mathematically proved that it'd actually generate what you seem to be asking for. I haven't even proved that doing so wouldn't produce wildly incorrect answers. After all, a small change in one step of the evaluation could result in a very large change later on in the evaluation.
-- Assuming you've got a well-defined ordering of the F matrices based on the computed score in the lower right corner, then the brute-force approach would work, albeit very expensively. For very short inputs, this might be fine. But inputs even with as many as a dozen elements could start to tax even a modern PC, and inputs with hundreds or thousands of elements could be so expensive to compute that it's just not practical to do so.
Because later choices depend on earlier choices, and because scores accumulate, it's possible that there's some way to take the first approach and guarantee that it produces the correct results. I just don't see an obvious route to that solution at the moment. For your sake, I hope that the above at least provides some food for thought, if not enough to chew on that can lead to a correct-but-computationally-practical solution.
Pete
.
- Prev by Date: Re: Covert C# to C
- Next by Date: Re: Filestream to Memorystream
- Previous by thread: DataView's ToTable
- Next by thread: RE: COM
- Index(es):
Relevant Pages
|