Longest Common Subsequence and NumPy

Perl may be crafty and efficient like a ninja, Ruby may be written like a prose or work of fiction, but, for most purposes, Python, with its simplicity and elegance, is usually my weapon of choice when it comes to programming languages. (To be frank, as long as it’s not some cryptic code like Fortran that should probably be waiting for the rain to wash it away, it floats my boat.) With my knack for mathematics, I had been reconstructing various equations and theorems from scratch in most of my scripts. Recently, I’ve begun to embrace NumPy to give me more functionality for purposes like matrices and arrays, but also that I can do all the things my MATLAB friends do without too much effort to learn extra languages.

(In fact, while we’re at it, let’s just put everything in Python! Python-Excel,  Python-sql, import everything!)

Back in my pancake post, I talked about how you can use a simple two-step algorithm for sorting out a string of numbers. In this post, however, I’m going to talk about sorting through two different string of letters to find their longest common subsequence. Unlike the challenge of the longest common substring, the subsequence need not consist solely of letters that are adjacent to one another, but can contain letters separated. So, this means that the longest common subsequence between “AACTTG” and “ACTGG” would be not be “ACT”, but “ACTG”.

I found this problem interesting because it gave me the chance to flex my NumPy muscles and look for more “indirect” ways of solving a problem rather than using a brute-force Ctrl+F-esque approach that wipes away your entire RAM when you search through strings longer than 30 characters.

Fret not! For we can construct a matrix of some sort to help us with this issue. In bioinformatics, we can solve this problem using a scoring matrix. By taking advantage of the set of possible DNA bases {A, C, T, G}:

Simply place your two DNA strings on the axes and move each number in the grid from the top-left to bottom-right . If there is a match in the base between the two strings at a certain location, add one to the number. Then follow your path form highest to lowest value. (source)

This is known as the Traceback approach, and we can optimize it further with hashes for lengths and in other ways.

I wrote a solution from this method (drawing heavily upon other sources) to solve this problem here, although I’m still fixing up some issues in it from converting between different syntaxes and formats.

Published by


Leave a comment