de Brujin graphs and Velvet Optimiser

I’m working with the Velvet Assembler as part of my virus identification project. When I’m not trying to write a perl module to complement the already-bulky script files that I’m working with,  I like to do something that I usually wish most other bioinformatics scientists I’ve met would do as well: I like to delve into the mathematics behind the program.

Velvet Assembler constructs de Brujin graphs in a manner like so:

source

Confusing, right? Velvet uses the separated small reads to make overlapping regions that be analyzed for errors and compressed for simplicity. These small reads are constructed into a de Brujin graph, which contains a list of all the overlapping segments. See those arrows in the picture above? Those are edges. And the rectangles are our nodes. We have to tell the computer the rules of how to play the game of read alignment before it can go crazy in the sandbox, so we’ve decided on a directed graph that can follow certain parameters. When the ending of one region connects with another, it forms a “bubble” that can be corrected using the Tour Bus algorithm, which is similar to the Dijkstra’s algorithm, which we’ll get to in another post.

And that’s basically all I need to know about it for my research project. But, as with most mathematical oddities, that’s not as much as I want to know.

With the help of my friend, Rosalind, I was introduced to de Brujin graphs. To construct one, we could easily write a program that iterates through a list of k-mers from each given segment and, then, record the ones that overlap by all but one of their base pairs. (My solution for the Rosalind problem can be found here, by the way.) This raises a lot of questions about how to reconstruct an entire genome from the smaller segments. The favorite problem of every computer scientist, the Traveling Salesman Problem, (and in some cases, something a bit more unique) arises when we have to decide what is the best way to connect each smaller segment into a larger picture.

I hope to optimize the Velvet Assembler portion of the Virus Detection program with a greater understanding of the mathematics behind it. So far, it seems that all my work simply relies on my ability to use different commands and put them in script files. And, recently, I’ve been bogged down with error message after error message in the implementation of a self-written subroutine module. Sometimes you need to relax with some puzzles. Even if the pieces are all so big that they overlap with one another. 

Suboptimal Alignment Algorithm

import distance
b=“GACTCCTTTGTTTGCCTTAAATAGATACATATTTACTCTTGACTCTTTTGTTGGCCTTAAATAGATACATATTTGTGCGACTCCACGAGTGATTCGTA”
a=“ATGGACTCCTTTGTTTGCCTTAAATAGATACATATTCAACAAGTGTGCACTTAGCCTTGCCGACTCCTTTGTTTGCCTTAAATAGATACATATTTG”
c=0
da=“”
db=“”
for j in range(32,41):
    for i in range(len(a)-j):
        c1=0
        for k in range(len(b)-j):
            if distance.hamming(a[i:i+j], b[k:k+j])<=3:
                c1+=1
                if c1>c:
                    c=c1
print c
c=0
da=a
db=b
a=db
b=da
for j in range(32,41):
    for i in range(len(a)-j):
        c1=0
        for k in range(len(b)-j):
            if distance.hamming(a[i:i+j], b[k:k+j])<=3:
                c1+=1
                if c1>c:
                    c=c1

 

print c

This code takes two DNA strings that share some short inexact repeat of 32-40 bp that may appear with slight modifications (each repeat differ by ≤3 changes/indels).

As you can see, this code is far from perfect. For one, it involves hardcoding in the sequence (a bad habit of mine). For another thing, anyone human being with at least half a brain will notice that this is essentially the same few lines of code repeated. What could be uglier than seeing the same thing twice? We could easily clean this up:

def func(a, b):
    for j in range(32,41):
        for i in range(len(a)-j):
            c1=0
            for k in range(len(b)-j):
                if distance.hamming(a[i:i+j], b[k:k+j])<=3:
                    c1+=1
                    if c1>c:
                        c=c1
    return c
print func(a, b)
 
print func(b, a)

Ahh, much better. I wanted to post it as an example of the importance of using functions, even if a script will only be run once or, perhaps, never at all. Now, if only we could do something about that hardcoding habit of mine…

Many of my scripts can be found on my Github page.

Perl

Ever since I got to Ithaca at the Boyce Thompson Institute, I’ve had to learn how to program in Perl, a  less elegant version of my weapon of choice, Python.

An underscore? Next, you’ll be telling me to use goto and watch the world burn.

My computer-esque tasks have included debugging a script that is used in genome analysis and moving parameters from bwa-align to bowtie2. It’s very tedious and mind-numbing to sift through all the scripts that I’ve been provided. I simply cannot absorb all the information necessary. This past week was especially brutal, given that it was the first week of Ramadan and, therefore, I’ve been running on empty. No, literally, running on empty. But I don’t want to be a drama-queen about fasting. This happens every year. Though my productivity has dropped from ~40 hours/week to ~30 hours this past week, it’s worth mentioning that I have been able to concentrate on certain tasks more, even if it means working on them for a smaller number of hours.

Anyways, about the research, identifying the specific exonic regions of the tomato genome requires novel ways of sequencing, namely, using multi-exonic sites. (This article that I linked to is not part of our research, but, rather for reference about the exotic sites of RNA).

Where do the introns go in the winter?

But my work isn’t tedious because it’s on a computer, nor because I have to learn a lot of different protocols and techniques. My work is tedious because it involves a lot of careful, deliberate button-pressing in editing and running software until we can get our program to do what the thing to do what we wanted it to do in the first place. I crave something more analytical, more innovative, but it makes sense that you have to start somewhere with beta testing.

Some things on my bucket list for the summer:
-Begin studying mathematics to prepare for physics next year. (My math skills have deteriorated since I haven’t taken a math class since junior year of high school.)
-Begin studying for the MCAT.
-Grind my nose to the concrete in research.
-My conscience: “That’s it. no more commitments!” Me: “But…but, I want to write a novel!” My conscience: “No, you need to focus.”

Despite my lofty academic goals, I’ve already begun to struggle in my classes. This past semester, my GPA dropped to a 3.6, after a stellar 3.9 during my first semester. There were a few things going on this past semester that were out of my control and definitely caused unnecessarily stress and a few other things here and there that I really made decisions about, but, ultimately, I just want to focus on improving and getting better. For one thing, I had to miss a biology test when I was in the hospital due to breathing problems and chest pain. My professor wouldn’t let me retake it, despite the policy of his syllabus. I spoke to an ombudsman, and he told me that, though he believes I should be able to retake it, there was nothing I could do to fix it. It was adding insult to injury, and my biology grade dropped from an A/A- to a B+.

Goals for the fall semester:
-Begin the student organization (that I can’t talk more about right now)
-Begin the Medical Ethics Circle as part of MAPS
-Focus on a few classes, and get more work done
-“Re”,”search”, hippie, hippie to the hip hip hop you don’t stop!

Let the world know that Hussain Ather will rise to his true academic potential.