Unplug and Recharge – my poster presentation and what I’ve learned form my internship

I’ve officially finished my research internship at the Boyce Thompson Institute. After several stressful nights of analyzing my data, putting it into a readable form, and drawing conclusions, I whipped together a poster that shows my results. From my RNA-Seq analysis of the tomato genome, I collected a lot of information about lncRNAs and cisNATs and their involvement in the ripening of the tomato.

I’ve learned a lot of things about science and life over the past several weeks. Among the things that I’ve learned:
1. Scientists a bunch of crazy people. And that’s cool. Everyone I met was crazy about what they do. It takes a true genius to be excited by making problems more and more complex.
2. Everyone is very friendly. Scientists are quite happy to discuss any scientific work with you. And most people who are older than you only want to see you get better everyday.
3. Obsessing over work isn’t worth it. You do need to make time to have fun every now and then. A few students (myself included) would obsess over their work by putting as many hours as possible, but end up feeling dead both physically and mentally at the end of the day.
4. Apart from having fun, you need time to self-reflect. Meditate. Walk. Breathe.

At the beginning of this internship, I really didn’t have much of interest in plant biology. In fact, for biology as a whole, I was still iffy about it. I’d be lying if I said that I’ve done a complete 180-degree-turnaround and now I’ve been enlightened, but I have gotten a better idea of what biology and bioinformatics research is like. What I mean is, while I am much more interested in plant biology than I ever was before, I do not want to limit myself by specializing in it.

Here is my beautiful poster summarizing my work.

After doing some thought about bioinformatics and the direction the field is heading, I’ve decided that I don’t want to pursue a PhD or further study in bioinformatics after my college career. This is mostly because, among other reasons, bioinformatics is moving towards having people who are skilled at using the programs to study biology rather than people who want to build software and systems. I am much happier with the latter, so I will continue to jump from field to field in the pursuit of knowledge. As for happiness, I don’t think that’s something I need to pursue to have.

My sophomore year at Indiana University-Bloomington starts in two weeks and that’ll carry many, many surprises and new discoveries that are only waiting for me to be found.

Building a Heap

Binary heaps are binary trees that are sorted in such a way that the parent leaf is either greater or lesser than both of its child nodes. When the root node is the greatest and each parent is greater than its children, then it’s a max heap. In the opposite scenario, we have a min heap. When presented with an array, we simply line up each element in the array from left to right, top to bottom in a binary tree format. 

For example, the array [1, 3, 6, 5, 9, 8] would correspond to the following binary tree, sorted as a min heap:

When presented with the problem of turning an array into a max heap, my initial thought was “Why don’t you just sort the array from greatest to least? Surely that will satisfy the binary condition without any fancy schmancy algorithms or recursion.” 

Turns out, it doesn’t. 

In order to solve this problem, you need to individually check each node and guarantee it follows the heap rule (that is, whether it is greater or lesser than its children if it’s, respectively, max or min heap). 

Confused out of my mind, I consulted MIT Open Courseware’s lecture on the subject, and, after that, I was even more confused. Just to get the slightest idea of what I was dealing with, I decided to get a solution down in Xcode, whether it was great or faulty, and scripted out something like this:

1. Check every node in the tree if it follows the rule. If it doesn’t, switch the node with its parent.

2. Check the tree if each node follows the rule. If not, go back to step 1. If yes, then you win!

From a computational standpoint, the errors are fairly obvious. You’d have to check each node individually over and over again until you found the mistake, and then, you’d have to check the entire tree over again. 

What might make more sense would be to check each node if it follows the rule and, if it doesn’t, check the parents recursively until you climb to the top of the tree, checking for errors along the way. This will rearrange the nodes in such a way that it fixed more errors more efficiently. My original solution does not “bubble-up” through the entire tree. 

I’ll have to fix this sometime in the future. In the meantime, I’m still working on breadth-first searches, Dijkstra’s theorem, and sorting by reversals. Hopefully those will be the subjects I touch upon in the future as I get my gears in shape.

My solution to Building a Heap can be found at https://github.com/HussainAther/rosalind/HEA.py

Testing graphs for bipartiteness

Graphs are amazing.

Bipartiteness is a characteristic of certain graphs. A graph is bipartite if and only if you can color each node one of two two colors in such a way that no two adjacent nodes have the same color. Well, that’s the realization of the actual definition, which can be found here. The two-color definition is also the foundation for the algorithm I wrote that tests whether or not a graph is bipartite.

My solution can be found here:
https://github.com/HussainAther/rosalind/blob/master/BIP.py

de Brujin graphs and Velvet Optimiser

I’m working with the Velvet Assembler as part of my virus identification project. When I’m not trying to write a perl module to complement the already-bulky script files that I’m working with,  I like to do something that I usually wish most other bioinformatics scientists I’ve met would do as well: I like to delve into the mathematics behind the program.

Velvet Assembler constructs de Brujin graphs in a manner like so:

source

Confusing, right? Velvet uses the separated small reads to make overlapping regions that be analyzed for errors and compressed for simplicity. These small reads are constructed into a de Brujin graph, which contains a list of all the overlapping segments. See those arrows in the picture above? Those are edges. And the rectangles are our nodes. We have to tell the computer the rules of how to play the game of read alignment before it can go crazy in the sandbox, so we’ve decided on a directed graph that can follow certain parameters. When the ending of one region connects with another, it forms a “bubble” that can be corrected using the Tour Bus algorithm, which is similar to the Dijkstra’s algorithm, which we’ll get to in another post.

And that’s basically all I need to know about it for my research project. But, as with most mathematical oddities, that’s not as much as I want to know.

With the help of my friend, Rosalind, I was introduced to de Brujin graphs. To construct one, we could easily write a program that iterates through a list of k-mers from each given segment and, then, record the ones that overlap by all but one of their base pairs. (My solution for the Rosalind problem can be found here, by the way.) This raises a lot of questions about how to reconstruct an entire genome from the smaller segments. The favorite problem of every computer scientist, the Traveling Salesman Problem, (and in some cases, something a bit more unique) arises when we have to decide what is the best way to connect each smaller segment into a larger picture.

I hope to optimize the Velvet Assembler portion of the Virus Detection program with a greater understanding of the mathematics behind it. So far, it seems that all my work simply relies on my ability to use different commands and put them in script files. And, recently, I’ve been bogged down with error message after error message in the implementation of a self-written subroutine module. Sometimes you need to relax with some puzzles. Even if the pieces are all so big that they overlap with one another. 

Suboptimal Alignment Algorithm

import distance
b=“GACTCCTTTGTTTGCCTTAAATAGATACATATTTACTCTTGACTCTTTTGTTGGCCTTAAATAGATACATATTTGTGCGACTCCACGAGTGATTCGTA”
a=“ATGGACTCCTTTGTTTGCCTTAAATAGATACATATTCAACAAGTGTGCACTTAGCCTTGCCGACTCCTTTGTTTGCCTTAAATAGATACATATTTG”
c=0
da=“”
db=“”
for j in range(32,41):
    for i in range(len(a)-j):
        c1=0
        for k in range(len(b)-j):
            if distance.hamming(a[i:i+j], b[k:k+j])<=3:
                c1+=1
                if c1>c:
                    c=c1
print c
c=0
da=a
db=b
a=db
b=da
for j in range(32,41):
    for i in range(len(a)-j):
        c1=0
        for k in range(len(b)-j):
            if distance.hamming(a[i:i+j], b[k:k+j])<=3:
                c1+=1
                if c1>c:
                    c=c1

 

print c

This code takes two DNA strings that share some short inexact repeat of 32-40 bp that may appear with slight modifications (each repeat differ by ≤3 changes/indels).

As you can see, this code is far from perfect. For one, it involves hardcoding in the sequence (a bad habit of mine). For another thing, anyone human being with at least half a brain will notice that this is essentially the same few lines of code repeated. What could be uglier than seeing the same thing twice? We could easily clean this up:

def func(a, b):
    for j in range(32,41):
        for i in range(len(a)-j):
            c1=0
            for k in range(len(b)-j):
                if distance.hamming(a[i:i+j], b[k:k+j])<=3:
                    c1+=1
                    if c1>c:
                        c=c1
    return c
print func(a, b)
 
print func(b, a)

Ahh, much better. I wanted to post it as an example of the importance of using functions, even if a script will only be run once or, perhaps, never at all. Now, if only we could do something about that hardcoding habit of mine…

Many of my scripts can be found on my Github page.

Perl

Ever since I got to Ithaca at the Boyce Thompson Institute, I’ve had to learn how to program in Perl, a  less elegant version of my weapon of choice, Python.

An underscore? Next, you’ll be telling me to use goto and watch the world burn.

My computer-esque tasks have included debugging a script that is used in genome analysis and moving parameters from bwa-align to bowtie2. It’s very tedious and mind-numbing to sift through all the scripts that I’ve been provided. I simply cannot absorb all the information necessary. This past week was especially brutal, given that it was the first week of Ramadan and, therefore, I’ve been running on empty. No, literally, running on empty. But I don’t want to be a drama-queen about fasting. This happens every year. Though my productivity has dropped from ~40 hours/week to ~30 hours this past week, it’s worth mentioning that I have been able to concentrate on certain tasks more, even if it means working on them for a smaller number of hours.

Anyways, about the research, identifying the specific exonic regions of the tomato genome requires novel ways of sequencing, namely, using multi-exonic sites. (This article that I linked to is not part of our research, but, rather for reference about the exotic sites of RNA).

Where do the introns go in the winter?

But my work isn’t tedious because it’s on a computer, nor because I have to learn a lot of different protocols and techniques. My work is tedious because it involves a lot of careful, deliberate button-pressing in editing and running software until we can get our program to do what the thing to do what we wanted it to do in the first place. I crave something more analytical, more innovative, but it makes sense that you have to start somewhere with beta testing.

Some things on my bucket list for the summer:
-Begin studying mathematics to prepare for physics next year. (My math skills have deteriorated since I haven’t taken a math class since junior year of high school.)
-Begin studying for the MCAT.
-Grind my nose to the concrete in research.
-My conscience: “That’s it. no more commitments!” Me: “But…but, I want to write a novel!” My conscience: “No, you need to focus.”

Despite my lofty academic goals, I’ve already begun to struggle in my classes. This past semester, my GPA dropped to a 3.6, after a stellar 3.9 during my first semester. There were a few things going on this past semester that were out of my control and definitely caused unnecessarily stress and a few other things here and there that I really made decisions about, but, ultimately, I just want to focus on improving and getting better. For one thing, I had to miss a biology test when I was in the hospital due to breathing problems and chest pain. My professor wouldn’t let me retake it, despite the policy of his syllabus. I spoke to an ombudsman, and he told me that, though he believes I should be able to retake it, there was nothing I could do to fix it. It was adding insult to injury, and my biology grade dropped from an A/A- to a B+.

Goals for the fall semester:
-Begin the student organization (that I can’t talk more about right now)
-Begin the Medical Ethics Circle as part of MAPS
-Focus on a few classes, and get more work done
-“Re”,”search”, hippie, hippie to the hip hip hop you don’t stop!

Let the world know that Hussain Ather will rise to his true academic potential.