Tag: Science

The Uncertainty of Stochastic Models and Human Mortality

Stochastic models help us predict events that deal with uncertainty. We can use them to do cool things like predicting the levels of noise in gene expression [1]. The randomness of genetic mutation, epigenetic factors, and other biological mechanisms that influence genetic expression isn’t something that we look at as some sort of black box that we can never know. Not only is it a truly remarkable demonstration of concepts that are inherent to theoretical physics in the messy world of biology, but I loved how these types of models incorporate the epigenetic factors that we have previously deemed “unpredictable” on the gene expression scale.

There are some thing we can’t really know with much certainty, though. Death is one of them.

My grandparents are the coolest people I know. Growing up in a household with my parents and grandparents is like living in a time capsule. But the message I get from them is not as clear as you might think. Both my grandpa and my grandma are about the same age, but, if you had met both of them, you would have never guessed they shared similar life stories. When you visit my house, you can find my grandpa remains in his room as he watches TV and reads book for most of his time. But, while you’re in my house, you might not expect to meet my grandma because she’s always spending time with the neighbors, working on the garden, or swimming. (Yes, swimming. My grandma swims. Usually, for four hours a day.) To me, it always seemed like my grandpa accepted his poor health as a harbinger of the end of his life while my grandma wanted to punch death in the face. I always admired both of them.

My understanding of the world is not only shaped by what my parents have to offer, but what my grandparents have to, as well, and I’ve always had a tremendous amount of respect for the elderly. Aside from the unique experience and wisdom that comes from their long, meaningful lives, the contrast between the way my grandpa and my grandma view their roles in life raised questions about how we should address issues in the elderly care. Particularly, end-of-life issues such as predicting the risks of certain treatments for fatal disease and judging the quality of life for those patients who undergo such treatment methods were reflected in my own home.

After reading Atul Gawande’s new book, “Being Mortal.”, it has become more and more apparent to my me the extent to which we need to re-evaluate the way we care for the elderly and address these end-of-life issues. (Dr. Gawande was actually one of the people who inspired me to take an interest in medical ethics.) From my own perspective of living with my grandparents, the idea of sending the elderly off to nursery homes and foster care for senior citizens has always been completely foreign and horrendous to me. Both points of view have their own benefits that we should try to embrace, and these types of living facilities have been becoming more common in most parts of the world. The markedly different cultures between my generation and the generation before have helped me realize that, as human beings, we can all view death not as something that should be avoided without regard to our own lifestyle otherwise. I hope that, in the future, I can tackle these problems in an ever-changing world to make the world safe for the elderly. (But hopefully they won’t be the same problems that I will face when I approach the end of my life.)

[1] Raser JM, O’Shea EK (2004) Control of stochasticity in eukaryotic gene expression. Science 304: 1811–1814 http://www.sciencemag.org/content/304/5678/1811.full

November 30, 2014

Medicine, Science
Float like a butterfly; sting like a bee.

Your hands can’t hit what your eyes can’t see.

For this reason, we have to be careful when feeding numbers into our computer. Check out what happens when you ask a basic math question to Python:

Pictured: the folly of man

Huh, that’s strange. Where did all those zeros come from? It turns out that machines use binary to represent integers. This means that, for the number .13, instead of summing 0/1 + 1/10 + 3/100, the computer must use 0/2 + 0/4 + 1/8 + … whatever else comes after that in binary code. (Python calls this a float.) Make sure you keep this in mind when working with mathematically intensive projects.

Pi is an oddity. It’s never-ending just like our efforts to calculate it. At the same time, it’s always nice to appreciate how there are so many different ways to calculate pi. For example, you could drop needles. And, sure, we can always measure the ratio of the circumference of a circle to its diameter, but how do we tell a computer, a fundamentally deterministic and causal entity, how to calculate pi? Let’s ask two different programming languages and see what they have to say.

Python:

pi=4*np.arctan2(1,1)

Fortran:

double precision pi
pi=4.d0*datan(1.d0)

(Notice the “d” in the Fortran statement. That tells us the double precision, or, to what power of ten we multiple our number.)

It appears Python and Fortran are in consensus about this one. If you use a bit of intuition when reading the two statements, you might be able to tell that they are both defining pi as four times the angle created from arctan 1. But what value of pi do they both actually give us?

Python: 3.1415926535897931
Fortran: 3.1415926535897931

According to Value of Pi, pi is 3.1415926535897932384…with several different computational techniques. Looks like it’s a tie for this round.

I’ve found that working with two different languages on the same project forces you to really understand the syntax and meaning behind each of the languages. With my example of Fortran and Python, We can really see the difference between the two.

October 22, 2014

Science
Programming for Particle Physics – Monte Carlo simulations and Markov Chains

Call me Ishmael.Some years ago – never mind how long precisely – having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

I’ll tell you a story from my ongoing adventure in physics.

Though I do a lot of bioinformatics research, I’ve always secretly loved physics more than any other subject. (Maybe it’s not much a secret since I am a physics major.) I never really saw a reason why I couldn’t do research in both fields while I was an undergraduate, so I decided to just follow wherever my heart leads me in whatever field I enjoy the most. That’s why I do research in both fields. Recently, I’ve joined a project at a computational physics lab here at IU in the Fox Lab. As exited as I am to get into this research, it requires a very deep level of understanding of mathematics and abstract physics concepts, unlike my research in Biology.

From my weeklong excursion to Canada last summer. (Kitchener uptown)

Imagine that, one fine, beautiful night, around 1 or 2 in the morning, you’re walking home from a party. Unfortunately, you had a bit of too much fun, so you struggle down the street and stumble to see the road in front of you. Instead of doing the right thing and calling for a friend or trusted adult to help, you decide to latch yourself onto a streetlamp until you start really feeling it again. Now, as you stand by this street lamp, every now and then, you put together your willpower and courage in order to let go and start walking again. But, you find yourself meandering aimlessly without any control whatsoever, and occasionally, end up hitting your head in the street lamp.

The moral of the story is to study mathematics. Especially if you want to be a physicist.

This drunken behavior is a random walk. And it’s similar to the basis for the Markov Chain, which constructs the probability of future events that are independent of the probabilities of the past events, but dependent on the present conditions. It basically accounts for randomness in different sciences and processes such as Brownian motion.

Brownian Motion (Source)

A while ago I made a blog post about Comparing Genome Alignments in which I touched on a few models for creating alignments form given genome inputs. What this algorithm uses is a Markov Model which creates nodes that string together to form full-fledged genomes. Markov Models work by assuming the Markov property that I just described, and the Genome Alignment algorithm uses it to, well, align genomes of different species. But why limit yourself to Biology? Why not extend it to particle physics?

And what could be more exciting than the world of particle physics? Making Monte Carlo simulations to study the decay and interactions of subatomic particles, of course!

We can probabilistically determine the energies of momenta of the particles in the following reaction:

γ + p –> π+ + π- + πo + p

This is how a photon and proton collide in order to form three pions (plus, minus, and naught), along with a proton. When this collision happens, we want to know how the energy and momenta of the particles change. We can do a Monte Carlo simulation (or a computational test that gives us probabilities of different results) of this Markov process to get the probabilities of various outcomes of this interaction.We run thousands of trials with different collisions and take a look at the values of the particles throughout the collision. This means that, according to our Markov property, the process is like a bunch of drunk people running into each other on the street.

I like to think of the randomness of particles physics almost like the tumultuous waves of the ocean that rock back and forth. A single event doesn’t tell much, but, together, they show beautiful patterns about the world we live in.

From my stay at Cornell University last summer

I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me. – Sir Isaac Newton

October 1, 2014

Science
The Importance of Hashing

People who are just beginning to code make a lot of mistakes and do a lot of stupid things. I once used to struggle with parsing thorough every line of a file before I learned that it would be much easier to use split-functions and similar lists. One common mistake people make is that they need to store large amounts of data and parse through them every time they need to access that information. But, with a bit more expertise, those newbies can throw away those “for” loops and sort-methods. There’s a new kid in town. And his name is “Hashing.”

for i in range(len(s)):

c+=1

if c in indices:

line=int(indices_start.index(c))

for i in s[i:indices_end[line]+1]:

s2+=“N”

The problem is that this takes forever and a half to run. But, if we just replaced the two lists with a single hash (or dictionary) which we will name indices, then the program runs like clockwork.

for i in range(len(s)):

c+=1

if c in indices:

line=int(indices[c])

for i in s[i:line+1]:

s2+=“N”

Just goes to show how much of a difference code optimization can make.

END

(Excuse me for that last line. My Fortran is leaking.)

September 24, 2014

Science

Longest Common Subsequence and NumPy

Perl may be crafty and efficient like a ninja, Ruby may be written like a prose or work of fiction, but, for most purposes, Python, with its simplicity and elegance, is usually my weapon of choice when it comes to programming languages. (To be frank, as long as it’s not some cryptic code like Fortran that should probably be waiting for the rain to wash it away, it floats my boat.) With my knack for mathematics, I had been reconstructing various equations and theorems from scratch in most of my scripts. Recently, I’ve begun to embrace NumPy to give me more functionality for purposes like matrices and arrays, but also that I can do all the things my MATLAB friends do without too much effort to learn extra languages.

(In fact, while we’re at it, let’s just put everything in Python! Python-Excel, Python-sql, import everything!)

Back in my pancake post, I talked about how you can use a simple two-step algorithm for sorting out a string of numbers. In this post, however, I’m going to talk about sorting through two different string of letters to find their longest common subsequence. Unlike the challenge of the longest common substring, the subsequence need not consist solely of letters that are adjacent to one another, but can contain letters separated. So, this means that the longest common subsequence between “AACTTG” and “ACTGG” would be not be “ACT”, but “ACTG”.

I found this problem interesting because it gave me the chance to flex my NumPy muscles and look for more “indirect” ways of solving a problem rather than using a brute-force Ctrl+F-esque approach that wipes away your entire RAM when you search through strings longer than 30 characters.

Fret not! For we can construct a matrix of some sort to help us with this issue. In bioinformatics, we can solve this problem using a scoring matrix. By taking advantage of the set of possible DNA bases {A, C, T, G}:

Simply place your two DNA strings on the axes and move each number in the grid from the top-left to bottom-right . If there is a match in the base between the two strings at a certain location, add one to the number. Then follow your path form highest to lowest value. (source)

This is known as the Traceback approach, and we can optimize it further with hashes for lengths and in other ways.

I wrote a solution from this method (drawing heavily upon other sources) to solve this problem here, although I’m still fixing up some issues in it from converting between different syntaxes and formats.

September 24, 2014

Science

Comparing genome alignment methods

One of my current projects in the Matthew Hahn Lab is to investigate the effectiveness of a few different full-genome alignment methods. My mentor and I have been studying a new program called progressiveCactus, and comparing its output to other alignment methods. By comparing the number of indels (that is, insertions and deletions) between different species, we can compare the effectiveness of different genome-alignment methods. But my work has mostly been spent struggling to figure out how to get programs to run, and deciding the best way to parse output files.

How does progressiveCactus work, you ask? When I tried to answer that question the moment I began working in the Hahn Lab, I couldn’t figure a thing out. After gaining much more experience in bioinformatics and analysis of complex systems, though, it has made more sense to me.

Circular genome plot

In order to allow multiple genomes to align to one another in any possible way, we can arrange them in a circular pattern, as shown above. This lets us create threads of different colors, in which each color represents a different sequence. The ends of the boxes (A1 and A4) are the telomeres, as in, the ends of the chromosomes. It’s easy to find reverse complements, similarities, and other neat features. When we combine all of the different circular genome plots, this way, we can create “cactus” graphs.

Pictured: a cactus

From these chains, we can create entire networks upon networks to give us full-aligned genomes. progressiveMauve has been shown to be very quick and effective with a small number of different genomes, and it has a very attractive GUI, as well. We’re focusing on the output from this program to compare to that of progressiveCactus.

Ever since I finished my work at Cornell, I’ve been much more confident and focused in my research at IU. I look forward to continuously keep moving onto bigger and better things in research and elsewhere.

September 11, 2014

Science
Genetic Inversions, Bill Gates, and Pancakes

Imagine that you are a waiter running back and forth in your breakfast restaurant. Your life is constantly moving between the kitchen and the seating area in your usual “flow”. Most days you have to work very hard to make ends meet, so you don’t have time to sit back and smell the roses or rose the smells. It’s a shame that your work prevents you from studying the world around you through mathematics and algorithms. Every now and then, a guest orders a stack of pancakes, but, when the cook hands you the plate of pancakes, you’re a bit disappointed because the pancakes aren’t stacked by size.

what is this madness

What self-respecting philopancakist would tolerate such blasphemy? The proper pancake stack must place the largest pancake on bottom, the smallest pancake on top, and fill the space with the pancakes in ascending order. This is the only way you can pour syrup on it so that the syrup touches each pancake. It should be the responsibility of the cook to flip the pancakes in such a way that lines them up from largest on bottom to smallest to top with everything making sense in between.

This begs the question, if someone gives you a randomly assorted stack of pancakes, how do you sort them through flipping them? Namely, what’s the most efficient way for us to look at a set of random numbers and sort them from least to greatest (or vice versa) by reversing different segments of those numbers? This is the Pancake-Sort problem, and the number of flips is known as the reversal distance.

A man named Bill Gates proposed a solution to the Pancake-Sort problem. You can read it here.

What makes this problem more interesting is that it has application in biology in the study of genetic inversions. DNA bases experience a type of mutation known as inversions in which segments of bases are reversed. This can occur with a small segments of genes or multiple genes.

We want to know how many inversions that a certain gene or region of the genome has undergone because that tells us how old the DNA is or how much it has evolved. Two species that share a large reversal distance may have evolved farther apart than two that share similar reversal distances.

Perhaps it is ironic to mention that mother nature’s love for making molecular biological interactions actually makes this problem much easier to solve. In biology, we don’t have more than four different DNA bases, and our bases are actually aligned in such a way that there is a “forward” and “reverse” direction to each string. This means that each base must be aligned in the forward or reverse direction in order for that string to function properly. Taking these into account will make the problem simpler because we can restrict ourselves to aligning the DNA strings so that these conditions are satisfied.

When I first approached a simple version of this problem, I wrote a solution that would take the input string that needs to be sorted and judge potential inversions by their hamming distance from the desired end sequence. By constantly following the potential inversion that had the lowest Hamming distance, we would hope to find the end result. (Hamming Distance is the number of bases between two strings that do not match when the two are aligned. So “AAAG” and “AAAA” would have a hamming distance of 1 since they differ by one base.) Basically, this approach would try to find the shortest way to get from the beginning to the end by seeing which inversion would match the end result the most, and repeating this process until the end result is reached. But, even intuitively, this approach would not necessarily find the end result in all scenarios. It may end up creating loops and traversing through inversions that would have low hamming distances but not move in the most optimal path from the first string to the end. (You can see some of my solution here.)

September 4, 2014

Science
Helping other students with Undergraduate Research Awards and Opportunities

My university recently featured me on their webpage for their new Office of Competitive Awards and Research for my recent REU at Cornell University. It’s definitely exciting to get press coverage. And it looks like REU’s are definitely the gift that keeps on giving.

This new office at Indiana University is actually keeping in touch with me to promote research opportunities for other students at my university.

To give a brief background, when I entered Indiana University, I was so obsessed with science that I was almost desperate to join a research lab. After emailing around a few professors, I was offered spot in the Matthew Hahn Lab to study Bioinformatics. Soon enough, I helped a few of my friends get into labs, too, by giving them advice and instructions about how I did it. Later, during my freshman year, I was accepted to a full-time summer internship at Cornell University that paid for transportation, housing, food, and a gave a $5000 stipend. From all of these experiences, I’ve compiled my advice and instructions into a guide from the beginning to the end.

If you’re a college student reading this blog and looking for research opportunities at your university or advice and tips about undergraduate research or you just want to read more about my experience, check out my “Scientific Research Guide” under “My Work” over there on the right.

As for now, I’m currently working on a new approach to my project in my lab.

September 1, 2014

Science
How do I prepare myself for research internships (or how do I do well in my research lab)?

When I entered college, I was obsessed with science and getting involved in research, but I didn’t know how to join a lab, let alone what exactly it was I wanted to research. I was particularly fascinated by understanding biology through math, so I joined a bioinformatics lab.

When I joined, I was overwhelmed. I had never written a programmed anything before in my life. But I put hours and hours of effort into learning the programming skills and research techniques throughout the semester, and, before the end of my freshman year, I was accepted to a bioinformatics REU at Cornell University. Having said that, it didn’t come easy, but I’d like to share some thoughts on excelling in your research so you can find new opportunities.

Once you’ve found your research area of interest, identify what it takes to get better and ambitiously develop those skills that will make you look attractive to internship admissions officers. Most students spend 10-15 hours a week doing research. Make sure that you are conscious of what you are learning from your research and you can reflect on that later. This will allow you to write amazing essays later when it comes time to apply for the internships. Find whatever skills to make you better. Practice any relevant lab technique you can get your hands on. You don’t need some glorious goal of publishing a paper or winning a Nobel Prize when you first start out. Your first steps are to gain the fundamental skills necessary to gain more advanced research skills.

But you can’t just have the skills and go anywhere with just that. To be truly successful, you need to develop a purpose. This is important, not only to keep you happy and motivated while you work, but also to give you something to discuss for your application and essays and interviews in the future. Finding a purpose in your research work can help you decide if it’s really right for you. But “finding a purpose” sounds so abstract and generic that it’s hard to know what it really means. When confronted with the harsh, unforgiving reality of research, your idealistic goals might get shot down. One thing that I’ve found to be effective is to make the search for meaning something…meaningful.

I’ve found that self-reflection on your work between periods of intense focus has helped me actualize my own thoughts. When I say “self-reflection,” it could mean writing down your thoughts and information into a journal or blog as you work. That way, you can reflect on your experiences until you cultivate a purpose behind what you do. If you can, keep a notebook wherever you go (separate from your lab notebook) and write down any new ideas, questions, hypotheses you might want to answer in the future. I work in a bioinformatics lab, meaning, I do absolutely no work that doesn’t involve a computer (My lab members don’t even use lab notebooks). However, I still keep a physical lab notebook for penning down my ideas. When I write down whatever comes to my mind, I can look back on it later. This has not only helped me stay focused during the day, but also helped me put the work that I do on a daily basis into a broader context. I read back on it my thoughts for inspiration and reflection for improvement. For example, you might say you want to study how species evolve over time, but, in your lab, you’ve only run a few different cell genomics assays. If you can write down some other ideas you have, you can maybe develop some new ideas and questions over time that help you create something yourself. Give it time. Give it thought. And, perhaps someday, the light bulb will spark.

Remember that careful and meticulous practice is always better than losing yourself in the “flow” of your daily routine. Remember that what you get out of your work isn’t necessarily just how many hours you spend working in your lab, but, rather, a combination of the number of hours and the intensity of your focus in those hours.

August 18, 2014

Science
Unplug and Recharge – my poster presentation and what I’ve learned form my internship

I’ve officially finished my research internship at the Boyce Thompson Institute. After several stressful nights of analyzing my data, putting it into a readable form, and drawing conclusions, I whipped together a poster that shows my results. From my RNA-Seq analysis of the tomato genome, I collected a lot of information about lncRNAs and cisNATs and their involvement in the ripening of the tomato.

I’ve learned a lot of things about science and life over the past several weeks. Among the things that I’ve learned:
1. Scientists a bunch of crazy people. And that’s cool. Everyone I met was crazy about what they do. It takes a true genius to be excited by making problems more and more complex.
2. Everyone is very friendly. Scientists are quite happy to discuss any scientific work with you. And most people who are older than you only want to see you get better everyday.
3. Obsessing over work isn’t worth it. You do need to make time to have fun every now and then. A few students (myself included) would obsess over their work by putting as many hours as possible, but end up feeling dead both physically and mentally at the end of the day.
4. Apart from having fun, you need time to self-reflect. Meditate. Walk. Breathe.

At the beginning of this internship, I really didn’t have much of interest in plant biology. In fact, for biology as a whole, I was still iffy about it. I’d be lying if I said that I’ve done a complete 180-degree-turnaround and now I’ve been enlightened, but I have gotten a better idea of what biology and bioinformatics research is like. What I mean is, while I am much more interested in plant biology than I ever was before, I do not want to limit myself by specializing in it.

Here is my beautiful poster summarizing my work.

After doing some thought about bioinformatics and the direction the field is heading, I’ve decided that I don’t want to pursue a PhD or further study in bioinformatics after my college career. This is mostly because, among other reasons, bioinformatics is moving towards having people who are skilled at using the programs to study biology rather than people who want to build software and systems. I am much happier with the latter, so I will continue to jump from field to field in the pursuit of knowledge. As for happiness, I don’t think that’s something I need to pursue to have.

My sophomore year at Indiana University-Bloomington starts in two weeks and that’ll carry many, many surprises and new discoveries that are only waiting for me to be found.

August 11, 2014

Science