Summer Research Webpage

Daniel Yehdego

                         Project Description

input sequence. The program conducts an exhaustive search for the optimal structure with the lowest free energy and has the capability to predict rather complex structures, even some non-planar structures, for short RNA segments up to 200 nucleotides in length. To overcome the tremendous demand on computing resources needed by pseudoknot prediction, alternative algorithms have been proposed (e.g., Pknots-RG, ILM , and HotKnots) that tend either to restrict the types of pseudoknots or the length of the secondary structure to be predicted to keep runtime and memory size under control. For instance, the program Pknots-RG [9] limits the types of pseudoknots to simpler structures for longer segments, up to 800 nucleotides. However, a large variety of pseudoknots have been shown to occur in nature. Their omission from computational methods may significantly affect the prediction accuracy. Furthermore, even the simplified programs are not able to predict secondary structures on the order of thousands of nucleotides.

   Our analysis of the length of pseudoknots in a widely used database known as PseudoBase, which collects 245 pseudoknots, shows that most pseudoknots known to date are formed by RNA segments whose lengths are less than 200 nucleotides, i.e., 95% of the segments in the database have lengths that range between 20 and 200 nucleotides. The range of lengths between 30 (lower quartile) and 67.5 (upper quartile) nucleotides covers 50% of all segments. This observation leads to the idea of overcoming the computing constraints presented above by developing a strategy for cutting long RNA sequences into segments not longer than 200 bases in length and distributing the task of structure prediction of each segment to be done simultaneously on different computers. Ideally, if two segments that are cut from the same RNA sequence overlap each other, then the predicted structures of their overlapping part should be consistent with one another. Such consistency is important for the final structure assembly. In preliminary work, we observed that arbitrarily cutting the RNA sequence into overlapping segments is not advantageous for consistency. It is well conceivable that when an arbitrary cut goes through the middle of a close inversion, the bases forming the pairings do not get into the same segment, resulting in the omission of the structure from that prediction. For instance, consider a 100 base segment of the Severe Acute Respiratory Syndrome (SARS) coronavirus genome, which is one of the coronavirus genomes analyzed, from position 25,884 to 25,983 and another segment from position 25,923 to 26,022. When the program Pknots-RE is applied to these segments, two predictions are produced which are shown in Fig. 3. Note that, over the stretch of the 62 bases that the two segments overlap one another, the two predictions are different. This kind of inconsistency poses a serious problem when the predicted structures of the segments need to be assembled. If the prediction of secondary structures for long RNA sequences is not feasible with thermodynamic methods even with powerful supercomputers, arbitrarily cutting an RNA sequence into shorter overlapping segments makes the single segment predictions feasible but not advantageous for consistency, unless the cutting algorithm uses the locations of high concentration of close inversions. In addition, once motifs are predicted from sampled segments, they have still to be assembled together into complete secondary structures. Both predictions and rebuilding can significantly benefit from the combined prediction capability of different codes, as opposed to using the single codes separately. Prediction of large numbers of short, overlapping segments is still computationally demanding but it also allows massive task parallelism that can be addressed with grid computing resources.

 

   The above bioinformatics application requires both processing of huge amounts of data and heavy computation. Fulfilling these requirements calls for simple ways to implement parallel computing. MapReduce is a general purpose parallelization model that seems particularly well-suited to the task of RNA secondary structure prediction of thousands nucleotide lengths. With an open source implementation (Hadoop), my objective is to implement a MapReduce/ Hadoop application for RNA secondary structures prediction.


Pages 1 2

 

                    a href="journal.html#week1">Week One

 

                    Week Two

 

                    Week Three

 

                   Week Four

 

                   Week Five

 

                   Week Six

 

                   Week Seven

 

                   Week Eight

 

                   Week Nine