Project Description
input sequence. The program conducts an exhaustive search for the
optimal structure with the lowest free energy and has the capability
to predict rather complex structures, even some non-planar
structures, for short RNA segments up to 200 nucleotides in length.
To overcome the tremendous demand on computing resources needed by
pseudoknot prediction, alternative algorithms have been proposed
(e.g., Pknots-RG, ILM , and HotKnots) that tend either to restrict
the types of pseudoknots or the length of the secondary structure to
be predicted to keep runtime and memory size under control. For
instance, the program Pknots-RG [9] limits the types of pseudoknots
to simpler structures for longer segments, up to 800 nucleotides.
However, a large variety of pseudoknots have been shown to occur in
nature. Their omission from computational methods may significantly
affect the prediction accuracy. Furthermore, even the simplified
programs are not able to predict secondary structures on the order
of thousands of nucleotides.
Our analysis of the length of
pseudoknots in a widely used database known as PseudoBase, which
collects 245 pseudoknots, shows that most pseudoknots known to date
are formed by RNA segments whose lengths are less than 200
nucleotides, i.e., 95% of the segments in the database have lengths
that range between 20 and 200 nucleotides. The range of lengths
between 30 (lower quartile) and 67.5 (upper quartile) nucleotides
covers 50% of all segments. This observation leads to the idea of
overcoming the computing constraints presented above by developing a
strategy for cutting long RNA sequences into segments not longer
than 200 bases in length and distributing the task of structure
prediction of each segment to be done simultaneously on different
computers. Ideally, if two segments that are cut from the same RNA
sequence overlap each other, then the predicted structures of their
overlapping part should be consistent with one another. Such
consistency is important for the final structure assembly. In
preliminary work, we observed that arbitrarily cutting the RNA
sequence into overlapping segments is not advantageous for
consistency. It is well conceivable that when an arbitrary cut goes
through the middle of a close inversion, the bases forming the
pairings do not get into the same segment, resulting in the omission
of the structure from that prediction. For instance, consider a 100
base segment of the Severe Acute Respiratory Syndrome (SARS)
coronavirus genome, which is one of the coronavirus genomes
analyzed, from position 25,884 to 25,983 and another segment from
position 25,923 to 26,022. When the program Pknots-RE is applied to
these segments, two predictions are produced which are shown in Fig.
3. Note that, over the stretch of the 62 bases that the two segments
overlap one another, the two predictions are different. This kind of
inconsistency poses a serious problem when the predicted structures
of the segments need to be assembled. If the prediction of secondary
structures for long RNA sequences is not feasible with thermodynamic
methods even with powerful supercomputers, arbitrarily cutting an
RNA sequence into shorter overlapping segments makes the single
segment predictions feasible but not advantageous for consistency,
unless the cutting algorithm uses the locations of high
concentration of close inversions. In addition, once motifs are
predicted from sampled segments, they have still to be assembled
together into complete secondary structures. Both predictions and
rebuilding can significantly benefit from the combined prediction
capability of different codes, as opposed to using the single codes
separately. Prediction of large numbers of short, overlapping
segments is still computationally demanding but it also allows
massive task parallelism that can be addressed with grid computing
resources.
The above bioinformatics application requires
both processing of huge amounts of data and heavy computation.
Fulfilling these requirements calls for simple ways to implement
parallel computing. MapReduce is a general purpose parallelization
model that seems particularly well-suited to the task of RNA
secondary structure prediction of thousands nucleotide lengths. With
an open source implementation (Hadoop), my objective is to implement
a MapReduce/ Hadoop application for RNA secondary structures
prediction.