Log In

Efficient algorithms for RNA structure prediction, mRNA sequence design, and learning to fold RNAs

RNA structure prediction is a challenging problem, especially with pseudoknots. Recently, there has been a general shift from the classical minimum free energy-based methods (MFE) to partition function-based ones that assemble structures based on base-pairing probabilities. Inspired by ProbKnot, a simple and faster heuristic algorithm than can predict structures with pseudoknots using base-pairing probabilities, we show that a simple thresholded version of ProbKnot, we call ThreshKnot, leads to more accurate overall predictions by filtering out unlikely pairs whose probability falls under a given threshold. A messenger RNA (mRNA) vaccine has emerged as a promising direction to combat the current COVID-19 pandemic. This requires an mRNA sequence that is stable and highly productive in protein expression, features that have been shown to benefit from greater mRNA secondary structure folding stability, and optimal codon usage. However, sequence design remains a hard problem due to the exponentially many synonymous mRNA sequences that encode the same protein. We show that this design problem can be reduced to a classical problem in formal language theory and computational linguistics that can be solved in cubic time. Inspired by LinearFold, we further develop a linear-time approximate algorithm for mRNA sequence design. This algorithm also incorporates the Codon Adaptation Index (CAI) into dynamic programming, which gives our algorithm the capability to jointly optimize the stability and codon usage. Now let us switch back to the RNA secondary structure prediction problem. Machine learning-based methods could achieve higher accuracy than thermodynamics-based models since statistical methods conquer the problem of inaccurate measurement of thermodynamic parameters by learning weights from known structures. A recent linear-time machine learning-based RNA folding system, learning-to-fold, uses LinearFold as the inference engine. To further improve learning-to-fold, we plan to incorporate LinearParititon as its inference engine since the overall accuracy of partition function-based methods is generally higher than that of MFE-based ones. Moreover, using structure-probing data generated by probing methods, such as SHAPE (Selective 2′-Hydroxyl Acylation analyzed by Primer Extension method), to guide RNA folding prediction becomes popular. CONTRAfold-SE is such a model that trains parameters using SHAPE data. However, this model considers probing-data on single nucleotide bases, which lacks motif information. We could firstly consider SHAPE data for our learning-to-fold with partition model; then, we will go one step further to bring back neighborhood information to the model.

Co Advisor: Weng-Keen Wong
Co Advisor: Liang Huang
Committee: Prasad Tadepalli
Committee: Xiaoli Fern
GCR: Leonard Coop

0 people are interested in this event