About this Event
Linear-time Algorithms of Partition Function, Stochastic Sampling and two-strand Folding for RNA Secondary Structures
Many RNAs fold into multiple structures at equilibrium. The partition function-based methods are proposed to compute folding ensembles and estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore a slow calculation for long sequences. We design a linear-time heuristic algorithm, LinearPartition, to approximate the partition function and base pairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold. More interestingly, the resulting base pairing probabilities are even better correlated with the ground truth structures. With LinearPartition, the estimated base-pairing probabilities provide compact representations of the exponentially large ensemble, but they cannot provide direct and intuitive descriptions, and cannot be directly used for accessibility prediction. Stochastic sampling algorithm, which samples secondary structures according to their probabilities in the Boltzmann ensemble, is widely used, e.g., for accessibility prediction. However, the current sampling algorithm suffers from three limitations: (a) the formulation and implementation of the sampling phase are unnecessarily complicated; (b) much redundant work is repeatedly performed in the sampling phase; (c) the partition function runtime scales cubically with the sequence length. These issues prevent it from being used for full-length viral genomes such as SARS-CoV-2. To alleviate these problems, we first propose a hypergraph framework under which the sampling algorithm can be greatly simplified. We then present three sampli! ng algorithms under this framework of which redundant work is eliminated in the sampling phase. Finally, we propose LinearSampling, the first end-to-end linear-time stochastic sampling algorithm, which can be used to detect SARS-CoV-2 potential regions of diagnostics and treatment. Many RNAs function through RNA-RNA interactions. LinearSampling is able to provide single strand accessibilities of RNA binding, however, two-stand folding, which can directly predict the structures with consideration of RNA-RNA interaction, is also well-desired. Some existing tools, such as RNAhybrid and RNAduplex, are not only less informative but also less accurate due to omitting the competing between intermolecular and intramolecular base pairs. Another group of tools such as RNAup focus on predicting the binding region rather than predicting two-strand co-folding structure. Other tools like RNAcofold are too slow due to cubic runtime complexity. To address these issues, we propose LinearCoFold, which is able to predict two-strand folding structure, partition function and base pairing probabilities in linear runtime and space. LinearCoFold is a global co-folding approach without restriction on base pair length, and can outputs both intermolecular and intramolecular base pairs.
Co Advisor: Li-Jing (Larry) Cheng
Co Advisor: Liang Huang
Committee: Lizhong Chen
Committee: Prasad Tadepalli
GCR: Brett Tyler