节点文献

非编码RNA相关计算问题研究

Research on Relevant Computational Problems of Noncoding RNA

【作者】 赵英杰

【导师】 王正志;

【作者基本信息】 国防科学技术大学 , 控制科学与工程, 2010, 博士

【摘要】 非编码RNA(non-coding RNAs ,ncRNA)是指不编码蛋白质的那部分转录产物,在各种生命过程中发挥着重要作用,包括基因调控、染色体重塑、基因定位、基因修饰和DNA印记等。对ncRNA的研究不仅具有重要的理论和应用价值,而且将对人们探索生命本质问题提供不可或缺的工具。由于采用实验方法研究ncRNA的各类问题,通常代价高、耗时长,且盲目性强。而随着各种生物基因组测序的先后完成,以及相应各类数据库的建立和不断丰富完善,使得计算方法在ncRNA各类研究中的应用成为可能和必要。本文选择了和ncRNA相关的序列-结构比对、二级结构预测和ncRNA基因识别等经典计算问题为研究课题,采用模式分类中的各种方法展开了深入研究,论文的主要研究内容和创新点包括以下几个方面:1、ncRNA序列-结构比对研究。序列比对是计算分子生物学的经典课题,而ncRNA因为其结构保守性比序列保守性更强,使得用传统的序列比对程序得到的结果不能满足各种ncRNA相关分析的需要。为此,要在序列比对的同时更多的考虑ncRNA的结构信息,这成倍的增加了算法复杂度。本文将量子遗传算法引入ncRNA序列-结构比对中,充分利用量子编码的叠加性、种群的多样性和量子旋转门进化的并行性,结合传统遗传算法的突变和交叉操作,提出了一种兼顾了结构和序列信息的全干扰配对保守交叉算子,并定义了充分考虑结构和序列信息的优化目标函数,使得进化速度和对局部最优沦陷的控制达到了较理想的结果,和传统遗传算法相比,缩短了优化过程的时间,提高了比对质量。2、ncRNA二级结构预测研究。遵循“结构决定功能”的ncRNA基因,二级结构在其各种相关研究中起着重要作用。传统的ncRNA二级结构预测方法都是基于优化算法,计算复杂度较高,时效性差。本文将ncRNA二级结构预测问题视为分类问题,重点研究了给定序列比对的情况下,根据序列比对提供的各种信息,判断比对的任一列对是否能够形成碱基对。在总结了现有结构预测算法中采用的各种数值计算量后,运用特征选择技术的不同方法对各种数值计算量进行了定量分析(以往预测方法都只进行了定性分析),并选出了适于分类算法进行结构预测的特征子集,此最优特征子集结合了热力学信息(平均碱基对配对矩阵)、共变信息(包含碱基对堆积的共变分值)和进化信息(Akmaev采用的整合序列间进化关系的R统计量)。采用SVM分类器及所选特征,结合茎组合规则,给出了基于分类算法的ncRNA二级结构预测方法,为结构预测提供了新的思路。3、microRNA基因前体识别。各类ncRNA基因由于不具备传统基因的识别特征,且在基因组中分布广泛、类别多样、长短不一,使得通用ncRNA基因识别方法效果不佳。microRNA作为一类重要的调控RNA,在许多生命过程中发挥着重要的作用,而microRNA前体(pre-miRNA)识别则是进行相关分析的前提步骤。虽然发卡二级结构是pre-miRNA的一个显著特征,但基因组中存在大量能折叠成发卡结构的非pre-miRNA序列。本文围绕pre-miRNA发卡二级结构特征,研究了如何从具有相似结构特征的序列中识别出pre-miRNA的问题。首先,通过将RNA二级结构“拉伸”,我们提出了一种新的局部序列结构特征,新特征不仅包含了发卡结构中茎结构序列信息,还考虑了凸环和内环的信息。测试显示,这些新的局部序列结构特征的分类表现要优于同类的3SVM特征。然后,为详细刻画pre-miRNA发卡二级结构信息,我们将图论和计算化学中的拓扑指数相结合,构造了新的自由能权重图拓扑指数特征。通过对这些拓扑指数的统计分析,以及和其他现有pre-miRNA识别特征的综合比较,显示了新的拓扑指数特征,不仅能够很好的刻画出pre-miRNA发卡结构中各元件的拓扑关系,而且通过自由能权重体现了结构中碱基组成及其相对位置。最后,通过特征选择技术,我们从包括了4种拓扑指数特征的52个候选特征中,选出了适合于pre-miRNA识别的23个全局特征子集,并通过对类不平衡数据的有效处理,得到了一个性能较好的pre-miRNA识别模型。

【Abstract】 Non-coding RNAs (ncRNA) are defined as all functional RNA transcripts other than protein encoding messenger RNAs (mRNA). The ncRNAs play many key roles in the various process of life, including gene regulation, chromatin remodeling, gene localization, gene modification and DNA imprinting. Researching in ncRNAs not only has importance of theory and applications but also will offer necessary tools for exploring the hypostasis of life. It is usually expensive, time consuming and aimless to research in ncRNAs with experimental methods, which are essential for understanding ncRNAs. However, computational tools for researching ncRNAs are sorely possible and needed, with successive accomplishment of sequencing of various genomes and the establishment and enrichment of corresponding databases. This dissertation focuses on the theme of classic computational problems related with ncRNAs, including sequence-structure alignment, secondary structure prediction and identification of ncRNAs genes. The main contents and contributions of the dissertation are summarized as follows:1. The research on ncRNAs sequence-structure alignment. Sequences alignment is one of the classic problems in computational molecule biology. NcRNAs molecules are highly conserved in secondary structure but share little sequence similarity, therefore the traditional methods of multiple alignments fail to meet the needs of analysis involved with ncRNAs. This in turn means that the computation of reliable ncRNAs alignments must take structural information into account, which results in visibly increase in computational complexity. To deal with this problem, we employ the quantum genetic algorithm (QGA) which is based on the concept and principles of quantum computing such as a quantum bit and superposition of states. Moreover, we design a new full interference pair crossover operator and construct a fitness function, which consider information of sequences and structures simultaneously. Experiments on BRAlibase show that QGA performs well without premature convergence, and have shorter optimization time and higher solution quality compared to the conventional genetic algorithm.2. The research on ncRNAs secondary structure prediction. The secondary structures of ncRNAs, which determine their function, are crucial to related researches. Most of the traditional methods for ncRNAs secondary structure prediction use optimization algorithm, which suffers from high space and time complexity. Given aligned ncRNA sequences, we consider secondary structure prediction as a classification problem: to judge whether any two columns in the alignment correspond to a base pair using provided information by alignment. After analyzing various computational measures used in the existing prediction methods, the classification capability of those measures was compared quantitatively using filter and wrapper approach with combination of support vector machine (SVM) classifier. As a result, an optimum subset of computational measures, including thermodynamic, covariation and phylogenetic information, was selected for predicting RNA secondary structure by classification. Our method used SVM classifier with selected measures and the rules of stem combination to predict ncRNA secondary structure, which represent a new methodology for future ncRNA secondary structure prediction approaches.3. The research on the precursors of microRNA genes. The universal computational methods to identify ncRNA genes are far from satisfactory because ncRNA genes have less signals in comparison with protein coding genes, and moreover, they are widely distributed in genome and have various varieties in kind and length. As one of important regulatory ncRNAs, microRNA plays crucial roles in lots of life processes. Identifying microRNA precursors (pre-miRNAs) is a primary step for analysis problems involved with microRNA genes. While the hairpin secondary structure is a distinguishing feature of pre-miRNAs, there are a large number of sequences folding into them, which are not pre-miRNAs. Focused on hairpin secondary structure, we research prediction methods to distinguish pre-miRNA hairpins from pre-miRNA-like pseudo hairpins. Firstly, 25 novel local features for identifying hairpin structures of pre-miRNAs were proposed by pulling hairpin of RNA, which captures characteristics on not only the stem but also bulge and interior loop in structure. The tests show that the classifier with new features outperformed the 3SVM. Secondly, to characterize detailed information of pre-miRNA hairpin, four topological indices weighted by free energy are defined. Exploration on these indices shows that they could not only characterize topological connection of elements, but also depict composition and relative position of bases in structure. Finally, we select 23 features from 52 candidates, which include 4 new topological indices, as feature set to identify pre-miRNA. And moreover, through handling of class imbalance problem in the datasets, an effective classifier model for pre-miRNA is developed.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络