节点文献

非编码RNA基因识别模型的设计与实现

The Design and Implementation of ncRNA Gene Finding Model

【作者】 管乃洋

【导师】 骆志刚;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2006, 硕士

【摘要】 生物信息学是计算机科学与生命科学相结合形成的一个研究领域。它通过用计算机科学的理论和相关算法对生命科学领域内的数据进行加工、存储、检索和分析。随着生物序列数据的快速增长,如何运用高效的算法来处理这些数据已经得到了越来越多的关注。基因识别正是其中一个焦点。它是指在DNA序列中识别出所有编码蛋白质的区域和所有与基因表达调控有关的不编码蛋白质的区域。本文主要研究非编码核糖核酸(non-coding ribonucleic acid, ncRNA)的基因识别问题。研究的方法采用上下文敏感隐马尔可夫模型(context-sensitive hidden markov model,csHMM)的技术,结合物种进化关系,尝试找出一种能够从基因组中识别非编码RNA基因的新方法。本文的重点是利用上下文敏感隐马尔可夫模型和物种进化关系构建非编码RNA的二级结构模型,并实现了非编码RNA基因的理论预测。首先,利用csHMM构建基本的非编码RNA二级结构模型。其次,从代表物种进化关系的氨基酸置换矩阵推导出上下文敏感隐马尔可夫模型的生成概率,从而构建新的非编码RNA识别模型框架pair-csHMM。再次,修改csHMM的Inside-Outside算法优化模型参数,使模型能从已知序列中提取二级结构特征。最后,用优化后的模型去预测非编码RNA基因,并实现了原型系统。研究的难点在于反映非编码RNA特征的模型的建立,及其参数的优化。本文把非编码RNA的二级结构特征和物种进化过程中的保守性融合到非编码RNA模型中,使模型能更好地反映非编码RNA的特征。并且修改了csHMM的Inside-Outside算法以训练新构建的非编码RNA模型,使模型更精确。实际的测试结果表明,所构建的模型比较合理地反映了非编码RNA的特征,经过优化后可以用于非编码RNA基因的识别。本文的主要创新点:(1)在非编码RNA识别中使用上下文敏感隐马尔可夫模型。实验结果表明,该模型提高了非编码RNA基因识别的特异性;(2)在csHMM模型中引入物种进化关系。实验结果表明,两比对基因组的进化距离与模型的进化距离越近识别效果越好;(3)实现了非编码RNA基因识别原型系统RNA-cs。

【Abstract】 Bioinformatics is a field which combines computer science and life science. It processes、stores、searches and analysises data produced in life science using theories and correlative algorithms of computer science. With increaing of biologic sequence data, more and more focuses have been put on processing data by efficient algorithms. Gene finding is one of these focuses which is to predict either the regions coding proteins or the regions regulating gene expression but do not coding any protein from DNA sequences.The finding question of non-coding RNA (non-coding ribonucleic acid, ncRNA) genes was studied in the thesis. Its method is using the technique of csHMM (context-sensitive hidden markov model) and the species evolutionary relationship to set up a new computational framework which is able to distinguish non-coding RNA genes from genome.The strong emphasis of the thesis was laid on using csHMM model and species evolutionary relationship to set up the secondary structure model of non-coding RNA. Firstly, basic secondary structure model of non-coding RNA was set up using csHMM model. Secondly, probabilities of emitting paired residues were computed from amino acid mutation matrix representing the species evolutionary relationship to form a new computational framwork of non-coding RNA gene finding called pair-csHMM. Thirdly, we modified the Inside-Outside algorithm of csHMM model to optimize pair-csHMM, whose aim was to distill feature of RNA secondary structure from known RNA sequence. Finally, a prototype system was implemented to find non-coding RNA gene.The main difficulties encountered in the thesis were the establishment of the non-coding RNA model and its parameter optimization. Not only the secondary structure conservation of non-coding RNA but also its sequence conservation between evolutionary processes was integrated into the non-coding RNA model using csHMM model. And the Inside-Outside algorithm of csHMM was modified for training the non-coding RNA model to make it more accurate. The result of testing indicates that the new framwork can be used to find non-coding RNA genes.The new ideas were summarized as follow: (1)The csHMM model was used to predict non-coding RNA genes. The result testing indicates that the model improves the differential of non-coding RNA gene finding. (2)The species evolutionary relationship was introduced into pair-csHMM model. The result of testing indicates that the nearer the evolutionary distance between the aligned genome and non-coding RNA model the more it can predict non-coding RNA genes. (3)A prototype system called RNA-cs was implemented to predict non-coding RNA genes.

【关键词】 生物信息学非编码RNA基因识别隐马氏模型
【Key words】 Bioinformaticsnon-coding RNAgene findingHMM
  • 【分类号】TP301
  • 【被引频次】3
  • 【下载频次】669
节点文献中: