节点文献

基于信号处理技术的生物序列相似性分析与基因识别

Similarity Analysis of Biological Sequences and Gene Identification Based on Signal Processing Techniques

【作者】 王世元

【导师】 田逢春;

【作者基本信息】 重庆大学 , 电路与系统, 2011, 博士

【摘要】 生物信息学是一门新兴的交叉学科。它是以计算机和网络为工具,采用数学和信息科学等理论和方法研究核酸、蛋白质等生物大分子。生物信息学的研究能够帮助我们进一步探索生物进化和生命本质等重大问题。同时,生命中蕴藏的巨大信息也将进一步促进其他学科的发展。本文旨在探索信号处理技术在生物信息学中的应用。主要研究内容包括生物序列的相似性分析和基因识别。本文的研究成果可概括为:①鉴于RNA二级结构的结构特征主要体现在碱基对中,本文以碱基对为出发点,提取出RNA二级结构序列所对应的基序列,并借鉴信号处理技术中的正交投影和小波变换的思想在所得的基序列上设计碱基对变换,进而构建序列间的相似性函数。该函数结合了序列间碱基对变换后结果之间的差值及其对应的位置差值,因此能够全面地比较序列间的差异,从而实现RNA二级结构的相似性分析。基于碱基对变换的相似性分析方法的时间复杂度较小。除此之外,该方法获得的相似性分析结果之间的差异较大,有利于进一步实现所得结果的聚类分析。②基于信息论中的汉明距离,本文提出了一种具有普适性的双边相似性函数,使之能够适应DNA序列、RNA二级结构序列和蛋白质序列的相似性分析。该方法不需要对生物序列进行数值映射,能够较好地提取生物序列中的信息,以较低的时间复杂度统一地实现三种生物序列的相似性分析,证明了双边相似性函数的有效性和普适性。尤其对RNA二级结构序列的相似性分析,不考虑结构信息和考虑结构信息的分析结果近似一致。这样就简化了RNA二级结构序列的相似性分析过程。③基于符号动力学原理,本文提出了一种新的DNA序列表示方法。该表示方法不仅具有良好的数值特征,能够挖掘DNA序列中的混沌特征,而且还能够实现序列的可视化表示。新表示方法的可视化特征能够实现DNA序列的图形比对和密码子比对。基于密码子比对的结果,构建序列间的相似百分比有效地实现了DNA序列的相似性分析。基于几何中心构成的特征向量,新表示方法同样能够有效地实现DNA序列的相似性分析,表明符号动力学原理能够有效地应用在DNA序列的分析中。④结合RNA二级结构序列与DNA序列的不同点,改进DNA序列的符号动力学表示方法使之适合RNA二级结构序列。其出发点是RNA二级结构的结构稳定性主要是由碱基对的自由能决定。重点讨论了改进后的RNA二级结构序列表示方法中的截取长度对序列相似性分析结果的影响。在时域中,结合矩阵不变量,利用改进后的表示方法定量地实现了RNA二级结构序列的相似性分析。为了进一步验证改进后的表示方法的有效性,对表示结果进行离散傅里叶变换,从频域定性地分析了RNA二级结构序列的相似性。实验结果表明符号动力学原理同样能够有效地应用在RNA二级结构序列的相似性分析中。⑤结合DNA序列的符号动力学表示方法和Z曲线表示方法,本文利用基因编码区的周期-3特性设计了一种基于扩展卡尔曼滤波器的基因识别模型。该方法能够利用扩展卡尔曼滤波器的预测能力,有效地识别基因的外显子位置。同时,为了降低识别结果中的背景噪声,对识别结果采用加窗处理的方法,进一步提高了基因编码区和非编码区的识别效果。

【Abstract】 Bioinformatics is a new interdiscipline. With the aid of computers and internet, bioinformatics deals with biological macromolecules including nucleic acid and protein etc., according to the theories and methodologies from mathematics and information science. The research on bioinformatics can help us explore some serious problems about biological evolution and life inbeing. In addition, the huge knowledge hidden in life can also accelerate the development of other disciplines.This dissertation is aimed at exploring the applications of signal processing techniques in bioinformatics. The main research focuses on similarity analysis of biological sequences and gene identification.The main results obtained can be summarized as follows.①Since the structure information on RNA secondary structure is mainly composed of base pairs, we construct base sequences from sequences of RNA secondary structure based on base pairs. With the help of the principles of orthogonal projection and wavelet transform, base-pair transform on the obtained base sequences is then designed. Then, based on the designed base-pair transform, the similarity function between sequences is constructed for comparing RNA secondary structures. The similarity function combines the difference between the transformed results of two sequences with the difference between the associated locations. Therefore, the similarity function can comprehensively compare difference of sequences, and can be applied to similarity analysis of RNA secondary structure. This proposed method for similarity analysis has lower time complexity. In addition, the difference among the results obtained by this method is larger, which can help to further implement cluster analysis of the obtain results.②Based on Hamming distance of information theory, a universal bilateral similarity function is proposed to implement similarity analysis of biological sequences including DNA, RNA secondary structure and protein. With no requirement of numerical mapping of biological sequence, the proposed method with lower time complexity, contains much information of biological sequences, and unify the methods for similarity analysis of three kinds of biological sequences. Simulation results fully show the validity and universality of the bilateral similarity function. Especially for RNA secondary structure, based on the proposed similarity function, the results with consideration of structure information is consistent with the ones without consideration of structure information, which can simplify the procedure of similarity analysis of RNA secondary structure.③Based on the principle of symbolic dynamics, a novel representation method for DNA sequence is proposed. This proposed representation method with the feature of visualization, bears better numerical characteristic which can help to find the chaotic characteristic of DNA sequence. The visualization feature of the proposed method can implement graphical alignment, codon alignment of DNA sequence. Based on the results of codon alignment, a similarity percent between sequences is constructed for effectively implementing similarity analysis of DNA. Based on the characteristic vector composed of the geometrical centers, the proposed method can also implement similarity analysis of DNA, effectively. It is shown from the obtain results that the principle of symbolic dynamics can be applied to sequence analysis of DNA, effectively.④Combined with the difference between the sequences of RNA secondary structure and DNA, the representation method for DNA based on symbolic dynamics is modified for RNA secondary structure. The starting point is that the structure stabilization of RNA secondary structure is mainly decided by the free energy of base pairs. The influence of truncated length on the results of similarity analysis is discussed emphatically. In time domain, combined with matrix invariants, the modified method can implement similarity analysis of RNA secondary structure, quantificationally. In frequency domain, the qualitative analysis is made to further validate the modified method. Simulation results show that the principle of symbolic dynamics can also be effectively applied to similarity analysis of RNA secondary structure.⑤Combined with the representation methods based on symbolic dynamics and Z curve for DNA, the period-3 feature of protein coding region is utilized to design a gene identification model based on extended Kalman filter. With the help of the prediction ability of extended Kalman filter, the proposed model can effectively identify the location of gene exons. In order to reduce the background noise, a window operation is performed after the proposed model, which can further improve the identification results of coding and noncoding regions of gene.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2011年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络