节点文献

基于时间序列理论方法的生物序列特征分析

Analysis on the Characteristics of Biological Sequences Based on Time Series Theory Methods

【作者】 高洁

【导师】 徐振源;

【作者基本信息】 江南大学 , 轻工信息技术与工程, 2009, 博士

【摘要】 生物信息学的主要研究对象是DNA、RNA和蛋白质分子,因为这些生物大分子包含了遗传及物种进化的所有信息.随着DNA和蛋白质被测序,如何从这些DNA和蛋白质序列中获得更多的生物信息是具有挑战性的问题.随着碱基和氨基酸在基因数据库中的规模呈指数增长,利用新的理论方法去研究DNA和蛋白质序列就变得越来越重要.许多生物学家、物理学家、数学家和计算机专家都被吸引到这个研究领域中来.在介绍了生物信息学的研究背景之后,本文首先介绍了研究生物序列特性的时间序列理论方法,对本文要用到的短记忆ARMA模型和长记忆ARFIMA模型作了详细的阐述,为研究DNA序列、蛋白质序列特性做了理论上的准备工作.混沌游走表示(Chaos Game Representation,简记为CGR)是一种迭代映射技术,它可以把序列中的每一个单元,如DNA序列中的核苷酸,蛋白质序列中的氨基酸,映射到一个连续的坐标空间中去.我们基于CGR坐标提出了一种将DNA序列转换成一个时间序列(CGR-游走序列)的方法,并引入长记忆ARFIMA (p, d, q)模型来分析.我们分析了十条DNA序列的CGR-游走序列,发现都能用长记忆ARFIMA (p, d, q)模型高度显著地拟合.作为一个具有完善算法的经典时间序列模型,ARFIMA模型能帮助我们挖掘DNA序列中未知的特性.因为合适的ARFIMA模型在模型选择时成功率较低,且在参数估计中最大似然计算量较大,用短记忆模型去近似长记忆模型是研究者们感兴趣的问题.我们考虑利用短记忆ARMA(1, 1)过程去近似长记忆ARFIMA(p, d, q)过程,证明了这种适应性方法的均方误差准则,并引入DNA序列的十条CGR-游走序列用以分析,验证了这种近似方法的有效性,为长记忆DNA序列找到了一个算法更为简单的近似模型.在此基础上,我们还考虑利用ARMA(2, 2)模型去逼近ARFIMA(0, d, 0)模型.基于ARMA(2, 2)模型和ARMA(1, 1)模型有效性损失率的比较可知,ARMA(2, 2)近似模型优于ARMA(1, 1)近似模型.为验证此结论,还引入了服从ARFIMA(0, d, 0)模型的CGR-游走序列用以分析,比较了ARMA(1, 1)和ARMA(2, 2)这两个模型近似ARFIMA(0, d, 0)模型的有效性,根据残差标准差的结果可得ARMA(2, 2)近似模型优于ARMA(1, 1)近似模型.我们修改了Kalman滤波递推公式,解决了长记忆ARFIMA模型的缺失数据问题,并利用DNA序列的CGR-游走序列验证了此方法的有效性.基于已建立的DNA序列的CGR-游走模型,我们建立了一个类似的基于详细HP模型的连接蛋白质序列的CGR-游走模型,并引入长记忆ARFIMA (p, d, q)模型来分析,发现来自12条细菌全基因组的连接蛋白质序列的CGR-游走序列能用长记忆ARFIMA (p, d, q)模型显著地拟合.

【Abstract】 DNA, RNA and protein sequences are of fundamental importance in understanding living organisms, since all information of the hereditary and species evolution is contained in these macromolecules. After DNA and protein are sequenced, how to gain more bioinformation from these DNA and protein sequences is a challenging problem. The nucleotides and amino acids stored in GenBank have been growing exponentially. It has become important to improve on new theoretical methods to conduct DNA and protein sequences analysis. Many biologists, physicists, mathematicians and computer specialists are attracted to this interesting research field.After introducing the background of Bioinformatics, this paper first introduces the time series theory methods applied to characteristics researches of biological sequences. We introduce the short-memory ARMA model and the long-memory ARFIMA model which will be applied to biological sequences analysis in the paper.Chaos Game Representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to find the coordinates for their positions in a continuous space. A CGR-walk model is proposed based on CGR coordinates for the DNA sequences. The CGR coordinates are converted into a time series model, and a long-memory ARFIMA (p, d, q) model is introduced to DNA sequence analysis. This model is applied to simulate real CGR-walk sequence data of ten genomic sequences. Remarkably long-range correlations are uncovered in the data and these models are fitted highly reasonably by ARFIMA (p, d, q) models. As a classical time series model with perfect algorithm, ARFIMA model can help us find out the unknown characteristics of DNA sequences.Since there is low success rate in the selection of the right ARFIMA model, along with the complicated maximum likelihood calculations in the parameters estimation, the approximation by a short-memory process in the prediction of ARFIMA model is a topic of interest in the literature. We analyze the approximation of a general long-memory ARFIMA(p, d, q) process by a short-memory ARMA(1, 1) process. To validate this approximation, a mean square error forecast criterion is proved. The performance of the ARMA(1, 1) approximation to an ARFIMA model is illustrated by using an application to ten DNA sequences. We find an approximating model with more simple algorithm.We also study the approximation of a long-memory fractionally differenced ARFIMA(0, d, 0) model by a short-memory ARMA(2, 2) process. Based on the difference of the efficiency loss ratio of the ARMA(2, 2) model and the ARMA(1, 1) model, we know that the approximating ARMA(2, 2) model is better than that ARMA(1, 1) model to ARFIMA(0,d,0) model. To validate this conclusion, the two approximating models are applied to simulate CGR-walk sequence obeying ARFIMA(0, d, 0) model .We find the approximating ARMA(2, 2) model is better than that ARMA(1, 1) model to ARFIMA(0,d,0) model according to the prediction error standard deviation.By modifying the Kalman filter recursive equations, the proposed method allows an efficient estimation of a long-memory ARFIMA process with missing values. In order to illustrate the application and effectiveness, we analyzes a CGR-walk sequence of DNA sequence, and draws a conclusion: the proposed approach is really very efficient.Based on the CGR-walk model of DNA sequences, a new CGR-walk model of the linked protein sequences from complete genomes is proposed based on the detailed HP model. A long-memory ARFIMA (p, d, q) model is introduced into the protein sequence analysis. This model is applied to simulating real CGR-walk sequence data of twelve linked protein sequences from twelve complete genomes of bacteria. Remarkably long-range correlations are uncovered in the data and the results from these models are reasonably fitted with those from the ARFIMA (p, d, q) model.

  • 【网络出版投稿人】 江南大学
  • 【网络出版年期】2010年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络