节点文献

生物序列分析中的非比对方法及其应用

The Alignment-free Methods and Their Applications for Analysis of Biological Sequences

【作者】 刘迎照

【导师】 王天明;

【作者基本信息】 大连理工大学 , 应用数学, 2008, 博士

【摘要】 随着数学与计算机技术的飞速发展和巨量生物学数据的不断积累,一门新兴的充满活力的交叉学科——计算分子生物学(Computational Molecular Biology)应运而生。计算分子生物学主要是研究生物学应用上具有计算复杂度的问题,它吸引了许多计算机学家、分子生物学家、数学家等积极投入研究。生物序列分析是计算分子生物学研究的核心内容,传统的分析方法主要是以序列比对方法为主,而随着“后基因组(post-genome)”时代的到来,生物序列分析的非比对方法作为对传统方法的补充和发展已逐渐成为计算分子生物学研究中的一个热点领域。本文在对传统的序列比对方法进行简要回顾的基础上,较系统地总结了已有的非比对方法并提出了一些新的非比对方法,然后针对一些具体的生物序列进行了分析研究。本文的主要工作包括以下几个方面:基于生物序列的概率向量表示,提出了一种新的距离度量——正规化欧氏距离,重构了两组蛋白质序列集CK35和SP86的二级结构分类,并利用ROC曲线和AUC值与传统的比对方法和其它距离度量得到的分类结果进行了比较。以生物序列L-联体为核心,给出了DNA序列的一种8D向量表示和高维向量表示,并根据滑动窗口不同的起始位置构造相关矩阵,选取相关矩阵的正规化最大特征值和Frobenius范数作为数值特征比较序列的相似性。作为应用,我们比较了十一个物种的β-球蛋白基因的第一个外显子的相似性;简单模拟了DNA序列高维向量表示及相关矩阵在数据库搜索方面的应用;重构了H5N1型禽流感病毒全基因组编码序列的种系进化树。基于L-联体在生物序列中出现的次数和位置,根据离散随机变量分布函数的定义提出了L-联体特征分布的概念,以此来反映L-联体的分布规律,揭示生物序列中所包含的生物信息。利用此特征分布我们研究了11个物种β-球蛋白第一个外显子的GC特征分布图;重构了24种冠状病毒全基因组序列,34种哺乳动物线粒体全基因组序列和40种跨膜蛋白序列的种系树。

【Abstract】 With the rapid development of the mathematics and computer technologies and the continuous accumulation of the tremendous biological data, a new and active interdiscipline—Computational Molecular Biology comes into being. The research in computational molecular biology which has attracted plenty of computer scientists, molecular biologists, mathematicians and so on to devote to it, is mainly concerned with the problems involving the computational complex in the biological applications. Biological sequence analysis is the key content of the interdiscipline and the traditional methods for the analysis are chiefly based on alignment of the strings, while with the coming of the " post-genome" era, alignment-free methods of the sequence analysis as the complement and development of the alignment methods have become a hot research area of computational molecular biology. In this dissertation, we firstly simply review the alignment methods; secondly relatively systematically summarize the alignment-free methods and propose some new alignment-free methods; finally make the analysis for some species sequences using the novel methods. The main contents of this dissertation are listed as follows:Based on the vectors of L-tuple probabilities for biological sequences, we provide a novel distance measure-normalized Euclidean distance, and classify two sets of protein sequences-CK35 and SP86 according to protein secondary structures using the distance function. Further, we compare our method with other metrics and alignment methods via ROC (Receiver Operating Curve) analysis in order to assess the intrinsic ability of the methodology to discriminate and classify biological sequences and structures.Using L-tuples, we consider to construct three 8-components vectors and multivariate vectors for a DNA primary sequence, and by the different start positions of the sliding window, a set of related matrices are given. The normalized leading eigenvalues and Frobenius norm from the constructed matrices have been selected as the numerical characterizations. As applications, we compare the similarity and dissimilarity for exon 1 ofβ-globin genes belonging to eleven species; we simulate the search for similar sequences of a query sequence from a database of 39 library sequences by the multivariate vectors representations of DNA sequence; we reconstruct the phylogenetic trees of H5N1 avian influenza virus genomes.From the frequency and position of appearance of L-tuple in a biological sequence, we consider construction of a characteristic distribution of an L-tuple to reflect the biological information involved in the sequence. The graphs of characteristic distributions of dinucleotide GC for the coding sequences of the first exon ofβ-globin gene of eleven different species, and the construction of phylogenetic trees of twenty four coronavirus genomes, thirty four mitochondrial genomes and 40 G protein-coupled receptors illustrate the utility of the approach.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络