节点文献

生物序列的几何刻画及应用

Geometrical Characterization of Biological Sequences and Applications

【作者】 郭颖

【导师】 王天明;

【作者基本信息】 大连理工大学 , 应用数学, 2008, 博士

【摘要】 随着各种模式生物基因组计划的蓬勃发展和相继完成,特别是人类基因组计划的顺利完成,生物学数据积累出现了前所未有的飞跃。伴随着这些生物数据的迅猛增长,生物信息学作为一门崭新的交叉学科运用而生并且得到了迅速的发展,正逐步成为21世纪自然科学的核心领域之一。它以数学、统计数、计算机科学为研究工具,以核酸、蛋白质等生物大分子为主要研究对象,对其进行科学的采集、存储、传递、检索、分析,进而探索生命的起源、生物的进化、生命本质等重大理论问题。生物信息学的研究内容十分丰富,主要有:序列比较、系统发育分析、基因预测、蛋白质结构预测、药物设计、生物化学模拟、整个基因组分析、RNA结构预测、序列重叠群装配、公共数据库和数据格式等等。本文我们主要在序列比较以及分子进化分析等方面进行了一些研究,主要研究成果有:在第二章中,我们基于CGR的思想,给出了RNA二级结构序列和蛋白质序列的2-D图形表示方法。避免了一些之前提出的生物大分子序列的图形表示模型的缺陷。同时我们分别用所提出的方法分析了不同序列的相似性,并构造了蛋白质序列的进化树。在第三章中,我们将三次样条函数光滑化后的曲线的曲率引入生物序列的相似性分析中,提出用曲线的曲率作为新的度量。并且我们以11种物种的β球蛋白基因和它的每一个外显子编码序列为例,分析了它们之间的相似性并构造了进化树。同时我们还研究了每一个外显子,发现第二个外显子所涵盖的生物信息要多一些。此方法具有准确性高,计算简单等优点。在第四章中,我们避免了上章中用光滑化后的近似结果的不精确性,提出了挠率的差分形式。我们把挠率的差分形式作为新的描述子来刻画蛋白质序列中的TOPstrings,然后我们分析了34条TOPS strings的相似性,并与基于Clustal X方法得到的结果做了一些比较,取得了比较好的结果。此方法同样也具有准确性高,计算简单等优点。在第五章中,我们不是单纯考虑曲线的一个特征量,而是把曲线的曲率和挠率两个特征量联合起来,作为一个新的度量,来分析DNA序列的相似性。应用此方法我们分析了11种物种的β球蛋白基因和它的每一个外显子,取得了比较好的结果。并且应用此方法我们对各种冠状病毒之间的亲缘关系进行了一系列的分析研究,并构造了它们的进化树。最后我们对比了以往常见的基于矩阵不变量的方法,从时间和数值结果对比上可以发现我们的方法要优越些,我们的方法过程简单,计算速度快。

【Abstract】 With the active development and completion of the genome of some model organism, especially the completion of Human Genome Project, the biological data presents unprecedented leap. With the increasing of these biological data, Bioinformatics, as a new interdiscipline, has generated and obtained the rapid development. Now Bioinformatics is becoming one of the core domains of nature sciences in this century, which uses mathematics, statistics, computer science as the study tools, and takes nucleic acid, protein, and some biological macromolecule as the study object. The subject focuses on how to collect, store, transfer, search, analyze, and then to explore the life origin, biological evolution, life inbeing and some serious theory problems.The research area of Bioinformatics is very wide, which includes sequence comparison, phylogenetic analysis, gene prediction, protein structure prediction, drug design, biochemistry simulation, the whole genome analysis, RNA structure prediction, assembly sequence, public database, the database format, and so on. The dissertation mainly studied the sequence comparison and phylogenetic analysis. The main results obtained in this dissertation can be summarized as follows:In Chapter 2, based on the idea of CGR, a 2-D graphical representation method of RNA secondary structure sequences and protein sequences is given, which avoids some limitation occurred in some former graphical representation model of biological sequence. These methods are used to analyze the similarity and dissimilarity of different species, and the phylogenetic tree of protein sequences is constructed.In Chapter 3, we have used the curvatures of smoothed curves by theβ-spline function to analyze the similarity of the DNA sequences and proposed the curvatures as a new invariant. The proposed method is tested on two real data sets: the coding sequences ofβ-globin gene and all of their exons. Meanwhile, we find that the information ofβ-globin gene of 11 species contained in the second exon is richer than the other two exons. Our method is simple and has high veracity.In Chapter 4, to avoid the unprecise approximate results, we have proposed the difference form of torsion. Then the torsion is regarded as the new descriptor to numerically characterize TOPS string. Our analysis on 34 TOPS strings has indicated that the introduction of TOPS strings into evolution analysis is successful. This method is also simple and has high veracity.In Chapter 5, instead of merely considering one curve characterization, we have computed curvature and torsion of curves as one descriptor to numerically characterize DNA sequences. The new method was tested on three data sets: the coding sequences ofβ-globin gene and all of their exons, Using the method we have also analyzed coro-navirus genomes and constructed their phylogenetic tree. In order to comparize, we employ the matrix invariant method to perform the similarity analysis on the same data. It’s obvious that our method performs faster and better results.

节点文献中: