节点文献

生物序列的相对特征分析及Burrows-Wheeler方法

Relative Character Analysis and Burrows-Wheeler Methods for the Biological Sequence

【作者】 杨连平

【导师】 王天明;

【作者基本信息】 大连理工大学 , 应用数学, 2011, 博士

【摘要】 随着后基因组时代的到来,面对着大量的基因组的完全测序及各种问题的涌现,人们期望低成本的序列比较分析工具能够更精准、更快速的分析和预测序列的结构与功能,从而降低用实验方法测定与分析而带来的高额时间与金钱成本。本文致力于生物序列分析的研究领域,提出具有一定特色的比较分析模型。通常,序列的比较分析主要被分成两类模型:比对模型和非比对模型。本文从比较分析流程的拓扑框架上看待各种比较模型,提出将比较分析模型分为特征分析模型及相对特征分析模型。比对模型及基于信息压缩的比较模型都属于相对特征分析模型。在相对特征分析模型中,相似性假设是这类比较模型的一个核心内容。通过分析相似性假设可以得出该模型的主要的优缺点。本文重点研究讨论了两类相对特征分析模型:基于序列间公共子串的比较模型和’Burrows-Wheeler方法。本文提出的基于公共子串的比较模型是通过讨论最长公共子串与最短特异子串之间的关系而得出的一种模型。其主要特点是:算法的时间复杂度为线性的,从而适合分析很长的基因组;其中的局部距离度量可以较好的分析基因组间的局部相似性,即使所考虑的局部包含了部分片段的重组信息;根据局部距离度量而得出累积局部距离也能有效的分析基因组的整体相似性。通过对HIV-1全基因组及其片段的子型判别的问题的研究,我们验证了该模型的有效性。Burrows-Wheeler方法是另一类本文重点研究讨论的相对特征分析模型。其理论主要基于信息无损压缩理论中的一个重要的可逆变换——Burrows-Wheeler变换。在此变换的基础上而得出的扩展Burrows-Wheeler变换可以有效的分析序列间的共有因子的含量。本文提出了一种称为Burrows-Wheeler相似性分布的概念,并用其来描述序列间的相似性。在此基础上,我们提取Burrows-Wheeler相似性分布的两类数字特征——期望和信息熵,并针对基因序列、蛋白质序列及其结构序列的特点,采用不同的策略比较它们之间的相似性。

【Abstract】 As the coming of the post-genome period, we have to face up to the vast complete genomes and kinds of questions. The inexpensive sequence analysis tools are expected to be faster and more accurate to analyze and predict the structure and the function of the biological sequences, which can reduce the high cost of time and money by the experimental methods. In this dissertation, we focus on the field of the biological sequence analysis and propose some models with great value.Traditionally, there are two kinds of sequence analysis tools:alignment and alignment free models. However, we point out that the models fall into two categories by the topology structure of the basic comparison frames:one is character analysis and the other is relative character analysis. Models based on alignment or based on text compression are all relative character analysis models. We find that the core of the relative character models is the hypothesis of the similarity. We will find the main merit and demerit by the hypothesis of the similarity.The discussion topics of this dissertation are two kinds of relative character comparison models which are based on common strings and Burrows-Wheeler method respectively. The common string model is designed through investigating the relationship between the longest common strings and the shortest absent words. The advantages of this model are:the time complexity is linear which is perfect to analyze the huge genomes; the local distance measure derived by this model can be used to search the similar parts between the genomes, even though the local parts take some gene recombination information in; the local distance deduce the integral local distance easily which can be used to analyze the integral similarity efficiently. The validity is confirmed by classifying the subtype of the complete genomes and their segments of the HIV-1.Burrows-Wheeler methods are another kind of relative character methods. The essential foundation is the invertible Burrows-Wheeler transformation which has important applications in the field of the lossless compression. The extensive Burrows-Wheeler transformation is the key generalization for the comparison frame, which can detect the content of the common factors between the biological sequences. We propose a concept called Burrows-Wheeler similarity distribution to represent the similarity of the sequences. Moreover, some digit characteristics, expectation and entropy, are computed to compare kinds of biological sequences with different strategies chosen by the feature of the gene, protein or the structure sequences.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络