节点文献

基于纠错编码理论的DNA序列编码特性分析

Analysis of Coding Features of DNA Sequences Based on Error-Correction Coding Theory

【作者】 刘晓

【导师】 田逢春;

【作者基本信息】 重庆大学 , 电路与系统, 2010, 博士

【摘要】 现代生物学的研究不再是单一学科的研究,而是多学科交叉、综合的研究。由于生物系统自身的复杂性,需要将多种分析理论和研究方法应用到该领域。随着基因工程所获得的基因数据的急速增加,引起了人们采用新方法、技术和工具对其进行分析的兴趣。由于生物系统中存在的信息传递、编码与现代通信系统中的信息传输与编码的相似性,因此将现代通信工程中的纠错编码理论应用于生物遗传序列的研究和测试系统的设计并取得了一些可喜的进展。本文基于通信工程的纠错编码理论对生物系统信息分析方法进行研究,对若干对象的序列进行分析,旨在为通信纠错编码理论在生物学领域研究中的应用寻求新的途径和方法。开展工作如下:1、根据三联体密码在遗传信息表达过程中的重要作用,将三联体密码(而不是单个碱基)作为遗传信息的基本信息单元,同时考虑相邻密码子之间的相互作用,借鉴通信编码理论中分组码编码模型的设计、分析方法,通过试验选定基于分组码的(6,3)分组码模型。选定GC含量不同的12种原核生物和9种真核生物作为分析对象,利用(6,3)分组码模型对它们的DNA序列进行分析,利用码距作为特征参数与分析对象的生物特征进行对比。分析结果在表明原核与真核分析对象的平均码距在起始密码子附近和终止密码子附近均呈现出显著变化,在原核生物的SD区域也有显著变化。2、在纠错编码中,卷积码是一种具有较好性能的信道编码方式,理论和实际上均已证明卷积码的性能至少不比分组码差,应该可以寻找更好的卷积编码模型来分析DNA序列的编码特性。参考分组码模型分析方法和结果,借鉴通信编码理论中卷积码编码模型的设计、分析方法,基于密码子简并性、密码子上下文关联性、碱基短程关联占优特性,使用三联体密码作为基本信息单元,设计了(6,3,1)卷积码分析模型。利用(6,3,1)卷积码模型对所选12种原核生物和9种真核生物DNA序列进行分析,结果表明原核与真核分析对象的平均码距在起始密码子附近和终止密码子附近均呈现出显著变化,在原核生物的SD区域有显著变化。此外,所有对象的平均码距曲线在编码区表现出明显的周期3特性。根据观察到的不同GC含量的分析对象平均码距曲线分离的特性(特别是对于原核生物),我们在实验中新定义了一个参数:特征平均码距(CACD),它与GC含量具有关联,与原核生物GC含量具有较好的比例特性。这赋予了编码参数以生物特征,表明卷积码模型在生物信息研究中具有深入研究和应用的潜力。由于上述分析模型的设计是基于生物遗传信息的通用特性提出,因此对分析对象没有依赖性,可以对多类对象进行分析而不需要对模型调整。3、侧重于基于卷积码的分析模型,根据碱基短程关联占优特性,对参数进行了对比分析。考虑通常分析方法中常采用将单个碱基作为基本信息单元,选定(2,1,1)卷积码模型进行分析。考虑过渡状态的对比,选定(3,2,1)卷积码模型进行分析。通过对编码输出长度、码距计算码长等参数的对比分析,初步确定效果较好的(6,3,1)、(3,2,1)和(2,1,1)模型作为分析模型。4、将基于纠错编码的分析模型应用于序列相似性分析。使用所设计的(6,3,1)、(3,2,1)和(2,1,1)卷积码模型对11个物种(包括人,山羊,负鼠,鸡,狐猴,小鼠,大鼠,兔子,牛,大猩猩和黑猩猩)的β-球蛋白第一个外显子编码序列的相似性/不相似性进行分析。利用L/L和M/M矩阵的归一化最大特征值构建8分量矢量,计算其两两端点间的欧几里得距离,分析结果反映出3种灵长类对象(人,黑猩猩,大猩猩)之间由于进化关系而存在的强相似性,而与负鼠(距现存哺乳动物最远物种)和鸡(其中唯一非哺乳动物对象)的相似性很弱。数据分析的结果表明所提出的方法可以反映所分析的DNA序列的重要信息。

【Abstract】 Researches in modern biology are based on multi-interdisciplinary subjects, instead of single one. The complexity of biological systems requires the crossing of various theories and methods. The rapid increasing data obtained from genetic engineering have aroused the scholars’interest to study the biological systems as information transmission systems. Based on the similiarity of information transmission and coding between biological systems and modern communication engineering, the error-correction coding theory of modern communication engineering is employed for the study of genetic sequences and design of biological test systems, which has resulted in some obvious progresses.In our research, we studied the information analysis method of biosystem based on error-correction coding theory of communication engineering and sequences of some objects were analyzed. This will help us explore a new approach of applying communication coding theory to biological field.The relative work is as follows:1. A codon is treated as a basic genetic information unit, instead of a nucleotide, based on the importance of codons in the expressing of genetic information. Considering interaction between adjacent codons, we designed a (6,3) block code model for analysis, using the design method of block code encoding model in communication coding theory as reference. DNA sequences of the twelve procaryotic organisms and nine eukaryotic organisms with different GC content were selected for analyzing with the (6,3) block code model. Code distance was used as a characteristic parameter for detecting the corresponding biological feature. We observe that average code distances fluctuate obviously near the initiation codon and termination codon. Remarkable changes also appear in the SD field of procaryotic organisms.2. We know that convolutional code model is always better than block code mode in coding system, which inspires us to study and search better convolutional code model for the analyzing of DNA sequences. Considering the convolutional code encoding model and the results based on our block code model, we designed a (6,3,1) convolutional code-based model according to the degeneracy of codons, context of condons, short-range dominance of bases correlation and a codon being a information unit. And then, we analyzed the selected DNA sequences of the twelve procaryotic organisms and nine eukaryotic organisms with the (6,3,1) convolutional code model. We observe that average code distances fluctuate obviously near the initiation codon and termination codon. Remarkable changes also appear in the SD field of procaryotic organisms. We also observe obvious period-3 feature in the coding region of all objects. We defined a new parameter, characteristic average code distance (CACD), to describe the separation of average code distance curves of different objects with different GC contents (especially for procaryotic organisms). CACDs are relative to GC contents and proportional to the corresponding GC contents of procaryotic organisms approximately. So, the code parameter carries certain biological information. This shows that this model deserves further study and usage in bioinformation processing.We establishe these models on the basis of general features of genetic information, so it is species-independent and suitable for various kinds of objects analysis without model’s adjustment.3. Focusing on the convolutional code model, we compared some model parameters based on short-range dominance of bases correlation. Considering a nucleotide as a genetic information unit as usually, we selected (2,1,1) convolutional code model. And (3,2,1) model was selected as a transition. We compared code length of coding output and code length for code distance calculation, and then confirmed that (6,3,1), (3,2,1) and (2,1,1) models can provide good results.4. The analysis models based on error-correction coding theory were used for similarity study of DNA sequences. We studied the similarities/dissimilarities among the coding sequences of the first exon ofβ-globin gene of 11 species (human, goat, opossum, gallus, lemur, mouse, rabbit, rat, gorilla, bovine and chimpanzee) with the (6,3,1), (3,2,1) and (2,1,1) convolutional models. We constructed an 8-component vector whose components were the normalized leading eigenvalues of the L/L and M/M matrices. Based on the Euclidean distances between the end points of the 8-component vectors, the simulation illustrates that the three kinds of Primates (human, chimpanzee, and gorilla) are similar to each other strongly because of their evolutionary relationship, and opossum (the most remote species from the remaining mammals) and gallus (the only non-mammalian representative) are of weak similarity to the others. The results demonstrate that the approach can reflect the important information of the DNA sequences considered.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2010年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络