节点文献

原核与真核生物蛋白质编码区识别及基因组分析

Recognition of Protein-coding Genes and Genomic Analysis of Prokaryotic and Eukaryotic Genomes

【作者】 陈玲玲

【导师】 张春霆;

【作者基本信息】 天津大学 , 生物物理学, 2004, 博士

【摘要】 随着人类基因组、模式生物基因组及微生物基因组计划的蓬勃发展,已有近二百种自由生物体全基因组完成测序,国际三大核酸序列数据库中的碱基数量呈指数形式增长。基因组序列测定之后,找出其中的蛋白质编码基因是进行基因组分析的第一步,在生物信息学研究中占有非常重要的地位。本论文主要致力于原核生物与真核生物及冠状病毒蛋白质编码基因识别以及基因组分析方面的工作。论文第一部分介绍了生物信息学的发展背景及主要研究内容、原核生物与真核生物基因的结构特点、主要的蛋白质编码基因识别算法以及DNA序列的Z曲线理论及应用。Z曲线理论是本文中我们分析原核生物和真核生物基因组的主要工具,因此对其做了较为详细的介绍。论文的第二部分是原核生物及冠状病毒的基因识别和分析。首先我们提出了一种方法从细菌、古细菌基因组中注释较好的已知基因出发训练参数,进而确定注释不完善的ORFs中可能不编码蛋白质的ORFs,在此基础上开发了一套细菌、古细菌基因识别软件ZCURVE_C并提供网上服务;我们还发现基因组的GC含量比进化上的亲缘关系对于细菌、古细菌的基因识别更为重要。其次,我们利用Z曲线方法参数少的优点,开发了专门适用于冠状病毒 (尤其是SARS冠状病毒) 的基因识别软件ZCURVE_CoV,并采用位置权重矩阵来预测3C-like和papain-like两种蛋白酶的剪切位点,开发出能预测冠状病毒多聚蛋白酶切位点的新版本。 论文的第三部分是真核生物基因识别和基因组结构分析。首先,我们基于Z曲线的非窗口技术分析了拟南芥基因组的isochore结构,画出了拟南芥五条染色体的Z’曲线图。详细分析了2号染色体上找到的两个isochore,其中一个位于核仁组织区,另外一个是线粒体DNA插入片断,我们可以精确的确定它的大小和在染色体中的位置。其次,我们开发了基于Z曲线方法的真核生物从头预测基因识别软件Zcurve_E。该软件侧重于提取蛋白质编码序列在三个密码子位的全局统计学特征,具有参数少和通用性较强的优点。将Zcurve_E和当今识别效果较好的Genscan联合使用,可以部分降低Genscan的伪正率,得到更好的识别效果。

【Abstract】 The fast increasing pace of human and other model organism genome-sequencing projects have provided us a large quantity of genome data, which leads to a great need for automatic genome annotation. One of the important tasks of annotation is to recognize protein-coding genes in prokaryotic and eukaryotic genomes. This paper describes some new approaches for recognizing protein-coding genes in bacterial and archaeal, coronavirus and eukaryotic genomes by using the Z curve method.The first part of the paper introduces the development of bioinformatics and the progress of computational gene-finding algorithms. The Z curve theory, which is the basic tool in analyzing prokaryotic and eukaryotic genomic sequences in this paper, is also presented in this section. The second part proposes some algorithms in the recognition of protein-coding genes in prokaryotic genomes. Since false positive prediction always exists in the annotation of microbial genomes, it is essential to confirm which ORF is coding and which is not. Starting from the known genes in the annotation file, we describe a method based on Z curve theory to recognize protein-coding genes in questionable ORFs. The average recognition accuracy of 57 bacterial and archaeal genomes is greater than 99%. A computer program, ZCURVE_C, has been developed and website service is provided. We also find that the genomic GC content of bacterial and archaeal genomes is more important than phylogenetic lineage in gene recognition. Finally, a new program to recognize genes in coronavirus genomes, especially suitable for SARS-CoV genomes, has been proposed. The improved system, ZCURVE_CoV 2.0, can predict the cleavage sites of viral proteinases in coronavirus polyproteins. The third part analyzes the genome structure of Arabidopsis thaliana and develops an ab initio eukaryotic gene recognition program. Using a windowless technique based on the Z curve method, the isochore structure of Arabidopsis thaliana genome has been explored. The position and size of a mitochondrial DNA insertion isochore has been precisely predicted. Its amino acid usage and codon preference shows different properties with genes in other regions. Furthermore, a new ab initio gene-finding software for eukaryotic organisms, Zcurve_E, has been proposed in this section. The new algorithm addresses <WP=4>global statistical features of protein-coding sequences by taking the frequencies of bases at three codon positions into account. Consequently, it gives better consideration to both typical and atypical cases. Compared with other gene-finding software, the present program has the merits of simplicity, universality and reliability. Joint applications of Zcurve_E with Genscan, which is probably the best software currently available for gene recognition in eukaryotic genomes, may lead to better results over any individual program.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2004年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络