节点文献

组合多重证据促进真核生物基因结构预测

Improving Gene Structure Prediction by Combining Multiple Sources of Evidence

【作者】 李校

【导师】 张义正;

【作者基本信息】 四川大学 , 遗传学, 2007, 博士

【摘要】 人类基因组计划的实施预示着现代生物学的发展进入到了组学的时代。当前,有近2,000个物种的基因组序列测定已经完成或者正在进行。基因组序列是一个物种进行一切生命活动的遗传与物质基础,解释和理解基因组序列的第一步是完整地注释其中参与编码蛋白质的基因。有许多证据能够对基因组注释提供支持,包括表达序列标签(Expressed Sequence Tag,EST)、同源蛋白质、基因预测软件的结果、相近物种间的保守片段等。这些不同类型的证据既能够相互补充,同时它们之间又存在冲突。人工的基因组注释主要是通过对比EST与基因组序列,产生一个可靠的注释结果。然而人工的注释耗时耗资,而且EST数据量的大小和质量严重影响到注释的完整性。计算机的基因预测能够提供了一个便宜的具有互补性的初始注释。计算机的基因预测主要是使用统计的机器学习方法,虽然在过去的20年里取得了重大的进展,但仍然有些问题亟待解决。当使用到大尺度的基因组序列时,当前的基因预测程序预测假阳性仍然偏高,而且对于缺乏训练数据的新测序物种会产生一个高度不准确的结果。本论文提供了一个基于分值的方法组合不同类型的证据,产生一个具有代表性的基因组注释结果。组合的证据包括与EST和蛋白质数据库的比对结果与4个计算机基因预测软件(Genscan,Augustus,Fgenesh,Geneid)的结果。首先,使用非参数估计统计方法转换不同证据的原始分值,使得转换后的分值能够准确地反映该证据的信任程度。我们测试了4种非参数估计方法——经验分布,分段线性函数,核密度估计,局部多项式估计,结果显示局部多项式估计是最可靠的转换方法。然后,所有的证据通过使用Dempster-Shafer证据理论结合投票的方法进行组合和归一化。最后,使用动态规划方法组合所有的证据到一个完整的真核生物基因结构。由于动态规划的方法组合基因结构不依赖于训练数据,因此此方法同样适合于预测新测序的物种。根据上述算法开发了一个真核生物基因结构预测软件,命名为SCGPred(Score-based Combinational Gene Predictorl。该软件使用Perl语言编写,为开放源代码。本论文详细地描述了上述组合算法的实现,并使用3个大的数据集评估了该软件的性能。其中,两个数据集(人的完整的第22号染色体和ENCODE序列集)用于评估该软件的监督的方法,而完整的玉米黑粉菌基因组则用于评估非监督的方法。结果显示,和其他的基因预测软件相比,我们的方法在敏感度和精确度上都有较大的提高,尤其是外显子水平。我们还证明,当应用到新测序的物种时,我们的方法同样超过了其他的非监督方法。除了编码蛋白的基因,当前研究发现有一类基因编码微RNA(microRNA)。这类微RNA通过碱基互补的方式结合到mRNA(通常是转录因子基因)上阻止该mRNA的翻译,或者启动该mRNA的降解。因此,是一种重要的后转录调控机制。使用比较拟南芥和水稻基因组并结合RNA二级结构分析,我们成功地预测了96条拟南芥微RNA,并显示这些微RNA通过结合转录因子mRNA参与到多重的代谢和遗传通路。

【Abstract】 The Human Genome Project (HGP) is a sign that we have entered an "omic" era in molecular biology field. To date, the determination of genome sequences of approximate 2,000 organisms has been sequenced or is ongoing. The first stage for interpreting and annotating the genomic data is to list the protein-coding genes and determine the exact exon-intron structure for every gene.There are many sources that can support evidence for annotating genomes, including the expressed sequence tags (EST), homologous proteins, computational gene predictions and the conservation among the closely organisms. The evidence from multiply sources is complementary and conflictive for the genome annotation. Although some model species have been annotated by the manual curators, the method is time-consuming and money-costing, and limited to annotate the genomes of model species. Therefore, the computational gene finding as the only solution has been carried out to produce an initial annotation, especially for most newly-sequenced species. The computational gene predictions have been made well progress in the last few years in terms of both methods and prediction accuracy measure, but the task still remains a significant challenge, especially for eukaryotes in which coding exons are usually separated by introns of vary length. The current gene predictors can produce results with a number of false positives when implementing in large genomic sequences. Moreover, computational gene finding in newly-sequenced genomes is especially difficult task due to the absence of a training set which is composed of abundant validated genes.In this thesis, we present a based-score method for predicting eukaryotic gene structures by combining multiply evidence generated from a diverse set of sources. The evidence includes the predictions of the four leading ab initio gene finders (Genscan, Augustus, Fgenesh and Geneid) and alignments to EST and protein databases. At first, the raw scores of evidence are transformed by the nonparametric estimation methods to the probabilistic ones that can reflect the likelihood that the evidence is correct. We tested the four methods (experience distributing, segment linear function, kernel density estimating and local polynomial regress), showing that local polynomial regress is the best method for score transformation. The evidence is then integrated and normalized by Dempster-Shafer theory of evidence and vote algorithm. Lastly, the normalized evidence is combined into a frame-consistent gene model by using dynamic programming. As dynamic programming is an unsupervised method, it can be used to predict genes in newly-sequenced organisms.Based on the models and algorithm described above, a computational program was designed, named as SCGPred (Score-based Combinational Gene Predictor). SCGPred was written as Perl language, and is open source based on GNU license. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with three datasets composed of large DNA sequences from human (the 22th chromosome and ENCODE sequence set) and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in contrast to the best of ab initio gene predictors. We also demonstrate that SCGPred can improve significantly prediction in novel genomes by combining several foreign gene finders with similarity alignments, and is superior to other unsupervised methods. As a result, SCGPred can be served as an alternative gene-finding tool for newly-sequenced eukaryotic genomes.Besides coding proteins, there is a large class of genes that code microRNAs. MicroRNAs, an abundant class of tiny non-coding RNAs, have emerged as negative regulators for translational repression or cleavage of target mRNAs by the manner of complementary base paring in plants and animals. By searching short complementary sequences between transcription factor open-reading frames and intergenic region sequences, and considering RNA secondary structures and the sequence conversation between the genomes of Arabidopsis and Oryza sativa, we detected 96 candidate Arabidopsis microRNAs. These candidate microRNAs were predicted to target 102 transcription factor genes that are classified as 28 transcription factor gene families, particularly those of DNA-binding transcription factor families, which imply that microRNAs might be involved in complex transcriptional regulatory networks for specifying individual cell types in plant development.

  • 【网络出版投稿人】 四川大学
  • 【网络出版年期】2008年 05期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络