节点文献
DNA信号序列分析的基因预测方法研究
Research on DNA Signal Sequences Analysis for Gene Prediction
【作者】 郭烁;
【导师】 朱义胜;
【作者基本信息】 大连海事大学 , 通信与信息系统, 2010, 博士
【摘要】 生物信息学是21世纪最具发展前途的一门科学,它致力于解释丰富的基因信息并从中揭示和提取规律,最终达到全面认识生命及其过程的目的。而解释和理解基因组序列的关键是基因预测,即识别基因组中所有的功能单元,包括编码蛋白质的DNA片段和其它功能单元。由于生物基因多样性、基因结构复杂以及该学科较为年轻等原因,现有的生物识别算法在辨识精度、计算量、适用范围等方面还存在很多问题。针对上述问题,本文从基因预测的三个方面进行研究:1.剪接位点预测方法研究:剪接位点辨识是基因预测的一个重要环节。本文基于Takagi-Sugeno(T-S)模糊模型具有泛化性较好、鲁棒性强、结构简单等优点,提出一种基于模糊似然函数的模糊聚类和最小二乘相结合的T-S建模方法;根据剪接位点上下游附近序列的统计特征与附近序列碱基组成随GC含量高低变化的特征分别建立剪接位点T-S预测模型,有效地提高了识别精度。为了进一步提高辨识精度,减少计算量,提出基于序列中碱基的组成信息以及位置信息的改进贝叶斯剪接位点预测模型。基于核方法理论,算法提出了贝叶斯特征映射方法,通过将DNA序列映射到新的特征空间,推导出决策属性和各条件属性对数值间存在线性关系,并用最小二乘法求出这种线性关系系数,设计出一种新的贝叶斯分类器。仿真结果表明,该算法的计算效率高、结构简单、分类精度高,优于SVM-B和朴素贝叶斯方法,能够适应大数据量DNA序列结构辨识。2.蛋白质编码区的预测方法研究:蛋白质编码区辨识是基因预测的重要研究课题。本文提出一种辨识外显子精确位置的综合算法。首先根据蛋白质编码区的保守序列,建立支持向量机二元分类器。然后依据密码子第一位碱基的“周期3行为”,用短时傅立叶变换对分类器的输出值进行分析,精确辨识出编码区的位置。由于基因结构复杂多样,为了提高辨识精度,基因中碱基的位置应分为3部分。用支持向量机二元分类器不能很好辨识基因中碱基所在位置,而支持向量机多分类器的结构较复杂。用Takagi-Sugeno模糊模型建立基因序列模型,输出值反映输入窗中心碱基是否属于:非编码区碱基、编码区密码子第一位碱基或编码区密码子非第一位碱基。然后用短时傅立叶变换对模型的输出值进行分析,精确辨识出编码区的位置。3.人类基因启动子预测方法研究:真核基因启动子辨识是基因预测的难点。本文提出基于寡核苷酸位置分布密度模型的启动子识别方法。首先,使用高斯混合模型(GMM)建立寡核苷酸的位置分布密度模型以提取一些重要的基序,这些基序往往对生物信号起着重要调控作用。采用期望最大化算法(EM)估计GMM模型参数,应用模糊聚类指导GMM模型混合度和初始均值的选取,较好地保证了GMM模型的精度;然后根据提取的寡核苷酸位置密度采用基于最小二乘的加权贝叶斯分类器辨识人类基因启动子。该算法的计算量小、适合海量数据的建模。为了更有效利用启动子序列固有信号特征以提高辨识精度,提出通过贝叶斯特征映射将原启动子序列投影到高维寡核苷酸位置分布密度空间,基于构建新的核函数,建立最小二乘支持向量机模型辨识人类基因启动子。核函数的特征变换综合了启动子序列的寡核苷酸组成信息和位置信息,能够较好反映实际的转录调控机制。该方法泛化性能好、计算量与输入维数无关。该预测方法可应用到几个其它生物问题。最后对本文研究工作进行了总结,并指出今后的工作方向。
【Abstract】 Biotechnology is the most promising science areas in 21th century, Bioinformatics dedicates to interpret the genomic information, explore hidden patterns in genome and comprehensively understand life and their process in the end. The key to gene prediction is to interpret and understand genome sequence, namely, the identification of all functional units in the genome, including the encoding protein DNA fragments and other functional units. Because of biodiversity and a large variation in structure, the existing Bio-recognition algorithms have many problems with the accuracy, computation load and scope of application. To deal with the above problems, three aspects are studied as follows:1. The research on splice site prediction. The recognition of splice sites is an important step in gene prediction. In view of Takagi-Sugen (T-S) fuzzy model with good generalization, robustness and simple structure, a T-S modeling algorithm based on least squares and fuzzy clustering with fuzzy likelihood function is proposed. A GC content-classified (high GC content and low GC content) modeling method is presented based on the relationship between the conservative signal sequences around splice sites and the statistical characteristics that the composition of the up and down stream sequences of splice site depending on the GC content of the sequences around splice sites. The identification accuracy is improved. In order to improve the identification accuracy and reduces computational complexity further, according to the composition and position information of bases in the sequence, an improved naive Bayesian splice site classification is proposed. Based on the kernel method theory, this method adopts Bayesian feature function to map the sequences into a new feature space. The linear relationship between condition attributes and decision attribute was derived and the relationship coefficients is determined by least square method. So a new Bayesian classifier is designed. Simulation results show the computation time is directly proportional to the number of sequences, and the methods has high classification accuracy. The performance is improved compared with SVM-B and the naive Bayesian classifier. This method is very suitable for gene structure identification with large DNA sequence data. 2. The research on accurate protein coding regions Localization. The recognition of protein coding regions is an important research subject in gene prediction. An integrated algorithm for exon identification is proposed. First, according to the conserved sequence of DNA coding regions, support vector machine classification of the first nucleotide of a codon in coding regions is established. Then, according to the period 3 behavior of the first nucleotide of a codon, the output sequences of the model are analyzed through short time Fourier transform, and the position of coding regions can be accurately determinate. As the complexity and diversity of gene structure, in order to improve the identification accuracy, the position of bases in gene should be divided into three classes. A binary SVM classifier can not recognize the position of bases well and the structure of SVM multi-classifier is complicated. T-S fuzzy model is used to construct the gene sequence model. The single output indicates whether the nucleotide in the center of the input window belonging to non-coding regions, the first nucleotide of a codon in a coding region or not the first nucleotide of a codon in a coding region. Then the output sequences of the model are analyzed by short time Fourier transform, and the position of coding regions can be accurately determined.3. The research on Human promoter prediction. The recognition of eukaryotic promoter is a difficult research subject in gene prediction. A promoter recognition algorithm based on the positional densities of oligonucleotides model is proposed. First, a Gaussian Mixture Model (GMM) is adopted to model the positional densities of oligonucleotides to extract the some important motifs which play an important role in signal regulation. Expectation Maximization (EM) algorithm is used to evaluate the parameters of GMM. In order to improve the modeling accuracy, the optimal numbers of Gaussian Mixture Model components and the initial means are determined through the fuzzy cluster. According to the known oligonucleotide position density, weighted Bayesian classifier based on least square is built to identify the Human promoter. The cost of computation is small and suitable for large DNA sequence data. To take advantage of the signal feature of promter to improve the identification accuracy and efficiency, the original promter DNA sequences are projected into the high dimension space of the oligonucleotides positional densities using Bayes feature mapping, and least squares-support vector machine (LSSVM) based on new kernel function corresponding to Bayes feature mapping is established, then Human promoters are identified by LSSVM. Through transformation of this kernel, both the content and position information of oligonucleotide can be integrated, which reflect the characteristic of actual Transcriptional Regulation mechanism well. These prediction methods can be generalized to several other biological problems. The algorithm has good generalization and the cost of computation is insensitive to the input dimension of samples.Finally, the research work of this paper is summarized, and the direction of future work is point out.