节点文献

蛋白质结构类与功能预测及物种亲缘分析问题的非线性方法研究

The Study of Nonlinear Methods for the Prediction of Protein Structural Classes and Functions and Phylogenetic Analysis

【作者】 韩国胜

【导师】 喻祖国;

【作者基本信息】 湘潭大学 , 应用数学, 2013, 博士

【摘要】 随着生物技术的不断进步与生物信息学研究的不断深入,生物学数据每年在以指数级增长。仅仅靠既昂贵又耗时的生化实验来分析这海量级数据及其相关的生物学问题,已变得不太现实。为适应这种需求,研发可靠高效的计算方法和算法已迫在眉睫。本文主要以非线性科学方法作为模型,研究了蛋白质结构类和功能预测及物种亲缘分析中的一些问题,具体工作如下:第二章我们将研究低同源蛋白质的结构类预测问题。基于被预测的蛋白质二级结构信息,我们提出了一种新的简单的核函数方法来预测蛋白质的结构类。蛋白质二级结构信息是由流行的蛋白质二级结构预测工具PSIPRED预测得到。然后基于二级结构元比对打分构造了一个线性核函数,并作为预置核函数来训练支持向量机分类器。我们的方法没有可变参数要训练。最后我们的方法被应用到两个公开的低同源训练集上,并取得了良好的分类效果。与现有方法相比,我们的方法不仅提高了总的预测精度,而且在分辨α+β类和α/β类上呈现出更高的精度。这也说明基于二级结构元比对打分的线性核函数比基于蛋白质二级结构的统计信息更能捕获蛋白质二级结构序列之间的相似性。第三章我们将研究蛋白质的亚细胞位置定位问题。蛋白质的亚细胞位置和其生物功能是紧密相关的。氨基酸组分是蛋白质亚细胞位置定位的一个重要模型,但是其忽略了蛋白质序列顺序信息。为了弥补氨基酸组分模型的不足,我们使用了递归定量分析和Hilbert-Huang变换。这两个方法分别可以提取时间序列中的递归模式和不同频率信息。为了使用这两种方法,我们使用氨基酸的疏水性自由能和可溶性特性将每条氨基酸序列转化为两条时间序列。综合氨基酸组分、递归定量分析和Hilbert-Huang变换这三个模型总共产生62个特征。最终,每条蛋白质序列由62维特征向量表示。我们使用最大相关最小冗余方法来排列这62个特征,并仍旧使用SVM作为分类模型。使用刀切检验选择最优特征子集和评估这个方法的性能。我们方法测试了三个凋亡蛋白数据集,并从最终的结果中可得出,我们的方法使用相对较少的特征达到了较好的预测精度。这说明我们的方法对已有方法可能起到弥补作用。第四章我们将研究蛋白质亚细胞核位置定位问题。比起蛋白质的亚细胞位置定位,蛋白质亚细胞核位置定位更具挑战性。我们设计了一个新的两阶段多类支持向量机(two-stage multiclass support vector machine),并成功地将它应用到蛋白质亚细胞核预测。我们综合使用了两类特征提取方法:基于氨基酸分类的方法和基于氨基酸物理化学性质的方法。为了减少计算复杂度和特征冗余,我们提出了一个“两步最优特征选择方法”(two-step optimal feature selection)来寻找最优特征子集。在我们设计的系统中,所有的分类子是用带有概率输出的支持向量机构造的。我们使用径向基核函数,它的参数是由一个自动优化方法来确定,这进一步加速了我们的方法。一个权重策略是被用来处理不平衡数据集的问题。最后,我们方法和已有方法在三个测试集上的比较结果表明我们的方法是更加有效的,而且我们方法的结果优于单独使用支持向量机分类子和随机森林等分类子的结果。第五章我们将研究脊椎动物的亲缘关系分析。我们选取线粒体基因组作为我们的数据。我们首先利用DNA序列的混沌游戏表示(chaos game representation,CGR)来表示线粒体基因组。然后我们使用两种马尔科夫链(Markov chain)模型来模拟线粒体基因组,并将其作为基因组序列的噪声背景(noise background)候选模型。然后,我们基于这两个模型构造无比对方法,并应用在分析64个脊椎动物的亲缘关系分析中。最后,我们发现,在模拟线粒体基因组的CGRs方面,二阶马尔科夫链模型比一阶马尔科夫链模型更精细;但是,一阶马尔科夫链模型的CGR更适合用来表示随机背景,从原始CGRs中去除这个随机背景能增强线粒体基因组中的进化信息。

【Abstract】 With the development of biotechnology and bioinformatics, biological data haveincreased in exponential way every year. It is not really practical to analyze such massdata alone by performing expensive and time-consuming biochemical experiments. Tomeet such requirement, it is extremely urgent to develop reliable and effective compu-tational methods and algorithms. This thesis study the prediction of protein structuralclasses and functions and phylogenetic analysis based on nonlinear science methods.The detailed work are summarized as follows:In Chapter2, we study about predicting the structural classes of low-homologyproteins. Based on predicted secondary structures, we propose a new and simple k-ernel method to predict protein structural classes. The secondary structures of al-l amino acids sequences are obtained by using the tool PSIPRED and then a linearkernel on the basis of secondary structure element alignment scores is constructed andthen is considered to be a precomputed kernel function for training a support vectormachine classifier without parameter adjusting. The overall accuracies on two publiclow-homolgoy datasets are higher than those obtained by other existing methods. Es-pecially, our method achieves higher accuracies for differentiating the α+β class andthe α/β class compared to other methods. It is concluded that the linear kernel on thebasis of secondary structure element alignment scores better captures the similarity be-tween two secondary structural element sequences than existing statistical informationextracted from predicted secondary structures.In Chapter3, we study the problem of subcellular localizations of proteins. Thefunction of a protein is closely related with its subcellular location. Amino acid com-position is one of important models for subcellular localizations of proteins, but itignores sequence-order information. In order to make up for this deficiency, we addtwo methods, recurrence quantification analysis and Hilbert-Huang transform. Thesetwo methods can extract recurrence patterns and frequency information in time series.In order to make use of two models, we convert each amino acids sequence into twotime series by using hydrophobic free energies and solvent accessibilities of20aminoacids. The ensemble model of amino acid composition, recurrence quantification anal-ysis and Hilbert-Huang transform generate62features. As a result, each amino acidssequence is represented by a62-dimensional feature vector. All features are ranked bythe maximum relevance and minimum redundancy method and support vector machineis still used as classifier. The jackknife test is used to select optimal feature subset, e- valuate and compare our method with other existing methods. Our method is testedon three apoptosis protein datasets. It can be seen from final results that our methodachieves the best performances by using relatively few features. This suggests that ourmethod may complement the existing methods.In Chapter4, we study subnuclear localizations of proteins. Compared with sub-cellular localizations of proteins,subnuclear localizations of proteins are more chal-lenging. A novel two-stage multiclass support vector machine is proposed and is suc-cessfully applied to predict subnuclear localizations of proteins. It only considers thosefeature extraction methods based on amino acid classifications and physicochemicalproperties. In order to reduce computation complexity and feature abundance, we pro-pose a two-step optimal feature selection process to find the optimal feature subset. Inour system, all classifiers are constructed using support vector machine with probabil-ity output. We use the radial basis kernel function, whose parameter is determined byan automatic optimization method to speed up our system. The weight strategy is usedto handle the unbalanced dataset. From the results on three datasets, our ensemblemethod is valuable and effective for predicting protein subnuclear locations comparedwith existing methods for the same problem and is better than popular machine learn-ing classifiers (such as support vector machine, random forest).In Chapter5, we study vertebrate phylogeny based on mitochondrial genomes.The mitochondrial genomes are represented by the chaos game representation (CGR),a tool for DNA sequence representation. Then, two Markov chain models are used tosimulate the CGRs of mitochondrial genomes and are considered as noise backgroundcandidate models. Alignment-free methods are constructed based on two Markovchain models, and are applied to analyze the phylogeny of64selected vertebrates. Fi-nally, we conclude from the results that the second-order Markov chain model is morepowerful than the first-order Markov chain model in simulating the CGR of the mito-chondrial genomes while the CGR simulated by the first-order Markov chain modelare more suitable for modeling the random background and can be subtracted from theoriginal CGRs to enhance the phylogenetic information in the mitochondrial genomes.

  • 【网络出版投稿人】 湘潭大学
  • 【网络出版年期】2014年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络