节点文献

蛋白质相互作用预测方法的研究

The Study of the Method for Predicting Protein-protein Interactions

【作者】 史明光

【导师】 黄德双;

【作者基本信息】 中国科学技术大学 , 模式识别与智能系统, 2009, 博士

【摘要】 随着人类基因组全序列测序与工作草图的完成,基因组学研究的重心己逐渐由结构基因组学向功能基因组学转移,生物医学随之进入一个新纪元--后基因组时代。在后基因组时代,一个重要任务就是对蛋白质组学的研究。得益于越来越多的高通量实验技术的出现和日臻成熟,目前已积累了大量的蛋白质组数据。当前的问题是.分析和研究这些数据的手段和能力严重滞后,使得花费大量人力和财力获得的数据未能产生更多有生物学意义的结果。因此,发展先进高效的信息分析和数据挖掘手段,从大量而繁杂的蛋白质组数据中找出内在联系,以揭示蛋白质的功能及相互作用关系具有极其重要的意义。蛋白质相互作用是分子生物学研究的热点及难点。蛋白质作为最主要的生命活动载体和功能执行者,对其复杂多样的结构功能、相互作用和动态变化进行深入研究,有助于在分子、细胞和生物体等多个层次上全面揭示生命现象的本质。蛋白质相互作用是生物体中众多生命活动过程的重要组成部分,是生物体生化反应的基础,是后基因组时代的主要任务。实验方法提供大量数据的同时,会带来大量的假阳性和假阴性数据。因此,本文从计算的角度来研究蛋白质相互作用,主要研究和探索了机器学习方法对蛋白质一蛋白质相互作用的预测问题。本文的工作主要包括以下几个方面:1)提出了一种新的基于氨基酸进化保守性的蛋白质相互作用预测方法。由于自然选择法则的作用,在一个蛋白质家族中与分子功能相关的氨基酸残基在进化过程中呈现保守性特征,蛋白质与外界的作用依赖于这些关键的残基。我们从蛋白质序列出发,提出一种新的基于氨基酸序列相关系数的编码方式,该编码方式同时考虑序列内部长程相互作用和序列之间的协同进化关系。对于正负学习样本,分别考虑来自DIP,MIPS与BIND数据库的正样本和四种不同方式构建的负样本,包括:通过随机选取蛋白质构造R-NEG;通过亚细胞定位方式,利用位于同一个亚细胞区间的蛋白质对构造IS-NEG;通过亚细胞定位方式利用位于不同亚细胞区间的蛋白质对构造BS-NEG:通过基因本体信息得到的具有较低RSSBP与RSSCC值的蛋白质来构造GO-NEG。MiPS Core和GO-NEG这一种组合方式与另外十一种组合方式相比预测准确率最高,并且有统计学意义的P值最小,分别称MIPS Core和GO-NEG为黄金正样本和黄金负样本。与已知的氨基酸残基自相关编码方式相比,相关系数编码方式得到较好的预测结果。对于跨模式生物的预测结果表明,基于相关系数编码方式的SVM模型具有较好的泛化能力。2)构建了一个改进的GPCA plus LDA预测模型来预测蛋白质相互作用,能够有效地提高膜蛋白相互作用的预测精度。Greedy KPCA(GPCA)得到的基坐标直接来源于样本数据,而KPCA算法得到的基坐标来自于样本数据的线性组合。尽管基于贪婪算法的KPCA算法是次优的,但是与传统的KPCA算法相比能够极大地降低计算复杂度。对于酵母这一典型的单细胞模式生物,大多数酵母整合膜蛋白无法通过实验方法来验证。我们提出利用膜蛋白相互作用的21个结构与序列特征,来构造56个正样本和150个负样本,实验表明基于GPCA plus LDA预测模型得到包括189个膜蛋白300对具有高可靠性的预测结果。实验结果也表明,尽管GPCA plus LDA方法进行了特征约简,同时有效地降低了数据之间的冗余性,但得到的结果比LDA方法略优;GPCA plus LDA方法由于解决了高阶信息丢失问题,得到的结果与KPCA plus LDA方法相比有了很大的改善与提高;GPCA plus LDA方法得到的预测正确率方差最小,表明十次测量值之间存在较小的差异,该方法具有较好的鲁棒性。通过计算揭示了膜蛋白相互作用网络具有小世界效应和无标度特性。3)提出了一种新的基于贝叶斯累加回归树计算模型(BART)的蛋白质相互作用预测方法。BART是一种新颖的集成学习方法,通过非参数化的贝叶斯回归方法,把累加回归树计算模型分解为若干个弱分类器,并通过集成方法整合为一个分类器集成系统。基于整合MCMC算法的BART预测模型用于蛋白质相互作用预测,获得了较好的预测准确率。与标准的MCMC方法相比,提出的整合MCMC算法能够有效地避免局部极小情况的出现。同时对于独立测试集,BART模型能够得到较好的预测结果,这表明BART具有较好的泛化性能.

【Abstract】 With the human genome sequencing and the completion of the draft work, genomics research has been gradually shifting from the focus of structural genomics to functional genomics,bio-medicine enters a new era - the post-genome era.In the post-genome era,an important task is the study of proteomics.Benefiting from a growing number of high-throughput experimental technologies and becoming more mature,it has accumulated a large number of proteomic data.The current problem is that the means and ability of data analysis and study seriously lag behind,making the obtained data via a lot of human labour and financial supports fail to produce more meaningful results of biology.Therefore,the development of advanced and highly efficient information analysis and data mining tools to find internal links from a large number of proteins and complex set of data so as to reveal the relationship between protein function and protein interaction is of vitally important significance.Protein-protein interaction is the hot and difficult spots in the molecular biology investigation.Protein is a major life activity carrier and function executor,and the in-depth study of its complex and diverse structure and function,interaction and dynamic changes will be helpful to reveal the nature of life phenomenon in the molecular,cellular and organism levels.Protein-protein interaction is an important part in the process of diverse life activity in organisms,and the basis of biochemical reactions in organisms as well as the main task of post-genome era.When experimental methods provide large amounts of data,they also at the same time,will bring a large number of false positive and false negative data.Therefore,this thesis studies protein-protein interaction from the perspective of computing,which mainly includes the research and exploration of applying machine learning methods for protein-protein interaction prediction problems.This thesis mainly includes the following facets:1) Anovei method of predicting protein-protein interaction was proposed in this thesis based on amino acid evolutionary conservation.Under the rule of natural selection,amino acid residues that are involved in the function of a given protein family are more conservative.The interaction between proteins and environments depends on these important residues.Starting from the protein sequence,a new correlation coefficient based on the amino acid sequence of the encoding is illustrated. The encoding sequence scheme considers the internal long-range interactions and sequence relationship between the co-evolution.For positive and negative learning samples,this thesis adopted the positive samples from the DIP,MIPS and BIND and the negative samples constructed from four different ways including:1) R-NEG constructed by randomly selecting protein structure;2) IS-NEG constructed through the subcellular localization and the use of the same range of sub-cellular structure of the protein;3) BS-NEG constructed through the subcellular localization and the use of subcellular localization in different sub-cellular range of structural protein;4) GO-NEG constructed through the Gene Ontology information available from RSSBP and RSSCC with lower value.The comparison of the combination of MIPS Core and GO-NEG with other 11 kinds of combination shows that the prediction accuracy for the former is higher,wherein the value of P with statistical significance is minimum. Thus,the MIPS Core and GO-NEG are called as gold standard positive samples and gold standard negative samples,respectively.In addition,compared with the known amino residual encoding auto-correlation,the correlation coefficient encoding scheme yields better prediction results.The prediction results for across-species show that the SVM model based on the correlation coefficient encoding scheme has better generalization ability.2) An improved GPCA plus LDA model was constructed to predict the protein-protein interation,which can effectively improve the prediction accuracy of the membrane protein-protein interaction.The base coordinates obtained by means of Greedy KPCA(GPCA) algorithm were directly from the sample data,while the ones by KPCA algorithm were derived from a linear combination of sample data.Although the greedy algorithm based on KPCA algorithm is sub-optimal,it can greatly reduce the computational complexity comparied with tranditional KPCA algorithm.For single-celled eukaryote of the yeast Saccharomyces cerevisiae,most of integral membrane proteins of Saccharomyces cerevisiae can not be verified by experiments. We proposed the use of 21 sturctures and sequence features for membrane protein interaction to construct 56 positive samples and 150 negative samples.It was found in experiments that based on the kernel method of GPCA plus LDA,300 protein-protein interactions involving 189 membrane proteins are of high reliability prediction results. The experimental results also show that although the GPCA plus LDA method performs feature reduction and removes the redundancy between the data,the obtained results were only slightly better than the LDA method;for GPCA plus LDA method which solves the loss problem of high-order information,the obtained prediction results are better compared with KPCA plus LDA method;the variance of the prediction correct rate for GPCA plus LDA approach is the smallest,which indicates that the difference among the measured values for ten times is smaller,thus this method has better robustness.Moreover,it revealed by computing that the interactions of membrane proteins are of the properties of small-world effect and scale-free properties.3) A novel Bayesian additive regression tree(BART) model was proposed to infer protein-protein interaction.BART is a newly integrated approach,which is a classifier ensemble system formed by decomposing BART model into a number of weak classifiers through non-parametric Bayesian regression method.Moreover, BART prediction model based on the integration of backfitting MCMC algorithms obtained better prediction accuracy for protein-protein interation.Particularly, compared with standard MCMC methods,the proposed integration of the backfitting MCMC algorithm can effectively avoid the local minimum situation.At the same time, an independent test set based on BART model achieves better prediction results, which indicates that BART model has good generalization ability.

节点文献中: