节点文献

机器学习方法预测蛋白质相互作用应用Logistic回归提高质谱多肽鉴定的准确度

Predicting Protein-protein Interactions Based on Machine Learning Algorithms Using Logistic Regression Model to Improve Accuracy of Peptide Identification in Mass Spectrometry Analysis

【作者】 邵晨

【导师】 高友鹤;

【作者基本信息】 中国协和医科大学 , 病理及病理生理学, 2008, 博士

【摘要】 蛋白质组学成为后基因组时代的热点学科。生物质谱、蛋白质芯片等高通量实验技术的发明极大地推动了蛋白质组学的发展。本文致力于通过生物信息学的方法,进一步提高当前高通量实验技术的效率和精确程度,以更低的实验代价,获得更加全面、准确的实验结果。蛋白质—蛋白质相互作用在生命过程中起着重要的作用。通过多年的生物学实验,已经积累了大量的蛋白质相互作用数据,但未知的相互作用还有很多。目前筛选蛋白质相互作用的实验方法既耗费人力物力,而且由于丰度抑制的原因而很难鉴定出低丰度的蛋白之间的相互作用。一条更简单的途径是通过生物信息学的方法首先用计算机筛选蛋白质数据库,预测出潜在的蛋白质相互作用,然后再用生物学实验进行验证。这个策略具有比实验手段高得多的通量,而且可以解决丰度抑制的问题。在蛋白质—蛋白质相互作用的类型中,有相当一部分相互作用是通过蛋白质的某个结构域与其配体蛋白上的一段短肽相结合来实现的,这种结构域被称为多肽识别元件(Peptide recognition module,PRM)。本文的第一章通过研究PRM结合多肽的结合特性,预测了蛋白质—蛋白质之间的相互作用。以PDZ结构域为例,结合了基于结构的和基于序列的预测方法,本文建立了一个整合的预测系统来预测结构域和配体间的相互作用。在这个系统中,我们提取了结构域和配体三维结构上相互接触的氨基酸残基来代替序列全长,利用三种新型的氨基酸编码方式,用支持向量机和人工神经网络两种机器学习算法分别建立了三个子预测系统,最后将它们的预测结果综合在一起。用交叉验证的方法来评价,预测系统的特异性为0.99,灵敏度为0.60。然而,由于已知的一个结构域的配体通常只有几十或几百个,远远小于蛋白质数据库的上万个蛋白的规模,仅仅建立在少量数据上的交叉验证的评价结果不一定能保证预测方法在筛选数据库时的成功。为了验证这一点,本文从Swissprot人类数据库中为3个PDZ结构域筛选了配体蛋白序列,预测结果的相当一部分与高通量的体外实验(peptide SPOT array)的结果重合,证明了预测系统的泛化能力。串联质谱技术(MS/MS)是常用的蛋白质组学研究方法。在这个方法中,蛋白质混合物首先被酶切为多肽混合物,在质谱仪中被离子化,再经过碎裂后产生大量的二级质谱图。数据库检索是常见的质谱数据处理方法。其主要思想是将实验谱图与数据库中的酶切多肽的理论谱图进行比对,通过特定的打分算法,找到匹配最佳的多肽。由于样品和实验原理的复杂性,质谱图带有很高的噪声,为后续的数据处理工作带来了很大的难度。目前已有多种算法用来优化多肽的鉴定,但阳性和阴性的多肽鉴定仍不能够被完美地区分。为了保证鉴定结果的可信,就不得采用更严格的参数限制来去除假阳性鉴定,与此同时不可避免地产生了大量的假阴性鉴定,降低了蛋白质组学研究的效率。本文的第二章建立了一个新的参数Oscore,对实验谱图与多肽的匹配进行打分。Oscore基于logistic回归模型建立,以18个标准蛋白数据集作为学习集,可以直接地计算出谱图与多肽的匹配为正确匹配的概率。回归模型的自变量包括:SEOUEST软件输出的参数Xcorr,△Cn,Sp(preliminary score)和实验室自制的AMASS(Sun etal.Mol Cell Proteomics.2004 Dec;3(12):1194-9)软件的输出参数Rscore,Cont,Matchpct,以及多肽电荷数和漏切位点数(number of missed internal cleavage sites)。AMASS的三个参数考虑了子离子强度和b/y系列离子的连续性的信息,有助于区分阳性和阴性的多肽鉴定。由于上述的8个参数之间具有复杂的相关关系,将它们组合成Oscore可以提高鉴定的准确度。与常用的软件PeptideProphet相比,Oscore同时在多个数据集上表现出更好的特异性(低假阳性率)和灵敏度(低假阴性率)。这些数据集包括标准蛋白混合物数据集和3个蛋白质组水平的数据集,涵盖了不同的样品复杂度、数据库规模和分离方式,在一定程度上表明了Oscore的泛化能力。通过一个同样基于logistic回归,但只采用PeptideProphet所用参数的新模型,本文探讨了Oscore具有更好的判别能力的原因。目前的Oscore针对的是具有完全酶切的末端(即多肽的两端都是由胰酶酶切在氨基酸K或R之后产生)的多肽,提高非完全酶切的多肽的鉴定水平将是今后的工作。

【Abstract】 Proteomics has become a hot subject in the post-genomic era.In the recent years,high-throughput technologies such as biological mass spectrometry and protein chip have greatly promoted the development of proteomics.This article works on further improving the accuracy and efficiency of current experimental technologies by the adoption of bioinformatics methods,in order to reduce the cost of biological experiments and to obtain more comprehensive and accurate data.Protein-protein interactions play an essential role in life course. During the past years,great amounts of interactions were found by various high-throughput biological experiments.However,there are still many unknown interactions.Unfortunately,experimental screening for protein binding partners is not only labor intensive but almost futile in screening for low abundant binding species,due to the suppression by high abundant ones.A more plausible way of studying protein-protein interactions is by using high-throughput computational predictions rather than experimental approaches to screen for interactions from protein sequence databases, consequently directing the validating experiments towards the most promising peptides.Compared to traditional experimental essays,computational prediction offers a higher throughput strategy for identifying interactions on a proteomic scale.It also provides a satisfactory settlement for the abundance suppression problem.A fairly large set of protein-protein interactions are mediated by families of peptide binding domains(PRM,Peptide recognition module).The first chapter of this article predicted protein-protein interactions by studying the binding selectivity of PRMs and their ligand peptides.Taking PDZ domain family as an example,an integrated prediction system was set up to predict ligand peptides for PRMs based on both structural and sequential information.In this system,amino acid residues on the interface of the interacting domain-ligand pairs were extracted to take place of their full-length sequences.Next,three novel coding methods were invented to represent different aspects of interactions between the amino acid residue pairs.Support vector machine and artificial neural network were employed as machine learning algorithms and three independent predictors were built to process the encoded data.Prediction results of these three predictors were assembled to make the final prediction.Evaluated by the cross-validation method,specificity of the assembled system was 0.99 and sensitivity was 0.60.However,since the number of known ligands of a PRM is usually only a few dozens or hundreds,which is much less than the size of a protein database(usually over ten thousands),the performance on cross-validation cannot represent the real performance when the whole protein database are screened.In this paper,we screened the Swissprot protein databases for potential ligands of 3 PDZ domains by this trained system.A large fraction of predictions have already been experimentally confirmed by peptide SPOT array assays,indicating a satisfying generalization capability of this prediction system.Tandem mass spectrometry(MS/MS) has been widely used in proteomics studies.In such an approach,protein mixture are firstly digested into peptide mixture by enzymes,then ionized and fragmented to produce large numbers of MS/MS spectra.Database searching is a common method to process MS/MS data by comparing experimental spectra with theoretical spectra,which are predicted from peptides in a target protein database,and finding the best matches based on some scoring methods.Due to the complexity of mass spectrometry experiments and the samples tested,the MS/MS spectra involve high level of noises,hence processing MS/MS data is a difficult work. Currently,various algorithms have been developed to improve peptide identification from MS/MS spectra.However,correct and incorrect matches between the experimental spectra and peptides in database still cannot be very well distinguished.To guarantee the confidence of peptide identification, strict criteria of the scoring functions have to be used,the sensitivity of proteomics research has to be scarified.In the second chapter of this article,a new measurement Oscore was developed by logistic regression based on a training dataset produced from 18 known proteins mixture.Oscore directly estimates the probability of a correct peptide assignment for each MS/MS spectrum.Variables involved in this regression model were:SEQUEST variables Xcorr,△Cn,Sp;and the homemade software AMASS(Sun et al.Mol Cell Proteomics.2004 Dec;3(12):1194-9.) output variables MatchPct,Cont,Rscore;peptide charge state and number of peptide internal missed cleavage sites(NIMCS).The AMASS variables provide supplemental information to SEQUEST variables by considering fragment ion intensity and b/y ion continuity.Because of the complicated associations among AMASS and SEQUEST variables,combining them together rather than applying them to a threshold model improved the classification of correct and incorrect peptide identifications.Oscore achieved both lower false negative rate and lower false positive rate than PeptideProphet on datasets generated from 18 known protein mixture and several proteome-scale samples of different complexity,database size, and separation methods.By a three-way comparison among Oscore, PeptideProphet and another logistic regression model which only made use of PeptideProphet variables,the main contributor for Oscore’ s improvement was discussed.Presently,Oscore is restricted to be used for identifying fully-tryptic peptides.To extend Oscore for non- and partially-tryptic peptides will be the future work.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络