节点文献

利用半随机抽样模型进行蛋白质概率计算方法的评估等质谱信息学研究

Evaluation of Different Protein Probability Calculating Methods Using a Semi-random Sampling Model and Other MS Bioinformatics Studies

【作者】 薛晓芳

【导师】 贺福初; 朱云平;

【作者基本信息】 中国人民解放军军事医学科学院 , 细胞生物学, 2006, 硕士

【摘要】 Shotgun技术是大规模蛋白质鉴定的重要方法,能在一次实验中获得大量的数据,而这些数据的可靠性是蛋白鉴定的一个重要问题。当前关于鉴定结果质量控制的研究主要集中在肽段水平上,而对于蛋白质水平上的鉴定结果可靠性研究比较少,且这些研究中的评估方法所用的数据量都比较小,不足以说明方法的有效性。 我们在对现有shotgun蛋白质鉴定过程充分理解的基础上,建立了一个半随机抽样模型模拟大量数据搜索后鉴定结果,以期用该模型评估蛋白质概率计算过程中可能涉及到的影响因素,同时评估现有的蛋白质概率计算方法。 为了验证所建立的半随机抽样模型的可靠性,我们对一批标准蛋白质数据进行模拟,比较不同肽段数上的模拟的和真实的蛋白质或肽段数,发现两种结果基本相似,证明了该模型能基本代表真实的蛋白质鉴定过程。 基于一批人肝脏的数据,我们利用半随机抽样模型模拟的34批数据,对鉴定结果的数据量大小、搜索数据库大小和去高丰度蛋白质等影响蛋白质概率计算的因素进行了评估,发现随着数据库和数据量的增大蛋白质的总体阳性率都会下降,而去除高丰度的蛋白质在一定程度上能够提高蛋白质的真阳性率。同时,利用这些模拟数据,我们对目前常用的4种蛋白质概率计算方法进行了评估,发现PROT_PROBE能较好地区分鉴定结果中假阳性和真阳性蛋白质;ProteinProphet计算的蛋白质概率高于真实结果且区分度不佳;取双肽段以上(≥2个非冗余肽段)蛋白质的方法效果较好,但会受到数据库和数据量大小的影响:HPPP所采用的泊松模型在一定程度上能较准确地计算假阳性蛋白质鉴定数,但这种方法强烈依赖于单肽段的假阳性率。因此,总体而言,现有的各种方法在部分解决蛋白质水平质量控制的同时,都存在着各自的缺陷,至今尚无较成熟可靠的方法。 另外,随着蛋白质组学的发展,以及系统生物学研究的逐渐开展,需要高通量地进行蛋白质的相对定量。我们在目前所用的无标记定量方法的基础上,用EM算法进行了方法的改进,并利用已有研究发表的数据进行验证。结果表明,改进

【Abstract】 Shotgun technology is a widely used approach in proteomics, which can produce large number of mass spectrum data for only one experimental process. The reliability of large-scale data sets in shotgun proteomics is an important problem, and until now most studies on the reliability of protein identification concentrate at the peptide level,. The quality control at protein level is still an intractable problem, current calculation methods of protein probabilities are evaluated basing on only small data sets of control protein samples or manual validation data, which cannot deduce the credible conclusions. So, large and reliable data sets are necessary to evaluate the reliability of calculated protein probabilities.The major object of this study is to establish an efficient evaluation system for different calculation methods of protein probabilities and examine the inpact factors of protein probabilities. A semi-random sampling model was developed according to the careful analyses of the protein identification process to simulate large-scale identified peptides, which were used to evaluate calculation methods for protein probabilities and inpact factors on protein probability.Simulation process was performed according to one data set of a control sample (18 proteins), and peptide or protein number of simulated result were compared with the real result at different peptide hits, demonstrating the efficiency of our model.Based on a experimental data set from human liver sample, 34 data sets were simulated. The three major influence factors, Data set sizes, database sizes and abundance distributions, are examined in our studies. According to these results, we found that the true positive rate decreased with the enhancing of the simulated data set or searched database size, and the depletion of high abundant proteins could increase the true positive rate of identified proteins. Finally, different methods for protein

  • 【分类号】Q51-3;Q811.4
  • 【下载频次】88
节点文献中: 

本文链接的文献网络图示:

本文的引文网络