节点文献

蛋白质组肽段鉴定质量控制方法的研究与应用

Investigation and Application of the Quality Control for Peptide Identification in Shotgun Proteomics

【作者】 马洁

【导师】 贺福初; 朱云平;

【作者基本信息】 中国人民解放军军事医学科学院 , 生物化学与分子生物学, 2010, 博士

【摘要】 随着人类基因组测序的完成,为从整体上掌握生命现象的本质和规律,生命科学对生命活动功能的真正执行者——蛋白质展开了全面研究,蛋白质组学成为后基因组时代生命科学研究的热点之一。生物质谱技术的发展为蛋白质组提供了高通量、高灵敏度和高分辨率的分析平台,成为蛋白质组学研究的支撑技术之一,并直接促成了大规模蛋白质组研究的开展。而串联质谱技术结合数据库搜索策略鉴定蛋白质,可以满足组学研究高通量、自动化的要求,已成为人类蛋白质组表达谱研究的重要技术路线。数据库搜索策略极大地增强了生物质谱数据的解析效率,但由于生物样品的多样性和实验过程的复杂性,以及现有搜索算法的局限性,使其不能完全解决蛋白质鉴定问题,导致质谱数据分析一直是蛋白质组数据处理的难点。数据库搜索策略存在的问题主要可以概括为两点:即如何保证鉴定结果的完整性和正确性。本研究致力于解决质谱数据蛋白质鉴定的正确性问题,针对数据库搜索策略鉴定肽段结果的质量控制展开,在保证肽段置信度的基础上,实现有效地区分正确/错误的鉴定结果。数据库搜索过程中主要由于模糊匹配和随机匹配两种情况存在导致阴性结果的产生,本研究也正是从这两方面着手。同时,本研究还着重考虑了质谱数据质量控制研究所面临的下面几个挑战:1.质谱数据复杂程度高,数据库搜索结果易受质谱仪器类型、图谱产生参数、搜库参数、数据库大小构成等多方面因素影响,充分利用质谱数据中所包含的信息将有利于全面完整地描述数据集特征;2.如何建立客观的评价体系,既考虑数据集整体置信度水平,又能体现肽段的“个性”,为实验人员提供单个肽段/蛋白鉴定结果的正确概率;3.保证所发展模型和方法的通用性及普适性,实现有效分析、整合多种来源的海量复杂数据;4.高精度质谱数据已成为生物质谱技术发展的趋势,如何针对高精度质谱数据的特点解析结果将成为质谱信息学的发展方向。本文针对数据库搜索策略鉴定肽段质量控制所面临的上述问题,考虑两种肽段水平产生阴性结果的原因,基于随机数据库搜索策略,对不同精度质谱仪器数据以及SEQUEST和Mascot两种最常用数据库搜索引擎的结果展开质控方法研究,提高了肽段过滤过程的灵敏度和实用性,并构建了大规模蛋白质组数据的质控分析流程,为后续生物学问题研究提供更可信、更完备的肽段和蛋白质列表。首先,利用标准蛋白数据集和理论模拟谱图集获得常规数据库搜索结果模糊匹配的基本模式,以及在不同精度数据集中的出现频率,并考察了不同数据库搜索质量误差设置对模糊匹配的影响。同时,通过构建包含人和非人物种蛋白质序列数据库,初步估计了实际样品数据集中模糊匹配发生的概率。我们认为模糊匹配主要受到数据集母离子精度的影响,对标准蛋白数据集应采用和样品蛋白同源性小的序列库作为搜索数据库能更真实的评估算法性能,而对于实际样品数据集可以通过把无法区分的鉴定肽段合并不做取舍,来提高蛋白质装配的准确性。然后,针对随机匹配问题,分别对高精度LTQ-FT质谱数据、SEQUEST和Mascot软件的数据库搜索结果,通过发展新搜库策略和过滤方法有效改进了其肽段水平的质控性能。LTQ-FT是一种兼具高精度和高通量的质谱平台,被广泛地应用于蛋白质组定性和定量分析中,但是该仪器时间依赖的系统误差会导致数据库搜索时无法确定合理的质量误差范围而使其精度大打折扣。我们详细分析了LTQ-FT质谱平台母离子质量误差分布的特点,改进了现有校正公式,并开发了自动化校正的工具。同时,我们提出了一种全新的数据库搜索策略——大误差搜库小误差过滤,用于数据库搜索误差规范和搜库结果确认,通过在标准蛋白数据集和实际样品数据集上的应用,证明了该策略可以显著提高鉴定肽段过滤方法的灵敏度。基于随机数据库策略和非参概率密度模型,我们发展了一种用于过滤鸟枪法蛋白质组串联质谱数据SEQUEST软件肽段鉴定结果的方法——贝叶斯非参模型(BNP)。共提取了28个描述搜库结果及其匹配信息的特征参数,利用多元线性回归、期望最大算法和贝叶斯公式完成了肽段局部发现假阳性率的估计,并给出其过滤门限。将模型应用于三批标准蛋白和五批实际样品(包括LCQ、LTQ和LTQ-FT三种仪器的数据)数据集的SEQUEST搜库结果中,并与动态卡值法、PeptideProphet以及简单非参模型比较,在给定期望假阳性率下,BNP模型能得到最多的过滤肽段数,说明了该模型较好的灵敏度和普适性,并且根据BNP模型计算的概率分值可以保留相当一部分被其他方法过滤的高可信肽段结果,从而大大提高了质谱数据的利用效率。Mascot作为与SEQUEST齐名的另外一种常用的搜库软件,对其鉴定肽段的质控研究较少,基于Mascot一致性阈值可以严格控制结果的假阳性率,但是其低灵敏度会带来较高的假阴性率,造成大量真实结果的丢失。我们对现有Mascot鉴定结果的过滤和评估方法进行了分类总结,并基于随机数据库搜索策略,通过应用概率模型整合新特征参数完善了Mascot肽段水平的质量控制,有效提高了Mascot搜库结果质控的敏感性,降低了假阴性率并增加了高可信鉴定肽段数目。随着人类蛋白质组计划研究的迅速发展,在实验仪器和技术不断进步的同时,也产生了大量的异质数据。为有效整合多来源实验数据,我们基于贝叶斯非参模型构建了大规模质谱数据统一质控标准的分析流程,完成了中国人类肝脏蛋白质组计划中小鼠肝脏细胞器表达谱数据集的系统分析,改进了表达谱常规分析策略的鉴定结果。在蛋白质组研究中,应用质谱实验数据获得高可信的鉴定结果对于后续的生物学和临床学应用意义重大,因此如何有效地控制鉴定肽段的假阳性率仍是数据库搜索策略面对的首要问题之一。本文着眼于质谱数据肽段鉴定确认这一过程,合理利用多种数学、统计模型整合多元特征参数深度解析质谱数据,从灵敏性、特异性和普适性三个方面发展和改进肽段过滤方法,完善了蛋白质组鉴定肽段的质量控制,并成功地构建了大规模蛋白质组表达谱数据的质控分析流程。

【Abstract】 With the completion of the Human Genome Project, in order to explore the essential nature and laws of life, scientists launched a comprehensive analysis of gene products– genome coded proteins. Proteomics has become one of the most active areas of life science research in the post-genomic era. The development of mass spectrometry has provided a high-throughput, high-sensitivity and high-resolution analysis platform for proteomics. Now, tandem mass spectrometry is one of the most powerful technologies for protein identification, and it makes the global protein profiling possible.Tandem mass spectrometry combined with database searching strategy allows high-throughput identification of peptides and proteins in shotgun proteomics. However, it cannot solve the problem of protein identification completely, considering the diversity of biological samples, complexity of experimental process as well as the limitations of existing search algorithms. The problems of applying database search strategy to analyze mass spectrometry data can be summarized as two points, that is how to ensure the integrity and correctness of the identified results. Thus, selection of all those peptide - spectrum assignments that are actually correct is one of the most daunting tasks in mass spectrometry based proteomics investigations.In this study, we focus on the improvements of the quality control procedure for peptide identification in shotgun proteomics, and the primary aim is to distinguish the correct and incorrect matches effectively. The negative results of tandem mass data identified peptides are mainly caused by the ambiguous identifications and randomized identifications in database search strategy. The challenges that we face for quality control procedure in mass data analysis are as follows.1. The analysis of digested proteins by mass spectrometry is a complex physical and chemical process. The database search results are likely to be effected by many factors, such as the sample complexities, sequence databases, experimental protocols and types of instrumentation. Taking advantage of many new features would provide a means of improving the sensitivity of filtration methods.2. It is necessary to establish an objective evaluation system, which not only takes into account the specific data set as a whole, but also reflects the "personality" of each identified peptide by providing the confidence level for each single peptide or protein identification for experimenters.3. It is vitally needed to develop robust filter methods and models which can effective analysis data from multi-sources.4. Mass spectrometers that provide high-accuracy data are being increasingly used in proteomic studies. Utilizing the accurate mass measurement in data analysis strategy would become a trend in proteomics application.In this paper, on the basis of target-decoy database search strategy, we conducted a comprehensive investigation on two kinds of identifications that contribute to the negative hits, and the research focused on the improvements of the validation process of identified peptides, which involved improving the sensitivity, specificity, and generalizability of the filter methods.First of all, we evaluated the patterns and frequencies of ambiguous matches occurred in database search outputs, using the standard data sets, theoretically simulated spectra and real sample data. We also conducted an in-depth study about how the different mass error tolerance (MET) settings in database search affected the ambiguous matches’occurrence. The observations indicated that the peptide MET was the main reason that determinated the number of ambiguous matches. The ambiguous matches would be one of the effects that impact the calculated false positive rate of standard protein data sets; and it can be improved by using the searched database composed of low homology sequences. If the ambiguous matches of the same spectrum belong to different proteins, we recommend reporting all peptides as a peptide group and chose the favoring protein supported by other peptide identifications.Then, we presented and evaluated the filter methods for peptide validation procedure, specifically for high accurate mass data and two most commonly used search engines SEQUSET and Mascot.The hybrid linear trap quadrupole Fourier-transform ion cyclotron resonance mass spectrometer (LTQ-FT), an instrument with high accuracy and resolution, is widely used in the identification and quantification of peptides and proteins. However, time-dependent errors in the system may lead to deterioration of the accuracy of these instruments, negatively influencing the determination of the MET in database searches. We investigated the parent ion mass error distribution of the LTQ-FT mass spectrometer and applied an improved recalibration procedure to determine the statistical MET of different data sets. Based on the improved recalibration formula, we introduced a new tool, FTDR (Fourier-transform data recalibration), that employs a graphic user interface (GUI) for automatic calibration. Consequently, we presented a new strategy, LDSF (Large MET database search and small MET filtration), for database search MET specification and validation of database search results. As the name implies, a large-MET database search is conducted and the search results are then filtered using the statistical MET estimated from high-confidence results. By applying this strategy to both standard protein dataset and complex dataset, we demonstrated the LDSF can significantly improve the sensitivity of the result validation procedure.A Bayesian nonparametric (BNP) model was developed to improve the validation of database search results for SEQUEST, which incorporated several popular techniques, including the linear discriminant function (LDF), the flexible nonparametric probability density function (PDF) and the Bayesian method. The BNP model is compatible with the popular target-decoy database search strategy naturally. We tested the BNP model on standard proteins and real complex-sample data sets from multiple MS platforms (LCQ, LTQ and LTQ-FT) and compared it with the cutoff-based method, PeptideProphet and a simple nonparametric method. The performance of the BNP model was shown to be superior for all data sets searched on sensitivity and generalizability. Some high-quality matches that had been filtered out by other methods were detected and assigned with high probability by the BNP model. Thus, the BNP model could be able to validate the database search results effectively and extract more information from MS/MS data.The probability-based search engine Mascot has been widely used to identify peptides and proteins in shotgun proteomic research. Most subsequent quality control methods filter out ambiguous assignments according to the ion score and threshold provided by Mascot. On the basis of target–decoy database search strategy, we evaluated the performance of several filter methods on Mascot search results and demonstrated that using filter boundaries on two-dimensional feature space, the Mascot ion score and its relative score, can improve the sensitivity of the filter process. Furthermore, using a linear combination of several of the characters of the assigned peptides, including the Mascot score, 23 previously employed features, and three newly introduced features, we applied the Bayesian nonparametric model to Mascot search results and validated more correctly identified peptides in control and complex data sets than could be validated by empirical score thresholds, the cutoff-based method and linear discriminant model.With the rapid development of Human Proteome Project, the experimental instruments and techniques have made great progress. However, a huge number of heterogeneous data has been generated by different laboratories using diverse analytical strategies. In order to integrate the multi-sources data, on the basis of the Bayesian nonparametric model, we conducted a unified data analysis procedure of quality control for large-scale mass spectrometry data. By using this strategy, we reprocessed the mouse liver organelle expression data set of Chinese Human Liver Proteome Project, and greatly improved the peptide and protein identifications.Making use of available information which was typically ignored could benefit data analysis process in proteomics. Compared to early researches that only a few characters were used for mass data classifier, more and more features would be involved in mass spectrum data mining process. Combination of new features with an appropriate framework is making an important role in obtaining the good results. On the basis of these concepts, we have done several positively exploratory studies which focused on the application of computational and statistical methods in high-throughput MS/MS data analysis process to improve the quality control for peptide identification in shotgun proteomics.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络