节点文献

蛋白质组学中串联质谱数据搜库结果质量控制方法研究

The Research on the Quality Control Methods of Database Search Results of Tandem Mass Spectrometry Data in Proteomics

【作者】 张纪阳

【导师】 贺福初; 谢红卫;

【作者基本信息】 国防科学技术大学 , 控制理论与控制工程, 2007, 博士

【摘要】 蛋白组学试图从整体上系统地研究生命活动的功能分子-蛋白质。由于生物系统中蛋白质表达丰度的动态范围超过了6个数量级,物理化学性质差异很大,所以,蛋白质组研究需要高通量、高灵敏度的分析仪器的支持。生物质谱具有这种特点,因此成为了蛋白质组研究的支撑技术之一。由于检测样品和实验原理的复杂性,质谱数据带有复杂的噪声,并且会受到实验过程中随机因素的严重影响,导致质谱数据分析一直是蛋白质组数据处理的难点。数据库搜索是目前质谱数据分析的主要方法,其基本思想是,将实验获得的图谱和数据库中酶切肽段的理论图谱进行比对,按照一定的打分算法,找出数据库中与实验图谱最匹配的肽段或蛋白质,这种匹配关系以及搜库软件提供的度量匹配质量的分值就构成了基本的搜库结果(又称肽段鉴定结果)。可以看出,搜库结果是在某个候选集合中寻找的最优匹配,但却并不一定是正确的,再加上计算复杂度大,自动化的搜库软件对图谱的解释一般比较粗糙,并且缺乏有效的结果可信度评价方法,数据质量控制问题十分突出。目前,质谱数据质量控制面临着以下几个困难:(1)在很多蛋白质组研究中,需要整合多来源、多质谱平台和多种数据处理软件的结果,需要一种统一的数据可信度评价体系;(2)由于实验原理的复杂性,从理论上推导肽段和图谱匹配的概率模型比较困难,数据质量控制中所使用的很多模型,都是从数据中通过观察、统计、拟合以及学习的方法获得的,建模工作依赖于特定的数据集,模型的推广性需要广泛的数据验证;(3)质谱数据复杂性的表现之一是图谱数据的统计特征会随实验条件、环境因素以及分析样品的变化而变化,这给从数据中建立具有一定推广性的算法模型带来了不小的困难;(4)质谱实验涉及种类繁多的物理化学机制,导致数据的“子类”情况特别多,建立统一、简单的搜库结果评估模型比较困难。目前,搜库结果的质量评价参数多种多样,这些参数从不同方面度量了搜库结果的质量。多元信息融合和综合判决是数据质量控制研究所必须面对的问题;(5)高通量的实验技术产出的数据量很大,给数据处理带来了不小的工程计算问题。本文针对蛋白质组中串联质谱数据搜库结果质量控制所面临的上述困难,以满足工程急需为原则,运用统计分析的方法,从数据库搜索参数优化,特征提取、优化和选择,基于随机数据库搜索的搜库结果验证等方面开展工作,研究了串联质谱数据搜库结果的质量控制问题。本文的研究目的在于,提高数据质量控制方法的灵敏度和分辨率,力图解决模型推广性和通用性等工程实践问题,为人类肝脏蛋白质组计划(Human Liver Proteome Proiect,HLPP)的数据分析提供技术支持和分析结果。本文的主要工作包括:(1)数据库搜索参数优化。数据库搜索是串联质谱数据搜库结果的质量控制问题的研究基础。在数据库搜索中,有一些需要用户指定的参数,其中有的参数可以决定一张图谱在数据库中的候选肽段集合,对搜库结果影响很大,例如,母离子质量误差容限和酶切参数。这些参数由仪器特性和实验的物理化学原理决定,并且与仪器运行状态、实验设计和样品复杂程度有关,不同的数据集需要根据实际情况慎重选择优化的搜库参数。目前,在蛋白质组数据分析中,很多搜库参数采用的是经验值或者仪器制造商的推荐值,缺乏根据用户数据集确定搜库参数的策略和方法。在实际数据分析中,通过试探性搜库,然后对结果进行统计分析,可以有针对性地优化搜库参数或者给出参数的确定方法。另外,已经有很多实验设计比较严密的标准数据集发表,利用这些数据集和数理统计的方法,也可以对搜库参数进行分析和优化。本文以标准蛋白质(control proteins)的数据集为分析对象,采用改变参数进行多次数据库搜索和数理统计的方法,分析了母离子质量误差容限、碎片离子质荷比误差容限、酶切方法等参数对搜库结果的影响,给出了这些参数的确定方法或者推荐值。在这些研究中,本文提出了从带有噪声的数据中估计母离子质量误差容限和碎片离子质荷比误差容限的方法;改进了高精度的傅立叶变换质谱仪的母离子质量校正公式;发现了碎片离子的质荷比误差随信号强度变化的规律,从而提出了一个根据相对信号强度确定误差容限的经验公式;分析了碎片离子质荷比误差容限对搜库分值的影响,从而给出了其确定方法;分析了漏切位点和酶切端数目对搜库结果的影响,为这2个参数的指定提供了参考。另外,本文还提出了扩大搜库误差容限,然后过滤搜库结果,利用分布拟合的方法确定统计意义上的母离子误差容限,再对全体结果进行过滤的数据处理策略。分析结果表明,这种策略可以有效提高搜库软件采用的参数的分类能力。(2)搜库结果质量控制的特征提取。搜库结果的质量控制是典型的模式分类问题,特征提取和选择是模式分类的基础工作。本文系统地总结了搜库结果质量控制的常用参数,将它们分为3类进行分析,包括常用的搜库软件SEQUEST提供的搜库分值、肽段和图谱的基本参数、不同文献中提出的经验参数。另外,对于特征计算相关的问题,例如理论图谱的产生,特征分类能力的度量等,本文也进行了比较深入的分析。在这一部分的研究中,通过文献阅读和对质谱实验背景知识的了解,再加上使用标准数据集进行数据“试验”,本文优化了一些特征的计算。对另外一些特征计算的实际问题,例如,肽段色谱保留时间预测模型的应用,提出了具体的解决方案。使用聚类分析和启发式知识,对特征之间的关系进行了分析。在此基础上,根据本文使用的不同分类方法的特点,给出了特征选择的建议规则。(3)基于随机数据库搜索的搜库结果验证方法研究。目前,在蛋白质组实验研究中,基于随机数据库搜索的肽段鉴定结果验证方法已经得到广泛应用。这种方法能够为不同样品、搜库软件、质谱平台、实验条件下的数据提供统一的质量控制框架。但是,基于随机数据库搜索方法的多个应用问题,还没有得到很好的解决,也缺乏方法性能评估的研究。在这一部分的研究中,本文首先提出了一种随机数据库的构建方法,通过实际搜库验证,发现这种方法可以很好地避免重复肽段问题,并且得到的搜库分值分布也能比较好地模拟正常数据库中的随机匹配的分值分布。在此基础上,本章研究了从简单到复杂的4种搜库结果验证的分类决策方法:提出了线性判别函数法(LDF法)、基于多元非参数概率密度函数拟合的方法和基于贝叶斯非参数模型的方法,改进了基于ln(Xcorr)和△Cn1/2的边缘分布拟合的方法。这些研究共同的目的是,解决基于随机数据库搜索方法的判别函数选择和特征融合问题,以提高搜库结果过滤方法的灵敏度。其中,本文提出的线性判别函数法性能比较好,也比较简单,容易被实验人员所接受,在中国人肝脏蛋白质组计划的数据分析中已经得到应用。而基于ln(Xcorr)和△Cn1/2的边缘分布拟合的方法和线性判别函数方法得到的结果基本一致,判别边界也十分接近。基于多元非参数概率密度函数拟合的方法和基于贝叶斯非参数模型的方法都使用了多个特征,方法的灵敏度得到了很大程度的提高。利用标准样品和实际样品的实验数据进行的验证表明,本文提出和改进的这4种方法比已有的搜库结果验证方法具有更高的灵敏度,并且在标准样品数据集上能够获得比较准确的假阳性率估计。另外,通过和PeptideProphet进行比较发现,基于随机数据库的方法在不同的数据集上都能够取得比较好的结果,模型具有比较好的泛化性能。总之,本文针对质谱数据质量控制中数据量大、特征分布可变、噪声复杂等特点,通过大量的数据统计分析,揭示了串联质谱数据质量控制的一系列问题和困难。在此基础上,通过对串联质谱数据处理各个环节的研究,包括搜库参数的优化,搜库结果验证的特征提取和选择、基于多元特征融合、非参数概率密度函数估计的搜库结果验证方法等方面,在很大程度上克服了串联质谱数据搜库结果质量控制的困难,提高了数据质量控制方法的灵敏度和鲁棒性。本文的研究成果在HLPP数据分析中已经得到了应用。

【Abstract】 Proteomics aims to systematically investigate the function molecules of the life-proteins, at the global level. Because the varying range of the protein expression in a biological system may exceed 6 order of magnitude, the physical and chemical properties of them are very complex, proteomic research needs high-throughput and high sensitive experiment platforms. Biological mass spectrometry (MS) has these characters and thus become a supporting technique of proteomic researches. Because of the complexity of the sample and the complex chemical and physical mechanism of the MS experiment, MS data involves complex noises and MS data process is an open, hot and difficult problem of proteomics. Database searching serves as a popular method of the mass spectrometry data process by comparing the experiment mass spectrum with the predicted spectrum of the digested peptide in a target protein sequence database, and finding the best matches with some scores aimed to measure the match quality. A database search result (also called peptide identification) is the best match in a limited searching space, which is not necessarily correct. Because of the huge computing burden, the automatic database search software interprets the mass spectra roughly and without any effective methods to evaluate the confidence of the resulting matches. Therefore, the problem with quality control of the mass spectrometry data is notable in the fowllowing areas: (1) Integrating the MS data from the multiple laboratories and multiple platforms is a common manner in the proteomic research. Thus, a universal quality control framework is needed for the large-scale proteomic research. (2) It is difficult to set up the probability model based on the complex physical model of the MS experiment. Many models used in the data quality control of peptide identifications were obtained by observation, statistic fitting or training from the standard dataset, so that the universality of these models is doubtful and the validation of the results given by the model is tiring work in the proteomic research. (3) One reason for the complexity of MS data is that the statistical characters of the data may change with the experiment conditions, environment factors and treating samples making it very difficult to build universal algorithms for the MS data process. (4) The various chemical and physical mechanisms involved in the MS experiment leads to the existence of many sub classes in the MS data. It is difficult to model the database search problem with a one-size-fit-all algorithm. Hence, multiple parameters were used to validate the database search results. Those parameters measure the match quality between the mass spectra and peptides in different aspects. The integration and fusion of multi-source information and synthetic decision-making is needed for quality control of the peptide identifications. (5) The huge data volume in the proteomic research brings about notable computing problems. This paper is intended to address these problems in the database search result validation, and focuses on the optimization of some database search parameters, extraction and selection of the features for the classification of correct and random peptide identifications, and some algorithms and schemes for the evaluation of peptide identifications based on the randomized database searching. The main work includes: (1) The optimization of some database search parameters. Database search is the base of the quality control of peptide identifications. Many parameters need to be specified by the user before database searching. Some of database search parameters can restrict the candidate peptides of a mass spectrum and affect the database search results greatly. These parameters rely on the character of the instrument and the physical and chemical theory of the experiment, and can be affected by the work status of the instrument, the experiment protocol and the complexity of the sample. In many researches, these parameters are selected as the recommended values provided by the instrument manufacturer or references. Statistical conclusions are lacking about their optimized values, which should be based on the experiment data of the user. Actually, many the database search parameters can be estimated form the results of the exploring database search. On the other hand, many reference datasets with strict experiment design have been published, which can be used to analyze and optimize the database search parameters. In this paper, the influence (on the database results) of mass error of parent ions, m/z error of the fragment ions and the enzyme specificity were investigated using the reference datasets and statistical methods. A robust method was proposed to estimate the mass error tolerance of the parent ions and the m/z error tolerance of fragment ions from the data with noise. An improved recalibration law was proposed for the high accuracy Fourier-transform mass spectrometry based on the observation that the mass error increases with the retention time. The m/z error of the fragment ions was found to decrease with the signal intensity of the ions, and an empirical formula is provided to determine the m/z error tolerance according to the signal intensity. The distribution of the number of miss-cleavage sites of the correct peptide identification and the distribution of the number of peptide identifications with different tryptic terminals is also analyzed. Based on the work in this section, we proposed a database search strategy that enlarges the actual database search parent mass error tolerance at first and than filters the results based on the statistical parent mass error tolerance. This strategy was applied to a control dataset and the results showed that it could improve the discriminant power of the database scores.(2) Feature extraction and selection of the quality control of database search results. The quality control of database search results is a typical pattern classification problem. Feature extraction and selection is the essential work of pattern classification. This paper summarized the parameters of the quality control of database search results, which include the database scores, the basic character of the mass spectrum and peptide and the empirical parameters proposed in different literatures. And then, this paper introduced the generation of theoretic MS/MS spectrum and the measurement of the discriminant power of these features. In this research, the discriminant powers of some features were optimized based on the background knowledges and exploring data analysis. Meanwhile, some practice problems about the application of peptide retention time to the validation of peptide identifications were discussed and settled. A set of features proposed in different literatures were summarized and defined. Finally, based on the background knowledge and the clustering analysis, correlation analysis was performed on these features and the basic rules were provided for the feature selection of different methods of database search result validation, which will be used in this paper.(3) The work on the validation of peptide identifications based on the randomized database searching. Currently, the randomized searching based methods can provide a universal framework for the quality control of MS data with different samples, different platforms, different experiment conditions and different database search softwares. However, many practical problems with the randomized database searching based methods are not adequately solved and the evaluation research on the performance of the randomized database searching based methods is still primary. This paper proposed a method for the construction of randomized database, which could avoid the share peptide problem. Then, four methods were proposed to validate the database search results: linear discriminant function based method, ln(Xcorr) and (ΔCn)1/2 margin distribution fitting based method, the multivariate nonparametric density estimation based method and the Bayesian nonparametric model based method. These efforts aimed to provide some solutions for the discriminant functions and the feature fusion in the randomized database searching based methods, and thus improve the sensitivity of the database search result validation. The linear discriminant function based method was easy to use and had been applied to the Human Liver Proteome Project (HLPP). ln(Xcorr) and (ΔCn)1/2 margin distribution fitting based method got almost the same results with the linear discriminant function based method. The other two methods used more features and the sensitivity of them is improved a lot. These methods were evaluated using the control datasets and real sample datasets and were proved to be more sensitive than traditional randomized database searching based methods. In addition, the false positive rate estimation was proved accurate enough on the control dataset. On the other hand, we compared the performance of the randomized database searching based method with PeptideProphet and found that the randomized database searching based method could get better performance on datasets from different instruments and laboratories. The generalization performance of the randomized database searching method was improved.In a word, this paper revealed a series of problems of the quality control of tandem mass spectrometry data by applying the statistical analysis to the huge datasets in proteomic research, which had varying statistics and contained complex noise inherently. Consequently, a systematic research on the optimization of some database search parameters, extracting and selecting of the features for the classification of correct and random peptide identifications, and some algorithms and schemes for the evaluation of peptide identifications based on the randomized database searching was provided. The methods proposed in this paper can largely improve the sensitivity of the validation of peptide identifications and overcome the variation of the datasets, which were based on the multi-source feature fusion and feasible nonparametric technique. The methods proposed in this paper have been applied in the HLPP.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络