节点文献

基于生物质谱的蛋白质组学数据处理及检索质量控制研究

【作者】 贠栋

【导师】 贺福初;

【作者基本信息】 复旦大学 , 化学生物学, 2007, 博士

【摘要】 本论文的主要贡献:1.建立了复旦大学蛋白质组研究中心实验平台的数据模板,并应用于人类肝脏蛋白质组计划(Human Liver Proteome Project,HLPP)的数据交流和管理,为大规模实验数据管理和交流的标准化提供了实际经验和实现思路;2.基于2-DE(二维电泳)-MALDI-TOF/TOF(基质辅助激光解析飞行时间质谱)及类似的以蛋白质为最终分离单元的组学实验,提出了一种优化的搜索策略,避免了因共享质荷比所产生的假阳性结果,同时采用迭代搜索全面反映单元内的蛋白存在;3.针对MALDI-TOF/TOF的相关数据,通过图谱特征变量的提取和线性判别分析(Linear Discriminant Analysis,LDA),建立了PMF(PeptideMass Fingerprint)和PFF(Peptide Fragment Fingerprint)图谱的评价模型,并应用于检索质量控制研究中;4.为配合人类肝脏蛋白质组的数据分析,构建了四个参考数据集:人类及小鼠肝脏蛋白质组数据集、健康人血浆蛋白质组数据集、健康人心脏蛋白质组学数据集以及肝病相关基因及蛋白质数据集;5.应用液相等电聚焦预富集方法(Liquid-phase Isoelectric Focusing,LIEF)对小鼠肝脏蛋白质组进行分析,证明了LIEF和多实验路线结合的策略在复杂样本蛋白质组学研究中的优势。在复杂的生命过程中,蛋白质是各种生命活动的具体实现者。1994年Williams正式提出了蛋白质组的概念,1995年Willkins正式提出了“Proteome(蛋白质组)”的专业术语及其定义。十余年来,蛋白质组学蓬勃发展,已成为生命科学、化学、信息科学等领域研究的重点与交叉的热点,被广泛地应用于各类模式生物和人类的探索之中。随着生物质谱技术的成熟,尤其是电喷雾(ESI)和基质辅助激光解吸(MALDI)等接口技术在20世纪后期的突破性发展,质谱检测满足了高通量、高分辨率的要求,逐渐替代了先前的Edman测序等生物化学手段进而成为蛋白质组鉴定的首选平台。但从蛋白质组学的发展现况来看,由于分离和检测过程还存在诸多的不完善,实验结果往往受蛋白质分离效果、丰度、化学特性等因素的影响;而各类检索算法也或多或少的存在缺陷。它们导致检索结果中常伴随有假阳性(False Positive)和假阴性(False Negative)的问题。而其解决需要实验科学和信息科学双方面的努力,其中除实验技术不断改进和发展外,数据处理流程的优化及检索质量的控制至关重要,但至今仍缺乏公认的理想数据处理范本。本文基于复旦大学蛋白质组研究中心的生物质谱平台,对目前蛋白质组数据处理流程中所涉及的数据标准化管理、检索策略的优化和质量控制等重要方面,进行了系列尝试,为进一步提高蛋白质鉴定的可靠性,以及更精确地展示生物样本的蛋白质组提供了理论依据和实现思路。论文共分六章,内容摘要如下:第一章:前言。概述了蛋白质组学的发展历程,对目前主要的技术体系和发展方向进行了评述。其中对蛋白质组学研究中所涉及的数据处理方法,进行了详细的综述,主要包括:数据处理流程、数据检索软件的分类及简介、数据处理所面临的挑战和目前研究的热点等。基于这些总结和评述,提出了本文的研究方向和思路。第二章:蛋白质组实验流程及数据标准化管理初探。针对HLPP中复旦大学相关仪器设备与产出数据的特点,基于PSI(The Proteomics Standards Initiative)原则,对实验流程和数据产出进行了信息抽提,并建立了相关的数据模板。其中,所涉及的实验流程包含了目前主流的实验平台,数字化地反映了双向凝胶电泳(2-DE)、多维液相(MDLC)等分离技术及MALDI、ESI质谱的实验参数和数据参数。模板已应用于HLPP实际的数据交流和管理中,并为标准化管理大规模实验数据提供了实际经验。第三章:大规模质谱数据分析中的非同质荷比迭代检索规则。由于目前分离过程尚无法保证每一分离单元只含一种蛋白质,因而常影响到后续的搜索过程,并产生假阳性或假阴性的匹配结果。在本章工作中,通过数据库统计和对实验数据的模拟匹配,发现检索结果之间质荷比的共享是产生假阳性的重要原因。因此,本文提出了一种优化的检索策略,首先以改进的半小数规则和频度限制对质谱进行去噪处理,然后以匹配分数高低及是否包含共享质荷比作为可信结果的评判标准,再将质谱文件中已匹配的质荷比进行过滤,产生新的质谱搜索文件并进行迭代搜索,直到没有可信结果产出为止。为进一步保证结果的可信度,反转数据库方法也被用于其中。在标准蛋白实验和法国人肝蛋白质组实验中的应用显示,非同质荷比迭代规则和反转数据库方法的结合可以更全面地反映蛋白质组的组分,同时赋予了检索结果更好的可信度。至此,为2-DE-MALDI TOF/TOF及类似实验平台,提供了一套系统可信的数据分析方法。第四章:MALDI TOF/TOF质谱图谱的质量评价及其在检索质量控制研究中的应用。作为生物质谱数据分析的根本,质谱图谱的质量与检索结果之间息息相关,但目前相关研究所涉及的数据基本来源于离子阱质谱(主要为LTQ,LCQ)的串级数据,对于PMF图谱及基于MALDI TOF/TOF的串级数据的评价则非常少见。本文基于MALDI TOF/TOF所产生的大规模数据,通过图谱特征变量的提取和线性判别分析,建立了相应的PMF和PFF数据的评价模型。在评价模型的基础上,通过反转库分析方法,进一步讨论了PMF图谱与相关PFF图谱质量之间的影响关系、图谱质量与蛋白质检索鉴定之间的关系,最终定义了WellQuality指数,对源于同一分离单元的PMF和PFF的质量进行了统一的评价。结果显示,质谱质量是决定蛋白质匹配是否成功的决定性因素,好的图谱往往是高质量匹配的先决条件。Well Quality指数可以很好地反映分离单元质谱检测的优劣,其指数与鉴定成功率和得分之间存在着明显的线性关系。此外,对于质量较好的图谱,随机匹配的可能性也在增加,因此本文同时采用了一种新的分数背景扣除方式进行质量控制,取得了良好的效果。MALDI TOF/TOF图谱的质量评价为蛋白质鉴定的质量控制提供了新的思路,同时也对蛋白质组学实验优化和机理研究提供了新的途径。第五章:肝脏蛋白质组参考数据集的建立及初步分析。对于组织样本的大规模蛋白质组研究方兴未艾,所产生的数据量极其可观。本章针对肝脏、血浆、心脏的蛋白质组和肝病相关基因及蛋白质的研究,建立了相关的参考数据集,并进行了初步分析。为保证数据的完备性和可靠性,基于NCBI PubMed医学文献数据库,采用人工搜索和判读的方式,对近年发表在国际知名杂志上的相关研究成果进行了遍历查询,并尽可能提取蛋白质组研究的相关参数。最终构建成四个参考数据集:人及小鼠肝脏蛋白质组数据集、健康人血浆蛋白质组数据集、健康人心脏蛋白质组学数据集以及肝病相关基因及蛋白质数据集。数据集的初步分析表明:各数据集的蛋白质存在一定的交互,而HLPP数据集对各参考数据集的覆盖均非常大。这样的重叠很可能反映了机体内各组织器官间的一些共性。同时作为机体最为重要的代谢器官,肝脏合成了很大部分的血浆蛋白,数据集间的高度交盖暗示了肝脏对于人体的重要性。这些数据的收集和整理对相关科学研究提供了系统可信的数据参考,有助于相关研究的深入和发展。第六章基于液相等电聚焦预富集方法(LIEF)的小鼠肝脏蛋白质组研究及数据分析。预分离技术已被证明可大大促进蛋白质的鉴定效果,对低丰度蛋白尤其明显。本章通过LIEF技术对小鼠肝脏中的蛋白质进行了预富集,并结合二维凝胶电泳(2-DE)和一维反相色谱(SDS-PAGE RPLC)分析策略对富集的蛋白质进行了分析。结果表明:LIEF技术可大大增加后续2-DE分析中蛋白质斑点的数目,同时大幅度提高相应的检索质量。LIEF富集后的LC路线和2-DE之间存在良好的互补性,这说明新技术的应用和现有实验路线的融合是全面高效分析复杂生物样本蛋白质组的有效途经。同时,LIEF可以更好的反映包含修饰信息的蛋白质斑点,这对深入挖掘表达谱数据的功能状态具有重要的意义。

【Abstract】 The main contributions of this dissertation are as follows. 1. The templates construction and application of proteomics standards for data management and exchange. 2. Development of an optimized search strategy named ’Iterative Non-m/z-sharing Rule’ for confident and sensitive protein identification of 2-DE based Proteomics. 3. Spectral quality assessment and application for gel based MALDI-TOF/TOF MS data interpretation. 4. Establishment of four reference datasets for human liver proteome analysis. 5. Comprehensive proteome analysis of mouse liver by ampholyte-free liquid-phase isoelectric focusing.As the nature of life, proteins act as executants of complex biological processes, and then are more close to the core of biological systems than genes. In 1994, Dr. Williams presented the original conception of proteomics, and Dr. Willkins defined the term "proteome". In the past two decades, technical thrusts have turned high-throughput proteome analysis into reality. The development of proteomics relies mainly on the improvement of ESI-MS and MALDI-MS technology, and also sprang up the proteomic-orientated bioinformatics. Now, proteomics has burst onto the scientific scene as an important objective crossing over life science, chemistry and informatics. MS-based proteomics have been the main force of large scale protein identification instead of classic Edman sequencing method. However, proteomics is still hindered by the deficiencies of separation and detection platforms. Although lots of efforts have been dedicated to solve the problem, the false positive and false negative results are still inevitable. As expected, its solution lies on progress or/and brekthrough from both analytic sciences (wet) and informatics (dry). Corresponding algorithm evaluations and developments are continuously progressive, but still remain largely not to be addressed. Herein, based on the proteomics platforms of Fudan University, a series of efforts were made to resolve several key problems of comprehensive proteome data analysis which were described briefly as follows:Firstly, current status of proteome data processing methods was reviewed including the introduction of search engines, progress and challenges of proteome data analysis, and important research objectives as well.Secondly, the process of templates construction and application of proteomics standards was introduced in detail. According to the principles from HUPO PSI (Proteomics Standards Initiative), the templates construction was performed by parameter extraction, minimization of parameter requirement, test of templates draft and application. The templates covered the most important proteomics platforms and have been successfully applied to data management and exchange of Human Liver Proteome Project (HLPP).Thirdly, a systematic search strategy named Iterative Non-m/z-Sharing (INMZS) analysis was proposed to address the problem. Actually, lots of pseudo-matches of 2-DE-based proteomic data are caused by over-used sharing m/z. Therefore, our strategy focused primarily on the validation of matched m/z. It utilized decimal rule and frequency threshold to filter the noise signal in the PMF and corresponding PFF peak-lists. Then search results were screened based on share status of corresponding matched m/z. Only the proteins that were matched with exclusive m/z information would be reserved as final results. Further iterative search was applied to improve discovery of minor components in a spot. Finally, identifications were all confirmed by reverse database evaluation. Simulation and application test of INMZS were implemented on large datasets of human liver proteome and standard protein cocktails. These results showed that INMZS was efficient to ensure the confidence and sensitivity of 2-DE based protein identification.Fourthly, a multi-variant regression approach was utilized to assess spectral quality for both PMF and PFF spectra obtained from MALDI TOF/TOF mass spectrometry. Then the assessed index was applied to investigations of MASCOT search results. After analyzing different search modes of MASCOT, a validation method based on score difference between normal and reference (reverse or random) database searching was proposed to define the positive matches. Systematic examinations on two large scale datasets of human liver tissues proved that spectral quality was a key factor for successful matching. Further analysis showed that spectral quality assessment was also efficient in representing the quality of 2-DE gel spot and promoting the discovery of potential post-translation modifications.Fifthly, to construct comprehensive and reliable reference datasets, manually searching and analyzing were implemented basing on NCBI PubMed search engine. Liver disease related dataset was collected from OMIM and Genecards with strict quality control. Four reference datasets were constructed: Integrated Liver tissue Proteome (ILP), Human Heart Proteome (HHP), Human Plasma Proteome (HPP) and Liver Disease Genes and Proteins (LDGP). The overlaps between the constructed datasets and Human Liver Proteome (HLP) are all considerable, indicating the remarkable similarity or/and protein exchanges between liver and plasma and other tissues. After annotated by HLP semi-quantitative information, lots of HLP proteins trend to be expressed at low, extra low or trace abundance in liver. Such abundance distribution suggested that HLP presented a comprehensive protein profiling of liver tissue.Sixthly, the ampholyte-free liquid-phase isoelectric focusing (LIEF) was combined with narrow pH range 2-DE and SDS-PAGE HPLC for comprehensive analysis of mouse liver proteome. As LIEF-prefractionation could greatly reduce complexity of sample and enhance loading capacity of IEF strips, the number of visible protein spots on subsequent 2-DE gels was significantly increased, facilitating discovery of low-abundant proteins. Totally, 6271 protein spots were detected after LIEF-prefractionation and integrating five narrow pH rang 2-DE gels from pH 3~11. Furthermore, LIEF fraction of pH 3~5 and unfractionated sample were separated by pH 3~6 2-DE respectively and identified by MALDI-TOF/TOF. Synchronously, LIEF fraction of pH 3~5 was also analyzed with SDS-PAGE RP-HPLC MS/MS strategy. More proteins with low abundance, or/and with extremely physicochemical characteristics were identified in comparison to the conventional 2-DE method. The combination of LIEF hyphened 2-DE and LC strategies is also effective to promote the identification of new proteins and investigations on post-translational modifications of mouse liver proteins.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2009年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络