节点文献

基于质谱血清多肽组谱图的管理分析系统构建与应用研究

【作者】 曹源

【导师】 邵宁生; 李伍举;

【作者基本信息】 中国人民解放军军事医学科学院 , 生物化学与分子生物学, 2009, 博士

【摘要】 在后基因组时代,随着人类和其他模式生物基因组测序的完成以及质谱仪器和方法取得的重要突破,蛋白质组学在基础研究和临床应用等方面取得了巨大进展。临床蛋白质组学是蛋白质组学新近出现的一个分支学科,它侧重于蛋白质组学技术在临床医学领域的应用研究,包括疾病预防、早期诊断和辅助治疗等方面。临床蛋白组学涉及多种数据类型,血清多肽组谱图(简称血肽图)是其中比较重要的一种,是基于非凝胶系统的临床蛋白质组学应用研究,其基本原理是通过基质辅助激光解吸电离飞行时间质谱(matrix-assisted laser desorption/ionization time-of-flight mass spectrometry, MALDI-TOF/MS)或表面增强激光解吸电离飞行时间质谱( surface-enhanced laser desorption/ionization time-of-flight mass spectrometry, SELDI-TOF/MS)检测血清中多肽组的精确质量数,然后采用生物信息学方法处理获得的一种数据。通过比较疾病与健康对照血肽图的差异,人们可以发现疾病特异表达的蛋白或多肽,进而有助于在蛋白水平研究疾病的发生机制。血肽图技术在生物标志物发现、疾病早期诊断和个性化治疗等领域有着广泛的应用前景。然而血肽图技术应用于临床研究过程中必须考虑下列一些因素。首先是样本选择对血肽图技术的影响,对于临床研究所需要收集的疾病患者和正常对照人群样本,要考虑到样本个体间差异和个体内差异,正常对照人群个体间差异包括年龄、性别、种族、家族史和疾病史等,疾病患者样本最好包含完整的疾病亚型,收集的信息要尽可能完整,以便满足构建数学模型和验证的需要。其次是样本收集对血肽图技术的影响,这属于分析前差异,包括样本收集、存储和运送过程中由于环境条件差异对样本所产生的影响,由于这些差异一般与疾病无关,有可能增加寻找与疾病相关的差异蛋白质或多肽的复杂性,最终影响血肽图分析的结果。最后是仪器分析的差异对血肽图技术的影响,血肽图技术需要的质谱仪器主要是MALDI-TOF/MS和SELDI-TOF/MS。由于质谱实验过程中存在多种影响因素,质谱产生的原始谱图数据包含了大量的噪音信号,必须进行预处理以去除干扰。鉴于血肽图具有变量个数和样本数目均众多的特点,面对这样复杂的数据,只有通过生物信息学方法,才能识别出与疾病密切相关的一组多肽峰,发现血肽图中与疾病相关的特征信息。然而,现有的数据管理与分析工具已经无法满足当前的需要,而商业化软件由于价格昂贵,也在一定程度上制约了血肽图技术的广泛应用。为此,我们将临床蛋白质组学与生物信息学相结合,开发了一套基于质谱血清多肽组谱图的管理分析系统BioSunMS。该系统基于ECLIPSE插件架构,采用JAVA语言开发,具有易于发布及二次开发,界面友好,跨系统平台等特点,便于管理临床样本、质谱谱图和对质谱谱图进行预处理和建模分析,从而为相关研究人员方便快捷地开展疾病分类与分型研究提供帮助,最后,我们以基于肺癌患者血肽图的样本分类和分型研究为例说明BioSunMS的功能,具体内容如下。1.血肽图数据库构建血肽图数据库主要用来存放正常人以及多种肿瘤(包括肺癌、肝癌、乳腺癌、直肠癌、前列腺癌和白血病等)患者的血清多肽谱、样本及其临床相关信息。该数据库主要包含样品来源、诊断方法、样品处理过程、质谱检测方法、血清多肽质谱数据等内容。该数据库主要提供了下列重要功能:血清多肽图查询,通过该系统,用户可获得特定肿瘤的血肽图的标志谱峰及其对应的多肽序列;各种疾病血肽图数据的提交,通过此系统,研究人员可以将自己实验室收集的疾病血肽图数据,提交到本数据库中,从而丰富了数据库中的疾病种类;血肽图疾病信息分析,检测人员将临床获得的血肽图直接通过本数据库进行查询,从而得到疾病相关信息。2.血肽图数据处理与分析的软件开发为了快速准确地开展以血肽图数据为基础的肿瘤分类与分型研究,开发了血多肽数据处理与分析模块。数据处理模块可实现对获得的血肽图质谱数据实现质谱图展示、数据导入、导出、格式转化和预处理等功能。数据分析模块具有对预处理后的数据进行统计学分析,找到特征谱峰,建立血肽图模型,对盲样进行判别等功能,可实现快速、自动化发现生物标志物等相关分析。3.基于血肽图数据的肿瘤分类与分型研究以支持向量机(SVM)、主成分分析(PCA)、遗传算法(GA)、朴素贝叶斯方法(Na?ve Bayes)和偏最小二乘法(PLS)等常用的统计学及机器学习方法为工具,以血肽图数据库中的数据为基础,构建了基于血肽图数据的肿瘤分类与分型模块,并提供模型参数优化功能,便于相关人员开展肿瘤分类与分型研究工作。4.肿瘤特征性血肽图模型建立该研究是与国家仪器分析中心合作开展的。在前期工作中,国家仪器分析中心已经完成了1000例健康人群和2000多例肺癌、肝癌、乳腺癌、直肠癌、前列腺癌和白血病等肿瘤患者的血肽图高分辨质谱数据采集。在此基础上,运用BioSunMS系统对数据库中254例肺癌组以及257例正常对照组的血肽图进行分析。首先,我们以150例肺癌组样本和150例对照组样本的血肽图数据构建了训练集,剩余104例肺癌组样本和107例正常对照组样本的血肽图构建了测试集。通过t检验进行变量选择,以P<0.005为标准,筛选出74个特征谱峰。以这些变量为基础,我们采用SVM方法构建了肺癌血肽图的分类模型,并用测试集进行了验证。对于测试集,分类准确度、敏感性和特异性分别是92.3%,96.3%,94.3%。通过上述分析,我们发现了一些肺癌特征质谱谱峰信息,并以这些谱峰信息为特征,构建了基于质谱血肽图的肺癌早期诊断模型,对肺癌的早期诊断研究进行了初步的探索。综上所述,该研究构建了一个集质谱血清多肽组谱图的数据库管理和分析为一体的软件BioSunMS,并应用该系统对肺癌血肽图数据进行了初步分析,构建了肺癌血肽图早期诊断模型,为基于质谱血肽图的相关研究提供了生物信息学支持。

【Abstract】 In the post-genomic era, with the completion of large-scale genome sequencing for human and model organisms, and a great breakthrough in the mass spectrometry, proteomics has made big progress in both basic research and clinical application. As a branch of proteomics, clinical proteomics focuses on the application of proteomics techniques in clinical medicine, which includes disease prevention, early detection, aiding therapy and so on. Many kinds of data are involved in clinical proteomics, and serum peptidome profiling is important one of them. It is a profile of proteins or peptides distributed in serum, which can be obtained via matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF/MS) or surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF/MS). Through comparison of peptide profiles between patient and control groups, we can find differentally expressed proteins or peptides leading to the development of a diseased condition at the protein level.Serum peptidome profiling shows a broad perspective in clinical studies, such as biomarker discovery, early detection and personalized medicine. Howerver, following issues should be considered before applying serum peptidome profiling in clinical studies.First, sample selection which has effect on the result of seurm peptidome profiling should be carefully assessed. We should consider the personlized difference among patients and control groups, such as age, sex, race, family history and medical history. Meanwhile, different stages of an individual must be indentified. To construct the mathematics models for disease diagnosis and model validation, the collection of sample information and the patient records linked to the samples should be comprehensive.Second, with a number of potential factors in the process of sample collection such as collection, transportion and storage, evaluating the effect on diagnostic sensitivity is important. Recording detailed information on collection, processing and storage of samples is crucial for both efficient reporting on biomedical study and subsequent data analysis.Finally, because there are much noise in the raw MS data from MALDI-TOF/MS or SELDI-TOF/MS, data preprocessing must be conducted.In view that the number of variables and the number of samples from the serum peptidome profile are very large, bioinformatics tools play a key role in discovering a set of peaks related to disease. Up to now, there are some projects for MS data management and analysis. However, few projects try to emphasize both the management of patients information and MALDI-TOF or SELDI-TOF MS-based statistical analysis. Here, we developed the flexible and compact software, BioSunMS, for MALDI-TOF or SELDI-TOF MS-based clinical proteomics study. BioSunMS was designed to support decission-making and allow patients information and spectra data to be stored, managed, processed and analyzed. The BioSunMS software had been tested with MS files of serum samples from patients with lung cancer and control groups. The whole paper is divided into the following four parts.1. Construction of the database for serum peptidome profileThe database is used to store the data from patients and control groups. The disease includes lung cancer, liver cancer, breast cancer, rectal cancer, prostatic cancer, leukaemia and so on. There are some tables for recording the information coresponding to the sample. The fields of the tables are sample source, clinical diagnosis, sample preprocess, detection methods, MS data and so on. Users can submit the data of serum peptidome profile to the database. There are many ways to query the database for spectra meeting desired criteria, such as research group, user, sample state, sample type, patient and characteristc description.2. Development of BioSunMS software for the ananlysis of serum peptide profile BioSunMS software includes two main modules, spectrum processing and MS profile analysis. The spectrum processing module performs spectrum import, spectrum export, and related preprocess such as calibration, normalization and peak detection. The MS profile module is designed for sample classification and identificaition of potential biomarkers. It includes feature selection and model construction to allow rapid automated analysis to identify potential biomarkers. 3. Sample class discovery and sample class prediction based on serum peptidome profileTo provide a platform for clinical researchers, we built a model based on the dataset of the database, using machine learning and statistical methods, such as SVM, PCA, GA, Na?ve Bayes, PLS and so on.4. Construction of a serum peptidome profile-based model for lung cancer The study was collaborated with the National Center of Biomedical Analysis. During the prelimilary research period, they collected and tested 1000 control samples and more than 2000 cancer samples by mass spectrometry. Among the dataset, there were 254 patients with lung cancer and correspondent 257 normal control samples. To construct the model for diagnosis of lung cancer patients using BioSunMS, we firstly collected 150 lung cancer samples and 150 healthy control samples as the training dataset. The remaining samples were used for test dataset, which contained 104 lung cancer samples and 107 healthy control samples. Then, the t-test was used to screen the peaks with statistical significance in training dataset, and 74 peaks were found. Finally, the method support vector machine (SVM) was used to construct model. The accuracy, sensitivity and specificity of the model on test dataset were 92.3%, 96.3% and 94.3%, respectively. The model has the potential application in early detection of lung cancer.In summary, we have developed the software BioSunMS, which integrates patients information and MS data storage, process, sample class discovery, sample classification and sample prediction in a single, user-friendly workbench. The project provides an additional solution to analyze hight-throughtput MS data of serum peptidome profile. Using BioSunMS, we also constructed an early detection model based on the serum peptidome profile for lung cancer. The present study finally provided bioinformatics support for the application of serum peptide profile in clinical studies.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络