节点文献

非小细胞肺癌发生分子机制的生物信息学研究

Research on Molecular Mechanism of Non-small Cell Lung Cancer (NSCLC) Based on Bioinformatics

【作者】 陈启龙

【导师】 郑文岭;

【作者基本信息】 上海大学 , 电子生物技术与装备, 2009, 博士

【摘要】 肺癌是一种常见的肺部恶性肿瘤。近年来,随着各种环境因素的影响,世界各国特别是工业发达国家,肺癌的发病率和病死率均迅速上升,位居恶性肿瘤发病率和死亡率首位,严重威胁了人们身体健康。然而,到目前为止,肺癌的发生分子机制依然不清楚,难以有效的进行早期诊断和治疗。鉴于此,本论文利用生物信息学方法,分别从基因差异表达数据挖掘、蛋白质相互作用预测及其网络的构建等方面探讨了非小细胞肺癌的分子作用机制。同时对部分生物信息学分析结果进行了分子生物学验证。考虑到非小细胞肺癌是肺癌的主要类型,因此本论文所用数据来源于GEO数据库中肺鳞癌、肺腺癌数据集,具体工作内容如下:第一,综合BRB-Array Tools和MATLAB程序,分别对肺鳞癌数据集(GDS1312)和肺腺癌数据集(GDS1650)进行数据挖掘。试图阐述两个方面问题:一是基因表达模式是如何在非小细胞肺癌中发生变化的;二是差异表达基因参与的代谢途径有哪些以及代谢途径在非小细胞肺癌发生过程中的可能作用。GDS1312中包括5例肺鳞癌组织及对应正常癌旁组织的全基因组表达数据。数据挖掘结果显示,肺鳞癌中共筛选出409条表达上调的基因和877条表达下调基因;经GO分类对比共有1730条基因与95个GO分类匹配,主要涉及细胞骨架、细胞增殖调控、程序性细胞死亡、免疫应答及蛋白酶等;KEGG通路主要涉及物质代谢、细胞周期及疾病相关等通路;BioCarta通路主要涉及细胞黏附、细胞周期调控、细胞免疫、细胞信号及物质代谢等通路。GDS1650中包括10例肺腺癌组织及对应正常癌旁组织的全基因组表达数据。数据挖掘结果显示,肺腺癌中共筛选632条表达上调基因和975条表达下调基因;经GO分类对比共有1358条基因与63个GO分类匹配,主要涉及细胞骨架发生、细胞黏附、细胞识别、血管发育、蛋白激酶束缚等;KEGG通路分别涉及细胞黏附分子通路、白细胞跨内皮迁移通路、VEGF信号通路、mTOR信号通路与细胞周期通路;BioCarta通路与肺鳞癌类似,分别涉及细胞黏附、细胞周期调控、细胞免疫、细胞信号及物质代谢等通路。第二,基于支持向量机(SVM)的蛋白质相互作用(PPI)预测。以任意连续的两个氨基酸所构成的特征作为一个描述符(二氨基酸特征单元),计算每一个特征单元在蛋白质序列中出现的频率。以此构建一个二元向量空间(V, F)来描述每一个蛋白质序列,将蛋白质序列的PPI信息映射进入特征向量空间。利用支持向量机(SVM)的学习方法,采用径向基函数作为核函数,构建了蛋白质相互作用预测模型。并用10次的10倍交叉验证以检测预测模型的可靠性。这种方法能够产生一个精确度超过83%的稳定PPI预测模型。第三,以肺鳞癌、肺腺癌的差异表达基因为依据,构建与肺癌相关的蛋白质数据,通过二次筛选获得与肺鳞癌、肺腺癌发生高度相关的蛋白质分别为95个和178个,其中有19个蛋白在肺鳞癌、肺腺癌中共表达。将这些蛋白质分别与HPRD数据库进行检索,获得目前已有的全部PPI数据,并整合SVM预测的蛋白质相互作用信息。删除自作用数据和冗余数据后,利用Cytoscape程序构建肺癌相关蛋白质相互作用网络。计算网络的中心节点(核心蛋白),其中肺鳞癌相关PPI网络有19个核心蛋白,肺腺癌相关PPI网络有35个核心蛋白。探讨核心蛋白在肺癌发生分子机制中的可能作用,并提出肺癌发生的“分子群”假设。第四,为验证上述生物信息学分析结果,从在肺鳞癌、肺腺癌共表达的基因中筛选6个基因,采用半定量RT-PCR方法检测这些基因在肺鳞癌、肺腺癌细胞株中的表达情况。结果表明,5个基因在两种肺癌细胞株中均有表达,显示这些基因在肺癌细胞株中的表达具有一定的“相关性”,其中SOX4基因呈现高表达,提示该基因可能与肺癌发生有一定关系。为此,采用PCR-SSCP及DNA测序技术,对90例肺癌组织标本进行SOX4基因突变检测,发现部分肺癌组织中有SOX4突变的发生。综合MATLAB与SwissPdbViewer程序,对突变SOX4蛋白三级结构进行预测。结果显示,突变导致SOX4蛋白的侧链结构发生改变,影响了该蛋白与其它分子的相互作用功能。由于SOX4蛋白是一类与发育相关的转录调控因子,暗示SOX4突变可能是导致肺癌发生的一个潜在因素。综上所述,肺癌发生并非是由单个或几个基因或蛋白质能够决定其发生机制的,它可能是由众多与肿瘤发生相关的“分子群”形成的复杂调控系统。本论文的主要创新点:1.综合MATLAB程序与BRB-Array Tools软件,对非小细胞肺癌差异表达基因数据进行挖掘,为基因芯片数据挖掘提供了新的研究方法,并从基因表达水平探讨了肺癌发生的可能分子机制。2.以任意连续的两个氨基酸特征作为一个描述符,设计一种基于支持向量机(SVM)的蛋白质相互作用(PPI)预测方法。该方法能最大限度地保证蛋白质对中氨基酸信息的完整性,并以MATLAB作为实验平台,极大地减小算法实现的难度。3.利用基因表达数据挖掘结果,获得与肺癌发生高度相关的蛋白质数据,并结合数据库中的PPI信息,构建了肺癌发生相关蛋白质相互作用网络。以PPI网络中的核心蛋白为主体,提出肿瘤发生的“分子群”假设,为肺癌发生分子机制研究提供了新的研究思路。4.发现了肺癌组织中SOX4基因突变的发生,综合MATLAB与SwissPdbViewer程序,对SOX4蛋白三级结构进行预测,为蛋白质三级结构的同源建模提供了新的研究方法。

【Abstract】 Lung cancer is the most common lung malignant tumor. In recent years, along with the many environmental factor influence, the morbidity rate and mortality rate of lung cancer were rapid rise in the world, especially in developed industry country. However, the molecular mechanism of lung cancer is by far still ambiguous, and difficulty in early diagnose and therapy. In view of this, the bioinformatics methods were used in this dissertation, and discussed the mechanism of non-small cell lung cancer (NSCLC) from the data mining of genes differentially expression, the prediction of protein-protein interaction (PPI) and constructed PPI network, respectively. In the meantime, the partial bioinformatics results were validated based on molecular biological experiment. Because the NSCLC was main type in lung cancer, the data of this dissertation were root in squamous carcinoma and adenocarcinoma database in GEO. The main works of this dissertation are as follows: Firstly, in order to elucidate that the changes of gene expressed mode, the metabolic pathway of differentially expressed genes (DEGs) and its possible roles in NSCLC development, using computer program BRB-Array Tools and MATLAB, the lung squamous carcinoma database (GDS1312) and adenocarcinoma database (GDS1650) were mined, respectively.The database GDS1312 including 5 cases lung squamous carcinoma tissues and 5 cases normal paracancerous tissues. The result shows that 409 DEGs were screened as up-regulated in squamous carcinoma, whereas 877 DEGs were screened as down-regulated. The Gene Ontology (GO) comparison result show that 95 GO categories were obtained from 1730 genes, and main involved cellular cytoskeleton, cell cycle regulation, programmed cell death, immune response, protein enzyme, and so on. KEGG pathways were main involved metabolism, cell cycle, and disease related pathway. BioCarta pathways were main involved cell adhesion, cell cycle regulation, immunology, cell signaling and metabolism.The database GDS1650 including 10 cases lung adenocarcinoma tissues and 10 cases normal paracancerous tissues. The result shows that 632 DEGs were up-regulated and 975 DEGs down-regulated. 63 GO categories were chosen from 1358 genes, and main involved cellular cytoskeleton biogenesis, regulation of cell adhesion, cell recognition, blood vessel development, and protein-kinase binding, and so on. Three KEGG pathways were involved Cell adhesion molecules (CAMs) pathway, Leukocyte transendothelial migration pathway, VEGF signaling pathway, mTOR signaling pathway and cell cycle pathway. BioCarta pathways are likely lung squamous carcinoma related pathways, also involved cell adhesion, cell cycle regulation, immunology, cell signaling and metabolism.Secondly, prediction of protein-protein interaction (PPI) based on support vector machine (SVM). The properties of any two continuous amino acids as a descriptor (two amino acids units), and counting the frequencies of each two amino acids units. Then, constructing a binary space (V, F) to represent a protein sequence, and the PPI information of protein sequences were mapped into a vector space. The predicted models of PPI were constructed using the radial basis function kernel, and the learning methods of SVM to construct. In order to validate the forecasting reliability, the 10 times 10-folds cross validation method were used. This method can obtain a stabilized PPI predicted model which the accuracy overrun 83%. Thirdly, the lung cancer protein-protein interaction (PPI) network constructed.The lung cancer related protein database was formed based on the up-regulated DEGs genes, and 95 proteins were obtained which high related to squamous carcinoma, whereas 178 proteins were obtained which high related to adenocarcinoma. 19 co-expressed proteins were also simultaneously obtained in two type lung cancer from comparison. The complete PPI data were searched from HPRD database based on these proteins, and integrate the predicted PPI information using SVM. To delete the self-interaction data and redundancy data, and the PPI network of lung carcinogenesis was constructed by Cytoscape program. Using Degree sorted computer program, the hub proteins of PPI network were obtained, including 19 proteins in lung squamous carcinoma and 35 proteins in adenocarcinoma. Discuss the possible role of hub proteins in molecular mechanism of lung cancer, and propose a“molecular group”hypothesis for lung carcinogenesis.Finally, in order to validate the results of bioinformatics, the 6 genes were screened form co-expression genes, then using Semi-Quantitative RT-PCR technology to validate these genes expression in squamous carcinoma cell strain and adenocarcinoma cell strain. The results show that 5 genes were expressed in two type lung cancer cell strains, and indicate the expression of these genes more likely“correlative”in cancer strains. Moreover, the SOX4 was high expressed and indicate this gene may associate with lung carcinogenesis. Then, SOX4 mutations were detected in partial tissues of 90 cases NSCLC tissue samples using PCR-SSCP method and DNA sequencing technology. Combine the MATLAB and SwissPdbViewer program, modeling the SOX4 tertiary structure were predicted. The results indicated that the mutation lead to the side-chain conformation of SOX4 was changed, and may effects the interaction function for other molecular. It also suggesting that SOX4 mutation may be a potential factor with lung carcinogenesis.In summary, single or several genes/proteins could not determine the molecular mechanism of lung carcinogenesis, and it may likely associated with a complex regulation system that formed by many“molecular group”which related to carcinogenesis.The main innovation of this dissertation:1. Combined MATLAB program and BRB-Array Tools to mine the differentially expressed genes data of NSCLCs, provided a new method for the data mining study of microarray data, and discuss the possible molecular mechanism of lung cancer form gene expression level.2. The properties of any two continuous amino acids as a descriptor, design a PPI predicted method based on support vector machine (SVM). This method can furthest ensure the integrality of amino acids information of protein pairs. Using MATLAB as experimental platform, and furthest decrease the difficulty of algorithm realized.3. The protein data were obtained which highly related to lung caner based on the data mining results of gene expression. Integrate the PPI information of database, the PPI network of lung cancer were constructed. Then, based on the hub proteins of PPI network, propose a“molecular group”hypothesis for carcinogenesis, and provided a new research clue for the mechanism study of lung carcinogenesis. 4. Reported the SOX4 mutation in non-small cell lung cancer (NSCLC) tissues. Combined MATLAB and SwissPdbViewer program to model and predicting the SOX4 protein tertiary structure, and provided a new method for homology modeling study of protein.

  • 【网络出版投稿人】 上海大学
  • 【网络出版年期】2010年 05期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络