节点文献

基于数据挖掘技术的肺癌早期预警模型研究

Study of the Early Warning Model for Lung Cancer Based on Data Mining

【作者】 王娜

【导师】 吴逸明;

【作者基本信息】 郑州大学 , 流行病与卫生统计学, 2012, 博士

【摘要】 肺癌是当今世界各国最常见的恶性肿瘤,其发病率和死亡率呈不断上升趋势,对人类的健康和生命构成了极大威胁。在中国,肺癌每年大约导致40万例患者死亡,已成为发病率和死亡率最高的恶性肿瘤。研究显示I期肺癌术后10年生存率可达到92%。然而肺癌早期不易诊断,恶性程度高,一经病理确诊多数已属晚期,失去手术治疗的最佳时机,总的5年生存率仅为15%左右。因此,要降低肺癌患者的死亡率关键在于肺癌的早期发现、早期诊断和早期治疗。肺癌的发生是多因素、多基因和多阶段发展的复杂过程,由于传统的影像学检查和支气管镜等检查手段存在敏感性、特异性和适用度等方面的局限,近年来国内外学者对肺癌早期预警或诊断相关的分子标志和多种肿瘤生物标志的联合检测做了大量有益的探索,以期找到更合理、敏感性和特异性更高的分子联合标志。肺癌的发生是环境因素和遗传因素共同作用的结果,因此在寻找肺癌早期预警或诊断的生物标志时,也可以从两方面着手,即反映机体先天具有或后天获得的对外源性物质产生反应能力的易感性标志;反映早期生物效应、结构和/或功能改变以及疾病的效应标志。遗传因素属于前者,其作用体现在同一环境暴露中个体肿瘤易感性的差异,归根到底由基因多态所代表的遗传背景决定。另一方面,在很多情况下,许多分子事件的发生早于明显恶性表型的出现,因此,运用分子生物学的方法检测肺癌发生过程中的早期分子事件,从而发现癌前病变或早期癌变也被认为是肺癌早期预警最具应用前景的手段。肿瘤发生的早期生物效应包括了DNA甲基化和端粒损伤在内的遗传学和表观遗传学改变。数据挖掘(Data Mining),又称数据库知识发现(Knowledge Discovery from Database, KDD),它是从大量数据中提取并挖掘未知的、有价值的模式或规律等知识的复杂过程。它通常与计算机科学有关,并通过统计、在线分析处理、情报检索、机器学习、专家系统(依靠过去的经验法则)和模式识别等诸多方法来实现上述目标。数据挖掘与传统数据分析有着本质的区别。数据挖掘是在没有明确的假设的前提下挖掘信息和发现知识。同时,通过数据挖掘得到的信息具有先前未知、有效及可实用3个特征。数据挖掘中的决策树和人工神经网络技术(Artificial Neural Networks, ANN)能够对数据信息进行大规模并行处理和分布式存储,且具有良好的自适应性、自组织性及较强的学习功能、联想功能和容错功能。在肿瘤的诊断方面,不仅能够起到检测可疑病变和分类的作用,还能挖掘用于检测和分类的潜在特征标志,为肿瘤的诊断做出建设性贡献。本研究检测对象外周血中CYP1A1, GSTM1, GSTT1, mEH, XRCC1基因多态性、p16和RASSF1A基因甲基化水平及端粒相对长度,探讨5种基因多态性与p16、RASSF1A基因甲基化和端粒相对长度的相关关系,在此基础上应用数据挖掘技术,检测这些分子指标对肺癌早期预警的相关性,抽取可用于肺癌预警的有效特征,构建较为适合的预测模型,探讨是否有助于提高肺癌早期预警或诊断的正确率及联合检测对肺癌辅助诊断的意义,以实现肺癌早期预警、诊断和分类的自动化,为高危人群的筛查和临床肺癌诊断提供有价值的参考资料。目的1.探讨肺癌患者外周血I相代谢酶基因CYP1A1,Ⅱ相代谢酶基因GSTM1、GSTT1、mEH,及DNA修复酶基因XRCC1的多态基因型与肺癌易感之间的关系,探讨抑癌基因p16、RASSF1A甲基化及端粒相对长度与肺癌发生的关系,筛选出与肺癌发生相关的有效分子生物标志,找出对肺癌早期预警意义最大的几项,为肺癌的早期预警提供基础资料。2.将数据挖掘技术和上述分子标志相结合,构建可“自动”处理信息的智能预警模型,为肺癌智能预警系统的研制开辟一条新途径,提高肺癌早期预警的准确率。材料与方法1.以251例肺癌患者和256例健康体检者为研究对象。2.采用等位基因特异性扩增法(allele-specific amplification, ASA)检测CYP1A1-exon7位点多态性,采用多重PCR法检测GSTM1、GSTT1基因多态性,采用聚合酶链反应-限制性片段长度多态性(polymerase chain reaction-restriction fragment length polymorphism, PCR-RFLP)方法分别检测CYP1A1-Mspl位点、mEH-exon3、mEH-exon4、XRCC1-194、XRCC1-280及XRCC1-399位点基因多态性。采用实时荧光定量甲基化特异PCR (real-time methylation specific PCR, qMSP)技术检测p16和RASSF1A基因甲基化水平,采用荧光定量PCR法检测端粒相对长度。3.应用SPSS12.0统计分析软件,采用x2检验、t检验、秩和检验、Logistic回归分析等方法对基因多态、甲基化水平和端粒相对长度的结果进行一般统计学分析处理,探讨基因多态性、DNA甲基化及端粒相对长度变化与肺癌发生的关系,筛选可能用于肺癌早期判别模型的有效指标。4.将每组样本按3:1的比例随机分为训练集和测试集,将CYP1A1-exon7、GSTM1、mEH-exon3、XRCC1-194和XRCC1-280位点基因多态性、p16基因和RASSF1A基因甲基化水平、端粒长度及吸烟情况作为输入参数,用Fisher判别分析、决策树C5.0和反向传播神经网络算法(Back-Propagation, BP算法)分别对训练集进行训练建立模型,用训练好的模型对相应的测试集进行盲法预测,验证判别模型的优劣,最终建立肺癌早期智能化预警模型。结果1. GSTM1基因缺失型,CYP1A1-exon7、mEH-exon3、XRCC1-194及XRCC 1-280基因位点纯和突变型在病例组与对照组中的分布频率差异均有统计学意义(P<0.05),GSTM1基因缺失者与GSTM1基因阳性者相比发生肺癌的危险性升高(ORadj=1.727,95%CI:1.211-2.463);携带CYP1A1-exon7 Ile/val+val/val基因型的个体较携带CYP1A1-exon7 Ile/Ile基因型的个体发生肺癌的危险性升高(ORadj1.727,95%CI:1.203-2.477);mEH-exon3突变基因型携带者与野生纯合型的个体相比发生肺癌的危险性升高(ORadj1.758,95%CI:1.194-2.589);携带XRCC1-194 Arg/Trp+Trp/Trp基因型的个体较携带XRCC1-194 Arg/Arg基因型的个体发生肺癌的危险性升高(ORadj=1.542,95%CI:1.083-2.196);XRCC1-280His/His基因型携带者较XRCC1-280 Arg/Arg+Arg/His基因型携带者发生肺癌的危险性升高(ORadj=2.941,95%CI:1.427-6.060)。CYPIA1-Msp1、GSTT1、mEH-exon4及XRCC 1-399多态基因型在病例组与对照组中的分布频率差异均无统计学意义(P>0.05)。基于5种基因多态性建立肺癌判别模型,结果为Fisher判别分析、决策树及ANN对训练集和预测集的准确率分别为63.59%、63.25%;95.64%、82.61%:84.1%、80.77%,Fisher判别分析、决策树及ANN模型的ROC曲线下面积(AUC)分别为0.627、0.836、0.821。2.肺癌组外周血p16基因和RASSF1A基因甲基化水平及端粒相对长度分别为0.59(0.16~4.50)、27.62(9.09~52.86)、0.93±0.32,与对照组相比差异具有统计学意义(P<0.05);p16基因和RASSF1A基因启动子区甲基化水平增高及端粒相对长度缩短与肺癌发生危险性增加有关;性别、年龄、吸烟情况、肺癌分期和病理类型与p16基因、RASSFIA基因甲基化及端粒长度无关(P>0.05)。基于上述指标建立肺癌判别模型,结果为Fisher判别分析、决策树及ANN对训练集和预测集的准确率分别为66.34%、65.82%;77.26%、75.45%;72.15%、71.72%,3种模型的AUC分别为0.660、0.782、0.759。3. XRCC1-280位点不同基因型之间p16甲基化水平有差异,CYP1A1-exon7、GSTM1、mEH-exon3和XRCC1-280位点不同基因型之间RASSFIA基因甲基化水平不同,CYP1A1-exon7和GSTM1基因突变型与野生型相比端粒相对长度差异。基于上述综合指标建立肺癌判别模型结果显示,Fisher判别分析、决策树及ANN对训练集和预测集的准确率分别为72.15%、70.59%;93.88%、93%;92.96%、89.62%,3种模型的AUC分别为0.722、0.929、0.894。决策树模型对临床早期(I+II期)肺癌的判别准确率为96.36%,ANN模型为89.09%。结论1.CYP1A1-exon7、GSTM1、mEH-exon3、XRCC1-194和XRCC1-280基因位点的变异、p16和RASSFIA基因甲基化水平异常增高、端粒相对长度缩短与肺癌患癌危险度增加有关,上述指标组成肺癌早期预警模型的分子标志群。2.数据挖掘技术联合肺癌发生相关的多角度分子事件建立模型对肺癌的判别准确性优于单方面分子标志的检测。3.本文建立的多个肿瘤分子标志联合决策树和ANN技术的肺癌早期预警模型对肺癌的判别优于传统的Fisher判别方式,比常规的统计学方法更适合于临床数据的分析,准确度较高,可以用于肺癌早期预警。

【Abstract】 Lung cancer is one of the most frequent malignancies in the world nowadays, its morbidity and mortality are continuously rising, and it constitutes a grave threat to human health. In China, there are about 400,000 people died every year because of lung cancer which morbidity and mortality is the highest in malignant tumors. Studies show that 10-year survival of postoperation can arrive to 92% in patients with stage I lung cancer. However, it is very difficult to diagnose lung cancer at the early stage, and also because of its high grade malignancy, lung cancer patients are usually diagnosed in the advanced stage and lose the best opportunity of operation, so that the total 5-year survival rate is only about 15%. So early detection, early diagnosis and early treatment are vital for lung cancer patients to reduce their mortality. The occurrence of lung cancer is a complex process involved in many factors, lots of genes and multiple steps. As tranditional methods including imageology and bronchial tube, et al. have limits in susceptibility, specificity and adaptability, in recent years a lot of scholars have devoted themselves to exploring new molecular marker and to combined detection of multiply tumor markers in order to find more reasonable and sensitive association.Lung cancer occurs because of both environmental factors and genetic factors. So we search for biomarkers of early warning or diagnosis of lung cancer from two aspects, that is biomarkers of susceptibility and effect. Genetic factors belong to the former which is reflected in the difference of tumour susceptibility and is determined by genetic polymorphism. On the other hand, in many cases many molecular events happen before obvious malignant phenotype, so detecting early molecular events during the occurrence of lung cancer to discover precancerosis or canceration of early phase is also one of the most promising approach. Early biological effects during tumorigenesis include changes of genetics and epigenetics such as DNA methylation and telomere damage.Data Mining, also known as Knowledge Discovery from Database, is a complex process which to extract and to mine unknown and valuable knowledge such as model or regular pattern from mass of data. It is usually related with computer science, and to discovery knowledge through statistics, on-line analysis, information retrieval, machine learning, expert system (relying on past rule of thumb) and pattern recognition etc. There is essential difference between data mining and traditional data analysis. Data mining is to excavate information and discover knowledge without clear hypothesis. Meanwhile, information gained from data mining is unknown, effective and practical. Decision tree and artificial neural networks techniques can parallel process and save large-scale data information distributedly, and also take on well self-adaption, self-organization and strong learning, association and fault-tolerance function. In tumour diagnosis aspect data moning techniques can not only detect suspicious lesion and type but also mine potential pathognomonic markers that constructively contribute to tumour diagnosis.In this study genetic polymorphisms of CYP1A1, GSTM1, GSTT1, mEH and XRCC1 genes, p16 and RASSF1A gene methylation, and telomere length were detected in peripheral blood of lung cancer patients and health people to explore their correlationship. Then data mining techniques were used to detect the relevance between these molecular index and early warning or diagnoisis of lung cancer, to extract effective feature and construct suited prediction model of lung cancer, and to explore wheather it can contribute to increase accuracy rate of lung cancer early diagnosis and the significance of united detection used in auxiliary diagnosis of lung cancer. So that to automaticly early warn, diagnose and classify lung cancer and to provide valuable information in screening high risk populations and clinical diagnosis of lung cancer.Objectives1. To study the association between genetic polymorphism of metabolizing and DNA repairing enzymes and susceptibility to pulmonary cancer, to explore the association between p16, RASSF1A methylation and telomere length and occurrence of lung cancer. To screen out effective molecular biomarkers correlated with lung cancer and find the most significant index so as to come up with initial value for early warning or diagnosis of lung cancer. 2. Combining data mining techniques with above index to construct intelligentized model for diagnosis that can automaticly analyse information for increasing accuracy rate of early diagnosis of lung cancer.Materials and methods1.251 lung cancer patients and 256 health persons were chosen to be study subjects.2. Using AS-PCR to detect genotype of CYP1A1-exon7, using multiplex PCR to detect genotype of GSTM1 and GSTT1, using PCR-RFLP to detect genotype of CYP1A1-Msp, mEH-exon3, mEH-exon4, XRCC1-194, XRCC1-280 and XRCC1-399. Using qMSP to detect methylation levels of p16 and RASSF1A, using RT-PCR to detect telomere length.3. Using SPSS 12.0 statistic analysis software, using chi-square test, t test, rank sum test, Logistic regression to analyze the data, and to explore the association between the above index and lung cancer in order to screen out effective index used in early discrimination model of lung cancer.4. Deviding the samples of each group into training set and testing set by 3:1, using Fisher discriminatory analysis, decision tree C5.0 and BP arithmetic to train the training set and build the model, then using the model to test the testing set by blind method in order to verify its odds, the intelligentized model was developed for early diagnosis of lung cancer.Results1. The frequencies of GSTM1-null, CYP1A1-exon7 mt/mt, mEH-exon3 mt/mt, XRCC1-194 Trp/Trp, XRCC1-280 His/His genotype in case group were significantly higher than those in control group (P<0.05), There was an increased risk of lung cancer for individuals carrying genotypes of GSTM1 (ORadj=1.727,95%CI: 1.211-2.463), CYP1A1-exon7 Ile/val+val/val (ORadj=1.727,95%CI:1.203-2.477), mEH-exon3 wt/mt+mt/mt(ORadj=1.758,95%CI:1.194-2.589), XRCC1-194 Arg/Trp +Trp/Trp (ORajd=1.542,95%CI:1.083-2.196) and XRCC1-280 His/His (ORadj=2.941, 95%CI:1.427-6.060) compared with subjects carrying genotypes of GSTM1 null, CYP1A1-exon7 Ile/Ile, mEH-exon3 wt/wt, XRCC1-194 Arg/Arg and XRCC1-280 Arg/Arg+Arg/His; There was no significant difference for CYP1A1-Msp1, GSTT1, mEH-exon4, XRCC1-399 genotype between the two groups (P>0.05). Building the model of lung cancer discrimination based on above index and The accuracy rate of Fisher, decision tree and ANN model for training set and testing set was (63.59%, 63.25%), (95.64%,82.61%), (84.1%,80.77%), respectively; AUC of the three models was 0.627 (Fisher),0.836 (decision tree),0.821 (ANN), repectively.2. The level of p16, RASSF1A gene methylation and telomere length of peripheral blood in lung cancer group was 0.59 (0.16-4.50)、27.62 (9.09-52.86)、0.93±0.32, respectively, and there was significant difference between the case group and control group; The hypermethylation of p16 gene and RASSF1A gene and contraction in length of telomere was correlated with increasing risk of lung cancer; There was no significant association between sex, age, tobacco smoking, lung cancer stage, pathological types and hypermethylation of p16 gene, RASSF1A gene and telomere length (P>0.05). Building the model of lung cancer discrimination based on above index and the accuracy rate of Fisher, decision tree and ANN model for training set and testing set was (66.34%,65.82%); (77.26%,75.45%); (72.15%, 71.72%), respectively; AUC of the three models was 0.660(Fisher),0.782(decision tree),0.759(ANN), respectively.3. The hepermethylation level of p16 gene was significantly variant in different genotypes of XRCC1-280; the hepermethylation level of RASSF1A gene was significantly variant in different genotypes of CYP1A1-exon7, GSTM1, mEH-exon3 and XRCC 1-280; the contraction in length of telomere was variant in different genotypes of CYP1A1-exon7 and GSTM1. Building the model of lung cancer discrimination based on above index and the accuracy rate of Fisher, decision tree and ANN model for training set and testing set was (72.15%,70.59%), (93.88%, 93%), (92.96%,89.62%), respectively; AUC of the three models was 0.722 (Fisher), 0.929 (decision tree),0.894 (ANN), respectively; the accuracy rate of decision tree and ANN model for clinical early stage (Ⅰ+Ⅱ) lung cancer was 96.36 and 89.09, respectively.Conclusion1. The genetic polymorphisms of CYP1A1-exon7, GSTM1, mEH-exon3, XRCC1-194 and XRCC1-280, the hypermethylation of p16 gene and RASSF1A gene and the contraction in length of telomere might contribute to the risk of developing lung caner; and the above index made up a tumour marker group for early diagnosis model of lung cancer.2. The discriminative model of lung cancer based on multi-dimension molecular events related to the occurrence of lung cancer is superior to that based on unilateral molelular markers.3. The diagnostic model of lung cancer based on multiple tumour markers and data mining techniques was superior to traditional discriminative pattern, and it was more suitable for analysis of clinical data than conventional statistics, and it could be used in early warning of lung cancer.

  • 【网络出版投稿人】 郑州大学
  • 【网络出版年期】2012年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络